Article pubs.acs.org/IECR
Phase Partition and Phase-Based Process Monitoring Methods for Multiphase Batch Processes with Uneven Durations Lijia Luo,*,†,‡ Shiyi Bao,† Jianfeng Mao,† and Di Tang† †
College of Mechanical Engineering, Zhejiang University of Technology, Engineering Research Center of Process Equipment and Remanufacturing, Ministry of Education, Hangzhou, China, 310014 ‡ Department of Chemical and Biomolecular Engineering, University of Delaware, Newark, Delaware 19716, United States ABSTRACT: Integrated phase partition, online phase identification, and phase-based monitoring methods are proposed for multiphase batch processes with uneven durations. A new phase partition method is developed based on the warped K-means (WKM) clustering algorithm, which divides the entire batch into several operation phases by clustering the trajectory data of phase-sensitive process variables. This WKM-based phase partition method can efficiently cope with the sequentiality of batch data and, thus, ensures a reasonable phase partition result. Besides, because only phase-sensitive variables are used for phase partition, the phase partition accuracy is improved. An online phase identification method is proposed to identify the corresponding operation phase of a new sample according to a phase identification combination index (PICI). PICI quantifies the correlation of a new sample with each operation phase by calculating distance and time difference between the sample and the phase center. The PARAFAC2 and unfolded principal component analysis (uPCA) methods are applied to build monitoring models from the uneven-length batch data in each phase. T2 and SPE statistics are constructed for fault detection. The contribution plot of T2 statistic is used for fault diagnosis. The effectiveness and advantages of proposed methods are illustrated by the case study in a fed-batch penicillin fermentation process.
1. INTRODUCTION Batch processes are a very popular production mode in pharmaceutical, biochemistry, and food industries.1,2 To ensure a safe and profitable operation, lots of batch process monitoring methods have been proposed based on process mechanism models, process knowledge, or process data.3−5 Among them, data-driven monitoring methods, also known as multivariate statistical process monitoring (MSPM) methods, attract more and more research attention, thanks to the quick development of data acquisition and analysis techniques. MSPM methods build monitoring models from historical batch data using statistical modeling techniques, such as multiway principal component analysis (MPCA),1 multiway locality preserving projections (MLPP),6 multiway Gaussian mixture model (MGMM), 7 tensor global-local preserving projections (TGLPP),8 and so on. In comparison with those monitoring methods based on mechanism models and prior knowledge, MSPM methods are much easier to implement because of their data-driven nature. Process behaviors of a batch process may change along the operation time due to transitions of processing units or operation modes. As a result, many batch processes present the characteristics of multiple operation phases/stages. Variable correlations may significantly differ between phases. Such multiphase characteristics make conventional single-phase-based monitoring methods, which treat the entire batch as a single operation phase, ill-suited and lead to a poor monitoring performance. To overcome the drawbacks of single-phasebased monitoring methods, several phase partition methods and multiphase monitoring strategies have been proposed for multiphase batch processes.7,9−13 For instance, Yu and Qin7 applied the Gaussian mixture model (GMM) to identify © 2015 American Chemical Society
multiple operation phases and build monitoring models for a multiphase batch process. Camacho and Pico10 proposed a multiphase (MP) algorithm for modeling and monitoring a multiphase batch process. It divides a complete PCA model into several submodels along the operation time to improve the prediction accuracy and each submodel is considered as an independent operation phase. However, this method is very time-consuming, because it needs to search over all sampling instants even more than once to find appropriate division points. Lu et al.11 proposed a stage-based sub-PCA method to deal with the multiple operation stages in a batch process. It divides the entire batch into different stages by clustering loading matrices of time-slice PCA models using the K-means (KM) algorithm and then models each stage separately. However, the above methods have a common drawback that they are not applicable for a batch process with uneven durations. Moreover, their phase partition results are sometimes difficult to be interpreted physically, because identified phase boundaries may not exactly correspond to transition points of real operation phases. A possible reason is that they only focus on changes in underlying variable correlations probably caused by phase transitions, while ignore obvious and quick variations in trajectories of some key process variables that are sensitive to phase transitions. In addition, phase partition methods based on sub-PCA and GMM cannot handle the sequentiality of batch data directly, which may generate operation phases with discontinuous operation time, leading to Received: Revised: Accepted: Published: 2035
October 26, 2015 December 8, 2015 December 10, 2015 December 10, 2015 DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research
uneven-length data using the tensor decomposition technique. Two conventional monitoring statistics, namely T2 and SPE, are constructed for fault detection. The contribution plot of T2 statistic is used for fault diagnosis. A case study is carried out in a benchmark fed-batch penicillin fermentation process to illustrate the effectiveness and advantages of proposed methods.
unreasonable and inexplicable phase partition results. These drawbacks degrade the monitoring performance. Most batch process monitoring methods assume that all batches have the same duration for easily modeling. However, the duration of a real batch process is difficult to be fixed because of unavoidable disturbances, fluctuations in initial conditions, as well as changes of operation conditions.13,14 Moreover, multiphase characteristics of batch processes could aggravate the uneven duration problem, because the duration of the same operation phase also varies with batches. Several batch trajectory synchronization techniques can be used for data alignment,13 such as cutting all batch data to the minimum length,15 replacing the time index with a surrogate indicator variable,16 as well as using dynamic time warping (DTW)17 or correlation optimization warping (COW).18 However, these methods are not applicable for multiphase batch processes as they do not take multiphase characteristics into consideration. Besides, fault patterns may be distorted after trajectory synchronization in that original data structures are altered, which reduces the fault detection and diagnosis efficiency. Research work has been carried out to simultaneously deal with multiphase characteristics and uneven duration problem of batch processes. For instance, Lu et al.19 extended their stagebased sub-PCA method for monitoring uneven-length multiphase batch processes. Zhao et al.20 developed a process trend based statistical modeling method to identify different operation phases and track inner-phase evolution as well as handle the uneven-length problem. Ge and Song21 developed a monitoring method for uneven-length multiphase batch processes by combining the data unfolding technique with the support vector data description (SVDD) method. A common feature of above methods is that they need to unfold the uneven-length batch data along the variable direction for modeling. However, because the data unfolding operation destroys the inherent three-way data structure, the resulting monitoring models fail to reveal three-way interactions in batch data set, which may decrease the monitoring efficiency.22 In this paper, integrated phase partition, online phase identification, and phase-based monitoring methods are proposed for multiphase batch processes with uneven durations. At first, a phase partition method is proposed based on the warped K-means (WKM) clustering algorithm. This WKM-based phase partition method divides the entire batch into several phases by clustering the trajectory data of phase-sensitive variables. It has three outstanding advantages. First, the WKM algorithm can deal with the sequentiality of batch data directly, which ensures a reasonable phase partition result. Second, the phase partition accuracy is improved by only using phase-sensitive variables. Third, it is easy to choose an optimal phase number according to a partition performance combination index (PPCI). Then, to identify the corresponding phase of a new sample, an online phase identification method is proposed based on a phase identification combination index (PICI). This PICI calculates both space distance and time difference between the new sample and each phase center, and then combines them by a weight coefficient. The new sample is assigned into the phase with the smallest PICI. Finally, for comparison, PARAFAC2 and unfolded PCA (uPCA) methods are used to model the uneven-length batch data in each phase. The uPCA method rearranges the uneven-length data into a data matrix via variablewise data unfolding, and then performs ordinary PCA on the data matrix to build a monitoring model. PARAFAC2 directly builds a monitoring model from the
2. PRELIMINARIES 2.1. PARAFAC2 Model. PARAFAC223 is developed to model an uneven three-way data array with varying sizes in one dimension. Let the matrix Xi(Ki × J), i = 1, ..., I, be the ith slice of an uneven three-way data array X̲ = {X1, ..., XI}, where Ki varies with the index i. The PARAFAC2 model for X̲ is23,24 T X i = FD i i A + Ei
(1)
where Fi(Ki × R) is a score matrix, A(J × R) is a weight matrix, Di(R × R) is a diagonal matrix containing weights for the ith slice of X̲ , R is the number of latent variables, and Ei(Ki × J) is a residual matrix. Note that the score matrix Fi(Ki × R) is not invariable for all the slices Xi but varies with index i. Figure 1
Figure 1. Schematic diagram of PARAFAC2 model.
illustrates a PARAFAC2 model of an uneven three-way data array. However, because the score matrix is not constrained to be equal for all slices, the PARAFAC2 model is not unique without additional constraints.24,25 Thus, to improve the uniqueness, a constraint that the cross-product matrix FTi Fi is constant over i (i.e., FTmFm = FTn Fn = Φ for all m, n = 1, ..., I) is imposed on the PARAFAC2 model.23−25 This constraint is equivalent to requiring that Fi = HiF for some columnwise orthonormal matrices Hi(Ki × R) and an invariable matrix F(R × R), because FTi Fi = FTHTi HiF = FTF, i = 1, ..., I. Thus, with this constraint, the PARAFAC2 model can be rewritten as X i = HiFDi AT + Ei
(2)
A direct fitting algorithm has been developed for PARAFAC2.24 It finds Hi, F, Di, and A by minimizing I
σ(F , D1, ..., DI , H1, ..., HI , A) =
∑
X i − HiFDi AT
2
i=1
(3)
HTi Hi
subject to two constraints: = I and all the Di are diagonal matrices. Detailed descriptions of this fitting algorithm can be found in ref 24. 2.2. Warped K-Means Algorithm. Based on the conventional K-means (KM) clustering algorithm, Leiva and Vidal26 proposed a warped K-means (WKM) algorithm to cluster sequentially generated data, namely data that follow some sort of sequentiality. To cope with the sequential structure of data, WKM imposes a hard sequentiality constraint at the classification step of KM algorithm. Let X = {x1, ..., xn} be a data set 2036
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research
which is helpful for improving the accuracy of phase partition and modeling. Because the batch data set has a three-way data structure, containing three-way information on batches, variables and time, conventional normalization methods for twoway data are usually combined with a data unfolding procedure to normalize the three-way batch data set. Two common methods include variablewise normalization and batchwise normalization. Variablewise normalization performs the normalization procedure on a matrix that is unfolded from the three-way batch data set along the variable direction. It has two advantages. First, the uneven batch duration problem is spontaneously solved after data unfolding. Second, the trajectory information on all process variables is well retained after normalization. These two advantages make variablewise normalization be a preferred data pretreatment method for phase partition. However, variablewise normalization is not suitable for process monitoring in that it highlights the data variation along the variable direction rather than the batch direction, resulting in a limited monitoring efficiency. Batchwise normalization unfolds the three-way batch data set along the batch direction first and then normalizes the unfolded data matrix. After normalization, the average batch trajectory is removed from all batches, and thus, the batch-to-batch variation is highlighted. Because of this advantage, batchwise normalization is a preferred pretreatment method for process monitoring. Note that batchwise data unfolding is not appropriate for online monitoring, because the resulting monitoring model requires data of the entire batch to compute monitoring statistics at every sampling time. However, only a part of batch data is available before the finish of a new batch, and thus the remaining data need to be predicted, which reduces the monitoring efficiency and accuracy. Moreover, batchwise normalization suffers from two problems when applied to the uneven-length batch data from a multiphase batch process. The first problem is caused by multiple operation phases. Batch data from different phases should be normalized separately, because of their different characteristics. However, boundaries between different operation phases are unknown before phase partition. The second problem is related to uneven durations. Except that batches have different total durations, the duration of the same operation phase also varies from batch to batch. Thus, data extracted from the same phase in different batches have varying data lengths. How should these uneven-length batch data be normalized? Similar to ref 19, combining advantages of variablewise and batchwise normalization, a two-step normalization scheme is adopted, as shown in Figure 2. Denote the historical data set of an uneven-length multiphase batch process as X̲ = {X1, ..., XI} with Xi(Ki × J) (i = 1, ..., I) being the data of ith batch, where I and J are numbers of batches and variables respectively, and Ki is the number of samples in ith batch. First, X̲ is unfolded into X(N × J) with N = ∑Ii=1Ki, and the variablewise normalization is used to pretreat X for phase partition, as shown in Figure 2a. After phase boundaries of every batch are available, batch data in the same phase are extracted and normalized using batchwise normalization for building monitoring models. The batchwise normalization procedures are illustrated in Figure 2b. To solve the uneven-length problem, phase data from different batches are expanded to have the same length by duplicating the last sample. Suppose the batch data set in lth phase is {Xi,l(J × Ki,l)}Ii=1, where Ki,l denotes the number of samples of ith batch in lth phase. Each data matrix Xi,l(J × Ki,l) is expanded to
that consists of n sequential samples x i ∈ 9 d , i = 1, ..., n. WKM seeks to optimally divide X into a set of L (1 < L ≪ n) sequential clusters {C1, ..., CL} by minimizing the sum of quadratic error (SQE) criterion.26 Denote the jth cluster of X as
Cj = {x bj , ···, x bj + nj − 1}
(4)
where bj is the left boundary of cluster Cj and nj is the number of samples in cluster Cj. The SQE criterion that WKM tries to minimize is26 L
G=
L
bj + 1− 1
∑ Gj =
∑(
∑
j=1
j=1
i = bj
1 μj = nj
x i − μj
2
) (5)
bj + 1− 1
∑
xi
i = bj
(6)
where Gj represents the heterogeneity of cluster Cj, and μj is the cluster center. To find an optimal partition, WKM reallocates samples from its current cluster to a potentially better one if and only if that reallocation decreases G. The variation in G caused by moving a sample x from cluster j to cluster i is calculated by26 nj ni ΔG(x , j , i) = x − μi 2 − x − μj 2 ni + 1 nj − 1 (7)
If ΔG(x, j, i) is negative, the sample x is reallocated. Meanwhile, new cluster centers μ̂j and μ̂i as well as Ĝ are computed by26 μĵ = μj − μî = μi +
x − μj nj − 1 x − μi ni + 1
Ĝ = G + ΔG(x , j , i)
(8)
In order to preserve the sequentiality of data, WKM imposes a hard sequentiality constraint at each reallocation step. The first half of samples in cluster j are only allowed to move to the previous cluster j − 1, and the last half of samples are only allowed to move to the next cluster j + 1. Owing to this constraint, along with the sequentiality of samples within each cluster, only those samples close to cluster boundaries are reallocated.26 Finally, an optimal partition is obtained by iterating the reallocation process over the data set X until no sample transfer is performed. It is worth pointing out that the cluster number L should be explicitly specified as an input parameter of WKM. Moreover, a suitable initial partition, namely the initialization of cluster boundaries, is required to start the WKM algorithm. The boundary initialization can be achieved using the trace segmentation (TS) method.26 The WKM algorithm has been described in detail by Leiva and Vidal.26
3. METHODOLOGY 3.1. Normalization of Uneven-Length Batch Data. Data normalization is an important step before phase dividing and modeling. The purpose of data normalization is to highlight data variations and eliminate effects of variable units, 2037
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research
Figure 2. Two-step normalization scheme for the uneven-length batch data set: (a) variablewise normalization, (b) batchwise normalization.
Xi,l(J × Kl*) by duplicating the last sample Kl* − Ki,l times. Kl* is the alignment length of lth phase, which can be set as the largest Ki,l or even a larger integer. The value of K*l should be large enough in case that the new batch has a longer data length than any of training batches. A large Kl* dose not change the data normalization result but slightly increases the computation load. After data alignment, the resulting data set {Xi,l(J × K*l )}Ii=1 is unfolded into a matrix Xl(I × JK*l ) and gets normalized. The normalized data matrix X̃ l(I × JKl*) is reorganized into the form of a three-way data array {X̃ i,l(J × Kl*)}Ii=1. Then, the phase data of each batch are truncated to their original length by removing the duplicated data at the tail. Finally, a normalized data set {X̃ i,l(J × Kil)}Ii=1 is obtained, which is used for modeling. 3.2. Phase Partition and Online Phase Identification. In a multiphase batch process, each operation phase has unique process characteristics that differ from other phases. Therefore, a multiphase batch process should be divided into different operation phases along the operation time for better understanding, analyzing and modeling the process. The task of datadriven phase partition is to identify potential phases from batch data. Compared with an even-length batch process, performing phase partition in an uneven-length batch process is more difficult, not only for historical batches but also for the new batch. In an even-length batch process, the same operation phase in different batches is often assumed to have equal durations, as shown in Figure 3a. Based on this assumption, each time-slice matrix in the historical batch data set can be considered as a unified unit as it contains batch data in the same operation phase. Thus, it is easy to find common phase division points of all batches by classifying all time-slice matrices into different clusters according to some kind of similarities among them. However, in an uneven-length batch process, the duration of the same phase varies with batches, regardless of whether total durations of batches are equal or not, as shown in Figure 3b. In such cases, there exist two types of irregular timeslice matrices. The first type is those time-slice matrices in time
Figure 3. Illustration of operation phases in a multiphase batch process with (a) even and (b) uneven durations.
span B that consist of data from different batches in different operation phases. The second type is those time-slice matrices at the tails of batches in time span C that have varying sizes. These irregular time-slice matrices should be treated carefully and differently from the others to guarantee the high phase partition accuracy for every batch. Moreover, exact positions of the first type irregular time-slice matrices are unknown before phase partition. All these bring great difficulties for the phase partition. On the other hand, the current operation phase of a new batch should be online identified. In an even-length batch process, it is easy to determine the current operation phase according to the operation time. However, in an uneven-length batch process, accurately identifying the current operation phase is very difficult due to the uneven duration problem. So far, it still lacks of efficient phase partition and online phase identification methods for uneven-length multiphase batch processes. Because trajectories of process variables directly reflect changes in process operating status, they can be used for 2038
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research phase partition. Particularly, in most cases, transition points in trajectories of some process variables are well consistent with those between real operation phases. This not only ensures reasonable phase partition results but also facilitates the physical interpretation of identified phases. Note that not all process variables are sensitive to phase changes, because some of them are controlled to be stable or show irregular fluctuations. Thus, to further improve the accuracy of phase partition, only those phase-sensitive process variables, whose trajectories exhibit obvious trend variations during the batch run, are concerned in phase partition. Moreover, because durations of the same phase vary with batches, phase partition should be carried out for each batch separately. In addition, choosing the total number of phases is an important step of phase partition. A large number increases the partition and modeling complexity, while a small number fails to capture varying process characteristics. However, it is difficult to determine a proper phase number in advance. On the basis of above discussions, a WKM-based phase partition method is developed in this paper. Let X s = {X̅ i(Ki × Js)}Ii=1 denote the batch data set containing phase-sensitive variables, where Js is the number of phase-sensitive variables. At first, the raw data set X s is normalized to X ŝ = {X̂ i(Ki × Js)}Ii=1 using the variablewise normalization. Then, the WKM clustering algorithm is performed on each matrix X̂ i(Ki × Js) to divide each batch into different operation phases. This phase partition process is carried out using a set of phase numbers from 1 to L*, and then an optimal phase number is selected according to both partition performance as well as partition and modeling complexity. The final SQE value G in eq 5 can be used to evaluate the partition performance. A smaller G indicates a better partition performance. The number of phases is a direct measurement of the partition and modeling complexity. Thus, a partition performance combination index (PPCI) is defined as PPCIL = γGŝ , L + (1 − γ )L̂
with tk̃ , l − min(tk̃ , l) , max(tk̃ , l) − min(tk̃ , l)
tk̂ , l =
dk̂ , l =
dk̃ , l − min(dk̃ , l) max(dk̃ , l) − min(dk̃ , l)
I
tk̃ , l =
I
∑ ti ,k ,l ,
dk̃ , l =
i=1
ti , k , l =
(13)
∑ di ,k ,l
(14)
i=1
|k − (bi , l + 1 + bi , l)/2| bi , l + 1 − bi , l
,
di , k , l = x̅k − μi , l (15)
where k = 1, ..., K is the index of sampling time, l = 1, ..., L is the phase index, i = 1, ..., I is the batch index, xk̅ (1 × Js) denotes a subsample containing phase-sensitive variables of sample xk, di,k,l is the distance of xk from the center μi,l(1 × Js) of lth phase of ith batch, ti,k,l represents the scaled time difference between sample xk and the time center of lth phase of ith batch, tk,l̃ and d̃k,l are summations of time differences and distances over all batches, tk,l̂ and d̂k,l are normalized time difference and distance, and β ∈ [0, 1] is a weight coefficient to adjust the trade-off between time difference and distance. According to PICI, the sample xk is allocated to the phase with the smallest φk,l. Figure 4 illustrates phase partition and online phase identification procedures.
(9)
with Gŝ , L =
Gs̃ , L − mean(G̑ s) , std(G̑ s)
L − mean(L̃) L̂ = std(L̃)
(10)
I
Gs̃ , L = log(Gs , L),
Gs , L =
∑ Gi ,L i=1
(11)
where L = 1, ..., L* is the phase number, i = 1, ..., I is the batch index, Ĝ s,L and L̂ are normalized values of G̃ s,L and L, mean(•) and std(•) denote mean value and standard variance of a vector, G̑ s = [G̃ s,1, ..., G̃ s,L*] and L̃ = [1, ..., L*], log(•) denotes the natural logarithm, Gs,L is the summation of final SQE (SFSQE) values of all batches for the phase number L, and γ ∈ (0, 1) is a weight coefficient to adjust the trade-off between partition performance and complexity. The optimal phase number corresponds to the smallest PPCI. In online applications, each sample of a new batch needs to be assigned into a corresponding phase. An online phase identification method is thus developed. The distance and time difference between a sample and each phase center reflect the correlation of a sample with a phase. By combining them together, a phase identification combination index (PICI) of a sample xk(1 × J) for lth phase is defined as φk , l = βdk̂ , l + (1 − β)tk̂ , l
Figure 4. Illustration of phase partition and online phase identification procedures.
Table 1 compares the WKM-based method with several existing phase partition methods. Most existing phase partition methods divide a batch into different phases by detecting changes in variable correlations, belonging to correlation-based methods, such as MP,10 sub-PCA11 and GMM.7 The proposed WKM-based method carries out phase partition according to changes in variable trajectories, termed as a trajectory-based method. Correlation-based methods need to build some models to reveal underlying variable correlations before phase
(12) 2039
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research Table 1. Comparison of Data-Driven Phase Partition Methodsa method I
II a
MP10 GMM7 Sub-PCA11 revised sub-PCA19 in this work
require modeling method?
cope with uneven-length problem?
can detect real phase division points?
yes yes yes yes
no no no yes
may may may may
no
yes
usually yes
how to handle the sequentiality of samples?
universality
computation complexity
other comments
directly postprocessing postprocessing postprocessing
low low low low
very high high high high
hard to determine the phase number generate soft partition results hard to determine those user defined parameters and the phase number
directly
high
low
not not not not
I: correlation-based methods. II: trajectory-based methods.
partition. As a result, their performance totally depends on the modeling method. Phase partition results for the same batch process may be different when using different modeling methods. One modeling method may be appropriate for some batch processes, while not for the others. The user defined tuning parameters in some methods are also hard to choose. These three problems significantly restrict the universality of correlation-based methods. Moreover, the modeling procedure increases the computation complexity of correlation-based methods. Obviously, the WKM-based method has better universality and higher computation efficiency as well as is much easier to implement, as no modeling method and no user defined tuning parameters are required for phase partition. Besides, its phase partition results can be easily combined with many modeling methods for process monitoring. On the other hand, the variable correlations revealed by correlation-based methods can not accurately reflect the real process mechanism. Moreover, in many cases, the changes in variable correlations lag behind the changes in operation modes, because of the time delay caused by energy and material transformations. These two problems may reduce the phase partition accuracy. This may be a reason why correlation-based methods sometimes cannot accurately detect real phase division points.13 Because phase-sensitive variables, especially those indicator variables for switching operation modes, can timely indicate changes in operation modes, the WKM-based method has higher phase partition accuracy. It may be difficult to choose phase-sensitive variables in some processes. In this case, data of all process variables can be used for phase partition, while the partition accuracy may decrease slightly. Most correlation-based methods, such as subPCA and GMM, cannot directly cope with the sequentiality of process data. Because of this drawback, their raw phase partition results may contain some outlier time instants that destroy the continuity of operation phase.7 Thus, to ensure consecutive operation phases, some postprocessing techniques are required to assign the outlier time instants into appropriate phases. This postprocessing procedure apparently degrades the phase partition accuracy and efficiency. On the contrary, the WKM-based method can directly handle the sequentiality of process data, guaranteeing higher partition efficiency and accuracy. 3.3. Online Phase-Based Process Monitoring. After phase partition, boundaries of all phases in each batch are available. Data belonging to the same phase are then extracted from all batches to form phase data sets. Suppose data in ith batch are divided into L phases as Xi(Ki × J) = {Xi,l(Ki,l × J)}Ll=1. The data set in lth phase can thus be denoted as X̲ l = {Xi,l(Ki,l × I . The batchwise normalization method described in J)}i=1 ∼ section 3.1 is used to normalize X̲ l to X l = {X̃ i,l(Ki,l × J)}Ii=1. For comparison, two different monitoring models are built
Figure 5. Illustration of offline modeling and online monitoring procedures.
Table 2. Process Variables in the Fed-Batch Penicillin Fermentation Process no.
process variables
unit
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13
aeration rate agitator power input substrate feeding rate substrate feeding temperature substrate (glucose) concentration DO saturation biomass concentration penicillin concentration culture volume CO2 concentration pH bioreactor temperature generated heat
L/h W L/h K g/L % g/L g/L L mmol/L K kcal/h
∼ from X l, including unfolded PCA (uPCA) model and PARAFAC2 model. The uPCA performs the ordinary PCA on a data matrix unfolded from the three-way batch data set. Different from uPCA, which models the batch data set via data unfolding, PARAFAC2 well preserves the inherent three-way 2040
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research Table 3. Fault Batches in the Fed-Batch Penicillin Fermentation Process no.
fault variable
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
v1 v2 v1 v2 v3 v1 v2 v3 v1 v2 v3 v1 v2 v3 v1 v2 v3 v1 v2 v3 v1 v2 v3 v1 v2 v3
magnitude
fault type
start time (h)
end time (h)
batch duration (h)
−2 L/h +5% +5% −5% −5% +3% +5 W +0.005 L/h +2 L/h +3% −10% −5% −3% +5% +2% −5 W −5% −3 L/h +4% +0.02 L/h −2 L/h +5% −10% −3% +10 W +10%
ramp step step step step step ramp ramp ramp step step step step step step ramp step ramp step ramp ramp step step step ramp step
20 20 20 20 45 20 20 45 38 38 45 38 38 45 60 60 60 60 60 60 100 100 100 150 150 150
40 40 100 100 100 150 150 150 100 100 100 150 150 150 150 150 150 300 300 300 200 200 200 250 250 250
360 380 355 363 397 360 371 382 377 360 376 351 355 392 360 365 382 373 362 393 356 384 397 364 378 398
representation of batch data set by using the tensor decomposition technique. As a result, PARAFAC2 not only avoids potential disadvantages caused by data unfolding, but also reveals three-way interactions in the data set. Therefore, PARAFAC2 is expected to have better monitoring performance than uPCA. To build the phase-based uPCA model, the normalized phase ∼ data set X l = {X̃ i,l(Ki,l × J)}Ii=1 is unfolded along the variable direction into a data matrix X̆ l(Nl × J) with Nl = ∑Ii=1Ki,l. Then, performing PCA on X̆ l(Nl × J) gives the following uPCA model for lth phase Rl
X̆ l =
∑
Figure 6. Effects of phase numbers on the phase partition performance: (a) SFSQE, (b) PPCI.
the kth sample xi,l,k(1 × J) of ith training batch in lth phase are given by Ti2, l , k = (ti , l , k − tl̅ )S−l 1(ti , l , k − tl̅ )T J
SPEi , l , k = ei , l , keiT, l , k =
∑ ei2,l ,k , j (19)
j=1
tr , lprT, l
where ti,l,k(1 × Rl) and ei,l,k(1 × J) are row vectors of matrices Tl(Nl × Rl) and El(Nl × J) in eq 16, tl̅ = ∑i,kti,l,k/Nl is the mean of all row vectors in Tl(Nl × Rl), Sl(Rl × Rl) is the covariance matrix of T̆ l(Nl × Rl), i.e., Sl = T̆ Tl T̆ l/(Nl − 1), with T̆ l being the mean centered Tl. Similarly, T2 and SPE statistics for PARAFAC2 models are also computed via eqs 18 and 19 by ̃ and eĩ ,l,k of matrices replacing ti,l,k and ei,l,k with row vectors ti,l,k T̃ i,l(Ki,l × Rl) and Ẽ i,l(Ki,l × J) defined in eq 17, respectively. The kernel density estimation (KDE) method27 is used to compute control limits of T2 and SPE statistics. The distribution of T2 statistics of samples in lth phase in all training batches is estimated by27
T
+ El = TP l l + El
r=1
(16)
where Rl denotes the number of principal components (PCs), tr,l(Nl × 1) and pr,l(J × 1) are score and loading vectors that constitute a score matrix Tl(Nl × Rl) and a loading matrix Pl(J × Rl) respectively, El(Nl × J) is a residual matrix. On the ∼ other hand, the PARAFAC2 model of X l = {X̃ i,l(Ki,l × J)}Ii=1 is built as X̃ i , l = Fi , l Di , l A lT + Eĩ , l = Tĩ , l A lT + Eĩ , l
(18)
(17)
where i = 1, ..., I and l = 1, ..., L denote indexes of batches and phases respectively, Fi,l(Ki,l × Rl) is a score matrix, Di,l(Rl × Rl) is a diagonal weight matrix, T̃ i,l(Ki,l × Rl) = Fi,lDi,l is a combined score matrix, Al(J × Rl) is a weight matrix, Ẽ i,l(Ki,l × J) is a residual matrix, and Rl is the number of latent variables. The conventional T2 and SPE statistics are constructed for fault detection. For the uPCA model, T2 and SPE statistics of
f ̂ (Tl2) =
T2i,l,k
1 Nlθ
⎛ T2 − T2 ⎞ l i,l ,k ⎟⎟ θ ⎝ ⎠
∑ Γ⎜⎜ i,k
(20)
∑Ii=1Ki,l
where is defined in eq 18, Nl = is the total number of samples in lth phase in all training batches, θ is a smoothing 2041
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research parameter, and Γ(•) is a kernel function. The control limit of T2 statistic at significant level α (normally α = 0.01 or 0.05) can be easily computed by the inverse cumulative distribution ̂ 2l ). The control limit of SPE statistic is obtained function of f(T in the similar way as T2 statistic. In online applications, after a sample xnew(1 × J) of a new batch is collected and assigned into a phase, it can be projected onto the corresponding phase-based uPCA model by
Table 4. Ranges of Boundaries between Five Phases in 80 Reference Batchesa
a
boundary
P(1)/P(2)
P(2)/P(3)
P(3)/P(4)
P(4)/P(5)
range
30−33
43−46
76−92
162−188
P(i)/P(j) denotes the boundary between phases i and j.
t new, l = x newPl x̂ new = t new, lPlT enew, l = x new − x̂ new
(21)
or onto the phase-based PARAFAC2 model by ̃ l = x newA l(A lTA l)−1 t new, ̃ lA lT x̂ new = t new, enew, ̃ l = x new − x̂ new
(22)
̃ (1 × Rl) are score vectors of xnew, where tnew,l(1 × Rl) and tnew,l and enew,l(1 × J) and eñ ew,l(1 × J) are residual vectors. Then, T2 and SPE statistics of the new sample are computed by substituting its score and residual vectors into eqs 18 and 19, respectively. A fault is detected if either the T2 or SPE statistic exceeds the control limit. After detected a fault, the contribution plot of T2 statistic is used for fault diagnosis. For the uPCA model, the T2 statistic of a new sample xnew(1 × J) can be expressed as 2 T −1 Tnew, l = (t new, l − tl̅ )Sl (t new, l − tl̅ )
= t new, lS−l 1t Tnew, l − 2 tl̅ S−l 1t Tnew, l + tl̅ S−l 1 t ̅ lT = t new, lS−l 1(x newPl )T − 2 tl̅ S−l 1(x newPl )T + tl̅ S−l 1 t ̅ lT = [t new, l − 2 tl̅ ]S−l 1(∑ x new, jpj , l)T + tl̅ Sl−1 t ̅ lT j
=
∑ x new,j[t new,l − 2 tl̅ ]S−l 1pTj ,l +
tl̅ S−l 1 t ̅ lT
j
(23)
Figure 7. Phase partition results of a typical reference batch: (a) variable trajectories with phase boundaries, (b) phase number vs sampling time.
where Pl(J × Rl) is the loading matrix defined in eq 16 and T pj,l(1 × Rl) is the jth row vector of Pl. Because tl̅ S−1 l tl̅ is a constant for a given training data set, the contribution of jth variable xnew,j to the T2 statistic is defined as c jT = x new, j[t new, l − 2 tl̅ ]S−l 1p Tj , l
to validate the effectiveness of proposed methods. It is a typical multiphase batch process with two distinct operation modes: a preculture mode and a fed-batch mode.28 At first, the system runs in the preculture operation mode to promote biomass growth. When the glucose concentration in initial culture substrate reaches a threshold value due to the consumption of growing cells, the system quickly switches to the fed-batch operation mode, and meantime new substrate is fed into the fermentor from outside. The fed-batch mode continues to the end of each batch, during which the penicillin is produced. A simulator (PenSim v2.0) developed by Birol et al.28 is employed to simulate the penicillin fermentation process. A training data set containing 80 reference batches is generated under normal conditions with small random variations in initial settings. The durations of all reference batches vary between 350 and 400 h, comprising a preculture phase of about 45 h and a fed-batch phase longer than 305 h. Thirteen process variables
(24)
Similarly, for the PARAFAC2 model, the contribution to the T2 statistic is defined as ̃ l − 2tl̃ ]S̃−l 1[a j , l(A lTA l)−1]T c j̃T = x new, j[t new,
(25)
where Al(J × Rl) is the loading matrix defined in eq 17, and aj,l(1 × Rl) is the jth row vector of Al. The process variable with the largest contribution is diagnosed as the fault variable. Figure 5 illustrates offline modeling and online monitoring procedures.
4. CASE STUDY 4.1. Fed-Batch Penicillin Fermentation Process. A benchmark fed-batch penicillin fermentation process is used 2042
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research Table 5. Physical Interpretation of Five Operation Phases operation mode
operation phase
preculture mode
phase 1 phase 2
fed-batch mode
phase 3 phase 4 phase 5
physical interpretation early biomass growth phase exponential biomass growth phase early exponential penicillin production phase later exponential penicillin production phase the end phase of fermentation
process feature cells begin to grow by consuming glucose and oxygen the greatest part of the biomass is formed; glucose and oxygen concentrations decrease to a very low level; very little penicillin production quick production of penicillin; a small increase in biomass; oxygen concentration comes back to its initial level; glucose concentration is stable penicillin production rate starts to decrease; a small increase in biomass; oxygen and glucose concentration are stable penicillin production gradually tends to stop due to cell senescence and autolysis
Figure 8. Phase partition results of the revised sub-PCA method: L = (a) 2; (b) 3; (c) 4; (d) 5.
used to quantify the phase partition performance, are plotted in Figure 6a. With increasing the phase number, SFSQE decreases sharply at first, and then tends to be stable after the phase number exceeds 5. Choosing γ = 0.6 in eq 9, the PPCI of different phase numbers are calculated, as shown in Figure 6b. Obviously, the smallest PPCI corresponds to the phase number of 5. Thus, the optimal phase number is chosen to be 5. Ranges of boundaries between 5 phases in 80 reference batches are listed in Table 4. Figure 7 shows the phase partition results of a typical reference batch. The normalized trajectories of biomass concentration (v7) and generated heat (v13) are almost overlapped in Figure 7a. Note that the boundary between phases 2 and 3 at around 45 h is exactly the switch point from preculture operation mode to fed-batch operation mode. Thus, phase 1 and phase 2 can be viewed as two subphases of the preculture phase. Similarly, phases 3−5 are subphases of the
are measured every 1 h during the batch run, as shown in Table 2. Moreover, 26 fault batches are generated with varying fault variables, magnitudes as well as start and end time. Detailed information about fault batches is summarized in Table 3. 4.2. Phase Partition Results. By analyzing historic trajectories of all 13 process variables, 5 process variables (v5, v6, v7, v8, and v13) are chosen as phase-sensitive variables, because their trajectories show obvious multiphase characteristics. Trajectory data of five phase-sensitive variables in 80 reference batches constitute a data subset and then are pretreated with variablewise normalization for phase partition. Performing the WKM-based phase partition method on the normalized data subset, each batch is divided into several operation phases. To select an optimal phase number, a set of phase numbers ranged from 1 to 10 are tested. The SFSQE values corresponding to different phase numbers, which are 2043
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research
fed-batch phase. The possible physical interpretations of 5 phases are given in Table 5. Process features in different phases can be observed from trajectories of five phase-sensitive variables in Figure 7. As a result, the WKM-based phase partition method successfully identifies not only two real operation phases, but also potential subphases in each real operation phase. To illustrate the advantages of WKM-based method, the revised sub-PCA19 phase partition method is also applied to the fed-batch penicillin fermentation process for comparison. According to procedures of the revised sub-PCA method,19 a group of time-slice matrices X̆ k(80 × 13) (k = 1, ..., K* with K* being the shortest batch length) is extracted from the variablewise normalized data set of 80 reference batches. Then, PCA is carried out on each time-slice matrix X̆ k(80 × 13), generating K* loading matrices P̆ k(13 × 13) (k = 1, ..., K*). Each P̆ k is converted to a weighted loading matrix P̆ k = [P̆ 1,k·g1,k, ···, P̆ j,k, P̆ j,k·gj,k], where P̆ j,k is the jth column of P̆ k and gj,k = λj,k/∑Jj=1λj,k with λj,k being the jth eigenvalue of the covariance matrix of X̆ k. Finally, to identify each operation phase, the K-means clustering algorithm is used to divide all matrices P̆ k (k = 1, ..., K*) into L clusters. Figure 8 shows the partition results when the cluster number L is 2, 3, 4, or 5. Obviously, no matter which cluster number is used, the phase partition results of the revised sub-PCA method are very bad. The identified operation phases are discontinuous and heavily intertwined with each other. It is impossible to detect two real
Table 6. Mean Online Phase Identification Errors of 20 Test Batches phase
P(1)
P(2)
P(3)
P(4)
P(5)
mean ABIE mean PBIE
0 0
1 7%
0 0
1 1%
3 2%
Figure 9. Online phase identification results for two fault batches.
Table 7. Fault Detection Rates (%) of Four Monitoring Models with 80 Reference Batchesa SPuPCA
MPuPCA
SPPARAFAC2
MPPARAFAC2
fault no.
T2
SPE
T2
SPE
T2
SPE
T2
SPE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 MFDR (%) MFAR (%) MFDD (h)
95 90 94 90 89 95 89 87 94 89 89 96 88 87 96 88 87 96 89 100 96 91 92 91 92 100 91.9 0 0.5
95 90 94 90 89 95 94 98 95 89 96 96 88 92 96 88 95 99 89 100 99 91 100 91 98 100 94.1 0 0.5
100 95 100 99 98 100 99 99 100 100 98 100 67 97 100 97 99 100 87 99 96 92 95 100 93 99 96.5 0 0.5
100 95 100 99 98 100 99 99 100 100 98 100 96 99 100 99 100 100 100 99 99 100 100 100 95 100 99.0 0 0.4
95 90 94 90 89 95 89 98 94 89 96 96 88 87 96 88 87 99 89 100 96 91 100 91 92 100 93.0 0 0.5
95 90 94 90 89 95 94 94 95 89 89 96 88 87 96 88 87 99 89 100 99 91 100 91 98 100 93.2 0 0.5
100 95 100 99 98 100 99 99 100 100 98 100 68 99 100 98 100 100 89 99 98 97 100 100 93 100 97.3 0 0.4
100 95 100 99 98 100 99 99 100 100 98 100 100 99 100 98 100 100 98 99 99 100 100 100 97 100 99.2 0 0.4
a
MFDR: mean fault detection rate. MFAR: mean false alarm rate. MFDD: mean fault detection delay. Bold values present better results of MPPARAFAC2 than MPuPCA. 2044
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research Table 8. Fault Detection Rates (%) of Four Monitoring Models with 20 Reference Batchesa SPuPCA
MPuPCA
SPPARAFAC2
MPPARAFAC2
fault no.
T2
SPE
T2
SPE
T2
SPE
T2
SPE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 MFDR (%) MFAR (%) MFDD (h)
38 81 47 85 84 53 82 90 48 84 93 55 81 73 58 80 70 84 85 99 61 85 99 60 90 100 75.6 0 1
38 81 47 85 96 53 92 98 71 84 98 55 81 96 58 82 99 95 85 100 90 85 100 60 96 100 81.7 0 1
67 90 90 96 98 94 98 99 98 98 98 99 78 99 100 97 100 100 98 100 98 100 100 100 95 100 95.8 1.6 1
67 90 90 96 98 94 98 99 98 98 98 99 98 99 100 98 100 100 100 100 99 100 100 100 97 100 96.8 0 0.9
38 81 47 85 95 53 89 98 57 84 98 55 81 95 58 80 98 94 85 100 82 85 100 60 95 100 80.5 0 1
38 81 47 85 84 53 95 97 73 84 98 55 81 73 58 86 70 96 85 100 93 85 100 60 97 100 79.8 0 1
67 90 90 96 98 94 98 99 98 98 98 99 77 99 100 98 100 100 98 100 98 100 100 100 95 100 95.8 0 1
67 90 90 96 98 94 98 99 98 98 98 99 98 99 100 98 100 100 100 100 100 100 100 100 97 100 96.8 0 0.9
a
MFDR: mean fault detection rate. MFAR: mean false alarm rate. MFDD: mean fault detection delay. Bold values present better results of multiphase models than single-phase models.
identification error for boundaries of phases 1 and 3. For phases 2 and 4, the mean ABIE is around 1, which means that only one sample may be assigned into a wrong phase. The mean ABIE of phase 5 is slightly larger because the left boundary of phase 5 varies in a large range (see Table 4). Nevertheless, most of samples are correctly allocated into phase 5 because of the very small mean PBIE. Figure 9 illustrates online phase identification results for two fault batches. Although process faults occurred during the batch run, the online phase identification method still can assign all samples into reasonable phases. 4.4. Online Process Monitoring Results. After obtaining boundaries of 5 operation phases in 80 reference batches, multiphase uPCA (MPuPCA) and multiphase PARAFAC2 (MPPARAFAC2) monitoring models are built from phase data. Besides, single-phase uPCA (SPuPCA) and single-phase PARAFAC2 (SPPARAFAC2) monitoring models are built for comparison. In single-phase models, the entire batch is regarded as a single operation phase and the undivided batch data are used for modeling. Numbers of PCs in MPuPCA and SPuPCA models are selected via the cumulative percent variance (CPV) method, where retained PCs explain more than 85% of the total data variance. Numbers of latent variables in MPPARAFAC2 and SPPARAFAC2 models are chosen empirically. The 99% control limits of T2 and SPE statistics are used in online monitoring. As in ref 2, fault detection rate (FDR), false alarm rate (FAR), and fault detection delay (FDD) are used to quantify the monitoring performance.
operation phases based on partition results of the revised subPCA method. As discussed in section 3.2, the revised sub-PCA method has poor performance because of two drawbacks. This first one is that the time-slice weighted loading matrices P̆ k cannot accurately reflect process characteristics at each sampling time. The second one is that the K-means clustering algorithm cannot cope with the sequentiality of P̆ k. However, the WKM-based method avoids these drawbacks, and therefore its performance is significantly improved. 4.3. Online Phase Identification Results. To validate the effectiveness of the online phase identification method, 80 reference batches are randomly divided into 2 data subsets. A data subset of 60 batches serves as training data, which is used for phase partition to obtain centers and boundaries of 5 operation phases. The other data subset of 20 batches is used to test the online phase identification method. Meanwhile, reference boundaries of 5 phases in 20 test batches are also obtained by the WKM-based phase partition method. For each test batch, online identified phase boundaries are compared with its reference boundaries. Their differences are quantified with two indexes named absolute boundary identification error (ABIE) and percent boundary identification error (PBIE). ABIE and PBIE are defined as ABIEi,l = bi,l − bi,l* and PBIEi,l = (bi,l − b*i,l )/(b*i,l+1 − b*i,l ), where bi,l is the online identified left boundary of lth phase in ith test batch, and b*i,l is the corresponding reference boundary. The mean ABIE and mean PBIE of 20 test batches are listed in Table 6. There is no 2045
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research
reference batches. As shown in Table 8, FDRs of SPuPCA and SPPARAFAC2 are roughly a half of those of MPuPCA and MPPARAFAC2 for some faults related to aeration rate (v1), such as faults 1, 3, 6, 9, 12, 15, and 24. For some faults caused by agitator power input (v2), including faults 2, 4, 10, 16, 19, and 22, FDRs of MPuPCA and MPPARAFAC2 increase by more than 10% as compared to SPuPCA and SPPARAFAC2. As a result, mean FDRs of MPuPCA and MPPARAFAC2 are much higher than those of SPuPCA and SPPARAFAC2. Mean FDRs and mean FDDs of MPPARAFAC2 are equal to those of MPuPCA. The T2 statistic of MPuPCA gives a mean FAR of about 1.6%, while MPPARAFAC2 almost has no false alarms, i.e., FAR ≈ 0. Therefore, the performance of MPPARAFAC2 is slightly better than MPuPCA. Figures 10 and 11 compare monitoring charts of MPPARAFAC2 and SPPARAFAC2 for fault batch 6 in two
Figure 10. Monitoring charts for fault batch 6 using 80 reference batches: (a) MPPARAFAC2, (b) SPPARAFAC2.
Table 7 compares FDRs of four monitoring models for 26 fault batches in Table 3. Only the mean values of FAR and FDD are listed in Table 7 because they are equal to zero for most faults. Four models have almost no FAR and very small mean FDDs. For most of fault batches, two multiphase models provide better monitoring performance than single-phase models, owing to their higher FDRs in T2 and SPE statistics. Particularly, for fault batches 4, 10, 14, 16, 17, 22 and 24, both T2 and SPE statistics of multiphase models have much larger FDRs than those of single-phase models. As a result, mean FDRs of T2 and SPE statistics of MPuPCA increase by 4.6% and 4.9% as compared to SPuPCA. Similarly, in comparison with SPPARAFAC2, mean FDRs of T2 and SPE statistics of MPPARAFAC2 increase by 4.3% and 6.0%, respectively. Two PARAFAC2-based models have better monitoring abilities than two uPCA-based models, because of their slightly higher mean FDRs in T2 statistic. To further demonstrate advantages of multiphase models, the monitoring performance of above four models are investigated in the case of decreasing the batch number in the training data set from 80 to 20. Monitoring results are listed in Table 8. Comparing Table 8 with Table 7, it can be found that the monitoring performance of SPuPCA and SPPARAFAC2 significantly degrade after decreasing the number of reference batches. However, MPuPCA and MPPARAFAC2 are less sensitive to the decrease in reference batch number, and they maintain well monitoring abilities in the case of using fewer
Figure 11. Monitoring charts for fault batch 6 using 20 reference batches: (a) MPPARAFAC2, (b) SPPARAFAC2.
cases of using 80 and 20 reference batches, respectively. When the number of reference batches is 80, both MPPARAFAC2 and SPPARAFAC2 have well monitoring abilities, but MPPARAFAC2 outperforms SPPARAFAC2 in that its FDR is 100% (see Figure 10). After decreasing the number of reference batches to 20, SPPARAFAC2 only detects a half of fault samples, while MPPARAFAC2 still can capture most of fault samples (see Figure 11). Another important difference between monitoring charts of MPPARAFAC2 and SPPARAFAC2 lies in 2046
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research
Figure 12. Contribution plots of T2 statistic for fault batch 6: (a) MPPARAFAC2, (b) SPPARAFAC2. Figure 13. Contribution plots of T2 statistic for fault batch 20: (a) MPPARAFAC2, (b) SPPARAFAC2.
the control limits of T2 and SPE statistics. MPPARAFAC2 presents varying control limits for T2 and SPE statistics in different phases, while SPPARAFAC2 has the same control limits over the entire batch. In fact, SPPARAFAC2 can only reveal common characteristics of all operation phases because it simply builds a single model based on all batch data. Because it loses phase-specific characteristics, the SPPARAFAC2 model may perform poorly in some phases. Therefore, its monitoring ability is limited. By modeling each operation phase separately, MPPARAFAC2 is able to extract specific characteristics of each phase, and further builds more accurate monitoring models and gives reasonable control limits for different phases. Such advantages make MPPARAFAC2 have a very well monitoring performance. Contribution plots of MPPARAFAC2 and SPPARAFAC2 for fault batches 6 and 20 are shown in Figures 12 and 13, which are obtained in the case of using 20 reference batches. As shown in Figure 12a, the contribution plot of MPPARAFAC2 indicates that the variable 1, i.e., aeration rate, is the largest contributor in the time period between 20 and 150 h. Thus, aeration rate is identified as the fault variable for fault batch 6, which is exactly the preset fault variable in Table 3.
The contribution plot of SPPARAFAC2 shows that both variables 1 and 6 contribute a lot to the T2 statistic under the fault condition (see Figure 12b), and thus it is difficult to determine which one is the real fault variable. In Figure 13, contribution plots of MPPARAFAC2 and SPPARAFAC2 both correctly indicate that variable 3, i.e., substrate feeding rate, is the fault variable for fault batch 20. However, the contribution plot of SPPARAFAC2 suffers from disturbances from variable 6. Such disturbances cannot be observed in the contribution plot of MPPARAFAC2. These results prove that multiphasebased monitoring methods are more powerful than singlephase-based methods in fault diagnosis.
5. CONCLUSIONS New phase partition, online phase identification and phasebased monitoring methods are proposed for multiphase batch processes with uneven durations. The new phase partition method is developed based on the WKM clustering algorithm. It performs the WKM algorithm on the trajectory data of 2047
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048
Article
Industrial & Engineering Chemistry Research
(6) Hu, K. L.; Yuan, J. Q. Multivariate statistical process control based on multiway locality preserving projections. J. Process Control 2008, 18, 797−807. (7) Yu, J.; Qin, S. J. Multiway Gaussian mixture model based multiphase batch process monitoring. Ind. Eng. Chem. Res. 2009, 48, 8585−8594. (8) Luo, L. J.; Bao, S. Y.; Gao, Z. L.; Yuan, J. Q. Tensor global-local preserving projections for batch process monitoring. Ind. Eng. Chem. Res. 2014, 53, 10166−10176. (9) Ge, Z. Q.; Zhao, L. P.; Yao, Y.; Song, Z. H.; Gao, F. R. Utilizing transition information in online quality prediction of multiphase batch processes. J. Process Control 2012, 22, 599−611. (10) Camacho, J.; Pico, J. Online monitoring of batch processes using multi-phase principal component analysis. J. Process Control 2006, 16, 1021−1035. (11) Lu, N. Y.; Gao, F. R.; Wang, F. L. Sub-PCA modeling and online monitoring strategy for batch processes. AIChE J. 2004, 50, 255−259. (12) Zhao, C. H.; Wang, F. L.; Lu, N. Y.; Jia, M. X. Stage-based softtransition multiple PCA modeling and on-line monitoring strategy for batch processes. J. Process Control 2007, 17, 728−741. (13) Yao, Y.; Gao, F. R. A survey on multistage/multiphase statistical modeling methods for batch processes. Annu. Rev. Control 2009, 33, 172−183. (14) Luo, L. J.; Bao, S. Y.; Gao, Z. L.; Yuan, J. Q. Batch process monitoring with GTucker2 model. Ind. Eng. Chem. Res. 2014, 53, 15101−15110. (15) Rothwell, S. G.; Martin, E. B.; Morris, A. J. Comparison of methods for dealing with uneven length batches. In Proceedings of the 7th International Conference on Computer Application in Biotechnology (CAB7); Elsevier Science: New York, 1998; pp 387−392. (16) Nomikos, P.; MacGregor, J. F. Multivariate SPC charts for monitoring batch processes. Technometrics 1995, 37, 41−59. (17) Kassidas, A.; MacGregor, J. F.; Taylor, P. A. Synchronization of batch trajectories using dynamic time warping. AIChE J. 1998, 44, 864−875. (18) Fransson, M.; Folestad, S. Real-time alignment of batch process data using COW for on-line process monitoring. Chemom. Intell. Lab. Syst. 2006, 84, 56−61. (19) Lu, N. Y.; Gao, F. R.; Yang, Y.; Wang, F. L. PCA-based modeling and on-line monitoring strategy for uneven-length batch processes. Ind. Eng. Chem. Res. 2004, 43, 3343−3352. (20) Zhao, L. P.; Zhao, C. H.; Gao, F. R. Inner-phase analysis based statistical modeling and online monitoring for uneven multiphase batch processes. Ind. Eng. Chem. Res. 2013, 52, 4586−4596. (21) Ge, Z. Q.; Song, Z. H. Online monitoring and quality prediction of multiphase batch processes with uneven length problem. Ind. Eng. Chem. Res. 2014, 53, 800−811. (22) Luo, L. J.; Bao, S. Y.; Gao, Z. L. Quality prediction based on HOPLS-CP for batch processes. Chemom. Intell. Lab. Syst. 2015, 143, 28−39. (23) Harshman, R. A. PARAFAC2: Mathematical and technical notes. UCLA Working Papers in Phonetics; 1972; Vol. 22, pp 30−47. (24) Kiers, H. A. L.; Ten Berge, J. M. F.; Bro, R. PARAFAC2-Part I. A direct fitting algorithm for the PARAFAC2Model. J. Chemom. 1999, 13, 275−294. (25) Kolda, T.; Bader, B. Tensor decompositions and applications. SIAM Rev. 2009, 51, 455−500. (26) Leiva, L. A.; Vidal, E. Warped K-Means: An algorithm to cluster sequentially-distributed data. Inf. Sci. 2013, 237, 196−210. (27) Martin, E. B.; Morris, A. J. Non-parametric confidence bounds for process performance monitoring charts. J. Process Control 1996, 6, 349−358. (28) Birol, G.; Undey, C.; Cinar, A. A modular simulation package for fed-batch fermentation: penicillin production. Comput. Chem. Eng. 2002, 26, 1553−1565.
phase-sensitive variables to divide the entire batch into different operation phases. The WKM-based phase partition method has three main advantages. First, the sequential structure of batch data is taken into account, which ensures a reasonable phase partition result. Second, only phase-sensitive variables are used for phase partition, improving the phase partition accuracy. Third, it is easy to choose an optimal phase number according to a partition performance combination index (PPCI). The online phase identification method allocates a new sample into a proper phase according to a phase identification combination index (PICI), which is defined by combining space distance and time difference between the new sample and each phase center. Two multiphase monitoring methods are developed, termed as MPuPCA and MPPARAFAC2, respectively. Two monitoring statistics, i.e., T2 and SPE statistics, are constructed for fault detection. The contribution plot of T2 statistic is used for fault diagnosis. Proposed methods are tested in a fed-batch penicillin fermentation process. The WKM-based phase partition method successfully identifies two real operation phases as well as potential subphases in each real operation phase, and thus gives a reasonable and interpretable phase partition result. The online phase identification method can assign all samples into different phases with a high accuracy. Multiphase monitoring methods, i.e., MPuPCA and MPPARAFAC2, significantly outperform single-phase monitoring methods, i.e., SPuPCA and SPPARAFAC2, especially when the number of reference batches is small. The reason is that multiphase monitoring models can capture phase-specific characteristics of the batch process by dividing it into several operation phases and modeling each operation phase separately. MPPARAFAC2 has better monitoring performance than MPuPCA. The contribution plot of MPPARAFAC2 is able to accurately identify the fault variable.
■
AUTHOR INFORMATION
Corresponding Author
*Tel.: +86 (0571) 88320349. E-mail address:
[email protected] (L. Luo). Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS This study was supported by the National Natural Science Foundation of China (No. 61304116) and the Zhejiang Provincial Natural Science Foundation of China (No. LQ13B060004).
■
REFERENCES
(1) Nomikos, P.; MacGregor, J. F. Monitoring batch processes using multiway principal component analysis. AIChE J. 1994, 40, 1361− 1375. (2) Luo, L. J.; Bao, S. Y.; Gao, Z. L.; Yuan, J. Q. Batch process monitoring with tensor global-local structure analysis. Ind. Eng. Chem. Res. 2013, 52, 18031−18042. (3) MacGregor, J. F.; Cinar, A. Monitoring, fault diagnosis, faulttolerant control and optimization: Data driven methods. Comput. Chem. Eng. 2012, 47, 111−120. (4) Venkatasubramanian, V.; Rengaswamy, R.; Kavuri, S. N.; Yin, K. W. A review of process fault detection and diagnosis Part III: Process history based methods. Comput. Chem. Eng. 2003, 27, 327−346. (5) Ge, Z. Q.; Song, Z. H.; Gao, F. R. Review of recent research on data-based process monitoring. Ind. Eng. Chem. Res. 2013, 52, 3543− 3562. 2048
DOI: 10.1021/acs.iecr.5b03993 Ind. Eng. Chem. Res. 2016, 55, 2035−2048