Article pubs.acs.org/IECR
Multiway Interval Partial Least Squares for Batch Process Performance Monitoring Shallon Stubbs,† Jie Zhang, and Julian Morris* Centre for Process Analytics and Control Technology, School of Chemical Engineering and Advanced Materials, Newcastle University, Newcastle upon Tyne NE1 7RU, U.K. S Supporting Information *
ABSTRACT: The method of interval Partial Least Squares (iPLS) is combined with multiway partial least-squares (MPLS) to allow the building of enhanced statistical process performance monitoring models. A novel algorithm is proposed for segmenting batch duration, or spectral data in applications employing spectroscopy data into several subintervals for which independent PLS models can be constructed. The approach deviates from the method of using subintervals of equal length and the practice of choosing only a subset of these subintervals for prediction and/or monitoring. The proposed approach provides dramatic reduction in the number of subintervals required and subsequently the number of PLS models required to give improved prediction and monitoring performance. The proposed method, MiPLS, is applied to the well-known benchmark fed-batch penicillin production simulator, Pensim, for quality variable prediction and fault detection. published papers in process analytics in two key ways. The first is that the subintervals are not equally spaced, and the interval slicing is optimally selected based upon an algorithm to minimize the overall root mean squared error (RMSE) of prediction. Second, all subintervals are employed as the aim of the model is to support the detection of fault conditions developing at any stage in the batch run which may have adverse impact on the target product quality. The proposed iPLS model is developed and applied to batch-wise unfolded data to provide a Multiway iPLS (MiPLS) monitoring model. The model’s fault detection performance and quality variable prediction capabilities are demonstrated using data generated from the well-known benchmark fed-batch penicillin fermentation simulator20 based on a detailed mechanistic model.21 It is shown that optimal splitting of the intervals can dramatically minimize the number of subintervals required, and subsequently the number of PLS models required, to represent the full duration of the batch process. The paper is structured as follows. Section 2 reviews several data unfolding methods that have been proposed in the literature and introduces the concept of batch duration interval splicing. The proposed method of combining the two concepts is explained in Section 2.2. Section 3 provides a descriptive overview of the fed-batch penicillin process simulator and describes the development of the MiPLS process monitoring and predictive models employed in the study. Section 4 presents the results and analysis obtained using the proposed process monitoring and prediction models. Section 5 presents some conclusions.
1. INTRODUCTION Multivariate statistical process control, or process performance monitoring, methods based on multiway principal component analysis (MPCA) and projection to latent structures (PLS) have proven their significant value in batch analysis and monitoring.1−6 These methods have been widely studied and extensively applied across many industries.7−14 They are essentially the only powerful approaches for the analysis and monitoring of batch processes when mechanistic models are not available. This work focuses on the MPCA approach proposed by Nomikos and MacGregor2 as a basis for the extension proposed using interval partial least-squares. Interval Partial Least Squares (iPLS) is a local modeling methodology quite widely used in the analysis and modeling of spectroscopic data and variable selection to optimize the predictive power of PLS regression models as well as to aid in data interpretation. iPLS has been applied successfully to the spectroscopic monitoring and prediction of quality variables, in particular using mid-infrared and near-infrared spectra. The technique has been reported in applications ranging from pharmaceutical medicinal explorations - measuring the content of flavone an active ingredient in a rare medicinal plant called snow lotus using near-infrared spectroscopy,15 determination of total polyphenols content in green tea using FT-NIR spectroscopy,16 for simultaneous active substance detection in pharmaceutical formulations,17 to the detection of quality parameters in biodiesel/diesel blends,18 and for the detection of contaminants in lubricating oil.19 The principle of the method involves splitting a spectrum into several equidistance subintervals or regions and then developing a PLS model for each subinterval using a selected number of latent variables. To the authors’ knowledge there have not been any studies into how iPLS might contribute to the analysis and interpretation of process malfunctions and faults within a multivariate statistical process performance monitoring scheme. In this paper the iPLS implementation differs from previously © XXXX American Chemical Society
Special Issue: John MacGregor Festschrift Received: December 21, 2012 Revised: April 26, 2013 Accepted: April 26, 2013
A
dx.doi.org/10.1021/ie303562t | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
2. COMBINING DATA UNFOLDING AND INTERVAL SPLICING TECHNIQUES 2.1. Three-Dimensional Data Unfolding. MPLS is an extension of PLS to deal with data in three dimension arrays. By three dimensions we refer to variables sampled over a fixed time duration collected across several batches (repeat runs of the same process) as shown in Figures 1 and 2. The MPLS is
the same dimension (KI × J) but with different ordering of the rows. 2.2. Combining Data Unfolding and Interval Splicing. The process involves first splitting the batches along identical time slice intervals. Each subinterval is then unfolded according to the unfolding approach (iv) in Figure 2. Data blocks extracted from identical intervals across each batch are combined into r subinterval grouping, and a PLS model is constructed for each subinterval as shown in Figure 3.
Figure 1. Time-mode unfolding methods (i) and (ii).
Figure 3. Batch-wise interval segmented unfolding of data set.
Critical to the success of iPLS is the process of arriving at the ideal interval subdivision across all batches or the selection of a subset of the intervals which will minimize the overall RMSE. Chen et al.15 used a genetic algorithm to select a subset of the intervals that most efficiently contributes to the prediction of the quality parameter. In this subsection we describe a novel procedure which has been effectively applied to derive the efficient interval subdivision across all batches. The approach renders the selection of subintervals unnecessary as it will be demonstrated that if the batch, or spectrum, duration is optimally spliced then the number of subintervals required is significantly less than choosing an arbitrary fixed-length subinterval. The algorithm was inspired by an algorithm proposed by Kegl et al.22 for constructing principal curves using polygonal line segments. The principal curve algorithm was initialized by using the first linear principal component of the data set. The algorithm then iteratively identifies new segment vertices by adding one vertex to polygonal curve in each iteration step. The new position of each vertex was determined by minimizing an average squared distance criterion penalized by a measure of the local curvature. The stopping criterion is based on a heuristic complexity measure, determined by the number of segments k, the dimension of data matrix, and the average squared distance of the polygonal line from the data points. Figure 4 shows a flowchart of the proposed algorithm, which is initialized by defining first a minimum subinterval length or step size Smin. All subsequently selected interval lengths are integer multiples of this minimum interval spacing. The process is best described as an iterative binary subdivision process that terminates when either a predefined maximum number of subdivisions is reached or the algorithm deems that further subdivision of the intervals would not achieve any significant reduction in the RMSE of prediction. To evaluate the degree of success or failure of a given subdivision stage, we define two
Figure 2. Batch-mode unfolding (iii) and variable-mode unfolding (iv).
essentially the equivalent of PLS applied to the unfolded three dimension data set X (I × J × K) where I is the number of batches, J is the number of variables, and K represents the number of measurement samples for the duration of the batch run. As shown in Figures 1 and 2 the data can be unfolded in one of three different modes: (i) and (ii) time, (iii) batch, and (iv) variable. Each of the three general unfolding categories (time, batch, and variable) has two alternative implementations making the total number of two-dimensional structure representations equal to six. However, due to the equivalence of some of the unfolded structures, the number of unique 2-dimensional data set structures reduces to three. For instance, (ii) and (iv) are of B
dx.doi.org/10.1021/ie303562t | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
Figure 4. Flowchart of binary interval splitting algorithm.
best binary split of any given segment is a sequential process of evaluating all split scenarios as illustrated in the flowchart (Figure 4) and the example shown in Figure 5. The partitioning is adjustment in discrete step size defined by the minimum subinterval. The binary division algorithm compares the segment RMSE value (RMSEi) before a split with the weighted sum of the split version of the interval to determine whether splitting the segment into two parts achieves any improvement over keeping the segment as a single block. The impact of a given subinterval binary splitting on the total RMSE (RMSET) is also factored-in using a combined weighted criterion computed as the weighted sum of two ratios
measures - the total and segment RMSE, given by eq 1 and eq 2, respectively r
RMSET =
ni RMSEi N
(1)
∑ (y ̂ − y)2 ni
(2)
∑ i=1
RMSEi =
where ni is the length (number of samples) of the ith subinterval, N is the full length of the batch duration (total number of samples), r is the total number of segments the batch is currently divided into, and ŷ is the estimated/predicted value of the quality variable measurement y based upon the PLS model of the ith subinterval. It is noted that the total RMSE is computed as a weighted sum of the interval RMSE such that the largest subinterval will have the most influence on the overall value. Identifying the
γi = α
RMSEi+ RMSET+ − + (1 − α) RMSET RMSEi−
(3)
RMSE+T
where is the evaluated total RMSE after a given subinterval is split, and RMSE−T is the total RMSE before splitting. Likewise, RMSE−i is the segment RMSE of the selected C
dx.doi.org/10.1021/ie303562t | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
Figure 5. An illustrative example of binary division algorithm for interval splitting.
The binary division interval splicing algorithm was first evaluated using simulated data defined by eq 5 where the coefficients a1 and a2 were designed to take on different parameter values over the different time intervals indicated.
interval before it is subdivided, and RMSE+i is the average segment RMSE after the segment is split into two subparts. The weighting parameter α takes on values over the range from 0 to 1, setting the parameter closer to 1 allows more emphasis to be placed on improving the total RMSE of the fit. By design, therefore, γ will take on values less than 1 if a splitting of the data set along a given time line reduces the overall RMSE and will be equal to or greater than 1 if an experimental splitting of the batch fails to improve the prediction fit. Based upon the criterion of eq 3, the algorithm is biased to favor splitting of the largest subinterval in any given stage since a registered improvement in the segment RMSE resulting from binary splitting exercise will impact on the RMSET value more if that subinterval represents a large proportion of the total batch duration. Therefore, as indicated in the flowchart (Figure 4), the algorithm will first target the largest subinterval and may either revert to the previous state if the split is rejected or proceed, in either case targeting the next largest subinterval for splitting. After the optimal subdivision of the batch duration is computed, the final stage of the process involves developing a PLS model to fit each interval k
Ek + 1 = X −
i=1
+ n(t ) 0.15;
0 < t ≤ 100
0.75; 100 < t ≤ 160 0.35; 160 < t ≤ 225 1.25; 225 < t ≤ 300
(5)
Subsequently, the linear relationship between the response variable y and predictor variables x1 and x2 is dependent upon the current time or sample instance. The data are corrupted with random noise n(t) with zero mean and variance equal to 10% of the true measurement variances. Three different scenarios were considered and are summarized in tabular form which is available as Supporting Information (simulation results evaluating the capability and accuracy of proposed interval division algorithm) where Smin is the minimum subinterval length. Ten batches of data were generated each spanning a batch duration of 300 samples. A value for Smin = 15 was selected. Multiway unfolding and interval splitting algorithm applied. Figure 6 shows the prediction fitting evolution for the response variable y as the algorithm discovers the optimal splicing of batch data set range. The RMSE reduction with discovery of the intervals progresses are as follows: (a) Total RMSE = 0.28403, (b) Total RMSE = 0.13022, (c) Total RMSE = 0.050297, and (d) Total RMSE = 0.028955. The vertical blue lines in each of the plots from (a) to (d) indicate the intervals in the order they are discovered (identified) by the proposed binary division algorithm.
k
∑ Tp i i and Fk + 1 = Y − ∑ Uq i i i=1
y(t ) = a1x1(t ) + a 2x 2(t ) ⎧ a1 = 0.75; a 2 = ⎪ ⎪ a1 = 0.10; a 2 = ⎨ ⎪ a1 = 0.35; a 2 = ⎪ ⎩ a1 = 0.50; a 2 =
(4)
where k is the number of latent variables used, X is the process measurements, and Y is the process quality variables. The process measurement and quality variable are therefore simultaneously decomposed into a set of scores T and U using a matrix of loading vectors p and q, respectively. A linear inner relation between the scores ensures that the scores of X are generated to be most correlated while being highly predictive of Y. D
dx.doi.org/10.1021/ie303562t | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
heating/cooling water flow rates; the internal state variables are culture volume, generated heat, carbon dioxide, dissolved oxygen, biomass, penicillin, and substrate feed concentrations; and the controlled variables used are pH and bioreactor temperature with glucose addition being carried out in openloop. The typical offline measurements of biomass and penicillin concentration were assigned as the process quality variable for which the predictive model was developed. The relationship between these variables is characterized by nonlinear dynamic equations, and the process is multistage in nature. In the initial preculture stage most of the necessary cell mass is usually generated after which penicillin production commences at the exponential growth phase. Penicillin production then continues until cell growth reaches the stationary phase. To ensure high penicillin productivity a minimum cell growth rate is maintained by feeding glucose continuously into the system during cultivation instead of applying it all at once. The duration of each batch is 400 h, comprising a preculture stage (about 45 h) followed by a final fed-batch stage. All batches are of the same duration, and measurement samples are collected with a sampling interval of 0.5 h making each batch essentially comprising of 800 samples (400 h). The set points for the controlled variables and initial conditions of the input variables are tabulated and available in the Supporting Information. The proposed MiPLS model was then used to detect faults in a simulated fed-batch penicillin cultivation process. Twenty batches were used for each of model training and validation and five ‘unseen’ batches for testing and evaluation of each of the various simulated fault conditions. 3.2. Prediction and Process Monitoring. The input/ output structures of the predictive and process monitoring models are shown in Figure 7. The PLS2 predictive model estimates the quality variables − biomass and penicillin concentrations using eleven measured process variables. The same set of input variables was employed to develop the process monitoring model. The process monitoring model is built using bootstrap aggregation regression of PLS1 models (see later discussion) with each PLS1 model providing an output that is representative of the estimation of one of the six selected online measured variables (substrate concentration, dissolved oxygen concentration, CO2 concentration, pH, reactor temperature, and generated heat) from the eleven predictor variables. Estimation or reconstruction of a given
Figure 6. Plot of predicted response versus actual response through different stages of the interval splitting algorithm.
The algorithm gave the best results when the parameters α and γmax were assigned values in the ranges 0.6 < α < 1.0 and 0.8 < γmax < 0.9, respectively. If γmax is chosen too close to the value of 1.0 and/or α chosen to be 0.5 or less, the binary division algorithm can be too sensitive to provide reduction in RMSEi and had the tendency of overshooting the actual number of segment intervals required in some cases.
3. FED-BATCH PENICILLIN SIMULATOR PREDICTION AND FAULT MONITORING 3.1. Fed-Batch Penicillin Production Process Simulator Overview. Commercial quantities of secondary metabolites are produced using filamentous micro-organisms. Formation of the target product (here penicillin) occurs after the cells have grown, and it is common practice to first grow the micro-organisms in a batch culture and then to promote the synthesis of the antibiotic by means of a fed-batch operation. Process data were generated using the PenSim simulator.20 The input variables are substrate feed rate and substrate feed temperature; the manipulated variables are acid/base and
Figure 7. Predictive model based on PLS2 (left) and fault detection models based on PLS1 (right). E
dx.doi.org/10.1021/ie303562t | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
Table 1. Specifications of the iPLS Models Used To Implement Process Monitoring Strategy model outputs process variables
predictor variables
1. aeration rate 2. agitator power 3. substrate feed rate 4. substrate feed temperature 5. substrate concentration 6. dissolved O2 concentration 7. volume 8. CO2 concentration 9. H+ concentration (pH) 10. reactor temperature 11. generated heat
not used not used not used not used 1, 2, 4, 6, 7, 11 1, 3, 5, 9, 10, 11 not used 1, 10, 11 1, 3, 5, 6, 7, 10, 11 1−9, 11 1−10
intervals (samples)
interval variance
0−90, 91−801 0−90, 91−600, 601−801
2.72 × 10−4, 1.59 × 10−5 8.39 × 10−6, 2.63 × 10−6, 1.79 × 10−6
0−801 0−90, 91−270, 271−801 1−330, 331−420, 421−801 1−90, 91−210, 211−300, 301−801
1.9 × 10−3 3.54 × 10−5, 7.39 × 10−5, 1.06 × 10−5 1.7 × 10−3, 1.2 × 10−3, 5.2 × 10−3 2.1 × 10−3, 11 × 10−2, 1.2 × 10−3, 14.2 × 10−3
In the development of the PLS models, both with and without variable selection, a bootstrapped PLS regression algorithm24 was employed. The algorithm incorporation within an iterative variable selection procedure is shown in Table 2,
output variable is based on the remaining variables or subset thereof consisting of input variables, control, and manipulated variables. Along with employing an interval PLS approach, the model development also incorporates variable selection to identify the minimum subset of the ten remaining process variables that is most reflective of the particular output variable being estimated. Table 1 shows details of each of the iPLS1 process monitoring models, indicating the interval segmentation evolved by the proposed binary splitting interval algorithm. The selected variables used by each PLS1 model as inputs are indicated by their index number in column 2 of the table ‘predictor variables’. The process variables, for which ‘not used’ appears in the column, are the five external input variables to the process, and as such no PLS models were constructed to reconstruct these variables. The method of variable selection incorporated with the interval PLS model identifies the minimum subset of predictor variables to apply in the specific model without significantly degrading the prediction RMSE for the particular output variable being estimated. Figure 8 illustrates the design of the
Table 2. Bootstrap PLS Algorithm24 Integrated within Variable Selection Algorithm step
calculation
Step 1
Initiate outermost loop iteration to start at first column c = 1; set j = 1; Mean center and scale X and Y; Remove a predictor variable (column c from X) and save the remaining variables as Xjc, set E0 = Xjc and F0 = Y and k = 1. Set uk to the first column of Fk‑1 wk = ETk−1uk/(uTk uk) Scale wk to unit length: wk = wk/∥wk∥ tk = Ek−1wk qk = FTk−1tk/(tTk tk) Scale qk to unit length: qk = qk∥qk∥ ũk = Fk‑1qk Check for convergence. Set uk = ũk If converges go to Step 13, else return to Step 5. Ek‑1 loadings: pk = ETk−1tk/(tTk tk) Regress uk onto tk: bk = uTk tk/(tTk tk) Compute Residuals: Ek = Ek − tkpTk and Fk = Fk − bktkpTk Apply cross-validation to determine whether to generate additional latent structure if required then return to Step 5, else move to Step 17. Execute interval splicing algorithm, repeating Steps 3 to 16 to identify optimal interval splitting; save best iPLS model. If c is less than the number of columns of X then set c = c + 1 and return to Step 2, else go to Step 19; Identify the best PLS model based on RMSE and set new X = Xjc (best case); If j > 1 compare the current best case PLS model with the previous one. If RMSE is lower than set j = j + 1 and return Step 1 else terminate outermost loop.
Step 2 Step 3 Step Step Step Step Step Step Step Step Step Step Step Step Step
4 5 6 7 8 9 10 11 12 13 14 15 16
Step 17 Step 18 Step 19 Step 20
Figure 8. Variable selection search tree algorithm.
variable selection process. The technique uses a search tree algorithm inspired by Bratley’s Algorithm23 employed in a periodic task scheduling applications to identify a feasible schedule from a set of non-pre-emptive tasks. The process starts out by using all n predictor variables and then branches into n nodes. Each branch node represents a case where one of the predictor variables is removed. The node with the minimum RMSE value is then selected, the other branches are terminated, and the process is repeated again for the selected node. The tree terminates when there is no node among the set of new branches provides any significant improvement in the RMSE value over that of the previous level.
essentially the subsection spanning from Steps 3 to 16. The selection of the number of latent variables A to employ was in all cases done via cross-validation, Step 16, which decides the number of iterations of the innermost loop. The outermost loop in each stage eliminates one predictor variable by identifying the best performing PLS model based on the subset of predictor variables Xjc. In a given stage j one PLS model is constructed for each unique set of predictor variables Xjc which differ from each other based on the selected variable that is removed. When removal of predictor variables fails to F
dx.doi.org/10.1021/ie303562t | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
Figure 9. Plot of root-mean square error RMSE reduction during discovery of the MiPLS model (sample) intervals.
yield further reduction in RMSE the algorithm terminates. The overall algorithm is a nested three loop structure. It was found that in this particular application study no more than four segments were needed to create the best MiPLS1 model. The PLS1 fault detection model residuals, computed as the difference between the estimated and the actual measured outputs, were then combined to form the monitoring metric. Both the squared prediction error (SPE) statistics and Hotelling’s T2 were experimented with as possible monitoring metrics m
SPEk =
∑ eik2 i=1
m
Tk2 =
∑ i=1
θik2 σk
(6) Figure 10. Predictions of penicillin and biomass concentrations using standard PLS model (upper plots) and the proposed MiPLS model (lower plots). The vertical blue lines indicate the intervals applied by the interval splicing algorithm.
(7)
where k is the sample instance, m is the number of variables for which the residuals are computed, and σk is a time varying covariance computed and applied as proposed in ref 25. The same monitoring statistics have been used by other process monitoring studies using the Pensim simulator.20,25 It is noted that in this study it was necessary to apply a time dependent variance to compute the T2 statistics due to a difference in variance characterizing the different intervals of the iPLS model.
converted into batch times). The results were obtained using five ‘unseen validation batches’ not used in the initial model building. It is interesting to observe that the intervals discovery has located intervals that reflect the phase nature of the fermentation, namely the preculture phase of around 45 h but which can vary from batch to batch, followed by about 355 h of the fed-batch stage. 4.2. Process Performance Monitoring. In order to investigate the potential of MiPLS to the detection of both abrupt and incipient (subtle) faults, the three process malfunctions were simulated, and the detection capabilities of the SPE and Hotelling’s T2 statistics monitoring statistics were evaluated. For each type of fault and occurrence (abrupt and incipient), four batches were generated with different fault commencement times (50, 100, 150, and 200 h). Fault 1 simulated a sudden pH controller failure, and fault 2 simulated both a sudden decrease in substrate flow rate of 15% and a gradual fault with a rate of decrease of 0.002% of the nominal value of the variable per sample (0.5 h). Fault 3 simulated both sudden and incipient faults using the same magnitudes as for fault 2 for the case of decrease in agitator power. The abrupt faults were all readily detectable performing equally as well as conventional multivariate statistical process control technologies studies by others using the Pensim simulator, and consequently only the results of the incipient subtle faults are discussed here. Figure 11 compares the fault detection performance of incipient faults 2 and 3 initiated at time t = 150 h. The left
4. PREDICTION AND MONITORING RESULTS 4.1. Predictive Model Performance. For the PLS2 predictive model alternatives considered for predicting biomass and penicillin concentration, the variable selection algorithm employed predictor variables 3 to 7 and 9 to 11. The evolution and reduction in the overall RMSE through the stages of the binary division algorithm used to identify the boundaries of the intervals employed by MiPLS model is shown in Figure 9. It can be observed that the initial RMSE corresponding to a noniPLS model, that is, a standard MPLS approach, is dramatically improved when the final MiPLS model is evolved. In all cases studied the interval splicing achieved reduction of the initial RMSE to less than 30% of its initial value. The plots in Figure 10 show the reconstruction/prediction of biomass concentration and penicillin concentration when the final MiPLS2 model is employed (stage 5 of Figure 9), in comparison to a conventional PLS model employing no interval segmentation (stage 1 of Figure 9). The vertical blue lines highlight the intervals identified and employed by the MiPLS algorithm (note that the identified ‘sample’ intervals are G
dx.doi.org/10.1021/ie303562t | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
immediately after fault iniation. Detection of fault 2 was not as quick and for the most part went undetected by the SPE statistics. A similar detection performance for this fault was also reported in Lee et al.25 The early detection of incipient (subtle) faults is crucial as the impact of the fault on the process quality variable would not be detectable in actual operation given that these are typically off-line measurements. Overall the Hotelling’s T2 statistics performed more reliably than the SPE statistics. It is conjectured that this is because it accounts for the difference in the residual variance across the different intervals of the PLS model fit. To prevent false alarm conditions for the SPE statistics a variable control limit over different interval time spans of the batch duration, similar to the control limits employed by Lee et al.,25 could be used. The studies presented demonstrate the potential of the proposed Multiway interval PLS multiple model approach for a new process performance monitoring method to that using a single PLS model through the increased flexibility and degrees of freedom inherent to the iPLS method. Employing subinterval dividing and the development of independent PLS models for each interval segment provides for more effective capturing and representation of the process dynamics and nonlinearities. Instead of just one general PLS model, several models are implemented with the freedom to select the best number of latent variables for each specific model to most accurately capture the dominant process features characterizing the specific interval within the batch duration.
Figure 11. Detection of incipient faults 2 and 3 using Hotelling’s T2 statistics.
vertical bar indicates the time of fault initiation, the right vertical bar indicates the time of fault detection, and the horizontal line represents the control limit. It can be observed that the detection of incipient fault 3 is achieved more readily than incipient fault 2. The horizontal dashed line represents the 99% control limit. Figure 12 illustrates the detection of incipient fault 2 for three different fault initiation times 50 h, 100 h, and 200 h. The
5. CONCLUSIONS The iPLS implementation presented differs from previously published papers in process analytics in two key ways. The first is that the subintervals are not equally spaced and the interval slicing is optimally selected based upon an algorithm to minimize the overall root mean squared error (RMSE) of prediction. Second, all subintervals are employed as the aim of the model is to support the detection of fault conditions developing at any stage in the batch run and which may have adverse impact on the target product quality. The proposed multiway iPLS (MiPLS) modeling approach combined with the proposed algorithm for identifying the optimal interval subdivision is shown to produce significant reduction in the RMSE over the stand MPLS approach. The usually adopted fixed-interval length iPLS approach typically involves the use of a genetic algorithm to select a subset of the intervals that are best predictive of the quality variable. Employing the binary algorithm has facilitated the use of a much smaller set of define segments/intervals to implement the iPLS model. It was demonstrated that optimal selection can significantly reduce the number of segments/intervals needed. The proposed binary division algorithm offers an efficient alternative to employing a fix size interval selection approach to the implementation of an iPLS model. It is noted that in the study presented the fault detection capabilities of the proposed MiPLS process monitoring scheme did not provide as dramatic an improvement in comparison to other more conventional approaches and those where wavelet decompositions have been used26−29 but does demonstrate that the MiPLS approach is arguably just as effective and deserving further study. Overall, both the Hotelling’s T2 and the SPE statistics were observed to be capable of detecting both incipient and abrupt faults arising at different batch run times. In practice, batches are not likely to be equal length, and methodologies such as dynamic time warping and its variants
Figure 12. Detection of incipient Fault 2 using T2 statistics: a) fault initiated at t = 50 h; b) fault initiated at t = 100 h; c) fault initiated at t = 200 h.
point of detection of the incipient faults and the detection delay time are indicated by the vertical dotted-lines. For the case of this particular incipient fault the results indicated a trend of reduced detection delay time for fault initiated later in the process. For the other incipient fault simulated the detection delay time was relatively constant. In comparison to fault detection performance of MPCA models presented by Birol et al.20 and Lee et al.,25 the proposed model results in much more distinctive detection of faults 1 and 3 with the monitoring statistics significantly exceeding the control limits almost H
dx.doi.org/10.1021/ie303562t | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Article
and other batch equalization methods such as ‘indicator variable approach’ have been widely explored. Only a few articles address the online real time issues although a recent article30 discusses the challenges and potential solutions. Such solutions could provide the batch data structures required for advanced process monitoring strategies including MiPLS. For batch processes with natural occurring intervals as determined by the batch recipe, one approach would be to build a separate MiPLS model for each recipe-based phase. An alternative approach is where the batch trajectories are subdivided into operating regions, which could be recipe-driven, with a MiPLS model being fitted within each region. These individual models are spliced together to provide an overall nonlinear global model.31
■
(10) Schlags, C. E.; Popule, M. P. Industrial application of SPC to batch polymerization. Process. Proc. Am. Control Conf. (Arlington, VA) 2001, 25. (11) Tates, A. A.; Louwerse, D. J.; Smilde, A. K.; Koot, G. L. M.; Berndt, H. Monitoring a PVC batch process with multivariate statistical control charts. Ind. Eng. Chem. Res. 1999, 38, 4769. (12) Nomikos, P.; MacGregor, J. F. Monitoring batch processes using multiway principal component analysis. AIChE J. 1994, 40, 1361. (13) Kosanovich, K. A.; Piovoso, M. J.; Dahl, K. S.; MacGregor, J. F.; Nomikos, P. Multiway PCA applied to an industrial batch process. Proc. 1994 Am. Control Conf. 1994, 1, 1294. (14) Hu, K. L.; Yuan, J. Q. Statistical monitoring of fed-batch process using dynamic multiway neighborhood preserving embedding. Chemom. Intell. Lab. Syst. 2008, 90, 195. (15) Chen, Q.; Jiang, P.; Zhao, J. Measurement of total flavone content in snow lotus (Saussurea involucrate) using near infrared spectroscopy combined with interval PLS and genetic algorithm. Spectrochim. Acta, Part A 2010, 76, 50. (16) Chen, Q. S.; Zhao, J. W.; Liu, M. H.; Cai, J. R.; Liu, J. H. Determination of total polyphenols content in green tea using FT-NIR spectroscopy and different PLS algorithms. J. Pharm. Biomed. Anal. 2008, 46, 568. (17) Silva, F. E. B.; Ferrao, M. F.; Parisotto, G.; Muller, E. I.; Flores, E. M. M. Simultaneous determination of sulphamethoxazole and trimethoprim in powder mixtures by attenuated total reflectionFourier transform infrared and multivariate calibration. J. Pharm. Biomed. Anal. 2009, 49, 800. (18) Ferrao, M. F.; Viera, M. D.; Pazos, R. E. P.; Fachini, D.; Gerbase, A. E.; Marder, L. Simultaneous determination of quality parameters of biodiesel/diesel blends using HATR-FTIR spectra and PLS, iPLS or siPLS regressions. Fuel 2011, 90, 701. (19) Borin, A.; Poppi, R. J. Application of mid infrared spectroscopy and iPLS for the quantification of contaminants in lubricating oil. Vib. Spectrosc. 2005, 37, 27. (20) Birol, G.; Undey, C.; Cinar, A. A modular simulation package for fed-batch fermentation: penicillin production. Comput. Chem. Eng. 2002, 26, 1553. (21) Bajpai, R.; Reuss, M. A mechanistic model for penicillin production. J. Chem. Technol. Biotechnol. 1980, 30, 330. (22) Kegl, B.; Krzyzak, A.; Linder, T.; Zeger, K. A polygonal line algorithm for constructing principal curves. In Advances in Neural Information Processing Systems; Kearns, M. S., Solla, S. A., Cohn, D. A., Eds.; 1999; Vol. 11, p 501. (23) Bratley, P.; Florian, M.; Robillard, P. Scheduling with earliest start and due date constraints. Naval Res. Q. 1971, 18, (4). (24) Zhao, S. J.; Zhang, J.; Xu, Y. M. Performance monitoring of processes with multiple operating modes through multiple PLS models. J. Process Control 2006, 16 (7), 763. (25) Lee, J. M.; Yoo, C. K.; Lee, I. B. Enhanced process monitoring of fed-batch penicillin cultivation using time-varying and multivariate statistical analysis. J. Biotechnol. 2004, 110 (2), 119. (26) Bakshi, B. R. Multiscale PCA with applications to multivariate statistical process monitoring. AIChE J. 1998, 44 (7), 1596. (27) Misra, M.; Yue, H. H.; Qin, S. J.; Ling, C. Multivariate process monitoring and fault diagnosis by multi-scale PCA. Comput. Chem. Eng. 2002, 26, 1281. (28) Yoon, S.; MacGregor, J. F. Principal-component analysis of multiscale data for process monitoring and fault diagnosis. AIChE J. 2004, 50, 2891. (29) Alawi, A.; Morris, A. J. Multiscale fault detection and diagnosis in fed-batch fermentation, 10th International IFAC Symposium on Computer Applications in Biotechnology Proceedings; 2007; p 37. (30) González-Martínez, J. M.; Ferrer, A.; Westerhuis, J. A. Real-time synchronization of batch trajectories for on-line multivariate statistical process control using dynamic time warping. Chemom. Intell. Lab. Syst. 2011, 105, 195. (31) Fletcher, N. M.; Morris, A. J.; Montague, G.; Martin, E. B. Local dynamic partial least squares approaches for the modelling of batch processes. Can. J. Chem. Eng. 2008, 86, 960.
ASSOCIATED CONTENT
S Supporting Information *
Tables related to simulation results evaluating the capability and accuracy of proposed interval division algorithm and initial conditions and set points used in the simulation. This material is available free of charge via the Internet at http://pubs.acs.org.
■
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. Present Address †
CAE Group, R&D Centre, OLED Business, Samsung Display Co., Ltd., South Korea. E-mail:
[email protected]. Author Contributions
The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS The work was supported by Dorothy Hodgkin Postgraduate Award and BP International Ltd. The authors acknowledge the School of Chemical Engineering and Advanced Materials and the Centre for Process Analytics and Control Technology (CPACT).
■
REFERENCES
(1) Wold, S.; Geladi, P.; Esbensen, K.; Ohman, J. Multi-way principal components and PLS-analysis. J. Chemom. 1987, 1, 41. (2) Nomikos, P.; MacGregor, J. F. Monitoring batch processes using multiway principal components. AIChE J. 1994, 40, 1361. (3) Kourti, T.; Nomikos, P.; MacGregor, J. F. Analysis, monitoring and fault diagnosis of batch processes using multiblock and multiway PLS. J. Process Control 1995, 5, 277. (4) Nomikos, P.; MacGregor, J. F. Multi-way partial least squares in monitoring batch processes. Chemom. Intell. Lab. Syst. 1995, 30, 97. (5) Nomikos, P.; MacGregor, J. F. Multivariate SPC charts for monitoring batch processes. Technometrics 1995, 37, 41. (6) Nomikos, P.; MacGregor, J. F. Multi-way partial least squares in monitoring batch processes. Chemom. Intell. Lab. Syst. 1995, 30, 97. (7) Nomikos, P. Detection and diagnosis of abnormal batch operations based on multi-way principal component analysis. ISA Trans. 1996, 35, 259. (8) Kosanovich, K.; Dahl, K. S.; Piovoso, M. J. Improved process understanding using multiway principal component analysis. Ind. Eng. Chem. Res. 1996, 35, 138. (9) Ramaker, H. J.; van Sprang, E. N. M.; Gurden, S. P.; Westerhuis, J. A.; Smilde, A. K. Improved monitoring of batch processes by incorporating external information. J. Process Control 2002, 12, 569. I
dx.doi.org/10.1021/ie303562t | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX