Improved Knowledge Extraction and Phase-Based Quality Prediction

Jan 15, 2008 - for quality prediction, allowing us to measure quality variables in a simpler ... Furthermore, when three-way batch data structure is a...
0 downloads 0 Views 182KB Size
Ind. Eng. Chem. Res. 2008, 47, 825-834

825

Improved Knowledge Extraction and Phase-Based Quality Prediction for Batch Processes Chunhui Zhao, Fuli Wang,* Zhizhong Mao, Ningyun Lu,† and Mingxing Jia School of Information Science and Engineering, Northeastern UniVersity, Shenyang, Liaoning ProVince, People’s Republic of China

This paper develops a process analysis and quality prediction scheme for the improvement of quality estimation performance in batch processes. Combined with prior phase division algorithm, correlation measure criteria are employed to identify critical phases and key variables with respect to quality prediction. As an effective process understanding and knowledge extraction tool, correlation analyses focusing on each phase help one reveal the phase-specific effect of process operation on product quality prediction without any requirement of prior expertise. The spoiling influences on quality inferential models caused by inclusion of variable redundancy and autocorrelation is well alleviated by the phase-based variablewise unfolding technique and key variable selection procedure. Meanwhile, the proposed method does not demand estimating the unavailable future process observations when used for online quality predicting. On the basis of specific phase analyses and variable selections, the applications of the proposed scheme to injection molding show its effectiveness and feasibility. 1. Introduction Rapidly changing market competition and demand for consistent and high-quality product have spurred the development of quality-related investigations for batch processes. This especially comes true in the processes mainly involved in the production and processing of low-volume and high-value-added products, including certain polymers, specialty chemicals, pharmaceuticals, biochemicals, etc. However, because of the absence of online quality measurements and the high-dimensional correlated process variable redundancy, online quality prediction and control in batch processes suffer a lack of reproducibility from batch-to-batch variations. It is, thus, necessary to make great efforts for the development of methods for quality prediction, allowing us to measure quality variables in a simpler, faster, but more accurate way. Recently, multivariate statistical methods,1,2 such as PCA and PLS, have been widely developed. Multiway principal component (MPCA) and multiway partial least-square (MPLS) modeling pioneered by Nomikos and MacGregor3,4 are applied in batch processes to extract directly useful underlying information from process measurements with little prior process knowledge. However, conventional MPLS modeling uses process variables over the entire batch course as the input, which requires estimating the unknown future process measurements in the evolving batch process. Furthermore, when three-way batch data structure is arranged in batchwise form, the number of unfolded generalized variables dramatically increases with high autocorrelation and crosscorrelation complexity. Although MPLS is well-known as an effective data-compression technique, one cannot expect superior feature extraction results from such a great capacity of data information, thus deteriorating the performance of quality prediction. In real industrial processes, the final product quality is mostly determined by some critical time regions and closely related * Corresponding author. Tel.:+86-024-83687434. Fax: +86-02423890912. E-mail: [email protected]. † College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu Province, P. R. China.

with only a small portion of process measurement variables. Therefore, a necessary procedure to reduce the dimensionality prior to PLS should be performed so that a reasonable number of pertinent variables can be chosen as the candidate variables in PLS projection pursuit. Previous works have shown the timespecific effects of process variables on the final product quality in real batch processes.5-8 They extend PLS/MPLS modeling for prediction improvement by selecting the core process variables or focusing on the critical-to-quality time periods to enhance process interpretation and analysis. Duchesne and MacGregor5 proposed a new pathway multiblock PLS algorithm. They incorporate information provided by intermediate quality measurement to help in identifying the time-specific effects of trajectory features on quality associated with phases of operation in which different physical phenomena dominate. However, for most industrial processes, online measurements of intermediate quality are rarely available. Bootstrapping-based generalized variable selection is used by Chu et al.6 to extract quality-related variables from typical batchwise unfolded batch data with limited samples and isolate the local effects of process variables on the final quality. Considering that multiplicity of phase is an inherent nature of many batch processes and process measurements of different phases may have different effects on the final product qualities, a phase-based process analysis strategy has been developed by Lu et al.,7,8 in which the concept of critical-to-quality phase is introduced and quality prediction is carried out only focusing on those critical ones. The phasebased sub-PLS model obtained by averaging time-slice PLS models within the same phase, however, only focuses on the variance variation along the batch axis isolated at each time interval without capturing the correlations along the time direction within the same phase and cannot provide stable and reliable enough regression models. In the present article, we present a new phase-based process analysis and online quality-predicting method for a batch process with online measurements of process variables and off-line measurements of the final product quality. Combined with the modified phase clustering algorithm,12 the correlation measure index is introduced to check the critical phases to quality

10.1021/ie0707063 CCC: $40.75 © 2008 American Chemical Society Published on Web 01/15/2008

826

Ind. Eng. Chem. Res., Vol. 47, No. 3, 2008

prediction without requiring prior process knowledge. Then, in each critical phase, considering that different variables may have different effects on the end product qualities, variables with significant contributions to quality variations are identified using a variable-selection procedure that combines correlation analysis with mean prediction square errors analysis focusing on validation data sets. Regarded as the prior data-compression strategy, the checking of critical-to-quality phase as well as the selection of key process variables in each critical phase can effectively exclude the redundant data and extract critical information in advance. They will reduce the complexity burden of the subsequent PLS analysis, namely, the second data compression, and help to further simplify the regression models more correctly and concisely. Consequently, the final phasebased PLS regression models are developed and will be applied to online quality prediction. Compared with other prediction methods, the scheme proposed here has the following advantages and disadvantages: • Advantages: (a) Combined with the prior phase clustering algorithm, it allows one to better understand the phase-specific nature of quality prediction, classify different phases bearing different relations with quality, and, accordingly, judge the critical phases, which help to locate local effects of process variable trajectory on final quality. (b) Using correlation analysis based on variablewise data arrangement, the identification of critical-to-quality phases and key process variables in each critical phase bears no heavy computation and shows more direct meanings with parsimonious expression. (c) In each phase, variablewise unfolding of the three-way array extends batch samples with time series. PLS modeling based on generalized samples covers the time-varying variance information and eliminates the corrupting influences of process variable autocorrelations and cross-correlations on the regression equation. Moreover, it avoids the prediction of future process data required by batchwise unfolding in conventional MPLS-based methods. • Disadvantages: (a) The approach is more complex than the straightforward approaches of PLS with batchwise unfolding, which is a price to pay if you want to use the proposed method and get better results. (b) The method could become very convoluted if there is more than one quality variable, in which one needs to perform the analysis separately for each quality variable. Thus, the information of the correlation among quality variables might be lost compared with the original PLS approach based on batchwise unfolding. This paper is organized as follows. First, the details of the proposed method are described in Section 2. Then, the effectiveness and feasibility of the proposed predicting method are illustrated by applying it to the injection molding in Section 3. Finally, conclusions are drawn in Section 4. 2. Methodology In each batch run, assume that J variables are measured at k ) 1, 2, ..., K time instances throughout the batch and one quality variable is measured at the end of each batch. Then vast amounts of process data collected from similar I batches can be organized as a three-way array X(I × J × K) and a corresponding quality variable vector Y of dimension I × 1, as shown in Figure 1a. Here, it should be noted that, in the present paper, we shall treat only the case with a univariate dependent variable; for the multivariate case, one just needs to treat each of the dependent variables separately. In the present work, the batches are of equal length without special declaration so that the specific process time can be used as an indicator to data preprocessing, modeling, and online application.

Figure 1. Illustration of the phase-based PLS modeling scheme in cth phase: (a) batchwise unfolding and data normalization and (b) variablewise unfolding and regression modeling.

2.1. Phase Division Algorithm. For the multiphase batch processes, each phase has its own underlying characteristics. The effects of process behaviors on quality are similar within the same phase. In contrast, different influences on quality are exhibited over different phases. Generally, the product quality is only determined by several critical phases and explained by some key process variables. It is, therefore, better to partition the whole batch process into different phases, then locate the critical phases, and identify the key process variables. It is well-known that there have been various techniques employed to get phase identification.7,9-12 Zhao et al.9,10 investigated multiple PCA/PLS models for different operating modes based on metrics in the form of principal angles to measure the similarities of any two models. Doan et al.11 developed an approach to model the dynamics of nonstationary processes based on DPCA. In their method, phase transitions produce distinct patterns in the DPCA scores, which can be identified as singular points. A variant of k-means clustering algorithm has been developed by our earlier work.7,12 Different from the other phase identification algorithms, the clustering algorithm is developed assuming that the alternation between phases can be reflected by checking the changing trend of underlying process characteristics. Generally, all the alternative ways provide diverse solution concepts from different views

Ind. Eng. Chem. Res., Vol. 47, No. 3, 2008 827

and aspects with different advantages and specific applicability under some conditions. Here, the basic clustering unit employed in the present work is loading matrix Pk(J × J), which is obtained from PLS algorithm rather than from the PCA12 or PLS regression parameter matrix.7 On the basis of this idea, those consecutive samples with similar underlying characteristics related to quality are collected together as a phase and dissimilar sample patterns are classified into different phases. The clustering algorithm used for the determination of phases is simple and straightforward without increasing comprehension and treatment complexity. Moreover, the time-slice score matrices obtained from the clustering analysis will be readily employed in the following identification of critical phase to reveal the correlations between process measurements and quality variable. Loading matrices obtained from PLS, Pk(J × J), revealing the process correlations related to quality at K time intervals, are employed as the basic unit in our clustering algorithm. It should be noted that, in the present work, they are obtained by performing PLS algorithm on normalized time-slice data sets {Xk(I × J), Y(I × 1)} instead of PCA algorithm in our previous work.12 In this way, those process variations that are more correlated with the quality variables would be used, rather than all the common behaviors in the process variables. Then the loading matrices derived from PLS analysis, Pk(J × J), are transformed into a weighted form in view of the importance of each column, Pk,j:

P ˜ k ) [Pk,1‚gk,1, Pk,2‚gk,2, ‚‚‚, Pk,J‚gk,J] ) Pk‚diag(gk,1, gk,2, ‚‚‚, gk,J)

in turn. In this way, Tc(KcI × J) covers more detailed dissimilarity of process nature varying among time within the same phase and reflects the different weights between each principal component, Tcj (KcI × 1). Consequently, the synthetical phase-specific process variation information closely related with the final quality can be extracted naturally by the simple sum form of principal components: J

˜tc(KcI × 1) )

Tcj (KcI × 1) ∑ j)1

(3)

In addition, unlike process variables X collected throughout the duration of each batch run, the quality variable Y is only measured at the end of each course. In order to carry out the following correlation analysis and the later PLS regression, the number of rows in the quality variable should be properly arranged to make a consistent size with ˜tc(KcI × 1). The I batches of normalized quality variable are duplicated by Kc, as shown in Figure 1b. The data is structured because quality at the end of the batch run is accepted if the operating condition of the data slice Xk at current time point k is normal. Then, we can simply perform correlation analysis between the phaserepresentative principal score ˜tc(KcI × 1) and the restructured quality variable yc(KcI × 1) in the cth phase to check the critical phases without troubling to hunt through every time interval of cth phase. The squared simple correlation coefficient, CPc2, can be defined to represent how good the correlation relationship between predictor variables and response variable is

(1) CPc (t˜c,yc) ) 2

(

cov(t˜c,yc)

)

2

)

cov2(t˜c,yc)

where gk,j ) and is exactly the variance of the associated jth principal component at each time k. Here, it should be noted that λjk differs from the eigenvalue of the covariance matrix in PCA. The phase-based modeling begins with analyzing and clustering these weighted loading matrices so that the process duration is properly divided into different phases. Different phases bear different correlations with the final quality. Consequently, some strategy should be developed to identify the critical-to-quality phases. 2.2. Critical Phase Identification. In the phase division procedure, we have readily gotten the score matrix at each time instance, Tk(I × J), besides the loading matrix, Pk(J × J). After the phase division, Kc number of score matrices Tk(I × J) (Kc is the phase duration) can be collected together corresponding to each phase. In fact, they denote most of the important process variation information along the time direction within each phase. Moreover, scores are also used in the regression model to extract the main process information. Therefore, a phase-representative principal component should be formed, which can be employed to reveal the correlation relationship between process measurements and quality variable. First, for each time-slice score matrix, Tk(I × J), it is necessary to consider the different importance of each principal component, Tk,j(I × 1), and give them different weights,

where function cov( ) is the covariance of two vectors and D( ) denotes the variance of a vector. It is clear to explain the meanings of the criterion. PLS regression extracts the principal component scores from the predictor variable data set, which should represent most of the process information and, meanwhile, have the strong correlations with the response variable. So in the calculation of CPc2, it is simple and sufficient to reveal the phase-specific correlation significance to quality based on the representative principal instead of process measurements. Obviously, CPc2 ranges from 0 to 1 since CPc itself varies between -1 and 1. Different phase results in different CPc2, revealing the change of specific underlying correlation relationships between process behaviors and quality over different phases. Larger CPc2 indicates higher correlation between predictor variables and response variable, that is, the phase region is more critical and reliable to the prediction of quality. Naturally, a critical limit should be defined as the reference standard. It is well-known that the squared correlation coefficient is actually the coefficient of determination in simple linear regression analysis.13-15 Therefore, F-test13-15 is used here to test the significance of phase-specific explanation ability of process behaviors to the final quality,

T ˜ k ) [Tk,1‚gk,1, Tk,2‚gk,2, ‚‚‚, Tk,J‚gk,J]

CPc2

J λjk/∑j)1

λjk

λjk

) Tk‚diag(gk,1, gk,2, ‚‚‚, gk,J)

(2)

where gk,j is the same as that in eq 1. Then, using variablewise data rearrangement, Tc(KcI × J) is formed for the cth phase by putting all the weighted time-slice scores within the same phase, T ˜ k(I × J), beneath one another

xD(t˜c)‚D(yc)

(1 - CPc2)/(I - 2)

D(t˜c)‚D(yc)

(4)

(5)

where I is the number of observation batches and R is the significance level. The critical limit of CPc2 can be conversely calculated by the above equation, where the critical value of F-statistic with

828

Ind. Eng. Chem. Res., Vol. 47, No. 3, 2008

significance factors (R ) 0.01 or 0.05) can be found up in the statistical table of F-distribution. If the squared correlation coefficient CPc2 is larger than the critical limit, the corresponding time region is defined as the critical phase with pivotal effects on the concerned quality. The information is extremely helpful for us to further understand the different phase-specific effects of process variables on the end product quality and establish the corresponding modeling guideline. This information can be also used to improve the process either by redesigning the process equipment or redesigning the control logic with the aim to minimize any variability that is present today. 2.3. Key Variable Selection Based on Variablewise Unfolding. It is common that the product quality is determined by the key variables of several critical phases. Degradation of the prediction performance of the quality estimation model caused by inclusion of unimportant process variables is a serious problem. It is, thus, important to focus only on those variables that show the highest correlation with or contribute most to the quality variations in each phase. The importance of variable selection has been reported in numerous papers.16-19 The variable selection technique also can be used as a knowledge extraction tool. Generally speaking, although data compression techniques such as MPLS can handle the redundancy problem of batch process data, acceptable prediction accuracy cannot be expected without advance removal of nuisance variables. Because those variables have no correlations with the quality variables, they cannot be good predictors in the regression model and only cause deterioration of the data compression performance. If those variables are removed prior to modeling, the prediction performance of the quality-estimation model can be effectively improved because of the increased causal relationship between the predictors and the quality variables. As mentioned previously, effects of process variables on quality remain similar within one phase, i.e., they indeed have similar correlations with quality. This provides a theory basis for the proposed correlation analysis based on variablewise unfolding. After the phase division, the normalized three-way array X(I × J × Kc) (Kc is the phase duration) is formed in each critical phase, which may be unfolded in different ways, deducing various two-dimensional matrices. Moreover, considering the underlying nature that each observation variable tends to show a phase-specific explanation ability to the final quality, that is, within the same phase, they will remain the similar effects on quality with slight variation, it is advantageous and reasonable to conduct the index analysis using the form of phase-based variablewise unfolding. On the one hand, by means of variablewise unfolding Xc(KcI × J) for each phase, generalized samples are composed of batches and time series from the same phase. They take the characteristics of the overall phase into account, which can offer more sufficient system information and get more stable and comprehensive analysis results. On the other hand, the complex interactions of autocorrelations and cross-correlations between process variables often impact badly on the selection results, which accordingly induce biased regression models with worse prediction performance. On the basis of variablewise unfolding of the entire phase, variable correlations between different time intervals are no longer a serious matter, since only the initial J physical process variables are considered simultaneously, which gives more explicit and reliable results. Therefore, the phase-specific implementation based on variable-unfolding avoids imposing heavy calculation burden and complexity focusing on each time interval and alleviates the influence of process variables collinearity along time within the same phase.

Any procedure for variable selection must contain two components, a selection criterion and a search procedure. The selection criterion is the measure used to rank feature subsets. Most variable selection search algorithms have unbalanced performance between accuracy and computational speed, which commonly suggests that the variables are selected according to the increase in statistic indices, such as R2, AdjR2, AIC, Mallows Cp, and others.14,15 In practice, these procedures often lead to overfitting because they prefer to extremely pursue the ability to model the training data by measuring the degree of fit rather than to optimize the selection of variables by specifically taking the generalized prediction aspect of the model into account. Here, correlation contribution rate of process variable Xc,j(KcI × 1) to quality yc(KcI × 1) with respect to all candidate variables can then be defined to identify quality-related variables in each phase,

CVjc )

r(Xc,j,yc)

x

(6) J

∑j r2(Xc,j,yc)

where j and c are indices of process variables and critical phases, respectively. Xc,j(KcI × 1) is the generalized vector of the jth variable derived from variablewise unfolding in the cth phase; yc(KcI × 1) is the duplicated quality variable within the same phase. It is clear that CVjc varies between -1 and 1 since it is transformed from the correlation coefficient. (x1/J, the moderate level of contribution rate, can be easily derived supposing that all the process variables possess the identical correlation with quality. They will be employed as important reference boundary limits of ranking significance of variables in the following variable selection steps. Generally, the larger CVjc is, the larger is the rate that the jth process variable accounts for relative to the other candidate variables, i.e., the more significance that the process variable contributes to the variation of quality in the cth phase. Those variables with slight CVjc values can be initially and crudely judged that they may have weak correlations with quality. Here, it should be remarked that only the correlations cannot exactly stand for the prediction ability of models. Those process variables with minor absolute CVjc may also improve the prediction performance if entering regression models. So besides the prior correlation analysis, it is necessary to further evaluate these variables to verify whether they should be kept in the final regression models. Thus, the mean prediction square error is developed in eq 7.

MSEc )

1

KcI

(yi - yˆ ci )2 ∑ KI i

(7)

c

where c and i are indices of phases and generalized observations, respectively. yi and yˆ ci are real measurement and predicted quality, respectively. Here, it should be noted that MSEc is computed by focusing on the validated batches instead of the calibrated data used in calculating the CVjc index. Comparatively, the above two parameters, CVjc and MSEc, respectively, make full use of training and testing data subsets so that their combination ensures the generalization capability and prediction performance, overcoming the overfitting defect when only devoting one’s attention to the prediction errors of the training data set. Commonly, when the number of observations is sufficient, the samples are split randomly into two parts: one used to calibrate the model (i.e., to estimate the parameter

Ind. Eng. Chem. Res., Vol. 47, No. 3, 2008 829

values) and the other used to validate the model (i.e., to test the predictive performance). In general, it is often recommended to use more data to calibrate than to validate, usually two-thirds to one-third split of the sample. The latter set is sometimes known as a hold-out sample. After fitting the model using the data from the calibration subset, we can get the regression parameters and then use them to calculate the predicted values for the observations in the holdout sample. Thus, we can better assess the generalized predictive accuracy of the model in the validation sample rather than the fitting ability in the calibration sample. Moreover, if the data set is too small to split the samples, an alternative is to use jackknife validation. Generally, in previous work, the selection procedure is conducted by searching all possible feature variables one by one, such as backward elimination procedure, forward selection procedure, stepwise regression procedure,14,15 and so on. However, they require a large quantity of calculations, because the number of calculation steps grows exponentially with the amount of candidate variables. From any set of J predictors, 2J alternative models should be constructed, which is based on the fact that each predictor can be either included or excluded from the model. In most circumstances, therefore, it will be impossible and impractical to make a detailed examination of all possible regression models. In order to simplify the calculation and overcome the complexity of evaluating all the possible modeling variables one by one from quantities of candidate variables, we first divide the candidate variables into different intervals and impose an order of priority on them according to the forenamed CVjc criterion. Then, instead of individual variables, these intervals are searched step by step. In this way, it can greatly reduce the number of possible models to be evaluated compared with single variable selection methods, thus simplifying the analysis steps, avoiding the blindness of random stepwise selection, and typically increasing reliability. Certainly, if the number of variables is not large, it can directly search all variables one by one. The variable selection steps we suggest are listed in detail as follows: (a) Set (x1/J as the two-side critical limits of correlation relationship since they are the moderate level of contribution coefficient, CVjc. Retain the process variables with CVjc criterion beyond the critical values, (x1/J and perform the initial PLS regression model using them. (b) Find the highest absolute value of CVjc in the rest of the process variables and divide the range from zero value to the highest absolute value into several minor intervals, whose number can be determined adaptively according to the specific circumstance: more intervals correspond to more detailed variable selection analysis, and, contrarily, fewer intervals account for reduced calculation amount but cursory key variable searching result. Meanwhile, a descending order of priority is given to these intervals while they are approaching zero value. (c) Find the variables contained in the first interval with highest priority, add them into the PLS regression model, and compare the prediction square errors with the former one using the validation data set. If the prediction errors decrease, hold the variables in the new regression model and continue to search the next interval. Otherwise, stop, exclude all the other variables, and use the current regression model as the final one. By the above procedures, which will be illustrated in detail in the later simulation, we can find a reasonable set of variables used in PLS regressions. The quality prediction ability of the regression model can be improved when one eliminates variables that probably bring only noise or idle information. The selection

strategy emphasizes the fact that, when predicting quality, we should have optimized the appropriate variables to work with so that the current regression model has the least average squared prediction errors for test sets, i.e., the best prediction ability. In this way, the predictor variables used in the phase regression models are refined so that one can get more stable and parsimonious PLS regression models, which show increased correlations between the predictors and quality variable. 2.4. Phase-Based PLS Modeling Algorithm. According to the aforementioned phase-specific process behaviors, observed variables should have similar explanation and contribution to quality within the same phase, which provides a reasonable basis for phase-based variables identifying, regression modeling, and quality predicting. Then, the regression parameters, containing the information of relationships between process variables and quality, should remain similar in the same phase and show differences over different phases. Therefore, it is preferable to locate the phase-specific effects of process variables on the final product quality by phase-based regression models, demanding no prior process knowledge. As previously mentioned, each of the batch data rearrangements corresponds to looking at a different type of variability information.20-24 Batchwise unfolding focuses on the variability among different batches, while variablewise rearrangement stresses the system variances extracted along both batches and time. When working with batchwise data, the unfolded matrix X(I × KJ) has a great number of generalized variables KJ opposite the initial comparatively small quantity of observed batches. On the one hand, only limited system information can be extracted from less batches; on the other hand, repetitions of the same process variables at different times cause the models to be complex and confused. Another major problem in the online predicting of quality is that the dataset of the new batch has the process measurements only up to the current time point. So it is inevitable that the unknown future observations should be estimated appropriately when online quality prediction is performed. Four approaches have been proposed for complementing the future observations.25,26 However, the existing methods have been criticized for not being sensitive enough to process dynamic changes or for being time-consuming. The estimated values may also distort the process information, potentially leading to worse quality prediction, since predicting performance of regression models directly depends on the estimation accuracy of missing future values. Here, modeling based on variablewise unfolding overcomes the above shortcomings and provides the following solution. As mentioned in key variables selection procedure, we have well-prepared a representative regression parameter matrix, Bc(Jc × 1) by performing PLS algorithm on {Xc(KcI × Jc), Yc(KcI × 1)} for each critical phase (where Xc(KcI × Jc) is the reference predictor dataset, Jc is the retained number of key variables, and Yc(KcI × 1) is the response dataset),

Bc ) Wc(PcTWc)-1QcT

(8)

where Wc and Pc are, respectively, the weighting matrix and the loading matrix for Xc and Qc is the loading matrix for Yc. Then, the phase-based PLS regression model for quality prediction can be formulated as

yˆ k ) xk‚Bc, c ) 1, 2, ..., C; k ) 1, 2, ..., Kc

(9)

where xk(1 × Jc) is the determined process predictors at time k within cth phase. yˆ k is the predicted value of final product quality.

830

Ind. Eng. Chem. Res., Vol. 47, No. 3, 2008

In conclusion, the procedure for the new phase-based modeling method is outlined as follows: First, the three-way array X(I × J × K) is unfolded batchwise into K number of time slices as shown in Figure 1a. Then they are normalized to be zero mean and unit variance as well as the final quality variable vector Y(I × 1). Subtracting the mean of each column can approximately eliminate the main nonlinearity due to the dynamic behavior of the process and enable us to model the deviations from the mean process trajectory. Each variable is scaled to unit variance to handle different measurement units, thus giving each equal weight. Second, PLS algorithm can be performed on these normalized time-slice datasets {Xk(I × J), Y(I × 1)} to extract the loading matrices and score matrices, which incline to reflect the phasespecific process property more relevant to quality variation rather than the common process behaviors. Then, phase clustering algorithm is adopted using those weighted loading matrices, P ˜ k, to partition the operation duration into different phases. Third, in each phase, the synthetical phase-representative principal score ˜tc(KcI × 1) is derived using eq 3, and quality variables are duplicated correspondingly as yc(KcI × 1). Then, critical phases are identified by analyzing the phase-specific correlation between ˜tc(KcI × 1) and yc(KcI × 1). Finally, in each key phase, as shown in Figure 1b, the normalized batch data X h (I × J × Kc) (Kc is the time duration in the cth phase) is rearranged into the form of X(KcI × J). Then, Jc key variables closely related to quality are selected based on the corresponding correlation analysis and crossvalidation for prediction square errors of validation data set. Thus, the final representative regression parameter matrix, Bc(Jc × 1), for the cth phase can be generated correspondingly focusing on the predigested {Xc(KcI × Jc), yc(KcI × 1)}. From the outlined procedure, two types of data unfoldings, batchwise and variablewise, are appropriately used for different functions. As it is known that there is a growing debate about which unfolding is better, the present work shows that both are useful as long as researchers know what they want to achieve with the two different data arrangements. An illustration of the phase-based PLS modeling scheme is shown in Figure 1. 2.5. Online Quality Predicting. As mentioned in our previous work,7 quality variables in a batch run can be divided into two types: quality determined by only one specific phase and quality determined by more than one phase. It is comprehended easily that only critical-phase PLS models can give reliable and stable quality prediction while others are not directly related with the quality index, so there is no need to waste extra efforts to forecast the product in those unimportant time regions. Moreover, since different critical phases explain different parts of quality variations, a strategy will have to be implemented to combine cumulative effects of multiphase PLS regression on quality. Without losing generality, assuming that quality variable has two critical phases, phase 1 and phase 2, where their starting times are K1s and K2s, and the retained numbers of key variables are J1 and J2, respectively, the current online quality prediction can then be yielded as

[

1

3. Illustration and Discussion 3.1. Injection Molding Process Description. Injection molding,27-29 a key process in polymer processing, transforms polymer materials into various shapes and types of products. A typical injection molding process consists of three operation phases, injection of molten plastic into the mold, packingholding of the material under pressure, and cooling of the plastic in the mold until the part becomes sufficiently rigid for ejection. Besides, plastication takes place in the barrel in the early cooling phase, where polymer is melted and conveyed to the barrel front by screw rotation, preparing for next cycle.27 It is a typical process for the application and verification of the proposed phase-based quality prediction algorithm, where key process variables can be collected online from measurements with a set of sensors, while quality variable is only available at the end of each batch run. The material used in this work is high-density polyethylene (HDPE). The 12 process variables selected for modeling can be listed in sequence as follows: cavity temperature (C.T.); nozzle pressure (N.P.); stroke; injection velocity (I.V.); hydraulic pressure (H.P.); plastication pressure (P.P.); cavity pressure (C.P.); screw rotation speed (S.R.S); SV1 opening (SV1); SV2 opening (SV2); barrel temperature (B.T.); mold temperature (M.T.). The quality variable is product length (mm). In the simulation illustration, we focus on the prediction of product length, the dimension quality, whose real measurements can be directly obtained by instruments. The operating conditions are set as follows: injection velocity is 8-40 mm/s; packing pressures are set to be 150, 300, and 450 bar; barrel temperatures are 180, 200, and 220 °C; and mold temperatures are set to be 15, 35, and 55 °C. In total, 33 normal batch runs are conducted under various operation conditions by the design of experiment (DOE) method. Because of different filling times in the injection

k



k - K1s + 1 i)K1s

yˆ k )

where w1 and w2 are weights for critical phases 1 and 2, respectively. yˆ 1 is the quality prediction at the end of critical phase 1. The phase weights wc are calculated directly from the previous correlation metric, CPc, by simple ratio algorithm without increasing the heavy calculation burden and complexity: wc ) CPc/(CP1 + CP2). In this way, the explanations to quality variation throughout different phases are stacked with different significance weights. From eq 10, we can clearly see that, for every sample interval, there will be corresponding real-time quality prediction. In detail, for critical phase 1, the real-time predicted quality yˆ k is indeed the average value from the starting time of phase 1 up to the current time k. Therefore, with the time evolvement, the endof-phase quality prediction, yˆ 1, will be readily obtained when the first dominant phase completes. For critical phase 2, the online prediction at each time k is actually a weighted combination of yˆ 1 and the average value up to the current time k starting from the beginning of key phase 2. In the end, the quality prediction value, yˆ 2, obtained at the end of critical phase 2 is naturally regarded as the final quality prediction result of the whole process.

(xi(1 × J1)‚B1(J1 × 1))

k ∈ the 1st critical phase

k 1 (xi(1 × J2)‚B2(J2 × 1)) k ∈ the 2nd critical phase w1‚yˆ 1 + w2‚ k - K2s + 1 i)K2s



null

others

]

(10)

Ind. Eng. Chem. Res., Vol. 47, No. 3, 2008 831

Figure 2. (a) Phase division result for injection molding process and (b) fitness of regression models. Table 1. Critical Phase Analysis Result Using CP Metric for Four Main Phases main phase no.

duration

CP metric

critical value (R ) 0.01)

estimating result

1 2 3 4

77-251 252-541 542-793 902-1300

0.2557 0.6147 0.5349 0.3983

0.5052 0.5052 0.5052 0.5052

noncritical critical critical noncritical

phase induced by different injection velocities, reference batch runs, therefore, have varying operation lengths. Using injection stroke as an indicator variable, we can fix the injection phase with unified duration by data interpolation. Moreover, by controlling packing-holding and cooling time at 6 and 15 s, respectively, we can get the final data matrices X h (33×12×1300) and Y (33×1), among which the first 25 batches are used for modeling, postbatch process analysis, and knowledge extraction, while the other 8 cycles are used for model validation. 3.2. Illustration of Correlation Analysis. The weighted timeslice loading matrices calculated from PLS analysis are fed to the clustering algorithm. The clustering result is shown in Figure 2a, clearly showing that, without using any prior process knowledge, the trajectories of the injection molding are automatically divided into six phases, among which four long phases (marked with shaded circles) agree well with the four main physical operation phases of the process, i.e., the injection, packing-holding, plastication, and cooling phases, plus a few short transitional time periods. These temporary time regions, corresponding to the dynamic transient period with unstable process states, form individual duration, which have little impact on quality prediction. Dividing detailedly a batch process into “steady” and “transient” phases can not only improve quality prediction performance but also enhance process analysis and understanding. Without losing generality, each phase hints different effects on the final product as well as different correlations between process variables and product quality. Combined with the phase clustering result, Table 1 lists the analysis result of the CP metric over different phase operations. Obviously, it displays that, in phases 2 and 3, the CP values are above the critical point derived from eq 5 and can be inferred as critical phases. To further affirm the reasonability of the above analyses, goodness-of-fit of the phase-specific regression model is evaluated, where all the input process variables are retained,

Figure 3. CV criterion of key variable selection: (a) process variable at phase 2 and (b) process variable at phase 3.

using multiple coefficient of determination,13-15 Rk2, shown in Figure 2b. The larger Rk2, the better is the fitness of the corresponding phase-specific PLS model. Normally, the regression model in critical-to-quality phase is more accurate and reliable for the prediction of quality variable, so it should have a larger Rk2. From the plot, we can clearly see that larger and stabler fitness appears in two phases, i.e., phases 2 and 3, especially the second phase, indicating the same identification result of critical phases as Table 1, which demonstrates from another aspect that it is reasonable to determine the critical phases according to correlation analysis based on the CP index. In phases 2 and 3 (i.e., the packing-holding and plastication phases), the CV values can be calculated for each process variable based on variablewise arrangement in each phase, as illustrated in Figure 3. Here, for comparison, the phase regression coefficients including all process variables are also plotted

832

Ind. Eng. Chem. Res., Vol. 47, No. 3, 2008

Figure 5. Online quality prediction result for a test batch in phase 2 and phase 3: (a) online predicted quality and (b) online predicted error rate (%).

Figure 4. Regression parameters for each process variable in critical phases: (a) process variable at phase 2 and (b) process variable at phase 3. Table 2. Variable Selection in PLS Regression no.

variables in regression model

MSE

1 2 3 4 5

(a) In Phase 2 2, 3, 5, 7, 9 1, 2, 3, 5, 7, 9, 11, 12 1, 2, 3, 5, 6, 7, 9, 11, 12 1, 2, 3, 4, 5, 6, 7, 9, 11, 12 all

0.038 287 0.005 949 1 0.004 531 8 0.003 867 3 0.003 875 6

1 2 3 4 5

(b) In Phase 3 2, 3, 7 2, 3, 6, 7, 11, 12 2, 3, 6, 7, 9, 11, 12 1, 2, 3, 6, 7, 9, 11, 12 all

0.084 991 0.035 243 0.034 587 0.033 589 0.037 243

in Figure 4, which can numerically explain how these variables will affect the product quality. Similar to the status shown in Figure 3, pressure variables have positive relation with the quality, while temperature variables are negatively correlated with the quality, which generally tells that higher pressures and lower temperatures result in larger product length. Also, longer screw displacement results in more material being crammed into the cavity, which obviously engenders longer product. Now, taking phase 3 (plastication phase) as an example, we shall explain the variable selection guideline further. In Figure 3b, we place the two-sided horizontal critical limits (x1/J so that we can get the variables beyond the limits, including pressure variables (nozzle pressure and cavity pressure) and displacement variable (stroke), which are defined as the most quality-correlated variables. Moreover, they are all positively related with the quality. Using the three variables, we carry out the initial PLS regression and calculate the MSE value by eq 7. Sequentially, we search the rest of the variables and find the one with the highest absolute value of CV, from which to zero value the range is divided into 10 equal spaces with 9 related lower interval limits. In each space, we find the variables outside the relevant interval limit, add them into the regression model, and compute the corresponding MSE value using the current prediction model. For comparison, Table 2a lists the cases we deal with in the variable selection procedures and the corresponding MSE values for the test data. It shows that the

appropriate variables used in the regressions are variable nos. 1, 2, 3, 6, 7, 9, 11, and 12, respectively. If we use more or less process variables, we get worse results, which implies that it deserves extensive computations to find a good set of variables employed in the PLS regressions. Similarly, in phase 2 (i.e., packing-holding phase), the initial correlation analysis shown in Figure 3a reveals that dominant process variables may cover pressure variables (nozzle pressure, hydraulic pressure, and cavity pressure), temperature variables (cavity temperature, barrel temperature, and mold temperature), displacement variable (stroke), and manipulated variable (SV1). Further variable selection results presented in Table 2b denote that all the process variables except SRS and SV2 should be kept in the regression, which demonstrates that it is better that variable correlation analysis and variable selection strategy based on MSE should be closely combined to provide more reliable and concise modeling variables. Without the use of prior process knowledge, the above phasebased analysis agrees well with the real physical process, which can be useful for the quality improvement because it suggests what process variables in the critical phases should be more emphasized and better controlled for quality improvement. If we can extract enough information during a particular period when it is very important for product quality prediction, it will be possible to weight that period more heavily and, thus, track faster any quality deviations from the average over that period. Accurate online quality prediction is available to reflect quality status and perform process adjustments in time. 3.3. Performance Illustration of Quality Prediction. According to the above process analysis and understanding, phases 2 and 3 are indicated as critical phases. The product length has close relations with both phases. So when online predicting, it is necessary to conduct the quality prediction by stacking the cumulative effects over phases. The online quality prediction is performed at each sampling time of critical phases 2 and 3, respectively, as shown in Figure 5a for a test batch. From the figure, it is clear that the real-time predicted quality gradually approaches the real quality measurement along with time evolvement. The maximum online predicted error rates, shown in Figure 5b, are less than 0.06% in phase 2 and 0.07% in phase 3, respectively, which are well-accepted prediction precision in industry. The accurate quality prediction can be successfully applied to demonstrate the phase-specific effects of process

Ind. Eng. Chem. Res., Vol. 47, No. 3, 2008 833

Figure 6. Off-line quality prediction results for the reference batches using (a) the proposed method, (b) the phase-based sub-PLS method, and (c) the conventional MPLS method.

variables to quality, explore process running information, and evaluate product quality performance in advance. Moreover, a comparison of off-line prediction performance is conducted for training and testing batches, respectively, using the proposed method, phase-based sub-PLS method by Lu et al.,7 and conventional MPLS method. Here, multiple coefficient of determination,13-15 R2, can be used to quantitatively measure a model’s prediction performance. Generally, R2 ranges between 0 and 1. The larger the R2, the better is the fitness of the prediction model. Usually, R2 of a calibration set is larger than that of a validation set, because calibration models can easily lead to overfitting of the data. We may say that our inferential model is reliable if it works well for a new sample, so that the R2 of training and testing batches should be effectively combined to quantify the model’s fitting capability and generalization performance. The R2 of the training batches are, respectively, 0.8566, 0.8041, and 0.9969 in the case of the three methods, where the fitting ability of the MPLS model is the best. The R2 of the testing batches are, respectively, 0.8420, 0.7210, and -8.8993, where the R2 for the MPLS method has gone beyond its normal measure scope, indicating worse generalization ability for quality prediction. From the R2 values in the case of the training batches and the testing ones, it is obvious that our proposed method yields better quality prediction performance. Also, the comparison results are shown in Figures 6 and 7, respectively, which can visualize the comparison of prediction performance between the three methods. From Figure 6, it is generalized that the conventional MPLS model has the superiority for fitting training batches, where the predicted quality wellfitted with the real quality measurement. Contrastively, from Figure 7, it can be seen that, for test batches 3 and 7, the trained MPLS regression model fails to give a reliable enough quality prediction result. The comparison may illustrate the disadvantages of the conventional MPLS algorithm. Although MPLS is well-known as an effective data-compression and featureextraction technique, one cannot expect it to pick up pivotal information from a great capacity of redundant candidate data

Figure 7. Off-line quality prediction results for the test batches using (a) the proposed method, (b) the phase-based sub-PLS method, and (c) the conventional MPLS method.

information. Thus, conventional MPLS may not clearly distinguish the system information from normal stochastic noises, which have little relation with the final quality. Therefore, the overfitting to common-caused process variations may spoil the model’s ability to capture the useful predictor information, thus causing it to lose the favorable adaptability to new batches with normal stochastic dynamics. Lu’s sub-PLS method also compromises the reliability of predicted quality since the phasebased regression relationships are extracted by only focusing on each individual time-slice, which may not be able to catch stable enough prediction information. By comparison, the superiority of the proposed method over the other two methods is obvious for both training and testing batches, which gives a satisfying overall prediction trend, demonstrating the models’ fitness ability and prediction adaptability. 4. Conclusions A batch quality prediction method has been proposed for the improvement of quality estimation performance in multiphase batch processes. In each phase, based on the detailed correlation analysis, pivotal phase-specific process information related to product quality can be extracted in advance without any requirement of prior knowledge. It allows us to identify the critical-to-quality phases and check the dominant process variables combined with proper variable selection strategy. The variablewise unfolding alleviates the spoiling of variable redundancy and overcomes dynamic autocorrelations of process variables along time series from batchwise unfolding form. PLS regression models are then developed for each phase according to the process analysis, which not only simplifies the inferential model but also provides a more stable predicting relationship. The proposed scheme can give a valid quality-predicting result earlier without online estimating unavailable future process variables. Moreover, it can help one to better understand process characteristics and find out the critical factors to the concerned

834

Ind. Eng. Chem. Res., Vol. 47, No. 3, 2008

quality. All the above provide the potential for the quality improvement. The application to the injection molding process illustrates that the proposed method is effective. Especially if the proposed method is applied to batch processes bearing superfluous measurement variables, its superiority will be more obvious. The conclusion suggests possibilities for the continuation of this work. Acknowledgment The authors would like to acknowledge Professor Furong Gao’s group at Hong Kong University of Science and Technology (HKUST) for providing the injection molding data and the anonymous reviewers for their helpful comments. The project was supported in part by the National Natural Science Foundation of China (No. 60374003 and No. 60774068) and Project 973 (No. 2002CB312200), China. Literature Cited (1) Jackson, J. E. A User’s Guide to Principal Components; Wiley: New York, 1991. (2) Geladi, P.; Kowalshi, B. Partial least squares regression: A tutorial. Anal. Chim. Acta 1986, 185, 1. (3) Nomikos, P.; MacGregor, J. F. Monitoring batch processes using multiway principal component analysis. AIChE J. 1994, 40, 1361. (4) Nomikos, P.; MacGregor, J. F. Multi-way partial least squares in monitoring batch processes. Chemom. Intell. Lab. Syst. 1995, 30, 97. (5) Duchesne, C.; MacGregor, C. D. Multivariate analysis and optimization of process variable trajectories for batch processes. Chemom. Intell. Lab. Syst. 2000, 51, 125. (6) Chu, Y.-H.; Lee, Y.-H.; Han, C. Improved Quality Estimation and Knowledge Extraction in a Batch Process by Bootstrapping-Based Generalized Variable Selection. Ind. Eng. Chem. Res. 2004, 43, 2680. (7) Lu, N. Y.; Gao, F. R. Stage-Based Process Analysis and Quality Prediction for Batch Processes. Ind. Eng. Chem. Res. 2005, 44, 3547. (8) Lu, N. Y.; Gao, F. R. Stage-Based Online Quality Control for Batch Processes. Ind. Eng. Chem. Res. 2006, 45, 2272. (9) Zhao, S. J.; Zhang, J.; Xu, Y. M. Monitoring of Processes with Multiple Operating Modes through Multiple Principle Component Analysis Models. Ind. Eng. Chem. Res. 2004, 43, 7025. (10) Zhao, S. J.; Zhang, J.; Xu, Y. M. Performance monitoring of processes with multiple operating modes through multiple PLS models. J. Process Control 2006, 16, 763. (11) Doan, X.-T.; Srinivasan, R.; Bapat, P. M.; Wangikar, P. P. Detection of phase shifts in batch fermentation via statistical analysis of the online measurements: A case study with rifamycin B fermentation. J. Biotechnol. 2007, 132, 156-166. (12) Lu, N. Y.; Gao, F. R.; Wang, F. L. A sub-PCA modeling and online monitoring strategy for batch processes. AIChE J. 2004, 50, 255.

(13) Wang, H. Partial Least-Squares Regression-Method and Applications; National Defence Industry Press: Beijing, China, 1999. (14) Kleinbaum, D. G.; Kupper, L. L.; Muller, K. E.; Nizam, A. Applied Regression Analysis and Other MultiVariable Methods, third ed.; China Machine Press: Beijing, China, 2003. (15) Kutner, M. H.; Nachtsheim, C. J.; Neter, J. Applied Linear Regression Models, fourth ed.; Higher Education Press: Beijing, China, 2005. (16) Walmsley, A. D. Improved variable selection procedure for multivariate linear regression. Anal. Chim. Acta 1997, 354, 225. (17) Pudil, P.; Novovicva´, J.; Kittler, J. Floating search methods in feature selection. Pattern Recognit. Lett. 1994, 15, 1119. (18) Ho¨skuldsson, A. Variable and subset selection in PLS regression. Chemom. Intell. Lab. Syst. 2001, 55, 23. (19) Eklo¨v, T.; Lundstro¨m, I. Selection of variables for interpreting multivariate gas sensor data. Anal. Chim. Acta 1999, 381, 221. (20) Wold, S., Kettaneh, N.; Friden, H.; Holmberg, A. Modelling and diagnosis of batch processes and analogous kinetic experiments. Chemom Intell. Lab. Syst. 1998, 44, 331. (21) Wise, B. M. A comparison of multiway principal components analysis, tri-linear decomposition and parallel factor analysis for fault detection in a semiconductor etch process. Presented at International Chemometrics Research Meeting, ICRM98, Veldhoven, The Netherlands, 1998. (22) Westerhuis, J. A.; Kourti, T.; Macgregor, J. F. Comparing alternative approaches for multivariate statistical analysis of batch process data. Chemom. J. 1999, 13, 397. (23) van Sprang, E. N. M.; Ramaker, H.-J. Critical evaluation of approaches for on-line batch process monitoring. Chem. Eng. Sci. 2002, 57, 3979. (24) Lee, J.-M.; Yoo, C. K.; Lee, I.-B. Enhanced process monitoring of fed-batch penicillin cultivation using time-varying and multivariate statistical analysis. Biotechnology 2004, 110, 119. (25) Nomikos, P.; MacGregor, J. F. Multivariate SPC Charts for Monitoring Batch Processes. Technometrics 1995, 37, 41. (26) Cho, H.-W.; Kim, K.-J. A method for predicting future observations in the monitoring of a batch process. J. Qual. Technol. 2003, 35, 59. (27) Yang, Y.; Gao, F. Cycle-to-cycle and within-cycle adaptive control of nozzle pressures during packing-holding for thermoplastic injection molding. Polym. Eng. Sci. 1999, 39, 2042. (28) Yang, Y.; Gao, F. Adaptive control of injection velocity of thermoplastic injection molding. Control Eng. Pract. 2000, 8, 1285. (29) Zhao, C. H.; Wang, F. L.; Gao, F. R.; Lu, N. Y.; Jia, M. X. Adaptive Monitoring Method for Batch Processes Based on Phase Dissimilarity Updating with Limited Modeling Data. Ind. Eng. Chem. Res. 2007, 46, 4943.

ReceiVed for reView May 17, 2007 ReVised manuscript receiVed November 8, 2007 Accepted November 16, 2007 IE0707063