Research Note pubs.acs.org/IECR
Self-Training Statistical Quality Prediction of Batch Processes with Limited Quality Data Zhiqiang Ge,*,† Zhihuan Song,† and Furong Gao‡,§ †
State Key Laboratory of Industrial Control Technology, Department of Control Science and Engineering, Zhejiang University, Hangzhou, China ‡ Department of Chemical and Biomolecular Engineering, The Hong Kong University of Science and Technology, Hong Kong, China § Center for Polymer Processing and Systems, Fok Ying Tung Graduate School, Hong Kong University of Science and Technology, Hong Kong, China ABSTRACT: Because of expensive cost or large time delay, quality data are difficult to obtain in many batch processes, while the ordinary process variables are measured online and recorded frequently. This paper intends to build a statistical quality prediction model for batch processes under limited quality data. Particularly, the self-training strategy is introduced and combined with the partial least-squares regression model. For multiphase batch processes, a phase-based self-training PLS model is developed for quality prediction in each phase of the process. The feasibility and effectiveness of the developed method is evaluated by an industrial injection molding process.
1. INTRODUCTION In modern industry, batch processes have played an important role in producing low-volume and high-value-added products. Competition and requirements for high-quality batch products have received particular research attention in recent years. However, online quality control of the batch process is difficult, which is mainly due to the lack of online measurements of the quality variable. Therefore, significant efforts have been made for the development of methods for quality prediction. Compared to the first-principle model-based method, the data-based method has been much more widely used, among which the multivariable statistical analysis method may be the most popular one. While the basic partial least-squares model was widely used for soft sensing in continuous processes, its multiway counterpart has been introduced for quality prediction and monitoring of batch processes.1 In the past years, various related multivariable statistical modeling approaches have the developed for the same purpose in the batch process.2−9 Later, with exploring of the multiphase characteristic of the batch process, many phase-based statistic methods have been proposed for quality prediction.10−14 For development of a quality prediction model, the datasets of both process and quality variables are required. While the data of ordinary process variables are daily recorded in batch processes, the acquisition of quality data is very difficult, which are often obtained through expensive instruments, laboratory analyses, and other additional efforts. Therefore, it is very costly and something even impossible to get enough quality data for modeling. As a result, we may obtain a limited number of quality data, and have a large amount of datasets for process variables. Conventionally, the prediction model is built based on the training data samples, which consist of both process and quality variables. However, because of the limited number of quality data, the modeling performance of the batch process may not be guaranteed, which will also influence the result of the quality prediction. © XXXX American Chemical Society
Although the number of quality data is limited, we have plenty of data samples for process variables. How to use these additional process variable data to improve the modeling performance is of particular interest, which is also the focus of the present work. In this case, we may only need a small portion of quality data for prediction modeling, which can save a lot of resources, time, and effort. In some extreme situations, where it is impossible to obtain enough quality data in time, for example, the batch process is changed to produce a new production grade, we can only build the model under limited quality data. Here, we denote the data sample, which consists of both process and quality variables, as the labeled data sample, and the one that contains only the process variable as the unlabeled data sample. Therefore, our task is how to improve the modeling performance with the incorporation of the additional unlabeled data samples. Actually, modeling under both labeled and unlabeled data is termed as “semisupervised learning in the machine learning area”, which has received much attention in recent years.15−20 Traditional semisupervised learning methods include selftraining approaches, probabilistic generative model-based methods, cotraining methods, graph-based methods, etc.15,16 Nevertheless, in the present work, we only focus on the selftraining based method. The main advantages of the self-training method are its simplicity and the fact that it is a wrapper method, which means the inner structure will not be changed during the iterative self-training process. In the self-training method, the choice of the model structure is left completely open, which could be a very simple algorithm such as the nearest-neighbor method, or a very complicated model, such as neural network and the kernel-learning method. Received: March 7, 2012 Revised: October 14, 2012 Accepted: November 21, 2012
A
dx.doi.org/10.1021/ie300616s | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Research Note
Step 1: For each unlabeled data sample xu ∈ RJ, calculate the similarity value corresponding to each data sample in the labeled dataset, and rearrange them in descending order:
In this paper, the self-training algorithm is combined with the widely used PLS model for quality prediction of multiphase batch processes. Thus, in each phase of the batch process, starting from a limited number of training data samples, the similarity between each unlabeled data sample and the training data sample is calculated in the first step. Second, the nearestneighbor method is employed for labeling the unlabeled data sample. Then, the labeled data set is updated with the new labeled data sample from the nearest-neighbor method. Finally, the PLS model can be formulated after enough labeled data samples have been obtained. The remainder of this paper is organized as follows. In section 2, a detailed description of the phase-based self-training PLS method is provided, followed by an industrial case study of the injection molding process in the next section. Finally, conclusions are made.
k
k
2
e− || x u − x L ,i ||
Simk(u , i) =
RSimk = descend{Simk} k = 1, 2 ,..., K
(2)
xkL,i
where means the ith data sample in the labeled dataset, which corresponds to the kth sampling time point. Step 2: Calculate the value of quality variable yu for the current unlabeled data sample by using the Q-nearest-neighborbased method, where Q is the number of training data samples used in the nearest-neighbor method, given as follows: Q
y ku
2. SELF-TRAINING PLS MODEL FOR MULTIPHASE BATCH PROCESSES In many batch processes, the entire operation region can be divided into several different phases, where various data characteristics could be exhibited. To divide the batch process into different phases, many phase division methods have been developed, such as expert knowledge-based methods, process analysis techniques, data-based method, etc.14 In the present paper, it is assumed that the batch process has already been divided into different phases, given a batch process data set X(I × J × K), where I is the number of batches, J the number of process variables, and K the duration of each batch. Through the batch direction, the three-way data matrix can be unfolded into a two-dimensional data matrix X(I × JK). Suppose the entire batch process has been divided into S phases, X(I × JK) can be partitioned into
=
L ∑q = 1 [RSimk(q)y RSim ] (q) k
Q
∑q = 1 RSimk(q)
k = 1, 2 ,..., K (3)
where RSimk(q) represents first q elements in the rearranged similarity vector. Step 3: After the quality value of the unlabeled data sample has been obtained, add the new unlabeled data sample {xu,yu} to the labeled dataset {XL,YL}; update the labeled dataset for the next modeling training step; Step 4: Based on the final labeled dataset {XL, YL}, build a PLS model for each time slice of the batch process, given as1 k X kL = TkLPkT L + EL
YL = TkLQ kT + FkL L k −1 k RkL = W kL(PkT L W L) Q L
PkL
(4)
QkL
TkL
where k = 1, 2, ..., K, and are loading matrices, is the principal component matrix, EkL and FkL are the residual matrices, WkL is the weighting matrix of each PLS model, and RkL is the regression matrix for the corresponding PLS model. It is noted that the same quality dataset YL has been used for all of the time slices during the batch; this is because one batch can only obtain one quality value. The prediction aim at each time slice is to reduce the error between the prediction value and the one obtained in the end of the batch. Therefore, it is reasonable to use the same quality dataset for modeling at different time slices during the entire batch. Step 5: After the PLS model has been built for each time slice, a phase-representative PLS model can be constructed in each phase, the regression matrix of which is determined as follows:
X(I × JK ) = [X1(I × JK1)X 2(I × JK 2)···XS(I × JKS)] (1)
where K1, K2, ..., KS are the numbers of time slices in different phases. 2.1. Phase-Based Self-Training PLS Model (STPLS) Development. Suppose we have only obtained a small portion of the quality data, given as Y(Iy × Jy), Iy ≪ I, the aim of the self-training model is try to build a calibration model between the process data set X(I × JK) and the limited quality data set Y(Iy × Jy). Generally, the main idea of the self-learning approach is characterized by the fact that the learning process uses its own predictions to teach itself. The major advantage of the self-learning method is due to its simplicity, which means that the choice of the learner for the prediction model is left completely open. Therefore, any simple or complicated modeling methods can be incorporated into the self-training framework. However, the modeling procedure of the selflearning process is quite simple, which remains the same for different data-based models. A common modeling process is illustrated as follows, which is incorporated with the PLS model. By representing the initial labeled and unlabeled datasets as {XL ∈ RIy×J, YL∈RIy×Jy} and {XU∈R(I−Iy)×J}, the detailed derivation of the self-training PLS model for multiphase batch processes is given as follows: Input: Labeled data set {XL, YL}, unlabeled data set {XU} Repeat the following steps 1−3 until the unlabeled data set is empty
s
1 R*s = Ks
∑
∑ Ki i=1
s−1
k= ∑ Ki + 1 i=1
RkL (5)
where s = 1, 2, ..., S is the phase number of the batch process. 2.2. Online Quality Prediction and Performance Evaluation. After the STPLS model has been constructed for each phase, the online quality prediction can be made for the new batches at each time point. Therefore, given a new batch with its data information obtained at time point k, denoted as xnew,k, the online prediction value of the quality variable can be calculated as follows: T
ynew, ̂ k = R *sc x new, k B
(6)
dx.doi.org/10.1021/ie300616s | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Research Note
where the subscript “sc” represents the current phase of the new data sample xnew,k, ŷnew,k is the estimated value of the quality variable, k means the kth time point. Suppose the testing dataset contains Ite batches, the root-mean-square error (RMSE) criterion can be used for quality prediction performance evaluation in each sampling interval of the batch process, which is defined as follows: 2
I
RMSE(k) =
For each batch process data sample, the quality variable is only available in the end of the batch. Therefore, for performance evaluation during the entire batch time, it is assumed that the corresponding quality value in each time slice is equal to that obtained in the end of the batch. To simulate the case of limited quality data, only a small portion of batches are randomly selected from the training dataset. Initially, 20 training batches, which consist of both process and quality variables, are used as the basic labeled dataset. The quality variables of the remaining 80 training batches are artificially removed, which simulate the unlabeled batches. Based on the self-training PLS modeling procedures given in section 2, a representative STPLS model can be constructed in each phase of this process. Here, the number of nearest neighbors is selected as three. For comparison, the conventional representative PLS model is also developed in each phase. To evaluate the prediction performance of these two methods, 50 testing batches are used. The RMSE values of the testing batches for the two prediction models at each sampling interval of the process are shown together in Figure 2. It can be seen that the phaseSTPLS model performs much better than the traditional phasePLS model during the entire batch time. In the end of each phase, the comparison results of two methods are tabulated in Table 2, where we can also observe the superiority of the selftraining model. Particularly, detailed quality prediction results of both STPLS and PLS methods are plotted together in Figure 3, with the corresponding measured value of the quality variable. In order to examine if the number of nearest neighbors is important for modeling and prediction, different values are used for self-training and their prediction performances are evaluated through the same testing dataset. The values of RMSE at the end of each phase under different numbers of nearest neighbors are given in Figure 4. Based on these results, it can be inferred that the number of nearest neighbors is not very important for self-training modeling in this example, since the RMSE values of the testing batches are quite similar under various nearestneighbor (NN) numbers. This is because all of the training and testing batches have been obtained under the same operation conditions. In this case, the data characteristics of both labeled and unlabeled batches are similar to each other; therefore, the self-training process does not change significantly if different NN numbers are used. However, if the modeling batches are acquired from several different batch operation conditions, the data characteristics of labeled and unlabeled batches should be quite different. In this situation, the NN number may be important for self-training model development. If a small NN number is selected, the quality variable of the unlabeled batch will be determined under a small region of the operation condition. In contrast, if a big NN number is used, the labeled training batches, which are utilized to calculate the quality variable of the unlabeled batch, may come from several different operation conditions, which will deteriorate the labeling performance. Hence, with the iterative self-training process, the performance of the final regression model may be severely distorted by the unreliable labeling of the unlabeled batches. Next, the quality prediction performance of the STPLS model is evaluated under different numbers of labeled training batches. Detailed end-of-batch RMSE values of the STPLS and PLS methods under the labeled numbers between 10 and 100 are listed in Figure 5. Generally, the quality prediction performances of both two methods are improved with the
∑ jte= 1 || yj − ŷ kj || Ite
(7)
where k = 1, 2, ..., K, j = 1, 2, ..., Ite, yj is the real quality measurements of each testing batch, and ŷkj is the predicted quality of the testing batch j at time interval k.
3. INDUSTRIAL APPLICATION STUDY In this section, the injection molding process is used for performance evaluation of the self-training PLS model, which is a typical multiphase batch process. In this process, the weight of the final product is selected as the quality variable. For prediction of this quality variable, some online measured process variables are utilized, such as temperature, pressure, and the screw velocity. A schematic flowchart of the injection molding process is shown in Figure 1, and all of the process
Figure 1. Simplified schematic of the injection molding machine.11
Table 1. Selected Variables for Quality Prediction No.
variable
unit
No.
variable
unit
1 2 3 4 5 6
valve 1 valve 2 screw stroke screw velocity ejector stroke mold stroke
% % mm mm/s mm mm
7 8 9 10 11
mold velocity injection press barrel temperature 1 barrel temperature 2 temperature 1
mm/s bar °C °C °C
variables are tabulated in Table 1. A dataset that contains 150 batches has been collected in the injection molding process, among which 100 batches are used for model training in the present study and the remaining 50 batches are for testing. The duration of each batch is 635 sampling points, with a sampling interval of 0.025 s. Based on the previous phase division method, this process can be divided into seven different phases. C
dx.doi.org/10.1021/ie300616s | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Research Note
Figure 2. Prediction results of the testing batches by STPLS and PLS.
Table 2. End-of-Phase Prediction Results of Two Methods end-of-phases
#1
#2
#3
#4
#5
#6
#7
STPLS PLS
0.0455 0.0528
0.0444 0.0502
0.0441 0.0473
0.0439 0.0494
0.0446 0.0527
0.0446 0.0508
0.0434 0.0459
Figure 3. Quality prediction results of STPLS and PLS for a testing batch.
Figure 4. End-of-phase prediction results of STPLS under different NN numbers.
D
dx.doi.org/10.1021/ie300616s | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX
Industrial & Engineering Chemistry Research
Research Note
Figure 5. End-of-batch quality prediction results of two methods under different labeled numbers. (3) Chen, J.; Liu, K. C. On-line batch process monitoring using dynamic PCA and dynamic PLS models. Chem. Eng. Sci. 2003, 57, 63− 75. (4) Facco, P.; Olivi, M.; Rebuscini, C.; Bezzo, F.; Barolo, M. Multivariate statistical estimation of product quality in the industrial batch production of a resin. In Proceedings of DYCOPS 20078th IFAC Symposium on Dynamics and Control of Process Systems, Cancun, Mexico, June 6−8, 2007; Foss, B., Alvarez, J., Eds.; 2007; Vol. 2, pp 93−98. (5) Doan, X. T.; Srinivasan, R. Online monitoring of multi-phase batch processes using phase-based multivariate statistical process control. Comput. Chem. Eng. 2008, 32, 230−243. (6) Yu, J.; Qin, S. J. Multiway Gaussian mixture model based multiphase batch process monitoring. Ind. Eng. Chem. Res. 2009, 48, 8585−8594. (7) Chen, T.; Zhang, J. On-line multivariate statistic monitoring of batch processes using Gaussian mixture model. Comput. Chem. Eng. 2010, 34, 500−507. (8) Boonkhao, B.; Li, R. F.; Wang, X. Z.; Tweedie, R. J.; Primrose, K. Making use of process tomography data for multivariate statistical process control. AIChE J. 2011, 57, 2360−2368. (9) He, Q. P.; Wang, J. Statistic pattern analysis: A new process monitoring framework and its application to semiconductor batch processes. AIChE J. 2011, 57, 107−121. (10) Undey, C.; Cinar, A. Statistical monitoring of multistage, multiphase batch processes. IEEE Control Syst. Mag. 2002, 22, 53−63. (11) Lu, N. Y.; Gao, F. R. Stage-based process analysis and quality prediction for batch processes. Ind. Eng. Chem. Res. 2005, 44, 3547− 3555. (12) Muthuswamy, K.; Srinivasan, R. Phase-based supervisory control for fermentation process development. J. Process Control 2003, 13, 367−382. (13) Camacho, J.; Pico, J.; Ferrer, A. Multi-phase analysis framework for handling batch process data. J. Chemom. 2008, 22, 632−643. (14) Yao, Y.; Gao, F. R. A survey on multistage/multiphase statistical modeling methods for batch processes. Ann. Rev. Control 2009, 33, 172−183. (15) Zhu, X. Semi-supervised Learning in Literature Survey, Technical Report 1530; Computer Sciences, University of Wisconsin−Madison, Madison, WI, 2005. (16) Chapelle, O.; Zien, A.; Scholkopf, B. Semi-Supervised Learning; MIT Press: Cambridge, MA, 2006. (17) Song, Y. Q.; Nie, F. P.; Zhang, C. S.; Xiang, S. M. A unified framework for semi-supervised dimensionality reduction. Pattern Recogn. 2008, 41, 2789−2799. (18) Mallapragada, P. K.; Jin, R.; Jain, A. K.; Liu, Y. SemiBoost: Boosting for semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 2000−2014.
increased number of labeled batches. This is straightforward, because more and more reliable information has been incorporated for modeling. It can be seen that the quality prediction performances of the two methods are comparative after more than 30 batches are labeled. However, when the labeled batches are limited, which is