2680
Ind. Eng. Chem. Res. 2004, 43, 2680-2690
Improved Quality Estimation and Knowledge Extraction in a Batch Process by Bootstrapping-Based Generalized Variable Selection Young-Hwan Chu,† Young-Hak Lee,† and Chonghun Han*,‡ Department of Chemical Engineering, Pohang University of Science and Technology, Hyoja-dong, Nam-gu, Pohang, Kyungbuk 790-784, Korea, and School of Chemical Engineering and Institute of Chemical Processes, Seoul National University, Shillim-dong, Kwanak-gu, Seoul 151-742, Korea
This paper proposes a novel variable selection method for the improvement of the quality estimation performance and knowledge extraction in a batch process. The quality estimation method is an effective alternative to the costly and time-consuming quality measurement. However, degradation of the prediction performance of the quality estimation model caused by inclusion of insignificant variables is a serious problem. The preprocessing of variable selection is thus important to improve the prediction accuracy by removing the variables uncorrelated with quality variables. The variable selection technique also can be used as a knowledgeextraction tool. The technique allows us to identify the process characteristics related to product quality. The problem of inaccurate variable selection results caused by a large number of variables and a limited number of samples of batch process data is solved by the bootstrapping technique. Despite increased computational load, combination of bootstrapping with variable selection enhances the reliability of the variable selection result. An industrial poly(vinyl chloride) polymerization process is used as a case study to show the improved performance of the proposed method compared with multiway partial least squares (MPLS). The proposed method shows better accuracy than MPLS in both detecting the quality-related variables and estimating the real values of quality variables. 1. Introduction Growing global competition has forced chemical industries to produce products of higher quality. This atmosphere has resulted in a boom of quality-related activities such as 6σ campaign and transition in a production system from the continuous process to the batch process. The production of better quality products is accomplished based on a reliable and prompt evaluation of the product quality. Quality is usually measured in a separate analysis laboratory during or after the end of operation. In general, quality measurement is expensive, time-consuming, and cumbersome. There is a need to use a new quality evaluation method that allows us to measure quality variables in a simpler, faster, but more accurate way. For this purpose, the quality estimation approach is an effective alternative to the direct measurement of quality variables. In this approach, quality estimation models are built between process and quality variables based on the fact that process variables can explain the change of quality variables. Although the models can be built either empirically or mechanistically, the extreme complexity of chemical processes has led to preference of the empirical models in many cases. The empirical method has advantages in the modeling procedure because we can easily obtain the models by regression of historical data without prior process knowledge. In addition to its simplicity, the method ensures acceptable estimation accuracy if only reliable data can be gathered. Once the quality estimation models are built, the values of quality variables can be * To whom correspondence should be addressed. Tel.: +822-880-1887. Fax: +82-2-888-7295. E-mail:
[email protected]. † Pohang University of Science and Technology. ‡ Seoul National University.
estimated from the output values of the models. This empirical quality estimation method has been frequently applied to composition estimation in a distillation column and showed successful results.1-3 In addition to the distillation columns, there are many cases where the quality estimation was successfully applied.4-6 This paper focuses on the improvement of the prediction accuracy of the quality estimation model and statistical knowledge extraction in batch processes based on a novel variable selection method. When batch process data that have a three-way structure are unfolded to analyze batch-to-batch variation and the autocorrelation effect, the number of unfolded variables dramatically increases. However, because the final product quality is mostly determined at some core stage during a batch run, only a small portion of the unfolded variables are correlated with the quality variables. Previous works have shown that time-specific effects of process variables on the final product quality can be identified by designed experiments on manipulated trajectories.7,8 Therefore, these critical-to-quality variables can be systematically found, and the quality estimation models should include only those variables as predictors. Although data compression techniques such as multiway partial least squares (MPLS)6,9 can handle the redundancy problem of batch process data, acceptable prediction accuracy cannot be expected without removal of nuisance variables. Because those variables have no correlation with the quality variables, they cannot be good predictor variables in the partial least squares (PLS) model. They only cause deterioration of the data compression performance. If those variables are removed prior to modeling, the prediction performance of the quality estimation model can be significantly im-
10.1021/ie0341552 CCC: $27.50 © 2004 American Chemical Society Published on Web 04/13/2004
Ind. Eng. Chem. Res., Vol. 43, No. 11, 2004 2681
Figure 1. Identification of the dynamic characteristics of a batch process based on a variable selection.
proved because of the increased causal relationship between the predictor and quality variables. Furthermore, the selected variables provide process knowledge about key process variables and their critical stages in determining the values of quality variables. The information is extremely helpful for us to understand dynamic characteristics of a batch process and establish an operating guideline. Figure 1 illustrates with a simple example how the dynamic characteristics of a batch process can be identified from the variable selection. Note that the importance of variable selection has been reported in numerous papers.10-12 Generally, batch process data are used after unfolded and thus have “fat” form, where the number of variables
is significantly larger than that of samples. In this case, the variable selection is especially important to use only essential information in the quality estimation models. However, a lack of samples causes a serious problem in the variable selection. First of all, because sufficient variance information on the behavior of the batch process cannot be reflected in the dataset, the variable selection result may be biased. This means that the result can be different if we use other kinds of datasets. Therefore, a variable selection strategy that derives the correct result from a dataset with a limited number of samples needs to be developed. This paper proposes the bootstrapping-based variable selection method for this purpose. In this method, selection frequencies resulting
2682 Ind. Eng. Chem. Res., Vol. 43, No. 11, 2004
Figure 2. Principle of the bootstrapping.
from repetitive applications of a variable selection algorithm to various bootstrap datasets are used to extract true quality-related variables from unfolded batch data. The proposed method is applied to an industrial polymerization process producing poly(vinyl chloride) (PVC) to evaluate its performance compared with the existing MPLS method. This paper is organized as follows. A detailed explanation on the proposed bootstrapping-based generalized variable selection (BBGVS) method is given in section 2. In this section, the concept and effectiveness of the bootstrapping, the devised variable selection algorithm, and features of the proposed method are described in a sequential manner. In section 3, results obtained by applying the proposed method and MPLS to the data of an industrial PVC polymerization process are shown. Their performances are compared from the viewpoint of identification accuracy of process characteristics and prediction accuracy of the quality estimation models. The last section provides conclusions including limitations and prospects of the proposed method. 2. BBGVS Method 2.1. Bootstrapping: Derivation of True Information Based on Random Resampling. When the data size is small, the parameter estimation using statistical methods may lead to unreasonable results far from reality. This problem is caused by a lack of true variance information. For this case, bootstrapping can be a good solution. Bootstrapping is a generalization method to obtain true parameter values by individually applying a statistical method to a number of bootstrap datasets and then examining the distribution of the results. If the distribution shows a high frequency at a specific value, the value is probably the true parameter value. The generalization feature of the bootstrapping, which considers various kinds of datasets in parameter estimation, allows us to overcome the problem of serious error in parameter estimation caused by limitation in data. The principle of the bootstrapping is briefly summarized in Figure 2. For more information on the bootstrapping, refer to work by Efron and Tibshirani.13 The bootstrapping is a computationally intensive method because it requires repetitive implementations of a statistical method for numerous bootstrap datasets. Therefore, increase of the computational load is indispensable despite its great generalization effect. Nevertheless, if we consider the fast computational speed of current machines, the outstanding estimation accuracy by the bootstrapping must be attractive for many cases.
Figure 3. Variable selection algorithm based on the SFFS method and the RMSEP minimization criterion.
The advantage of the bootstrapping, derivation of true information from small-sized data, is effectively utilized in the proposed method to obtain truly quality-related variables from batch process data. Batch process data are typical “fat-type” data, meaning a small number of samples (batches) and a large number of variables. This feature of the batch process data comes from the unfolding process usually performed to generate a twoway structure from a three-way structure. In the proposed method, critical-to-quality variables are extracted from unfolded batch data based on selection frequencies of unfolded variables that are calculated from repetitive applications of a variable selection algorithm to different bootstrap datasets. The generalization effect of the bootstrapping enables us to extract the variables even from the data with limited samples. Because these variables are used as predictor variables in quality estimation models, the models show significantly improved prediction accuracy. 2.2. Variable Selection Based on Sequential Forward Floating Selection (SFFS) Algorithm and Root-Mean-Square Error in Prediction (RMSEP) Minimization Criterion. To identify quality-related variables in each bootstrap dataset, a variable selection method, which systematically finds out meaningful variables, is used. The variable selection method is performed based on a search algorithm and a selection criterion. In this study, a SFFS algorithm14 has been chosen as a search algorithm. This algorithm offers outstanding performance in both accuracy and computation time, while most variable selection algorithms have unbalanced performance between accuracy and computational speed.15-18 The remarkable performance of the SFFS algorithm comes from its flexibility in determining significant variables.19 Although the variable selection process is sequentially performed in the SFFS algorithm, the variables previously selected can be discarded when they become insignificant by adding a new variable into the existing variable subset. It is another advantage of the SFFS algorithm that it guarantees reproducibility of the variable selection results, unlike the genetic algorithm or the simulated annealing, which produces different results depending on the initial values.
Ind. Eng. Chem. Res., Vol. 43, No. 11, 2004 2683
Figure 4. Procedure of the BBGVS method.
As a selection criterion, the minimization of the RMSEP of a multiple linear regression (MLR) model is used. By adoption of the RMSEP minimization criterion, the prediction capability of a variable can be considered. If the variable selection procedure is performed so that the RMSEP is minimized, only the variables that contribute to the accurate prediction of quality variables are selected. Use of the MLR model significantly reduces the computation time in searching for the variable subset that can correctly explain the behavior of quality variables. Although the MLR model cannot handle a nonlinearity and multicollinearity problem, its simplicity is attractive in the time-consuming variable search procedure. Furthermore, the multicollinearity problem is automatically solved because the variable that causes a severe multicollinearity is repelled by a high RMSEP value of the MLR model including the variable. The detailed variable selection procedure based on the SFFS algorithm and the RMSEP minimization criterion is shown in Figure 3. 2.3. Proposed Method. In this paper, the BBGVS method is proposed to correctly extract quality-related variables used for predictor variables of quality estimation models. The procedure of the proposed method is shown in Figure 4. From steps 3-5, variable selection is implemented to N datasets generated by bootstrapping, and then a selection frequency of each variable is measured as an index to evaluate the significance of a
variable. As a general guideline for the choice of the value N, N g 30 is suggested to guarantee a statistical significance. In the remaining steps, quality-related variables are extracted by examining the selection frequencies window by window for batch operating time after returning unfolded variables and their selection frequencies to their original positions in corresponding process variables. In the proposed method, the selection frequency of a variable is an important index to judge the significance of the variable related to the product quality. If a variable is selected many times in various kinds of bootstrap datasets, it is highly probable that the variable is really critical to the quality. Therefore, variable extraction based on the selection frequency has a better generalization effect than variable selection for one dataset. Furthermore, the proposed method considers time windows (variable sets) of suitable size rather than time instants (individual variables). As step 8 in Figure 4 shows, if the sum of the selection frequencies in each time window of size l is larger than the predetermined constant, p, then the time window is considered to be significant in relation to the quality and the average time instant weighted by the selection frequencies in the window is extracted as a quality-related variable (step 9). This window approach allows us to consider even the case where selection frequencies are broadly spread over a time window as important to the product
2684 Ind. Eng. Chem. Res., Vol. 43, No. 11, 2004
Figure 5. Conceptual illustration for the window approach of the proposed method.
quality. Actually, selection frequencies of quality-related variables are more likely to be broadly distributed around the most important time instant than concentrated on specific time instants. If we consider that the product quality is determined not at a time instant but through a time stage, this approach is reasonable. In the algorithm, three parameters, m, l, and p, are used. For m and l, if one of them is specified, the other one is automatically determined. Therefore, two parameters, l and p, are actually used in the algorithm. The parameters critically affect the overall accuracy of the proposed method because the extracted variables can be significantly different depending on the values of the two parameters. If l is too large, the number of extracted variables becomes too small and important information may be missed. On the other hand, if l is too small, the number of extracted variables becomes too large and the effect of the window approach may be ignored. The threshold value for extraction of a variable, p, should also be cautiously determined. If p is too large, most of the windows are rejected and few variables are extracted. On the contrary, if p is too small, most of the windows are considered to be significant and too many variables are extracted. Therefore, the correct choice of l and p is a prerequisite for the successful application of the proposed method. A conceptual illustration for the proposed method is shown in Figure 5. In this figure, the selection frequencies of unfolded variables that belong to a process variable of a batch process data are plotted on their original time instants. The whole batch time has been divided into 40 windows of size 5 min. For each window, the selection frequencies are summed to examine whether they are larger than the predetermined constant, 5, or not. For instance, the time instant of 3 min after the start of a batch is judged to have a close relationship with the quality because of its high selection frequency. However, we consider all time instants in the window rather than the time instant alone to extract the significant time instant (variable). According to step 9, if we round the average time instant weighted by selection frequencies in the window to the nearest integer, a time instant of 3 min after the start of a batch run is obtained as a quality-related variable. With the same method, the second and final two windows are judged as significant and time instants of 8, 192, and 198 min after the start of a batch are extracted as
Figure 6. Process flow diagram of an industrial PVC polymerization process.
quality-related variables. Note that the second window was considered to be significant by the proposed algorithm, although no variable in the window shows a remarkably high selection frequency. 3. Case Study 3.1. PVC Polymerization Process. To validate the improved performance of the proposed method, an industrial PVC polymerization process was used as a case study. The purpose of this process is to safely produce PVC products with uniform quality by optimally controlling an operating condition during a batch run. The most important thing in this operation is that the inner temperature of the reactor should follow a specified trajectory devised for small operating costs and acceptable product quality. The temperature trajectory consists of three stages: heat-up, main reaction, and cooling. In the heat-up stage, the temperature of the reactants is increased to a specific level to initiate a polymerization reaction. Once the reaction starts, heat is continuously generated from the exothermic reaction. Therefore, a cooling jacket and chilled water are used to maintain the inner temperature of the reactor at a constant level until the end of the main reaction. After that, the temperature sharply drops because the exothermic reaction finishes and the cooling system still works. The process flow diagram of the PVC polymerization process is shown in Figure 6. The whole process condition is monitored online with 15 sensors, denoted as black circles. These sensors measure various physical states such as temperature, pressure, flow rate, electric current, and rotations per minute of the agitator at each minute for 350 min. They all have their own typical trajectories and deviate from them depending on the process conditions. Product quality is inspected based on three quality variables: Y1, Y2, and Y3. Because these three quality variables are measured offline in a separate laboratory after each batch run, considerable time and costs are consumed in this step. A description for the 15 process variables and the 3 quality variables is given in Table 1. 3.2. Data. In our study, 40 batch data were collected and the dataset had a three-way structure as the usual batch dataset. The collected dataset was unfolded to make a general two-way form composed of 5250 un-
Ind. Eng. Chem. Res., Vol. 43, No. 11, 2004 2685
Figure 7. Description for the dataset. Table 1. Description for the Process and Quality Variables variable
description
T1 T2 T3 F4 P5 R6 A7 P8 F9 F10 F11 T12 T13 F14 T15
Process Variables inner temperature of the reactor input temperature of cooling water output temperature of cooling water input flow rate of cooling water inner pressure of the reactor rotations per minute of the agitator electric current of the reactor pressure of the agitator input flow rate of chilled water (side) input flow rate of chilled water (top) input flow rate of chilled water (bottom) input temperature of cooling water for the condenser output temperature of cooling water for the condenser input flow rate of cooling water for the condenser top temperature of the condenser
Y1 Y2 Y3
Quality Variables mass fraction of PVC particles below 35 µm average size of the PVC particles average number of flaws generated when the PVC mass is spread
folded variables and 40 samples. To consider all variables on the same scale around zero, mean centering and scaling to unit variance were performed on the unfolded variables. For the 40 samples, division into three parts after random reordering was performed 30 times to generate 30 different sets of bootstrap data. The first part, made of 20 samples, is used for MLR modeling. The second part, made of 12 samples, is used to calculate the RMSEP for the MLR model as a criterion for variable selection. The remaining eight samples were set aside to evaluate the prediction performances of quality estimation models. Note that this work is equivalent to the bootstrapping because different kinds of modeling and test data for variable
Figure 8. Data validation using a PCA score plot. All of the unfolded variables were used for PCA.
selection are generated. A visual description for the datasets used for this study is provided in Figure 7. 3.3. Identification of Batch Process Characteristics Based on Variable Selection. Before application of the proposed method, we examined all of the 40 samples by principal component analysis (PCA) to check whether abnormal batches exist. Figure 8 shows that there is no significant outlier, although the 17th and 18th samples are somewhat out of the ellipsoid representing 95% of the confidence limit. This figure also shows that the 40 samples are divided into two groups composed of the first 16 batches (left cluster) and the last 22 batches (right cluster). From this fact, we can infer that two different operational recipes were used during the 40 batch runs. Later, we will use one group as a modeling dataset and the other group as a test
2686 Ind. Eng. Chem. Res., Vol. 43, No. 11, 2004
Figure 9. Selection frequencies plotted over typical trajectories of T1, F4, P5, and F9.
Figure 10. Accumulated selection frequencies for the 15 process variables. Note that the variable numbers are the same as those in Table 1.
dataset to compare the prediction accuracy of quality estimation models built with the proposed method with that built with MPLS in the worst case. The results of variable selection using the proposed method are shown in Figures 9 and 10. Figure 9 was made by applying the proposed method to the given dataset until step 6. This figure shows that the period around 250 min after start-up is the quality-determining stage because selection frequencies in the period are high and dense for the four representative process variables. Figure 10 shows the sum of the selection frequencies of the unfolded variables for each process variable. In this figure, it is obviously shown that the selection frequencies that belong to F4, F9, and F10 are much larger than the other process variables. Therefore, it can be inferred that the variables relevant to the flow rates of cooling and chilled water have a strong impact on the product quality.
The analysis results agree with real phenomena on the process behavior. It is notable that these variable selection results nearly perfectly explain actual process characteristics without additional process knowledge. The main PVC polymerization reaction is initiated from 100 min after the start-up and lasts for about 200 min. The main reaction stage goes through three periods during 200 min: particle generation, particle growth, and finalizing periods. During the particle generation period, the initial seeds for later polymerization are formed. After the period, seeds are no longer generated and the particles start to grow through a polymerization reaction. When the particle size becomes sufficiently large, the rate of the polymerization reaction slows down and whole reactions are finalized. From this observation, the result that the variables around 250 min have been selected many times is reasonable. Because this time corresponds to the particle growth period, the particle size of the final product is determined at the period. If we consider that the final values of the three quality variables are closely related to the particle size, the importance of this period to the product quality can be understood. High selection frequencies of the process variables relevant to cooling and chilled water flow rates can be interpreted as the importance of flow-rate variables on reactor temperature control. As mentioned in section 3.1, the most important thing in the operation is to control the inner temperature of the reactor so that it follows a given trajectory. In controlling the temperature, manipulation of flow rate variables is more facile and leads to a better tracking effect than temperature variables. Therefore, the variation of cooling and chilled water flow rates may be the main cause for deviation of the reactor temperature from its typical trajectory. Because the product quality is affected by whether the reactor temperature correctly follows the trajectory or not, a high selection frequency of the flow-rate-related
Ind. Eng. Chem. Res., Vol. 43, No. 11, 2004 2687
Figure 11. VIP values resulting from application of MPLS along with the selection frequencies from the proposed method for T1, F4, P5, and F9. The selection frequencies are denoted as bars, and the VIP values are denoted as dotted lines.
process variables can be explained by their impact on the variation of the temperature trajectory. 3.4. Determination of Critical-to-Quality Stages Based on the Variable Importance in the Projection (VIP) of MPLS. To compare the performance of the proposed method for information extraction with that of MPLS, MPLS was applied to the same data. Note that MPLS is different from the proposed method in that it uses all unfolded variables and depends on only the projection technique. VIP20 was used as an index for identification of the quality-determining stages because VIP reflects the overall importance of each variable on all of the response variables based on principal components (PCs). Note that only the first two parts (32 samples) of three divided parts in each of the 30 bootstrap datasets were used in this analysis for fair comparison with the proposed method. In crossvalidation to determine the optimal number of PCs, prediction error sum of squares (PRESS) was minimized when four PCs were chosen. With the four PCs, MPLS was applied and VIP values for all unfolded variables were obtained. Figure 11 shows the VIP values of unfolded variables that belong to the four process variables together with the selection frequencies already obtained by the proposed method. In this figure, the VIP values do not clearly indicate the quality-determining stage, while the selection frequencies contrast the stage with the other ones. Besides, the stages judged as important by the VIP values do not agree with those by the selection frequencies. Therefore, it is concluded that VIP values do not provide reliable information on the critical-to-quality stage. The conclusion is supported by superiority in the prediction accuracy of quality estimation models built with the
proposed method to that with MPLS, which will be shown in section 3.5. The reason for the disadvantage of MPLS in information extraction can be interpreted from two aspects. First, the VIP values resulting from MPLS might be biased because of the small size of the dataset. Unlike the proposed method considering various kinds of datasets by bootstrapping, only one dataset was considered in applying MPLS. Therefore, incorrect correlation information caused by disturbance or noise may be reflected in the VIP values. Second, the variables uncorrelated with the quality variables might corrupt the compression performance of the MPLS. Despite its great data compression capability, the performance of the MPLS can be poor if correlations among variables are weak. Therefore, poor data compression caused by the inclusion of nuisance variables might lead to inaccurate VIP values. From this fact, we can see that the removal of nuisance variables as a preprocessing enhances the accuracy of the projection-based modeling methods when handling batch process data in an unfolded form. 3.5. Comparison of the Quality Estimation Performance between MPLS and the Proposed Method. Using the third part (eight samples), which was set aside in each of the 30 different bootstrap datasets for evaluation of the prediction performance of quality estimation models, the means of the RMSEP values obtained by three quality estimation models were computed with two modeling methods: PLS via BBGVS and MPLS. First of all, we extracted the quality-related variables from unfolded variables by further applying steps 7-10 of the proposed BBGVS method to the results of Figure 9. From a trial-and-error search to find
2688 Ind. Eng. Chem. Res., Vol. 43, No. 11, 2004
Figure 12. Means of the RMSEP values of three quality estimation models built by the proposed method and MPLS for 30 different bootstrap datasets. Table 2. Description for the 23 Extracted Variables
no. 1 2 3 4 5 6 7 8 9 10 11 12 13
involved process variable
time instant (min)
T1
214 229 32 78 184 243 143 238 27 103 178 248 253
F4
P5 F9
no. 14 15 16 17 18 19 20 21 22 23
involved process variable
time instant (min)
F10
24 39 46 88 93 259 288 218 208 262
F11 F14
the best values for the three parameters (l, m, and p), 5, 70, and 4 were found to be the optimum values for l, m, and p, respectively. On the basis of the values, we extracted 23 variables from 5250 unfolded variables and used them for predictor variables of the three quality estimation models. Table 2 shows the process variables and the time instants to which those extracted variables belong. This table also shows that most of the extracted variables are related with the cooling and chilled water flow rates and the stage around 250 min after start-up. Figure 12 shows the means of the RMSEP values obtained by the three quality estimation models (Y1, Y2, and Y3) for 30 different bootstrap datasets. White circles denote the results obtained from the MPLS, and black circles denote those from the proposed method. This figure shows that the means of the RMSEP values resulting from the proposed method are smaller than those resulting from MPLS for all 30 bootstrap datasets. Furthermore, it is shown that the means of the RMSEP values of some datasets have been dramatically reduced by the proposed method because of the effect of variable selection based on bootstrapping. To compare the quality estimation performance of the two methods in the worst case, the two most heterogeneous data groups (the last 22 and the first 16 samples of the original dataset) were used as modeling and test data, respectively. Note that these two data groups were completely separated from each other in the PCA score plot already shown in Figure 8. Figure 13 shows the estimation results of the three quality estimation models
Figure 13. Estimation results of the three quality estimation models built by MPLS and the proposed method. Table 3. Performance Comparison between MPLS and BBGVS-PLS Y1
RMSEP Y2
Y3
MPLS 1.0877 1.1735 1.3562 BBGVS-PLS 0.7932 0.8880 1.3279 relative improvement 27.08 24.33 2.09 (%)
computation time (s) 4 1351
built by both methods for the 16 test data. It is obvious that the predicted data by the proposed method show a better agreement with the actual values of the quality variables than those by MPLS. Table 3 shows the RMSEP values, computation times, and degrees of relative improvement of the proposed method obtained from the worst-case analysis. Except Y3, the prediction accuracy has been improved more than 20% by using PLS via BBGVS in building quality estimation models. This result proves the effectiveness of the bootstrap-based variable selection in building quality estimation models with unfolded batch process data with a limited number of samples. The computation times in Table 3 were obtained by implementing Matlab codes on a personal computer with a Pentium 4 1.7 GHz CPU. The values show that the proposed method requires much larger computational cost than MPLS. Therefore, the proposed method is effective when a more accurate prediction performance is required.
Ind. Eng. Chem. Res., Vol. 43, No. 11, 2004 2689
When the computation time is a main cost rather than the prediction error, MPLS may be the better method. 4. Conclusions We proposed the BBGVS method to correctly extract quality-related variables from unfolded batch process data with limited samples. In the method, the bootstrapping technique generalizes the results of variable selection obtained from different bootstrap datasets based on the selection frequency. Therefore, the proposed method allows us to accurately extract the variables that have an actual relationship with the quality variables from typical “fat-type” batch process data. The quality-related variables extracted by the proposed method are good information to understand the process characteristics and behavior. The quality estimation models built with the extracted variables show excellent prediction performance because of increased correlation between the predictor and the quality variables. When the proposed method was applied to an industrial PVC polymerization process as a case study, the quality-determining stage and key process variables were identified based on the selection frequencies and proved to be correct by comparison with the real process behavior. The quality estimation results showed that the models built by PLS via BBGVS outperform those by MPLS from the viewpoint of prediction accuracy. Because the proposed method employs time-consuming methods such as variable selection and bootstrapping, MPLS may be the solution when the computation time is a main cost. From the practical viewpoint, accurate prediction capability obtained from the proposed method allows us to replace expensive physical sensors with soft sensors. Furthermore, the knowledge extraction ability of the proposed method provides chances to analyze and understand dynamic characteristics of a new process and to build reasonable operating guidelines. Actually, PVC productivity in LG Chem. has been improved by reducing the batch time by 5% based on the analysis results with the proposed method. Because the improvement of the quality estimation accuracy comes from the removal of the variables uncorrelated with the quality variables, the proposed method may not show a better performance for a severely correlated dataset. For these datasets, satisfactory results can be obtained by MPLS alone without the variable selection preprocessing. In addition, the bootstrapping technique may not be required if sufficiently large amounts of historical data can be gathered. It is effective when the modeling dataset cannot reflect all information on the process behavior because of its small size. Finally, it should be noted as a prerequisite for successful application of the bootstrapping technique that the dataset should contain statistically meaningful variation and the samples should be gathered at least to guarantee a statistical significance. Acknowledgment The authors acknowledge the support for fulfillment of this work by the IMT2000 project (00015993) fund of the Ministry of Information and Communication and the Brain Korea 21 program issued from the Ministry of Education and Human Resource Development, Korea.
Nomenclature A ) process variable A B ) process variable B C ) process variable C g ) original dataset g* ) bootstrap dataset l ) window size m ) number of windows n ) number of process variables N ) number of bootstrap datasets p ) threshold to judge the significance of a window SFijk ) selection frequency at the ith time instant in the jth window of the kth process variable timeijk ) value of the ith time instant in the jth window of the kth process variable vnew,s ) most newly added variable in a variable subset, Vs vx ) xth variable column Vs ) variable subset composed of s selected variables X ) X block composed of predictor variables Y ) Y block composed of quality variables Subscripts i ) index for the time instant number in a window j ) index for the window number k ) index for the process variable number s ) index for the number of selected variables x ) index for the variable number Greek Letter ) tolerance for the change of RMSEP Abbreviations BBGVS ) bootstrapping-based generalized variable selection MLR ) multiple linear regression MPLS ) multiway partial least squares PC ) principal component PCA ) principal component analysis PLS ) partial least squares PRESS ) prediction error sum of squares PVC ) poly(vinyl chloride) RMSEP ) root-mean-square error in prediction SFFS ) sequential forward floating selection VCM ) vinyl chloride monomer VIP ) variable importance in the projection
Literature Cited (1) Mejdell, T.; Andersson, B. Using temperature profile for product quality estimation on a distillation column. ISA Trans. 1994, 33, 27. (2) Park, S.; Han, C. A nonlinear soft sensor based on multivariate smoothing procedure for quality estimation in distillation columns. Comput. Chem. Eng. 2000, 24, 871. (3) Wang, Y.; Rong, G. A self-organizing neural-network-based fuzzy system. Fuzzy Sets Syst. 1999, 103, 1. (4) Casali, A.; Vallebuona, G.; Bustos, M.; Gonzalez, G.; Gimenez, P. A soft-sensor for solid concentration in hydrocyclone overflow. Miner. Eng. 1998, 11, 375. (5) Dacosta, P.; Kordich, C.; Williams, D.; Gomm, J. B. Estimation of inaccessible fermentation states with variable inoculum sizes. Artif. Intell. Eng. 1997, 31, 383. (6) Nomikos, P.; MacGregor, J. F. Multi-way partial least squares in monitoring batch processes. Chemom. Intell. Lab. Syst. 1995, 30, 97. (7) Duchesne, C.; MacGregor, J. F. Multivariate analysis and optimization of process variable trajectories for batch processes. Chemom. Intell. Lab. Syst. 2000, 51, 125. (8) Chen, J.; Sheui, R. G. Using Taguchi’s method and orthogonal function approximation to design optimal manipulated trajectory in batch processes. Ind. Eng. Chem. Res. 2002, 41, 2226.
2690 Ind. Eng. Chem. Res., Vol. 43, No. 11, 2004 (9) Wold, S.; Geladi, P.; Esbensen, K.; O ¨ hman, J. Multi-way principal components and PLS-Analysis. J. Chemom. 1987, 1, 41. (10) Walmsley, A. D. Improved variable selection procedure for multivariate linear regression. Anal. Chim. Acta 1997, 354, 225. (11) Ho¨skuldsson, A. Variable and subset selection in PLS regression. Chemom. Intell. Lab. Syst. 2001, 55, 23. (12) Eklo¨v, T.; Ma˚rtensson, P.; Lundstro¨m, I. Selection of variables for interpreting multivariate gas sensor data. Anal. Chim. Acta 1999, 381, 221. (13) Efron, B.; Tibshirani, R. J. An introduction to the bootstrap; Chapman & Hall/CRC Press: Boca Raton, FL, 1993. (14) Pudil, P.; Novovicˇova´, J.; Kittler, J. Floating search methods in feature selection. Pattern Recognit. Lett. 1994, 15, 1119. (15) Marill, T.; Green, D. M. On the effectiveness of receptors in recognition system. IEEE Trans. Inform. Theory 1963, 9, 11. (16) Whitney, A. W. A direct method of nonparametric measurement selection. IEEE Trans. Comput. 1971, 20, 1100.
(17) Siedlecki, W.; Sklansky, J. On automatic feature selection. Int. J. Pattern Recognit. Artif. Intell. 1988, 2, 197. (18) Siedlecki, W.; Sklansky, J. A note on genetic algorithm for large-scale feature selection. Pattern Recognit. Lett. 1989, 10, 335. (19) Jain, A.; Zongker, D. Feature selection: evaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 153. (20) Wold, S. PLS for multivariate linear modeling in QSAR: chemometric methods in molecular design. In Methods and principles in medicinal chemistry; van de Waterbeemd, H., Ed.; Verlag Chemie: Weinheim, Germany, 1995.
Received for review September 29, 2003 Revised manuscript received February 9, 2004 Accepted March 5, 2004 IE0341552