Multiset Independent Component Regression (MsICR) Based Spectral

Multiset Independent Component Regression (MsICR) Based Spectral Data Analysis and Calibration Modeling. Chunhui Zhao†* and Furong Gao‡. † State...
3 downloads 6 Views 2MB Size
Article pubs.acs.org/IECR

Multiset Independent Component Regression (MsICR) Based Statistical Data Analysis and Calibration Modeling Chunhui Zhao*,†,‡ and Furong Gao† †

State Key Laboratory of Industrial Control Technology, Department of Control Science and Engineering, Zhejiang University, Hangzhou, 310007 China ‡ Key Laboratory of System Control and Information Processing, Ministry of Education, Shanghai, 200240 ABSTRACT: In the present work, multiple mixture spectra data sets are given an integrated analysis for the estimation of substance compositions. Different spectral profiles are collected for the same mixtures resulting from the influences of “interference factors”, such as the variations in environmental, instrumental, and sample conditions. A multiset independent component regression (MsICR) algorithm is developed, where the relationship across multiple spectral spaces is studied from a quality-relevant viewpoint for both source extraction and regression modeling. The common systematic information shared by all spectral data sets is separated from the part that is influenced by “interference factors”. Since the common structure reflects information that is not influenced by “interference factors”, it is more informative and responsible to estimate substance compositions in the mixtures. A calibration model is then developed based on the decomposition result for composition estimation. This approach presents a new viewpoint of spectral data analysis and provides comprehensive statistical explanations of inherent information of multiple mixture spectral data. The feasibility and performance of the proposed method are illustratively compared with other methods based on spectral data from laboratory experiments.

1. INTRODUCTION During the past decades, the use of spectroscopic information1−8 has received much attention and begun to emerge as an important technique, which is being heavily encouraged and practiced for different purposes. To analyze the substance composition in mixture samples, a calibration model is constructed using the mixture spectra together with the reference concentrations of constituents to form a quantitative prediction relationship. Common multivariate calibration methods9−13 used for spectra analysis include principal component regression (PCR) and partial least-squares (PLS) regression. They are based on the fact that variable collinearity is typical in spectral data and its presence can result in unsatisfactory prediction performance if the original spectral data space is directly used. To deal with the problem of high data dimensionality and redundancy, PCR and PLS reduce the number of spectral variables by means of feature extraction. The original spectral data space is thus shrunk to a subspace of smaller dimension where those extracted underlying features are used for regression modeling. Considering NIR spectra of a set of mixtures are often the linear combination of the spectra of its constituent species, independent component analysis (ICA) algorithm14−18 has been used to recover the component spectra from the mixtures. After the source decomposition, the resulting mixing relationships are related to the concentrations for prediction, called independent component regression (ICR) here. Chen and Wang15 have applied ICA on near-infrared spectral data, which successfully proved the effectiveness of ICA for recovering the components of interest from spectra mixtures. Gustafsson16 has made some useful comparisons between different regression algorithms. He demonstrated that, although PLS could give more accurate quality predictions, the advantage of ICR was its ability to yield more chemically © 2013 American Chemical Society

interpretable latent variables (LVs) and retrieve a more meaningful LV model. However, these above analyses inherently restrict regression modeling to the case of single population sample, here called singleset ICR (SsICR). The subject of regression modeling and composition estimation arouses new issues and demands specific solutions when it refers to multiple spectral data sets.19,20 These spectral profiles may be different although they are collected for the same mixtures. The differences are caused by “interference factors”19,20 resulting from using different instruments or under different environmental or sample conditions, which add some undesired variations to the quality-concerned information. The constituent separation result will thus be distorted and different from one set of mixtures to another. That is, if separate single-set analysis is performed, the estimated independent components (ICs) may be far from the real constituent substances. This will result in confusing mixing relationships and may disturb the calibration modeling performance. From another viewpoint, these different spectral profiles provide more information for calibration modeling and make it possible to decompose the underlying information more meaningfully. For the same compositions/mixtures, without the influences of “interference factors”, the spectral profiles should be quite similar with each other. In contrast, under the influences of “interference factors”, multiset spectral data may be different to a certain extent. Therefore, the part shared by different sets reveals the common systematic information that is immune to “interference factors”. It tends to be composition-related and more useful for the prediction of composition concentrations. The other part resulting Received: Revised: Accepted: Published: 2917

August 30, 2012 January 1, 2013 January 27, 2013 January 28, 2013 dx.doi.org/10.1021/ie3023302 | Ind. Eng. Chem. Res. 2013, 52, 2917−2924

Industrial & Engineering Chemistry Research

Article

from the influences of “interference factors” is thus less informative for the estimation of substance compositions in the mixtures. For more reliable prediction, it is thus desirable to capture the shared/ common regression information across sets. SsICR algorithm, as a single-set analytic method, may lose its efficiency, where a calibration model developed under one specific circumstance usually does not perform well under another circumstance,19,20 since their spectral profiles may mismatch to a certain extent due to the influences of external parameters. Its extension to the multiset case is thus desired. Recently, a set of different tools have been presented to extend single set statistical analysis methods by giving multiple data sets an integrative consideration. Following the multiset PCA modeling method,21 which tries to extract the common variable mode or variation mode shared by different data sets, a multiset regression modeling method, termed MsRA,22 was proposed to relate the inherent quality-related predictor variation or correlations across multiple data sets. The two methods look into the cross-set relationship and focus on decomposing the cross-set underlying information without and with the quality-relevant consideration, respectively. Following the development of the theoretical algorithms along with their property analysis, the potentials of using the said algorithms to solve some meaningful practical problems have also been reported,23−25 generating more interesting statistical explanations of the inherent characteristics underlying multiple data sets. Calibration modeling and composition prediction for multiple spectral data sets is an important problem and issue which are drawing people’s attention. The main purpose of this study is to investigate ways of multiset extension of ICR modeling method (MsICR) for concentration prediction when more than two sets of spectral profiles are collected for the same mixtures. With multiple predictor spaces prepared, it would be interesting to look for the cross-set interrelations and reveal their Y-related common structures for regression analysis. The advantage of such a simultaneous consideration of multiple data sets is mainly to allow for extraction of more useful information and improved interpretation of the substance compositions in the mixtures. The two multiset statistical analysis algorithms21,22 in our previous work can be employed here as the basic modeling methods. They will extract the common sources of multiple predictor spaces and relate different mixing relationships to the same response data space, respectively. For readability, the two algorithms are comparatively summarized in Table 1, and the relevant details can be referred to the algorithm development in previous work.21,22 For both algorithms, the solutions lead to two-step eigenvalue decomposition procedures. It is noted that the original MsPCA algorithm in ref 21 is described based on the variable correlations. It is easy to get the MsPCA-score version by following the similar two-step calculation procedure. Here, using the two algorithms, on the one hand, the common sources are extracted based on the relationships across spectral data sets, which may result in different mixing coefficients for different sets; on the other hand, the common mixing relationships corresponding to the same sources are extracted by considering the cross-set relationships. So each underlying predictor space can be decomposed into two parts, the common cross-set one and the specific within-set one. This is achieved by looking at each data set and the relationships across sets so as to obtain the local information and global information simultaneously from the data. Regression modeling will simultaneously figure out how and how well the predictors and responses can relate to this “consensus”. The multiset common structures, which are extracted from multiple predictor spaces and closely related with each other, can

Table 1. Summary of MsPCA-Score and MsRA-Score Modeling Algorithms algorithm

MsPCA-score

MsRA-score

step no. analysis subject systematic scores for Xi regression scores for Xi-Y 1

objective

C

C

max ∑ (wi TX i Ttg)2

max ∑ (v TYTX ia i)2 i=1

i=1

constraint

solution

2

objective

⎧ t Tt = 1 ⎪ g g s. t . ⎨ ⎪ w Tw = 1 ⎩ i i

⎧ v Tv = 1 s. t . ⎨ ⎩ a i Ta i = 1 ⎪ ⎪

ti = X iwi and R scores are ti = X ia i and R scores are collected by collected by Ti̅ = {ti ,1, ti ,2, ..., ti , R , } Ti̅ = {ti ,1, ti ,2, ..., ti , R , } C

max ∑ (wi TTi̅ Ttg)2

2 max(v TYTTa i̅ i)

i=1

constraint

solution

⎧ t Tt = 1 ⎪ g g ⎨ ⎪ w TT̅ TTw ⎩ i i i̅ i = 1

⎧ v TYTYv = 1 ⎨ ⎩ a i TT̅ i TT̅ ia i = 1 ⎪



ti = Tw ti = Ta i̅ i and R i̅ i and R regression systematic scores are scores are collected by collected by Ti = {ti ,1, ti ,2, ..., ti ,R , } Ti = {ti ,1, ti ,2, ..., ti ,R , }

represent the underlying regression information (i.e., qualityrelevant systematic variations) and will result in a more robust regression model. Analyses are conducted to further comprehend the proposed solution.

2. METHODOLOGY 2.1. Single Set Independent Component Regression. For a set of spectra with J wavelengths acquired on N samples, XT(N × J), and the quality data, concentration matrix, Y(N × Jy) (where, Jy is the number of quality variables), a conventional SsICR model15 can be mathematically formulated as below: XT = SA + E Ŷ = ATB

(1)

where S(J × R) are the estimated independent components from the observed variables, which actually are the spectra estimation of the pure constituents in the mixtures. R is the number of ICs. Ideally, if the estimated ICs exactly match the pure substances constituting the mixtures, the mixing matrix, A(R × N), will agree well with the concentrations of the substances in mixtures. In practice, they cannot match with each other very well, and therefore, it cannot be taken for granted that the elements in the matrix A(R × N) are concentrations. Therefore, like PCR, regression analysis is performed between the estimated mixing relationships and real concentrations to derive the regression model B. E are the residuals caused by normal random measurement noises, which can be well controlled by routine laboratories. That is, if the same sample is scanned multiple times by the same instrument under the same condition, these spectra will differ slightly and stochastically. However, when each spectral data are considered separately, the source estimation results using ICA are influenced by the interference factors. In that way, on the one hand, the calibration models are different for different sets. On the other hand, the accuracy of prediction model may be influenced since the estimated ICs may be different from the real substance compositions or the mixing coefficients for ICs are different from the real concentrations of substance compositions in the mixtures. 2918

dx.doi.org/10.1021/ie3023302 | Ind. Eng. Chem. Res. 2013, 52, 2917−2924

Industrial & Engineering Chemistry Research

Article

Figure 1. Source extraction results using (a) MsICA and (b) SsICA and MICA methods for case study 1.

Rg global ICs (Sg(J × Rg)) are then extracted from Zg using modified ICA algorithm,26 which can produce a unique and repeatable ICA solution by fixing the initialization instead of the random initialization of conventional ICA.14 Note that the number of global ICs retained in a model is an important parameter. A higher dimension will result in more systematic information being used for calibration modeling so that less information is considered to be noise. A lower dimension will leave more variations in the residual subspace which are deemed to be noise. When no prior information exists, as a rule of thumb, the dimension of a statistical model is in general determined by trial and error to achieve good performance for different study objectives, such as prediction performance in the present work. The choice of this parameter is inevitably affected by the experience with the specific process and artificial subjectivity factors. The mixing relationships for different sets are then calculated by

2.2. Multiset Independent Component Regression. In the present work, multiple spectral profiles are measured, corresponding to the same mixtures. As mentioned before, the quality-relevant systematic information should be in general the same across sets since the spectral profiles are for the same mixtures. Under the influences of the external interferences, the spectra are different more or less. Moreover, the resulting undesired variations may distort the original systematic information and jeopardize the estimation of constituent concentrations of interest. The major problem is thus: how to extract the systematic parameter of interest from multiple spectral data sets so that calibration analysis can exclude the influence of interference factors? Taking into account the effectiveness of ICR for spectra analysis, our work will extend the single set ICR (SsICR) algorithm to its multiset version (MsICR), where the cross-set common quality-relevant systematic information will be modeled for calibration analysis. This is achieved by a threestep modeling strategy, where the first step is to extract those common ICs, the second step is to extract the common mixing relationships and the third step is to relate the common mixing relationships to the quality indices. Let {X1T,X2T, ..., XCT} be C (J × N) matrices of spectral data with the same N samples observed under the influence of interference factors, which all point to the same concentrations of constituents in mixtures, Y. The MsPCA-score algorithm shown in Table 1 is used to do whitening preprocessing in ICA estimation, where the (J × Rw)-dimensional global scores Tg will be further processed and serve as the white vectors Zg in the whitening preprocessing prior to ICA estimation. That is, its components are uncorrelated and their variances equal unity. Rw is the number of retained white components. In general, Rw can be chosen based on cross-validation to achieve the best prediction performance.

A i = (Sg TSg )−1Sg TX i T

(3)

where Sg(J × Rg) denote Rg systematic parameters of interest in the mixtures, which should be in general the same across different sets. It is the common part across sets. Ai(Rg × N) are mixing relationships, i.e., the contributions of global ICs to the mixtures, which actually reveal how those systematic sources Sg influence the mixture spectra. Ai may be different across sets resulting from the influences of inference factors. The common part should be further extracted for regression modeling and quality prediction. From eq 3, multiple sets of mixing coefficients are collected, preparing an integrated predictor space, {A1T,A2T, ..., AcT}, corresponding to the same substance compositions (Y). MsRA 2919

dx.doi.org/10.1021/ie3023302 | Ind. Eng. Chem. Res. 2013, 52, 2917−2924

Industrial & Engineering Chemistry Research

Article

Figure 3. Regression model coefficients for four quality indices using MsICR algorithm for case study 1.

Also, the global regression model can be calculated as Θg = (A gA g T)−1A gY

(7)

The concentration can then be predicted using either local information or global information ⎧ A TΘ ⎪ g g ̂ Y=⎨ ⎪ A TΘ ⎩ c, i c, i

Figure 2. (a) Common regression scores extracted from mixing coefficients using MsICR algorithm and (b) the mixing relationships extracted using SsICR algorithm for case study 1.

algorithm is then used for the extraction of common part (Ac,iT) across different sets A c, i T = A i TWc, i

(9) F = Y − Ŷ where Ŷ is the estimation, which could be the global prediction (Ŷ g) or the local prediction (Ŷ c,i). F is the corresponding prediction errors. For new samples collected by different measurement devices or under different conditions, xi(J × 1), the multiset analysis is performed and the estimation of concentration (ŷ) can be made as follows:

(4)

where Ac,i(Rc × N) are the common part of mixing relationships. Wc,i(Rg × Rc) are calculated by MsRA-score algorithm shown in Table 1, which can calculate the common regression LVs directly from the original mixing coefficients. The number of retained common regression LVs extracted from Ai is indicated by Rc, which may be different from Rg. The global regression scores from all sets are then calculated by Ag T

1 = C

a i = (Sg TSg )−1Sg Tx i ac, i T = a i TWc, i

C

∑ A c,i T i=1

ag T =

(5)

The regression model (Θi) in each data set for estimation of substance compositions is thus developed by least-squares algebra27 for quality prediction Θi = (A c, iA c, i T)−1A c, iY

(8)

1 C

C

∑ ac,i T i=1

⎧ a TΘ ⎪ g g ŷ = ⎨ ⎪ a TΘ ⎩ c, i c, i T

(6) 2920

(10)

dx.doi.org/10.1021/ie3023302 | Ind. Eng. Chem. Res. 2013, 52, 2917−2924

Industrial & Engineering Chemistry Research

Article

Table 2. Quality Prediction Performance Based on RMSE Index in Case Study 1 Using Three Different Methods for (a) Training Data and (b) Testing Data (a) Training Data index y1

method MsICRca MsICRga SICR MICR

by X1 0.07 0.11 0.04 0.01

by X2 0.13

y2 by X3 0.14

by X1 0.08 0.08 0.02 0.01

by X2 0.09

y3 by X3 0.09

by X1 by X2 0.20 0.18 0.19 0.09 0.05 (b) Testing Data

average (mean ± std)

y4 by X3 0.20

by X1 0.52 0.47 0.45 0.17

by X2 0.46

by X1 0.70 0.66 1.49 3.41

by X2 0.65

by X3 0.47

0.22 ± 0.17 0.21 ± 0.18 0.15 ± 0.20 0.06 ± 0.08

index y1

method MsICRca MsICRga SICR MICR a

by X1 0.13 0.21 0.07 0.03

by X2 0.27

y2 by X3 0.26

by X1 0.12 0.15 0.03 0.07

by X2 0.15

y3 by X3 0.19

by X1 0.30 0.30 0.38 0.73

by X2 0.34

average (mean ± std)

y4 by X3 0.31

by X3 0.63

0.34 ± 0.21 0.33 ± 0.23 0.49 ± 0.68 1.06 ± 1.60

Subscripts c and g denote prediction using local and global information, respectively.

Figure 4. (a) Source extraction results using MsICA method and (b) source extraction results using SsICA and MICA methods for case study 2.

The index of mean squared error (MSE)28 is formulated to evaluate the prediction errors as follows: MSEi , j =

1 N

measurement yi,n,j. These MSE values for each quality index from all sets can be averaged to quantitatively evaluate the general prediction performance.

N

∑ (yi ,n,j − yî ,n,j)2 n=1

3. SIMULATIONS AND DISCUSSIONS

(11)

3.1. Case Study 1. The first data set consists of spectra from 80 samples of corn with wavelength ranging from 1100 to 2498 nm at 2 nm intervals (700 channels), in which, each

where subscripts n and j denote the number of samples and quality indices, respectively; subscript i is the index of multiple data sets. ŷi,n,j is the quality prediction corresponding to the 2921

dx.doi.org/10.1021/ie3023302 | Ind. Eng. Chem. Res. 2013, 52, 2917−2924

Industrial & Engineering Chemistry Research

Article

ICA (MICA) method here. The source decomposition results are compared in Figure 1b for the two methods, which are different more or less, especially for the later two. Also it is observed that the fourth SIC for X1 is more different from those for X2 and X3. Moreover, they are different from MsICA results shown in Figure 1a. Second, the mixing relationships (Ai) for each set of mixtures are calculated based on the global ICs (Sg). Then the common part (Ac,i) is extracted using MsRA-score algorithm and the global one is calculated based on eq 5. They are comparatively shown in Figure 2a. Using the two-step MsRA calculation, the common scores for different sets are quite similar with each other and closely related to the global scores (Ag). Also, using SsICA extraction, the mixing relationships for each IC and each set are shown in Figure 2b, where those for X1 are different from those for X2 and X3 to a certain extent across sets. Third, based on the extraction of global mixing relationships, regression models are developed and used for concentration estimation. The regression model coefficients are calculated based on eqs 6 and 7 for local and global common scores (Ag), respectively, which are shown in Figure 3 for the four quality indices. As indicated by the figure, it can be seen that the local regression models are quite similar across sets, which are also close to the global model. The prediction performance is comparatively summarized in Table 2 based on both training and testing data for MsICRc and MsICRg models, which use local and global common information, respectively. In general, the prediction based on local common information in each data set (Xi) is similar to those using global common information with a certain extent of dissimilarity, indicating that the proposed algorithm can decompose the informative information for quality prediction in each data set. Also the predictions are also made using two other different methods, SsICR and multiway ICR (MICR). For the SsICR algorithm, each set of mixtures are separately modeled and isolated from the others. For MICR algorithm, all data sets are put together and given an integrated view for both ICA source extraction and regression modeling. All prediction results are comparatively shown in Table 2. In general, using the proposed method, the accuracy is good for all quality indices as evaluated by the average results which are calculated using mean and standard deviation (std) values. The prediction accuracy for testing data is not decreased significantly compared with that for training data using the proposed method (MsICR). Comparatively, using SsICR and MICR algorithms, the results for training data are in general better than those using MsICR algorithm, where for MICR algorithm, best fitting performance is observed. However, the testing results are greatly worse for the third and fourth quality indices for both SsICR and MICR methods, revealing more significant prediction variability across quality indices. 3.2. Case Study 2. In this case study, the spectral data obtained from the literature20 are ternary mixtures of water, ethanol and 2-propanol recorded in a 1 cm cuvette with the wavelength range 580−1091 nm. The short-wave NIR spectra of 22 samples are taken at three different temperatures (30, 50, and 70 °C), and the temperature of the samples in each set is controlled (only about 0.2 degree variation). Besides the spectral measurements, the quality matrix Y(22 × 3) describes different contents of ethanol, water and isopropanol. Here, the first twelve samples are used for model identification and the left are used for model testing. Three spectral data sets, {X1T(512 × 12), X2T(512 × 12), and X3T(512

Figure 5. (a) Common regression scores extracted from mixing coefficients using MsICR algorithm and (b) the mixing relationships extracted using SsICR algorithm for case study 2.

spectra sample are scanned on three different NIR spectrometers (m5, mp5, and mp6). Therefore, we can collect 80 × 3 (samples × instruments) spectral observations in all. They all correspond to the same concentrations, which are involved in the response matrix Y(80 × 4), referring to four constituents: moisture, oil, protein, and starch. The corn data are available at the eigenvector research homepage: http://www.eigenvector.com/ DATA/Corn. Here, the first 40 samples are used for model development, and the others are for model testing. First, three spectral data sets, {X1T(700 × 40), X2T(700 × 40), and X3T(700 × 40)}, are collected, corresponding to the same concentration matrix, Y(40 × 4). The common ICs are then extracted using MsPCA-score algorithm and illustrated in Figure 1 for the first four ICs. Comparatively, the first four white scores are also shown, which are different from the common ICs. Two different source extraction methods are also used. Single set ICA (SsICA) modeling method focuses on each set of mixtures separately. The other method puts all mixtures together, X (700 × 120), and extracts the ICs from them, which are called multiway 2922

dx.doi.org/10.1021/ie3023302 | Ind. Eng. Chem. Res. 2013, 52, 2917−2924

Industrial & Engineering Chemistry Research

Article

Table 3. Quality Prediction Performance Based on RMSE Index in Case Study 2 Using Three Different Methods for (a) training data and (b) testing data (a) Training Data index y1

method MsICRca MsICRga SICR MICR

by X1 0.007 0.008 0.001 0.0001

by X2 0.01

y2 by X3 0.009

by X1 0.003 0.003 0.001 0.0001

average (mean ± std)

y3

by X2 0.004

by X3 0.003

by X1 0.005 0.007 0.002 0.0002

by X2 0.01

by X1 0.04 0.03 0.05 0.10

by X2 0.05

by X3 0.01

0.007 ± 0.003 0.006 ± 0.003 0.001 ± 0.0006 0.0001 ± 0.0006

(b) Testing Data index method MsICRca MsICRga SICR MICR a

y2

y1 by X1 0.04 0.04 0.04 0.08

by X2 0.05

by X3 0.05

by X1 0.05 0.04 0.07 0.33

average (mean ± std)

y3

by X2 0.05

by X3 0.04

by X3 0.04

0.05 ± 0.01 0.04 ± 0.01 0.05 ± 0.02 0.17 ± 0.14

Subscripts c and g denote prediction using local and global information, respectively.

× 12)}, are collected, corresponding to the same concentration matrix, Y(12 × 3). The common ICs are then extracted and illustrated in Figure 4 for the first four ICs along with the first four MsPCA white scores. The source decomposition results by SsICA and MICA mehtods are compared in Figure 4b, which are different more or less, especially for the later two, which is similar to the scenario shown in case study 1. The common part (Ac,i) of the mixing relationships (Ai) for each set of mixtures are calculated using MsRA-score algorithm and comparatively shown in Figure 5a in comparison with the global one (Ag). The common scores for different sets are quite similar with each other and closely related to the global scores (Ag), which is different from the scenario shown in Figure 5b using SsICA. By multiset analysis, the common ICs and the common mixing relationships are extracted and collected, which are used for the development of estimation model. Then the prediction performance is comparatively summarized in Table 3 based on both training and testing data for three different methods, MsICR (including MsICRc and MsICRg), SsICR, and MICR. For MsICR algorithm, the predictions based on local common information are similar with those using global common information although they show a certain extent of dissimilarity, indicating that the proposed algorithm can decompose the informative information for quality prediction in each data set. Also the similar conclusion with that in case study 1 is drawn for SsICR and MICR algorithms. The simulations conclude with illustrations of how the proposed calibration modeling strategy performs on two case studies, revealing comparable or even better prediction performance with the other two methods and more meaningful statistical explanations. For the SsICR modeling method, only local insight is focused on. Since the mixture spectra are subject to the influences of interferences, the source separation results (ICs or mixing relationships) are different across different sets, esulting in different regression models. For the MICR modeling method, only the global view is taken into consideration and all mixture information is necessary for model development and application. Comparatively, for the MsICR modeling method, both local and global insights are given. Therefore, predictions can be

obtained when new samples from only one mixture set are available.

4. CONCLUSION In this article, the multiset independent component regression (MsICR) algorithm is developed to decompose the underlying systematic information from multiple spectral data sets. By considering the relationship across different spectral spaces, the systematic information is decomposed in a more meaningful way. The influences of interference factors are excluded and the common subspace is extracted from the original spectral data space for regression modeling and quality prediction. The simulations conclude with illustrations of how the proposed calibration modeling strategy performs on two experimental spectral data sets, revealing that the proposed algorithm can provide more informative statistical explanations and better chemically interpretable characteristics for the analysis of multiset spectra data. It also provides the basis and potential for future work. For example, considering different significance of wavelength regions, those important ranges of wavelengths can provide different information and also a platform for multiset analysis. How to put them into effective use deserves further investigation that can help improve multiset spectral data analysis and calibration modeling performance.



AUTHOR INFORMATION

Corresponding Author

*Tel: 86-571-87951879. Fax: 86-571-87951879. E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work is supported by the National Natural Science Foundation of China (No. 61273166), National Program on Key Basic Research Project (973 Program) under grant 2012CB720505, the Fundamental Research Funds for the Central Universities (2012QNA5012), Project of Education Department of Zhejiang Province (Y201223159), Technology Foundation for Selected Overseas Chinese Scholar of Zhejiang Province, and the Foundation 2923

dx.doi.org/10.1021/ie3023302 | Ind. Eng. Chem. Res. 2013, 52, 2917−2924

Industrial & Engineering Chemistry Research

Article

(24) Zhao, C. H.; Yao, Y.; Gao, F. R.; Wang, F. L. Statistical Analysis and Online Monitoring for Multimode Processes with Between-mode Transitions. Chem. Eng. Sci. 2010, 65 (22), 5961−5975. (25) Zhao, C. H.; Mo, S. Y.; Gao, F. R.; Lu, N. Y.; Yao, Y. Statistical Analysis and Online Monitoring for Handling Multiphase Batch Processes with Varying Durations. J. Process Control 2011, 21 (6), 817−829. (26) Lee, J.; Qin, S. J.; Lee, I. Fault detection and diagnosis based on modified independent component analysis. AIChE J. 2006, 52, 3501− 3514. (27) Johnson, R. A., Wichern, D. W. Applied Multivariate Statistical Analysis, 2nd ed.; Prentice-Hall: Englewood Cliffs, NJ, 1988; p 607. (28) Kutner, M. H., Nachtsheim, C, Neter, J. Applied Linear Regression Models, 4th ed.; McGraw-Hill/Irwin: Boston, 2004; p 701.

of Key Laboratory of System Control and Information Processing, Ministry of Education, P.R. China.



REFERENCES

(1) Gurden, S. P.; Westerhuis, J. A.; Smilde, A. K. Monitoring of batch processes using spectroscopy. AIChE J. 2002, 48 (10), 2283− 2297. (2) Abrahamsson, C.; Johansson, J.; Sparen, A.; Lindgren, F. Comparison of different variable selection methods conducted on NIR transmission measurements on intact tablets. Chemom. Intell. Lab. Syst. 2003, 69 (1−2), 3−12. (3) Gusnanto, A.; Pawitan, Y.; Huang, J.; Lane, B. Variable selection in random calibration of near-infrared instruments: Ridge regression and partial least squares regression settings. J Chemom. 2003, 17 (3), 174−185. (4) Othman, N. S.; Fevotte, G.; Peycelon, D.; Egraz, J. B.; Suau, J. M. Control of polymer molecular weight using near infrared spectroscopy. AIChE J. 2004, 50 (3), 654−664. (5) Gabrielsson, J.; Jonsson, H.; Trygg, J.; Airiau, C.; Schmidt, B.; Escott, R. Combining process and spectroscopic data to improve batch modeling. AIChE J. 2006, 52 (9), 3164−3172. (6) Ye, S. F.; Wang, D.; Min, S. G. Successive projections algorithm combined with uninformative variable elimination for spectra variable selection. Chemom. Intell. Lab. Syst. 2008, 91, 194−199. (7) Xu, H.; Liu, Z.; Cai, W.; Shao, X. A. wavelength selection method based on randomization test for near-infrared spectral analysis. Chemom. Intell. Lab. Syst. 2009, 97, 89−193. (8) Zhao, C.; Gao, F.; Wang, F. Phase-based joint modeling and spectroscopy analysis for batch processes monitoring. Ind. Eng. Chem. Res. 2010, 49, 669−681. (9) Geladi, P.; Kowalski, B. R. Partial least-squares regression - a tutorial. Anal. Chim. Acta 1986, 185, 1−17. (10) Brereton, R. G. Introduction to multivariate calibration in analytical chemistry. Analyst 2000, 125 (11), 2125−2154. (11) Kleinbaum, D. G., Kleinbaum, D. G. Applied Regression Analysis and Other Multivariable Methods, 4th ed.; Thomson Brooks/Cole: Australia, 2008; p 906. (12) Kutner, M. H., Nachtsheim, C, Neter, J. Applied Linear Regression Models, 4th ed.; McGraw-Hill/Irwin: Boston, 2004; p 701. (13) Ergon, R. Reduced PCR/PLSR models by subspace projections. Chemom. Intell. Lab. Syst. 2006, 81 (1), 68−73. (14) Hyvarinen, A.; Oja, E. Independent component analysis: Algorithms and applications. Neural Networks 2000, 13 (4−5), 411− 430. (15) Chen, J.; Wang, X. Z. A new approach to near-infrared spectral data analysis using independent component analysis. J. Chem. Inf. Comput. Sci. 2001, 41 (4), 992−1001. (16) Gustafsson, M. G. Independent component analysis yields chemically interpretable latent variables in multivariate regression. J. Chem. Inf. Model. 2005, 45, 1244−1255. (17) Westad, F. Independent component analysis and regression applied on sensory data. J. Chemom. 2005, 19 (3), 171−179. (18) Shao, X. G.; Wang, W.; Hou, Z. Y.; Cai, W. S. A new regression method based on independent component analysis. Talanta 2006, 69 (3), 676−680. (19) Andrew, A.; Fearn, T. Transfer by orthogonal projection: Making near-infrared calibrations robust to between-instrument variation. Chemom. Intell. Lab. Syst. 2004, 72 (1), 51−56. (20) Wulfert, F.; Kok, W. T.; Smilde, A. K. Influence of temperature on vibrational spectra and consequences for the predictive ability of multivariate models. Anal. Chem. 1998, 70 (9), 1761−1767. (21) Zhao, C. H.; Gao, F. R.; Niu, D.; Wang, F. A two-step basis vector extraction strategy for multiset variable correlation analysis. Chemom. Intell. Lab. Syst. 2011, 107 (1), 147−154. (22) Zhao C. H., Gao F. R. A Two-step Multiset Regression Analysis (MsRA) Algorithm. Ind. Eng. Chem. Res. 10.1021/ie201608f. (23) Zhao C. H., Gao F. R. Between-phase based Statistical Analysis and Modeling for Transition Monitoring in Multiphase Batch Processes, AIChE J. 10.1002/aic.12783. 2924

dx.doi.org/10.1021/ie3023302 | Ind. Eng. Chem. Res. 2013, 52, 2917−2924