Nonlinear Projection to Latent Structures Method and Its Applications

Projection to latent structures or partial least squares (PLS) is one of the most powerful linear ... this extended space corresponding to the project...
0 downloads 0 Views 177KB Size
Ind. Eng. Chem. Res. 2006, 45, 3843-3852

3843

Nonlinear Projection to Latent Structures Method and Its Applications Shi Jian Zhao,† Jie Zhang,*,‡ Yong Mao Xu,† and Zhi Hua Xiong† Department of Automation, Tsinghua UniVersity, Beijing 100084, China, and School of Chemical Engineering and AdVanced Materials, UniVersity of Newcastle, Newcastle upon Tyne NE1 7RU, U.K.

Projection to latent structures or partial least squares (PLS) is one of the most powerful linear regression techniques to deal with noisy and highly correlated data. To address the inherent nonlinearity present in the data from many industrial applications, a number of approaches have been proposed to extend it to the nonlinear case, and most of them rely on the nonlinear approximation capability of neural networks. However, these methods either merely introduce nonlinearities to the inner relationship model within the linear PLS framework or suffer from training a complicated network. In this paper, starting from an equivalent presentation of PLS, both nonlinear latent structures and nonlinear reconstruction are obtained straightforwardly through two consecutive steps. First, a radial basis function (RBF) network is utilized to extract the latent structures through linear algebra methods without the need of nonlinear optimization. This is followed by developing two feedforward networks to reconstruct the original predictor variables and response variables. Extraction of multiple latent structures can be achieved in either sequential or parallel fashion. The proposed methodology thus possesses the capability of explicitly characterizing the nonlinear relationship between the latent structures and the original predictor variables while exhibiting fast convergence speed. This approach is appealing in diverse applications such as developing soft sensors and statistical process monitoring. It is assessed through both mathematical example and monitoring of a simulated batch polymerization reactor. 1. Introduction Projection to latent structures or partial least squares (PLS) is a popular linear regression method based on projecting the information in a high-dimensional space down onto a lowdimensional space defined by some latent variables. It is successfully applied in diverse fields to deal with noisy and highly correlated data with, quite often, only a limited number of observations available.1-3 When dealing with nonlinear systems, this approach assumes that the underlying nonlinear relationship between predictor variables (X) and response variables (Y) can be locally approximated through linear regression models. However, data from practical situations such as industrial plants generally exhibit strong nonlinear behavior. To address the inherent nonlinearity, a number of approaches have been proposed to extend PLS to the nonlinear case. The simplest route to incorporating nonlinearities into PLS is to extend the input data matrix X with squares or cross terms and then project onto this extended space corresponding to the projection of the original X onto a quadratic surface.4 This method is practical only for a limited number of independent X variables, whereas it is unwieldy when many collinear X variables are presented. Furthermore, because the overall linear PLS decomposition can be regarded as the process of calculating the outer model and inner model consecutively, many researchers have suggested handling the nonlinear behavior in the data through introducing nonlinear terms into the inner relationship as an alternative. For example, if the projections of X and Y are forced to be quadratic or, more generally, spline, then nonlinear PLS with quadratic (QPLS) or spline inner relation5,6 are obtained. These methods have advantages of simple computations and algorithms but exhibit limited approximation power; they were generalized by * To whom correspondence should be addressed. Tel.: +44-1912227240. Fax: +44-191-2225292. E-mail: [email protected]. † Tsinghua University. ‡ University of Newcastle.

Qin and McAvoy7 through adopting a neural network in the inner model while keeping the determination of the outer model unaltered. Since a neural network with one hidden layer of sigmoidal units can model any nonlinear function, this neural network PLS (NNPLS) algorithm combines the universal approximation capabilities of feed-forward neural networks with the robustness of PLS regression and it is, thus, unnecessary to specify a particular form of nonlinear function.8 The appropriate number of latent variables is generally identified through crossvalidation.9 Recognizing that both QPLS and NNPLS merge nonlinear regression techniques within the framework of the linear PLS algorithm and the input weight updating procedure are obscure, Baffi et al.10,11 proposed new QPLS and NNPLS methods with error-based updating algorithms, which were shown to provide improved modeling capabilities. In addition, Wilson et al.12 substituted the sigmoidal function of NNPLS by radial basis function (RBF) neural networks to implement the inner mapping and pointed out that creating a nonlinear outer model to complement the inner model is their future direction. Besides, Holcomb and Morari13 and Malthouse et al.14 presented two nonlinear PLS algorithms based upon global network structures. These approaches are only qualitatively comparable to the above-reviewed algorithms and require nonlinear optimization algorithms. More recently, according to a neural network structure that preserves the orthogonality properties of the nonlinear principal component analysis (NLPCA),15 Doymaz et al.16 proposed an orthogonal version of the nonlinear PLS (O-NLPLS) by Malthouse et al.14 The current nonlinear PLS algorithms can be roughly classified into two categories. Methodologies such as QPLS5 and NNPLS7 merely introduce nonlinearities into the inner relationship model within the linear PLS framework, with the outer model unaltered. As a consequence, they lack the ability of characterizing a nonlinear relationship between the original predictor variables and the latent structures. On the other hand, approaches such as NLPLS14 suffer from the requirement of training a complex network with three hidden layers.

10.1021/ie0512340 CCC: $33.50 © 2006 American Chemical Society Published on Web 04/26/2006

3844

Ind. Eng. Chem. Res., Vol. 45, No. 11, 2006

X ) t1 p1T + E1 Y ) u1 q1T + F ˜1 where p1 and q1 are corresponding loading vectors. A corresponding inner model is then constructed to relate these score vectors Figure 1. Outer and inner models presentation for linear PLS decomposition.

On the basis of this, an equivalent presentation of PLS in the form of optimization is first presented in this paper, and it is illustrated that linear PLS can actually be performed within two consecutive steps. First, only the latent structure t is obtained by maximizing its covariance with the response variables Y, while the generally adopted calculation of u is totally unnecessary. The retained t is then employed to reproduce the original data X and Y. This formulation is then naturally extended to the nonlinear case along the same line through two separate stages. In the first stage, one radial basis functions (RBF) network is developed to extract the latent structures through linear algebra methods without the need of nonlinear optimization. This is followed by the second stage, where two feedforward networks are devised to reconstruct the original predictor variables and response variables, respectively. It is apparent that there exist various possible nonlinear extensions of linear PLS from different points of view. The proposed method distinguishes itself from other existing nonlinear PLS in several aspects: • A novel straightforward and intuitive presentation and interpretation for both linear and nonlinear PLS can be obtained within the proposed framework. Actually, it is shown that the proposed approach, at its extreme when all activation functions of neural nodes are linear, becomes linear PLS decomposition. • It possesses the capability of characterizing the nonlinear relationship between the latent structures and the original predictor variables while simultaneously reconstructing the original data nonlinearly. • Extraction of multiple latent structures can be achieved in either sequential or parallel fashion. When trained in the parallel manner, orthogonality between the latent structures in linear PLS can still be preserved. This paper is organized as follows. After illustrating an equivalent presentation of linear PLS in Section 2, this formulation is extended to the nonlinear case in Section 3 and some practical considerations are also discussed, together with a brief comparison with the existing methodologies. In Section 4, the potential benefits of the proposed approach are assessed through both a mathematical simulation example and monitoring of a simulated batch polymerization reactor. Section 5 concludes this paper. 2. Equivalent Optimization Presentation of PLS 2.1. PLS In the Form of Outer and Inner Models. PLS was originally characterized as an iterative bootstrap process, and its related mathematical properties can be found in refs 1, 3, 17, and 18. Assume that both X ) [x1, ..., xp] ∈ Rn×p and Y ) [y1, ..., yq] ∈ Rn×q assemble the variables in their columns and are of zero mean and unit variance. As shown in Figure 1, for the first latent variable, it employs an outer model containing score vectors t1 for X and u1 for Y

u1 ) f(t1) + r˜ 1 ) uˆ 1 + r˜ 1 For linear PLS, f(t1) ) b1t1 where b1 is the a coefficient determined through ordinary least-squares method. For QPLS and NNPLS, f(t1) is chosen as the corresponding quadratic and “generic” neural network based nonlinear regression relationship, respectively. After both the outer and inner models are identified, the original X and Y are deflated as

E1 ) X - t1 p1T F1 ) Y - uˆ 1 q1T Then the second latent variable is calculated based on the residuals E1 and F1 following the same procedure. This is repeated until the required number of latent variables are obtained. 2.2. Equivalent Optimization Presentation. For every pair of latent variables, the above PLS decomposition algorithm can be summarized as three sequential steps: (1) Calculate both score vectors, t and u. (2) Model the relationship between t and u. (3) Utilize score t and the approximation of score u, uˆ to predict the original data X and Y. Since the score vector u actually does not appear explicitly in the final model, it is reasonable to infer that the calculation of u is not essential from intuition. As a matter of fact, statisticians have presented a perspective of PLS in favor of a more interpretable version, such as those in refs 19-22. In all the following methodologies, the calculation of u is not required. It can be proven that the ith linear latent variable ti ) Xwi of data matrices X and Y can be reformulated as an optimization problem (the corresponding proof is deferred to appendix A for brevity): q

wi ) arg max

q

〈ti,yj〉2 )∑〈Xw,yj〉2 ∑ j)1 j)1

(1)

subject to

wTw ) 1 and wTΣXWi-1 ) 0T

(2)

where ΣX ) XTX and Wi-1 ) [w1, ..., wi-1]. The number of required latent structures, A, can be determined through crossvalidation.9 After all latent structures ti, i ) 1, ..., A, are extracted, setting E0 ) X and F0 ) Y, a regression performed between the obtained ti and the original X and Y will result in the following similar decomposition:

Ei ) Ei-1 - ti piT with pi )

Fi ) Fi-1 - ti riT with ri )

Ei-1T ti tiT ti Fi-1T ti tiT ti

)

)

XTti tiT ti YTt i tiT ti

(3)

(4)

Ind. Eng. Chem. Res., Vol. 45, No. 11, 2006 3845

Note that here, as shown in ref 3, the latent variable ti is defined with respect to the original data X rather than the residual Ei-1. Furthermore, in both eq 3 and eq 4, the regression coefficients pi and ri can also be computed based on either the original X and Y or the residuals Ei-1 and Fi-1. Hence, the PLS decomposition can also be formulated as follows (parallel Version): (1) Obtain all the required A latent variables T ) [t1, ..., tA] based on the original data X and Y through eqs 1 and 2. (2) Reconstruct X and Y based on T using eqs 3 and 4 by minimizing the residuals Ei and Fi. In addition, the linear PLS decompositions can also be achieved in a sequential fashion. It stems from the parallel version, but in this case, after the current latent variable t is obtained, it is employed immediately to deflate the original data. Similarly, this can be summarized as follows (serial Version): (1) Set i ) 1, E0 ) X, and F0 ) Y. (2) Calculate the ith latent variable ti ) Ei-1wi,

Figure 2. Architecture of networks to reconstruct X and Y.

q

wi,opt ) arg max

∑〈Ei-1wi,Fi-1, j〉2

(5)

wTw)1 j)1

where Fi-1, j is the jth column of Fi-1. 3. Reconstruct Ei-1 and Fi-1 based on ti by minimizing the residuals Ei and Fi. 4. If the required number of latent variables, which is usually determined through cross-validation,9 has been obtained, then stop; otherwise, set i ) i + 1 and go to step 2. According to the presented equivalent presentation, PLS can be regarded as, roughly speaking, two related but relatively independent steps: (1) Projection step: to calculate the latent variables which have maximum covariance with the response variables under certain regularization constraints. (2) Reconstruction step: to reconstruct the original data X and Y or, equivalently, to deflate them, using the latent variables T. 3. Novel Nonlinear Projection to Latent Structures Algorithm Although the different presentations of linear PLS are mathematically equivalent, extensions to nonlinear cases may have various performances if different methodologies are adopted. As pointed out in refs 14 and 16, the predictor variables can correlate nonlinearly with each other and with the response variables. Therefore, in this paper, the nonlinear projection step is adopted to identify the nonlinearities within the predictor variables, while the reconstruction step is also nonlinear in order to characterize the nonlinear relationship between the predictor variables and the response variables. 3.1. Nonlinear Optimization Objective Functions. As aforementioned, the main idea of PLS may be summarized as performing the projection step to attain the latent variables T and then the reconstruction step to reproduce the original data sets. Intuitively, if T are related with the original predictor variables X through certain nonlinear mapping, the projection part of PLS is then extended to the nonlinear case. Analogously, if the original X and Y are reproduced by t nonlinearly, the reconstruction part will then also be nonlinear. The next section clarifies them briefly. 3.1.1. Nonlinear Projection. ti is only concerned with the ith latent structure, analogous to eq 1, assuming that ti ) gi(Ei-1), where function gi(‚) is a certain nonlinear mapping satisfying

Figure 3. RBF network for extracting one nonlinear latent variable.

specific regularization constraints. The search of ti can then be formulated as q

〈ti,fi-1, j〉2 ∑ j)1

maximize J(ti) )

(6)

where Fi-1 ) [fi-1,1, ..., fi-1,q]. On the other hand, if it is required that A latent variables T ) [t1, ..., tA] ) g(X) should be extracted simultaneously, eq 6 can be modified correspondingly: q

maximize J(ti) )

〈ti,yk〉2, ∑ k)1

for i ) 1, ..., A

(7)

subject to the orthogonality constraint

tiT tj ) 0, for j < i

(8)

It is natural that the finally obtained ti depends on the selection of mapping function gi(‚), and Section 3.2 illustrates one example implemented through RBF networks. 3.1.2. Nonlinear Reconstruction. This step aims at minimizing the reconstruction errors X ˜ )X-X ˆ and Y ˜ )Y-Y ˆ. Because a neural network with one hidden layer of sigmoidal units can model any nonlinear function with a bounded number of hidden units,23-25 two separate networks are employed to reconstruct X and Y from the obtained latent structures T, as shown in Figure 2. Here, hidden-layer neurons use the sigmoidal activation function, whereas output-layer neurons use the linear activation function. 3.2. RBF Network Implementation. Although the objective functions eqs 6 and 7 can be approached by a certain feedforward neural network through nonlinear optimization techniques such as gradient ascent algorithm, they generally suffer from difficulties such as convergence speed and falling into local minima. Radial basis function (RBF) networks can overcome these problems elaborately. The methodology introduced in ref 26 for nonlinear PCA can be used for parametrizing the PLS method and incorporating nonlinearities in the PLS models to produce nonlinear latent factors. An RBF network shown in Figure 3 is employed for the extraction of latent structures (for simplicity, here only the

3846

Ind. Eng. Chem. Res., Vol. 45, No. 11, 2006

extraction of one latent variable is illustrated). The centers C and widths σ of the employed Gaussian radial basis functions are determined independently prior to the calculation of weights w. It should be noted that linear bypass connections from the input directly to the output layer are employed in order to ease the characterization of linear mapping.13 C and σ denote the centers and the corresponding widths of the typically employed Gaussian RBF function

Gk(x) ) exp-|x-Ck| /2σk 2

2

for the neurons in the hidden layer, respectively. For input patterns X ∈ Rn×p, let f(X) ∈ Rn×h be the output of neurons in the RBF hidden layer with h neurons in total. Extraction of the ith latent variable ti can then be formulated as

ti ) gi(X) ) [X f(X)]w

(9)

Letting ˜f(X) ) [X f(X)] and substituting it into the objective function eq 6 gives q

〈f˜(X)w,yj〉2 ∑ j)1

maximize J(ti) )

(10)

For simplicity, the regularization constraint is chosen as wTw ) 1. A can be denoted as follows: q

A)

˜f T(X)yjyjT˜f (X) ) ˜f T(X)YYT˜f (X) ∑ j)1

Equation 10 can, thus, be represented as q

maximize J(ti) )

〈f˜(X)w,yj〉2 ) wTAw ∑ j)1

(11)

subject to wTw ) 1. It is apparent that this is a principal component decomposition problem or, equivalently, an equation used to find the first linear latent variable between data ˜f(X) and Y. It can be solved using linear algebra methods easily. Furthermore, if A latent variables ti ) ˜f(X)wi, i ) 1, ..., A, are required to be extracted simultaneously, eq 7 can be reformulated analogously, q

maximize J(ti) )

〈f˜(X)wi,yj〉2 ) wiT Awi ∑ j)1

(12)

for i ) 1, ..., A, subject to

wiTwi ) 1 and tiT tj ) 0, for j < i

(13)

This is equivalent to extracting the first A latent structures between ˜f(X) and Y and can be easily worked out as well. Training of the employed RBF network is, thus, simplified as a two-step procedure, where the centers and widths of the Gaussian RBF network are predetermined through the clustering method while the weights can be obtained by a simple linear algebra calculation. Using this two-stage training strategy, the latent structures can be easily obtained without much computational effort required while keeping its nonlinearity. Therefore, the proposed nonlinear PLS algorithm can be regarded as the combination of two steps: (1) Obtain the latent variable by the RBF network through linear algebra without nonlinear optimization providing the

parameters C and σ have been obtained, which can usually be achieved using a certain clustering method and the corresponding widths-determination algorithm; (2) Reproduce or reconstruct the original data X and Y using two independent neural networks, as shown in Figure 2, through some nonlinear optimization methods such as error backpropagation27 and Levenberg-Maquardt algorithm.28 3.3. Parallel Versus Sequential Version. Analogous to the autoassociative nonlinear principal component extraction by Kramer,29 the extraction of latent variables can be achieved in either parallel or sequential manner. For the sequential decomposition case, the reconstruction of X and Y by the first latent variable t1 will first be subtracted from the original data X and Y and the residuals then serve as the original data for the extraction of the next latent variable. This procedure is repeated until certain stopping criterion is satisfied and can be briefly summarized as follows. Algorithm 1. Sequential nonlinear PLS steps are as follows: (1) Scale X and Y to zero mean and unit variance. Let E0 ) X and F0 ) Y, and set i ) 1. (2) For inputs Ei-1 and Fi-1, construct the ith RBF network and determine the corresponding centers and widths using a certain clustering algorithm. Calculate the ith nonlinear latent structure ti using the linear PLS algorithm. (3) Develop a new network to reproduce Ei-1, utilizing ti as the inputs and denoting the residual as Ei. (4) Set up a new network to reproduce Fi-1, utilizing ti as the inputs and denoting the approximation error as Fi. (5) If an additional latent structure is necessary, let i ) i + 1 and go to step 2. Otherwise, stop. However, this sequential algorithm is argued for possible mistake in extracting nonlinear terms from the predictor variables X to find subsequent latent structures in the nonlinear additive models,30 which is confirmed by the research of Malthouse et al. for their proposed methodology.14 Therefore, a parallel version of the proposed method is also presented. It is worth noting that, compared with its sequential counterpart, less networks are to be trained, and thus it significantly alleviates the computational costs. Algorithm 2. Parallel nonlinear PLS steps are as follows: (1) Scale X and Y to zero mean and unit variance. Specify the required number of latent structures, A. (2) For inputs X, initialize an RBF network and determine the corresponding centers and widths using a certain clustering algorithm. Calculate the nonlinear latent structures T using linear PLS algorithm. (3) Develop one network to reproduce X, utilizing T as the inputs and denoting the reconstruction residual as EA. (4) Set up one network to reproduce Y based on T and denote the fitting error as FA. 3.4. Implementation Issues. In previous sections, a framework for nonlinear PLS is developed, but specific issues must be taken into account when applying this strategy in practice. These related considerations are clarified as follows. 3.4.1. Training of the RBF Network for Projection. Unlike the generally presented supervised learning, the employed RBF network in the projection step is trained in an unsupervised manner. But similar to the supervised one, this training is also performed through two separate stages rather than one pass using nonlinear optimization algorithms. In the first stage, centers and widths of the RBF function are obtained through certain clustering methodologies. Generally, k-means algorithm is adopted to allocate the centers for a given number of clusters.31 However, the rival penalized competitive learning method32

Ind. Eng. Chem. Res., Vol. 45, No. 11, 2006 3847

seems more appealing in practice because of its capability of determine both the number of clusters and the centers themselves automatically. As aforementioned, in the second stage, latent structures can be achieved directly through the PLS algorithm without nonlinear optimization being involved. 3.4.2. Training of the Reconstruction Networks. After the RBF network is developed, the latent structures are then ready for fitting the predictor and response variables. Obviously, configurations of these two networks do not interfere and can, thus, be determined independently. For each reconstruction network, only the number of units in the hidden layer is undetermined for the sequential version, which has the current concerned latent structure presented only. This still holds for the parallel version, providing the retained number of latent structures is specified beforehand. The central objective is to find an appropriate network structure. A network that is too small generally fails to learn the inherent complexities in the data, whereas one that is too large is apt to “overfitting” the data. This issue has been investigated by different researchers and can be readily addressed through methods such as constructive learning,33-35 pruning,36 and sequential learning.37 Another issue closely related with the network topology selection is the choice of training algorithms. Generally, the learning process is performed in order to best fit the data or, equivalently, to minimize the sum of squared errors between network outputs and training data. Algorithms such as error back-propagation27 and its many varieties28,38,39 are generally adopted. Unfortunately, although many of them exhibit fast convergence and little computational requirements, they do not necessarily generate a network with a good generalization ability and usually lead to overfitting when the network size is relatively large. To circumambulate this difficulty, techniques such as early-stopping40 and regularization41-43 should be used. In this paper, the reconstruction networks are developed through constructive algorithm33 and are trained by the Levenberg-Marquardt method.28 Either the regularization technique with Bayesian methods to determine the regularization parameters automatically41-43 or the early-stopping algorithm40 is adopted for overcoming possible overfitting. 3.4.3. Selection of the Number of Retained Latent Structures. This parameter will be determined through crossvalidation.9 The original data is split into three subsets for training, validation, and testing, respectively. The number of retained latent structures is chosen as the configuration which exhibits the best prediction ability, i.e., with minimum prediction error, for the validation data set. Since the original predictor variables are augmented through the RBF network, it is possible, in principle, that the number of latent structures are greater than the number of the original predictor variables. In the end, it is noteworthy that, although not included in the proposed approach, linear bypass connections13 from the input directly to the output layer for networks in reconstruction steps are also, if desired, viable for easing the characterization of linear functionalities. 3.5. Discussion. In this section, the proposed approach and the other existing techniques are compared for better understanding of their similarities and differences. Without loss of generality, for the existing algorithms, only some representative ones, including NNPLS of Qin and McAvoy7 and NLPLS by Malthouse et al.,14 are mainly concerned. For brevity, hereinafter the proposed novel nonlinear PLS approach will be abbreviated as N2PLS. 3.5.1. Consistency with PLS. For the simplest case, the developed networks contain linear nodes only and no hidden

Figure 4. Linear extreme case of NLPLS.

layer. More precisely, for NLPLS, linear nodes in the bottleneck layer are preserved, while for the RBF extracting the latent structures of N2PLS, only the linear bypass connections are adopted. All three approaches then represent linear models. It is trivial to verify that both N2PLS and NNPLS will, under this condition, perform PLS, whereas for NLPLS, this is not the case. Consider the linear extreme of NLPLS as shown in Figure 4, where y is univariate and the first latent structure is pursued only; the objective is to minimize the fitting error through the following:14

min E ) |X - X ˆ |2 + |y - yˆ |2 w,v

(14)

By denoting Z ) [X y] and Z ˆ ) [X ˆ yˆ ], eq 14 can be reformulated as

ˆ |2 min E ) |Z - Z w,v

(15)

where w ∈ Rp×1, v ∈ R(p+1)×1, t ) Xw, and Z ˆ ) tvT. Note that, unlike PLS pursuing latent structures that have maximum covariances with the response variables, NLPLS strives for obtaining the one which can recover both the predictor and response variables simultaneously in a symmetric manner. Since these objectives are generally inconsistent, NLPLS, therefore, in some sense deviates from the initial motivation of regression. Actually, this linear network has been thoroughly investigated, and there is an explicit conclusion that44,45 E attains, up to equivalence, the unique local and global minimum when

w ) CΣXX-1 ΣXZu and v ) C-1u

(16)

where C is a nonzero constant, ΣXX ) XTX, ΣXZ ) XTZ, ΣZX ) ZTX, and u is the eigenvector corresponding to the largest eigenvalue of Σ ) ΣZX ΣXX-1 ΣXZ. It is apparent that the obtained w, which serves as the loading vector for the latent structure in PLS, is generally distinct from the solution of PLS where w is the eigenvector corresponding to the largest eigenvalue of XTYYTX. This can be demonstrated through a simple example using data from the example of ref 46. For PLS,

w1 ) [0.1900, - 0.5371, - 0.4138, 0.6664, 0.2375, - 0.0615]T while the previous network produces

w1 ) C[0.4124, 0.4114, 0.4169, 0.4176, 0.3751, 0.4144]T These two vectors are almost orthogonal, since the angle between them is approximately 88.5°. It can be inferred intuitively that, if the predictor variables contain information that does not associate with the response variables, the solution given by NLPLS will generally behave poorly because much of the efforts have been contributed to the recovery of the predictor variables.

3848

Ind. Eng. Chem. Res., Vol. 45, No. 11, 2006

Figure 5. Network utilized by NNPLS to fit the inner relationship between t1 and u1.

3.5.2. Comparison of NNPLS and Sequential N2PLS. Consider a special case of sequential N2PLS where both the projection and reconstruction of predictor variables X are implemented through the aforementioned extreme linear networks. For the first latent structure, it is apparent that, in this case, the projection network will produce the same first latent structure t1 of PLS while X is approximated by X ˆ ) t1p1T, where p1 is calculated in the same manner as for PLS. As far as NNPLS as shown in Figure 5 is concerned, since its first latent structure t1 is computed in the outer model using the PLS algorithm and then fits X also as X ˆ ) t1p1T, this approach is just identical to sequential N2PLS. However, difference arises for the approximation of Y. For NNPLS, the response variable Y is approximated as Y ˆ ) uˆ 1 q1T where uˆ 1 ) σ(t1w1T + 1n β1T)w2 + 1n β2T and q1 is obtained prior to the inner relationship fitting as

q1 ) arg min |Y - t1qT|2 |q|)1

where 1n is an n-dimension column vector with all elements being 1. This strategy obviously does not adequately make use of the prediction capability of t1 to Y. Actually, even if the reconstruction network with respect to Y in N2PLS employs the same topology and weights from input t1 to the hidden layer, whereas the weights from the hidden layer to the output is determined through regression between σ(t1w1T + 1nβ1T) and Y by least-squares algorithm, it will still generally present a better prediction of Y as Y ˆ ) σ(t1 w1T + 1nβ1T)w ˜ 2 + 1nβ˜ 2T where

(w ˜ 2, β˜ 2) ) arg minJ˜ ) |Y - σ(t1w1T + 1n β1T)w ˜ - 1nβ˜ T|2 w, β˜

This indicates that, even at its certain extreme case, N2PLS will generally present better prediction properties. It should be noted that this deficiency of NNPLS implies that, in order to produce the same approximation accuracy, NNPLS is generally liable to utilize more latent structures. In other words, it is expected that, as opposed to NNPLS, sequential N2PLS tends to provide a more parsimonious model. 3.5.3. Orthogonality and Computational Requirements. One backbone of PLS is the orthogonality of the latent structures, which indicates that each score captures unique or nonredundant information about the predictor variables. Furthermore, when these scores are used for fitting the original variables, less inputs will generally lead to a smaller network and speed up the training process. Unfortunately, neither NNPLS and NLPLS produce orthogonal latent structures, although Doymaz et al.16 introduced an orthogonal version of NLPLS at the cost of both training speed and, sometimes, stability. As aforementioned, parallel N2PLS can extract orthogonal latent structures easily. Furthermore, although N2PLS reconstructs the response variables based on the retained latent

structures, this will not cause any damage to its approximation ability. Consider the extreme case where the RBF network contains linear bypass connections only, while all the latent structures are employed for reproduction of Y. It is apparent that, in this case, X experiences a coordinate transformation only and there is not any information lost. Using all these latent structures to reconstruct Y is identical with the neural network, which approximates Y from X directly. In other words, N2PLS is, in principle, capable of producing solutions at least not worse than those of a neural network. From a practical perspective, another important aspect of the nonlinear methodologies is the computational requirements. For the projection part, N2PLS inherits the rapid convergence property of the RBF network, since no nonlinear optimization is necessary at all. While for the reconstruction networks with one hidden layer and NNPLS, many algorithms for improving training speed have been proposed, which renders it easy to implement. On the other hand, the performance of training a network with three hidden layers as utilized by NLPLS deteriorates as the number of hidden layers gets larger.47 It should also be noted that, compared with the parallel version N2PLS, the sequential one requires much more networks for reconstruction to be trained when more latent structures are retained. In accordance with the conclusion from Malthouse et al.14 and our empirical experience, the parallel version is preferred. 4. Application Examples 4.1. Case Study 1: A Nonlinear Simulation Numerical Problem. A nonlinear, multidimensional simulation example is devised based on the example in ref 26 to assess its efficiency. It is defined as

x1 ) t2 - t + 1 + 1

(17)

x2 ) sin t + 2

(18)

x3 ) t 3 + t +  3

(19)

y ) x12 + x1x2 + 3 cos x3 + 4

(20)

where t is uniformly distributed over [-1, 1] and i, i ) 1, 2, 3, 4, are noise components uniformly distributed over [-0.1, 0.1]. The generated data with 300 samples is segmented into training, validation, and testing data sets consecutively with equal length. The data for training and testing is illustrated in Figure 6. For this problem, it is apparent that the predictor variables are driven by one latent variable t only, while the response variables are nonlinearly correlated with the predictor variables. The mean-squared-prediction errors (MSE) of response variables Y on the validation data set are used to determine the retained number of latent structures. For each network configuration, the center and width of the RBF network are determined through the aforementioned algorithms, while the reconstruction networks are trained through the Levenberg-Marquadt algorithm28 and the training process is terminated through earlystopping.40 The number of neurons in the hidden layers for the RBF network with Gaussian functions is finally chosen as 8 through the rival penalized competitive learning method.32 For the two feed-forward networks with sigmoidal function in the hidden layer and the linear one in the output layer to reconstruct X ) [x1, x2, x3] and Y ) y, the number of hidden neurons are

Ind. Eng. Chem. Res., Vol. 45, No. 11, 2006 3849

Figure 6. Data sets used for training (+) and testing (‚).

Figure 7. Results of the proposed approach on the training (a) and testing (b) data sets. Table 1. MSE for Training and Testing Data Sets training data set N2PLS NNPLS PLS

testing data set

MSE

1 LV

2 LVs

3 LVs

1 LV

2 LVs

3 LVs

X Y X Y X Y

0.0027 0.0433 0.0337 0.0543 0.0337 0.2984

0.0021 0.0420 0.0127 0.0458 0.0051 0.0899

0.0010 0.0403 0.000 0.0422 0.000 0.0818

0.0034 0.0596 0.0257 0.0668 0.0257 0.2664

0.0120 0.0824 0.0115 0.0653 0.0056 0.1258

0.0165 0.0935 0.000 0.0671 0.000 0.1070

finally chosen as 3 and 2, respectively, using the constructive algorithm.33 One latent variable is retained through crossvalidation, and the results are listed in Table 1, where the MSE of reconstruction using the retained latent structures are highlighted by bold type. Figure 7 shows the results of the proposed approach on training and testing data. The upper plots of Figure 7 show the actual (‚) and predicted (+) values, whereas the lower plots of Figure 7 show the prediction errors. Results using Qin McAvoy’s NNPLS7 and PLS are also presented for comparison. It can be seen that the proposed approach provides a better solution of the underlying problem than both NNPLS and linear PLS. In fact, one single latent structure extracted by N2PLS is capable of estimating Y well, which is in sharp contrast with the performances of NNPLS and PLS, where ,when two or even all three latent structures

Figure 8. Values of t and the corresponding values of the extracted latent structure.

are employed, they still provide an inferior result. It is rather interesting if the value of t that generates x1-x3 in eqs 17-20 and the corresponding latent structure represented by the output of the RBF network for the training data set are plotted as shown in Figure 8. It is apparent that there exists an almost one-toone simple linear mapping between them. Since t are nonlinearly associated with the predictor variables, this indicates that N2PLS captures this nonlinearity effectively, whereas only relatively more computational effort is required. This inherent nonlinear structure is demonstrated by a similar presentation for the testing data. 4.2. Case Study 2: Monitoring of a Simulated Batch Polymerization Reactor. The control of final product quality in batch reactors is generally a difficult problem because of the fact that the quality variables are usually measured only after the batch is completed and are also, in most cases, performed off-line. Furthermore, the reactors exhibit nonlinear kinetics, which renders it almost impractical for model-based inference.48 It is, hence, advantageous if the final product can be estimated using some readily available on-line measurements taken from the reactor during the early stage of a whole batch run. Here, a pilot-scale polymerization reactor shown in Figure 9, which was developed at the Department of Chemical Engineering, Aristotle University of Thessaloniki, Greece, is investigated. The free radical solution polymerization of methyl

3850

Ind. Eng. Chem. Res., Vol. 45, No. 11, 2006

Figure 9. Pilot-scale batch polymerization reactor.

methacrylate (MMA) with water solvent and benzoyl peroxide initiator is considered in this paper. The jacketed reactor is equipped with a stirrer for thorough mixing of the reactants. Heating and cooling of the reaction mixture is achieved by circulating water at the appropriate temperature through the reactor jacket. The reactor temperature is controlled by a cascade control system consisting of a primary proportional-integralderivative (PID) controller and two secondary PID controllers. The reactor temperature is fed back to the primary controller, whose output is taken as the setpoint of the two secondary controllers. The manipulated variables for the two secondary controllers are the hot and cold water flow rates. The hot and cold water streams are mixed before entering the reactor jacket and provide heating or cooling for the reactor. The jacket outlet temperature is fed back to the secondary controllers. A detailed mathematical model covering reaction kinetics and heat and mass balances has been developed and comprehensively validated using the reactor operation data. On the basis of this model, a rigorous simulation program was developed and used to generate polymerization data under different operating conditions. In this study, the batch duration is set to 180 min. The reactor temperature, reactor jacket inlet and outlet temperatures, and coolant flow rate are chosen as the predictor variables, while the monomer conversion (X), number-average molecular weights (Mn), and weight-average molecular weights (Mw) are the response variables. The predictor variables measured at 5, 10, 15, 20, and 25 min from the start of the batch are used to predict the final polymer quality. To make the simulation close to reality, measurement noises are added to the polymer quality variables in the ranges [-3000 g/mol, 3000 g/mol], [-6000 g/mol, 6000 g/mol], and [-1%, 1%] for Mn, Mw, and X, respectively. In total, 30 batches were simulated for the training, where batch recipes including the reactor temperature setpoint and the initial initiator concentration were generated through Monte Carlo simulation. We denoted a set of J process variables measured over K time points for I different batches as an array with dimensions I × J × K; this generated arrays X(30 × 4 × 5) and Y(30 × 3). The proposed method is not feasible for such data, and the original variables, X and Y, are, therefore, first unfolded time-wise,49 generating X(30 × (4 × 5)) and Y(30 × 3). The resulting X(30 × 20) and Y(30 × 3) are then utilized by the proposed approach to develop regression models representing the nominal operating condition (NOC) of the process. The number of neurons in the hidden layers for the RBF network with Gaussian functions is finally chosen as 6 through the rival penalized competitive learning method.32 For

Figure 10. SPE for the training (1-30) and testing (31-43) batches.

the two feed-forward networks with sigmoidal function in the hidden layer and the linear one in the output layer to reconstruct X and Y, they are finally chosen as 8 and 16, respectively, using the constructive algorithm.33 Three latent variables are retained through cross-validation. Noticing that a reconstruction model for the predictor variables is also obtained based on the historical data, mostly in NOC, for every training sample x, the statistic SPE (squared prediction error) can be calculated as

Q ) |e|2 ) |x - xˆ |2 where xˆ is the output of the reconstruction network for the predictor variables corresponding to its input x. Box50 showed that the distribution of statistic SPE can be well-approximated by

SPE ∝ gχh2 where weight g and the degree of freedom h can be estimated by the matching moments of the mean (m) and variance (V) of the cumulants:49

g)

V 2m

h)

2m2 V

and

The resulting upper control limit (UCL) can, thus, be calculated as

UCLSPE )

( )

2m2 V χ1-R2 2m V

where R is the predefined level of significance. In this study, R ) 0.05 and R ) 0.01 are chosen a priori as the warning limits and action limits, respectively. The SPE statistic for the training samples is shown in Figure 10, and it is plain to see that most training batches lie in the 95% UCL while all the batches lie in the 99% UCL. In addition, the prediction for the end-product quality together with its actual value is also illustrated in Figure 11. It can be seen that the developed model achieves satisfactory performance.

Ind. Eng. Chem. Res., Vol. 45, No. 11, 2006 3851

explicitly characterizing the nonlinear relationship between the latent structures and the original predictor variables, whereas it exhibits fast convergence speed. Extraction of multiple latent structures can be achieved in either sequential or parallel fashion. The proposed nonlinear PLS approach is assessed through both numerical example and monitoring of a simulated batch polymerization reactor. Because this approach provides a predictive model for the response variables, it can serve as a generic regression tool and is applicable for quality inference. In addition, since it presents a reproducing model for the predictor variables, it can also be applied for statistical process control through monitoring the possible variations in the predictor variables relating to the quality information. On the other hand, this also indicates that reconstruction of the predictor variables is unnecessary in some cases. Figure 11. Prediction of monomer conversion (X), number-average molecular weights (Mn), and weight-average molecular weights (Mw) for the training (1-30) and testing (31-43) batches. The actual value is denoted by a circle (o) and the prediction is denoted by a plus (+).

To assess the validation of the developed model, 13 testing batches (batches 31-43) are simulated where the first 10 batches (batches 31-40) fall in the NOC region; whereas batch 41 suffers from initial initiator impurities that effectively reduce the initial initiator mass from its nominal value 2.5 g to 1.5 g, batch 42 suffers from moderate reactor fouling, and batch 43 suffers from a combination of them. The predictor variables are then recorded as aforementioned during the early stage of the batch run, and the unfolded samples are then used by the developed model to obtain reconstruction of the predictor variables and to predict the end-product qualities. The corresponding SPE and prediction are also shown in Figure 10 and 11. The SPE values for the normal batches (batches 31-40) all lie within the 99% UCL as shown in Figure 10, while the prediction for the quality variables exhibits satisfactory accuracy. In addition, The SPE plot correctly identifies batches 41-43 as abnormal, because their SPEs all exceed the 99% UCL. On the other hand, although the prediction for batches 41 and 43 deviates from its measured value to a great extent, the prediction for batch 42 does not deteriorate significantly as opposed to those for batches 41 and 43. This may be mainly due to the ability of the control system, which is capable of bringing the product quality back to target even when certain fouling is presented. At the same time, this fouling will apparently affect the correlation of the predictor variables and result in the SPE value violating its control limit.

Acknowledgment The authors thank Professor C. Kiparissides in the Aristotle University of Thessaloniki, Greece, for providing the polymerization reactor model and the simulation program. The financial supports of the U.K. EPSRC under the grants GR/ R10875 and GR/S90461, the National High Technology Research and Development Program of China (863 Program; No. 2001AA413320), the National Science Foundation of China (NSFC 60404012), and SRF for ROCS, SEM of China, are also greatly appreciated. Appendix A. Proof of the Equivalent Presentation of PLS To ease the proof, the presentation adopted by various researchers (cf. refs 19-22) that is probably not the most common description of PLS is introduced as a lemma. Lemma 1. Consider X ∈ Rn×p and Y ) [y1, ..., yq] ∈ Rn×q, and denote the ith latent variable by PLS as ti ) Xwi and Wi-1 ) [w1, ..., wi-1]. Then

{wi,c} ) arg max

(wTw)(cTc)

}

(A1)

subject to

wTΣXWi-1 ) 0T

(A2)

where ΣXY ) XTY, ΣX ) XTX, and c ) ΣYXwi. Proof of Equation 1. On the basis of Lemma 1, it can be proven that eq A1 is equivalent to (cf. ref 51)

max

5. Conclusions On the basis of an equivalent presentation which treats PLS as the combination of a projection step to calculate the latent variables and a reconstruction step to reconstruct the original data, a novel nonlinear PLS approach is presented where both nonlinear latent structures and nonlinear reconstruction are obtained straightforwardly through two consecutive steps. First, an RBF network is developed to extract the latent structures through linear algebra methods, which can, in fact, be summarized as performing PLS decomposition for the augmented predictor variables and the original response variables. The obtained latent structures are then utilized to reconstruct the original predictor variables and response variables by two feedforward networks. It, therefore, possesses the capability of

{

cov2(wTXT,cTYT)

) max

{

}

wTΣXYΣYXw wTw

{

}

wTXTYYTXw w Tw

{ } q

) max

(wTXTyj)2 ∑ j)1 wTw

(A3)

subject to the orthogonality constraint eq A2. Let w ˆ ) w/xwTw and substitute it into eq A3; after simple algebraic transformations, eq A3 leads to eqs 1 and 2. Noticing that Ei-1 ) Ei-2 - ti-1pi-1T ) ... ) X ti-1pi-1T - ... - t1 p1T, it holds that

3852

Ind. Eng. Chem. Res., Vol. 45, No. 11, 2006

pi )

Ei-1T ti tiT ti

)

(X-ti-1pi-1T - ... - t1p1T)Tti tiT ti

Since tiT tj ) 0, j < i, this will lead to eq 3. Equation 4 can be concluded analogously. Literature Cited (1) Wold, H. Nonlinear estimation by iterative least squares procedures. In Research papers in statistics, festschrift for J. Neyman; David, F., Ed.; Wiley: New York, 1966; pp 411-444. (2) Geladi, P.; Kowalski B. R. Partial least-squares regression: A tutorial. Anal. Chim. Acta 1986, 185, 1-17. (3) Ho¨skuldsson, A. PLS regression methods. J. Chemom. 1988, 2, 211228. (4) Gnanadesikan, R. Models for Statistical Data Analysis of MultiVariate ObserVations; Wiley: New York, 1977. (5) Wold, S.; Kettaneh-Wold, N.; Skagerberg, B. Nonlinear PLS modeling. Chemom. Intell. Lab. Syst. 1989, 7 (1-2), 53-65. (6) Wold, S. Nonlinear partial least squares modelling. II. Spline inner function. Chemom. Intell. Lab. Syst. 1992, 14, 71-84. (7) Qin, S. J.; McAvoy, T. J. Nonlinear PLS modeling using neural network. Comput. Chem. Eng. 1992, 16 (4), 379-391. (8) Qin, S. J. A statistical perspective of neural networks for process modelling and control. In Proceedings of the 1993 Internation Symposium on Intelligent Control, Chicago, IL, 1993; pp 559-604. (9) Wold, S. Cross-validatory estimation of the number of components in factor and principal components models. Technometrics 1978, 20 (4), 397-405. (10) Baffi, G.; Martin, E. B.; Morris, A. J. Nonlinear projection to latent structures revisited (the neural network PLS algorithm). Comput. Chem. Eng. 1999, 23 (9), 1293-1307. (11) Baffi, G.; Martin, E. B.; Morris, A. J. Nonlinear projection to latent structures revisited: The quadratic PLS algorithm. Comput. Chem. Eng. 1999, 23 (3), 395-411. (12) Wilson, D. J. H.; Irwin, G. W.; Lightbody, G. Nonlinear PLS using radial basis functions. Trans. Inst. Meas. Control 1997, 19 (4), 211-220. (13) Holcomb, T. R.; Morari, M. PLS/neural networks. Comput. Chem. Eng. 1992, 16 (4), 393-411. (14) Malthouse, E. C.; Tamhane, A. C.; Mah, R. S. H. Nonlinear partial least squares. Comput. Chem. Eng. 1997, 21 (8), 875-890. (15) Chessari, C. M. Studies in modeling and advanced control. Ph.D. Thesis, University of Sydney, Australia, 1995. (16) Doymaz, F.; Palazoglu, A.; Romagnoli, J. A. Orthogonal nonlinear partial least-squares regression. Ind. Eng. Chem. Res. 2003, 42 (23), 58365849. (17) Wold, H. Estimation of principal components and related models by iterative least squares. In MultiVariate Analysis II; Krishnaiah, P. R., Ed.; Academic Press: New York, 1966; pp 391-420. (18) Wold, H. Nonlinear iterative partial least squares (NIPALS) modelling: Some current developments. In MultiVariate Analysis III; Krishnaiah, P. R., Ed.; Academic Press: New York, 1973; pp 383-407. (19) Stone, M.; Brooks, R. J. Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. J. R. Stat. Soc., Ser. B (Methodological) 1990, 52 (2), 237-269. (20) Frank, I. E.; Friedman, J. H. A statistical view of some chemometrics regression tools. Technometrics 1993, 35 (2), 109-148. (21) Hinkle, J. Reciprocal components, reciprocal curves, and partial least squares. Ph.D. Thesis, Department of Statistics, University of Kentucky, Lexington, KY, 1995. (22) Hinkle, J.; Rayens, W. S. Partial least squares and compositional data: Problems and alternatives. Chemom. Intell. Lab. Syst. 1995, 30 (1), 159-172. (23) Hornik, K.; Stinchombe, M.; White, H. Multilayer feedforward neural networks are universal approximators. Neural Networks 1989, 2, 359-366. (24) Hornik, K.; Stinchombe, M.; White, H. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 1990, 3, 551-560. (25) Huang, S. C.; Huang, Y. F. Bounds on the number of hidden neurons in multilayer perceptrons. IEEE Trans. Neural Networks 1991, 2, 47-55. (26) Wilson, D. J. H.; Irwin, G. W.; Lightdoby, G. RBF principal manifolds for process monitoring. IEEE Trans. Neural Networks 1999, 10 (6), 1424-1434.

(27) Rumelhart, D. E.; Hinton, G. E.; Williams, R. J. Learning internal representations by error propagation. In Parallel Distributed Processing, Volume 1; Rumelhart, D. E., McClelland, J. L., Eds.; MIT Press: Cambridge, MA, 1986. (28) Hagan, M. T.; Menhaj, M. B. Training feedforward networks with the Marquardt algorithm. IEEE Trans. Neural Networks 1994, 5 (6), 989993. (29) Kramer, M. A. Autoassociative neural networks. Comput. Chem. Eng. 1992, 16 (4), 313-328. (30) Frank, I. NNPPSS: Neural networks based on PCR and PLS components nonlinearized by smoothers and splines. Presented at InCINC’94, The First International Chemometrics InterNet Conference, 1994. (31) Moody, J.; Darken, C. J. Fast learning in networks of locally tuned processing units. Neural Comput. 1989, 1, 281-294. (32) Xu, L.; Krzyzak, A.; Oja, E. Rival penalized competitive learning for clustering analysis, RBF net, and curve detection. IEEE Trans. Neural Networks 1993, 4 (4), 636-649. (33) Kwok, T.-Y.; Yeung, D.-Y.; Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Trans. Neural Networks 1997, 8 (3), 630-645. (34) Kwok, T.-Y.; Yeung, D.-Y. Objective functions for training new hidden units in constructive neural networks. IEEE Trans. Neural Networks 1997, 8 (5), 1131-1148. (35) Ma, L.; Khorasani, K. New training strategies for constructive neural networks with application to regression problems. Neural Networks 2004, 17 (4), 589-609. (36) Castellano, G.; Fanelli, A. M.; Pelillo, M. An iterative pruning algorithm fro feedforward neural networks. IEEE Trans. Neural Networks 1997, 8 (3), 519-531. (37) Zhang, J.; Morris, A. J. A sequential learning approach for single hidden layer neural networks. Neural Networks 1998, 11 (1), 65-80. (38) Reidmiller, M.; Heinrich, B. A direct adaptive method for faster back-propagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on NN (ICNN); Ruspini, H., Ed.; San Francisco, CA, 1993; pp 586-591. (39) Stager, F.; Agarwal, M. Three methods to speed up the training of feedforward and feedback perceptrons. Neural Networks 1997, 10 (8), 1435-1443. (40) Amari, S.; Murata, N.; Mu¨ller, K.; Finke, M.; Yang, H. H. Asymptotic statistical theory of overtraining and cross-validation. IEEE Trans. Neural Networks 1997, 8, 985-996. (41) MacKay, D. J. C. Bayesian interpolation. Neural Comput. 1992, 4 (3), 415-447. (42) Thodberg, H. H. A review of Bayesian neural networks with an application to near-infrared spectroscopy. IEEE Trans. Neural Networks 1996, 7 (1), 56-72. (43) Foresee, F. D.; Hagan, M. T. Gauss-Newton approximation to Bayesian learning. In Proceedings of the 1997 International Conference on Neural Networks, Houston, TX, 9-12 Jun 1997; Volume 3, pp 19301935. (44) Baldi, P.; Hornik, K. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks 1989, 2, 53-58. (45) Baldi, P. F.; Hornik, K. Learning in linear neural networks: A survey. IEEE Trans. Neural Networks 1995, 6 (4), 837-858. (46) Lindqvist, B. MultiVariate prediction and regression; Technical Report 75554; Department of Mathmatical Sciences, Norwegian University of Science and Technology: Trondheim, Norway, Nov 1996. (47) Hertz, J.; Krogh, A.; Palmer, R. G. Introduction to the Theory of Neural Computation; Addison-Wesley: Redwood City, CA, 1991. (48) Yabuki, Y.; MacGregor, J. F. Product quality control in semibatch reactors using midcourse correction policies. Ind. Eng. Chem. Res. 1997, 36 (4), 1268-1275. (49) Nomikos, P.; MacGregor, J. F. Multivariate SPC charts for monitoring batch processes. Technometrics 1995, 37 (1), 41-59. (50) Box, G. E. P. Some theorems on quadratic forms applied in the study of analysis of variance problems. I. Effect of inequality of variance in the one-way classification. Ann. Math. Stat. 1954, 25, 290-302. (51) Rayens, W. S. The art of maximizing coVariance; Technical Report 383; Department of Statistics, University of Kentucky: Lexington, KY, Jun 30 2000.

ReceiVed for reView November 8, 2005 ReVised manuscript receiVed March 29, 2006 Accepted March 29, 2006 IE0512340