Prediction of Profiles in the Process Industries - ACS Publications

Feb 27, 2012 - experimental design procedure can ensure more even coverage of the input .... the desired coverage probability, (1 − α) × 100%, usi...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/IECR

Prediction of Profiles in the Process Industries Marco S. Reis* and Pedro M. Saraiva CIEPQPF, Department of Chemical Engineering, University of Coimbra Pólo II, Pinhal de Marrocos, 3030-790 Coimbra, Portugal ABSTRACT: The characterization of products and processes by space- or time-related sets of measurements (profiles) is becoming an increasingly common situation in today’s highly instrumentalized manufacturing systems. Even though many applications and methodologies in which profile measurements are employed as inputs to analysis tasks have already been proposed and described, problems in which they naturally appear as outputs are rare. Therefore, in this work, we present realworld applications in which profiles are the desired prediction targets and describe the methodologies followed to address the underlying analysis goals. In this context, we show how to properly establish the sample-specific prediction intervals for profiles, in a simple and flexible way, through nonparametric resampling or noise addition procedures. The added value of the various analyses carried out during the description of the case studies is also highlighted from the standpoint of the associated benefits for process improvement.



mode table of observations × variables, then each line of such a table is not a profile. In fact, there is no natural indexing scheme related to space or time connecting the several elements of the one-dimensional vector of measurements for a given sampling time. Therefore, measurement vectors collected from a process are not profiles, according to the definition in eq 1. The current importance placed on the analysis of profiles cannot be overstated, as pointed out by Woodall et al,16 who described it as “the most promising area of research in statistical process control”.16 We believe that this interest will certainly spread to other application fields as well. In this article, we address applications involving the prediction of profiles, a field that, from our perspective, still needs to evolve to meet the challenges found in real-world practice today. In the next few paragraphs, we provide a brief justification for this remark. Exploring the available technical literature on the topic of profile analysis, one can quickly conclude that, in the large majority of reported works, profiles are being used as inputs to data analysis. This is true irrespective of the particular nature of the profiles, and it is done with a variety of goals. For instance, the purpose could be to infer features or properties of corresponding products, assess their quality, or classify compliance with requirements. We consider an input to a data analysis task as a quantity that is fed to an algorithm or procedure to produce a given outcome, called the output. For instance, when a spectrum is used to predict the concentration of analytes in a sample, the spectrum has the role of an analysis input, and the predicted analyte concentrations are the respective outputs.15,17 When images are used to assess the quality of cereal flakes or snack foods, the inputs are the collected images, whereas the output is the quality level of the entities under analysis.11,14 Process monitoring is characterized as being onedimensional8,16,18 or two-dimensional9 depending on the profiles used as inputs, whereas the output is the state of the process (normal or abnormal). Table 1 provides a list of some different

INTRODUCTION Products currently being manufactured are no longer characterized by single, well-defined, and isolated quantities that address specific aspects of their quality. Rather, they are described by data arrays, collected through automatic measurement systems, containing relevant information about how quality features are distributed in space and time. In fact, one can easily identify a variety of data arrays, with rather distinct structures, that are presently being acquired to characterize, monitor, and control the quality of products, such as near-infrared (NIR) spectra,1−3 nuclear magnetic resonance (NMR) spectra,4,5 chromatographs,6,7 surface profiles,8 gray-level images,9 color and hyperspectral images,10−14 and data from hyphenated instruments,15 among others. These examples are intentionally presented in a sequence that reflects the underlying dimensionality of their data structures: one-dimensional for spectra, two-dimensional for gray-level images, three-dimensional for color and hyperspectral images, and even higher dimensionalities for data collected with hyphenated instrumentation. We propose here to designate all such data arrays simply as profiles, considering a profile to be any array of data, indexed by time and/or space, that characterizes a given entity (product or process) Profile:

{Y(ix , iy , iz , it )} ∈ 9 m,

with (ix , iy , iz , it ) ∈ Ωx × Ω y × Ωz × Ωt

(1)

where Y represents a profile with dimensionality m, whose spatial or temporal features (if they exist) are indexed by the set of indices (ix, iy, iz, it) ∈ Ωx × Ωy × Ωz × Ωt (x, y, and z stand for the spatial indices, and t stands for the time index). For instance, a red−green−blue (RGB) image is an example of a three-dimensional profile in which two modes are indexed by the spatial indices (ix, iy) ∈ Ωx × Ωy, and the third mode is the intensity in the R, G, and B wavelength bands, Y(ix, iy) ∈ ℛ3, and represents the color for a given pixel. As this image has a structure properly indexed in space, it is considered to be a profile. On the other hand, when one collects and saves measurements from a variety of sensors (pressure, temperature, flow, composition, etc.) in a database, in the form of a two© 2012 American Chemical Society

Received: March 19, 2011 Accepted: February 27, 2012 Published: February 27, 2012 4254

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266

Industrial & Engineering Chemistry Research

Article

Table 1. List with Illustrative Examples in Which Profiles Are Used As Inputs to Analysis Methods task

profile used as input

analysis output

process monitoring

surface profiles, images

operations status (normal/abnormal)

predictive modeling (regression) classification

spectra, data from hyphenated instrumentation (multiway structure) images, spectra

composition, product quality, physicochemical parameters quality level/grade, product origin, product age

refs 1, 8−10, 16, and 18−24 3, 17, and 25−41 11 and 42−49

a PCA decomposition of the output profiles, called Y-PCA regression] and methods for constructing sample-specific prediction intervals (based on analytical expressions or Monte Carlo simulations and resampling techniques). Modeling Frameworks. In the following subsections, we provide short reviews of PCA and PLS2, mainly for the purposes of establishing the nomenclature employed and setting the basic theoretical background for this article. Furthermore, an alternative simple model structure that is used in one of the reported case studies is also introduced. Principal Component Analysis. Principal component analysis (PCA)67,68 is a multivariate statistical analysis methodology aimed at extracting the main linear correlation structures present in an original data set. It operates by successively finding linear combinations of the original variables with maximal variance. The coefficients for the linear combinations are subjected to size constraints (their L2 norm must equal 1) and orthogonal constraints (each vector of coefficients must be orthogonal to all previous ones). Each linear combination corresponds to a different principal component (PC). The coefficients are called the loadings, and the values for the linear combinations are called the scores. The PCs are usually arranged according to a natural ordering, where the first PC (PC1) corresponds to the linear combination explaining the largest portion of the original variability present in data, in the sense of the total variation, TV, defined as

situations in which profiles are used as inputs for different analysis tasks, together with some illustrative references, from which it is possible to find more information about the data structures and methods employed. In fact, when we performed a search using a general platform oriented toward technical publications using keywords such as “profiles”, “multivariate statistical analysis”, and “process data”, and analyzed the first 100 results obtained, we found that 93% of the cases considered applications in which profiles were used as inputs (∼80%) or did not fall within the scope of the definition of a profile employed in this work. The remaining 7% involved the prediction of end-of-batch properties in batch operations, for which the issue of predicting a profile can be considered to be tacitly present. However, in those cases, the entire interest lies in the final value of the profile, which is not true for the applications addressed in this article, for which the whole set of measurements is of interest. Therefore, the large body of literature available on applications of multivariate statistical frameworks to processing and analytical data50−62 is focused on problems in which profiles are used as inputs to predict univariate measures related to the goals of analysis or on situations that do not comply with the notion of profiles as stated in eq 1.1,7−10,12,14−18,23,34,37,63−65 Even when multiple responses are considered simultaneously, through, for instance, multiresponse partial least squares (PLS2),17 profiles are still considered as inputs, with the desired correlated properties being the targeted responses.66 In this context, further work is required to develop methodologies for properly dealing with situations in which profiles are the natural analysis outputs. In this article, we address two such real-world situations and present the predictive methodologies employed. We also present and discuss several ways to compute the so-called sample-specific prediction intervals for the predicted profiles. This article is organized as follows. In the next section, we introduce some of the basic methods currently in use for dealing with profile data, as well as the associated nomenclature employed in this work. The proposed flexible methodologies for constructing sample-specific prediction intervals are also presented in detail in that section. Then, after introducing the basic workflow followed in our analysis, we present case studies addressing the prediction of (i) the fiber orientation profiles in a paper and paperboard machine and (ii) the UV−vis spectra of a wine sample across its aging process. Each application is discussed, and the results are interpreted regarding their utility for process improvement activities. In the final section, we summarize the main contributions of this article and provide some remarks regarding future work.

m

TV =

∑ sj 2 j=1

(2)

here sj2 representes the sample standard deviation of the jth original variable. The subsequent PCs have the same meaning, but address only the variability not explained by the former ones (because of the imposed orthogonality constraints). More specifically, let us assume that data are organized in a two-dimensional array, X, with m variables (as columns) and n observations (as rows). The PCA decomposition of X is given by X = T·PT

(3)

where T is the n × m matrix of PCs (scores), which are the new (latent) variables (each column corresponds one such new variable) and P is the m × m orthonormal matrix of coefficients (loadings or loading vectors), which, when applied to X, result in T (each column of P corresponds to the coefficients of a linear combination associated with a given PC). (Note that it is assumed that the variables were previously centered at zero by subtracting their means and also scaled, if necessary.) As the last PCs explain very little about the original variability of X, which could also be due to noise and unstructured features, they can usually be discarded, and their contributions gathered into an n × m matrix of residuals, E (discarded, but



METHODS In this section, we introduce the methodologies used in this work to deal with profile prediction. These are divided into modeling frameworks [such as principal component analysis (PCA), PLS2, and an alternative model structure based on 4255

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266

Industrial & Engineering Chemistry Research

Article

met in practice, in which case PLS usually constitutes a good alternative. Y-PCA Regression. A simple alternative model structure was also developed here to address the problem of relating two blocks of data, when the Y block is formed by profiles. It consists of first performing a PCA decomposition of the output profiles (Y) and then estimating linear regression models to predict the scores of the relevant PCs using the available predictors (e.g., process variables). We thus have, for this approach, the following model structure, which we call Y-PCA regression

not forgotten, as the residual matrix contains useful information regarding unexplained variables and outlying observations) X = T·PT + E

(4)

where T is now an n × a matrix (a is the number of retained components, also known as the pseudorank of the process) and P is the m × a matrix with the loading vectors for the retained components. PCA has been used in the analysis of one-dimensional,69 twodimensional,9 and three-dimensional10,12,46 profiles, as well as in the prediction of important quantities, such as product properties, through principal component regression (PCR), where profiles are used as inputs. In this work, we explore PCA in the context of effectively compressing the output domain, or Y space, in prediction problems. Partial Least Squares. In contrast to PCA, which is a methodology for analyzing a single block of data, partial least squares (PLS) is an algorithm that relates two blocks of data, X and Y.17,50,70 Using reasoning similar to that employed in PCA, PLS seeks the linear combinations in X space that fulfill a given criterion. However, the criterion now employed to define the linear combinations aims to maximize not the explanation power of X variability, but rather their covariance with the response.71 This is exactly so, when there is only one output variable, in which case the PLS algorithm is not iterative. For multiple responses, the PLS algorithm, usually referred as PLS2, becomes iterative, and simultaneously searches for those linear combinations in X space and in Y space that present maximal mutual covariance. As in PCA, the first linear combinations of PLS also concentrate the main contributions for attaining the analysis goal, which now consists of explaining the overall covariance between the two data sets. Even though PLS addresses the estimation of the following two-block latentvariable model

Scores prediction:

Profile reconstruction: Y = t·PT + ε 2, ε 2 ∼ iidNm(0m, Σε2)

Y = T·Q + G

Y = XBPT + ε1PT + ε 2 ⇔ Y = XBPT + εT with εT ∼ Nm(0m, ΣT )

(5)

Σ̂ T = PΛε1PT + Σε2

(where T is the matrix of PLS X scores, P and Q are two matrices of PLS loadings, and E and G are the residual matrices of the X and Y spaces), in the end, predictions can be computed as for any linear regression model, through an expression of the form

Y = X·BPLS + G

(9)

The estimate of the covariance for the predicted profiles, Σ̂T, corresponds to a zeroth-order approximation, obtained by neglecting the uncertainty of the estimated coefficients, such as P. The covariance matrices in system 9, can be estimated using classical point estimation approaches, such as maximum likelihood. However, we implemented a simpler methodology in this work, because our focus for addressing the prediction problem relies on employing flexible resampling approaches, to be described in the next section, rather than approximate analytical methods. (It basically consists of first estimating Λε1 and Σ̂T, and then using these quantities to estimate Σε2, using the last equation in system 9, retaining only the positive definite part of the matrix thus obtained; even though not optimal, this procedure performed well in the case studies and enabled the illustration of the analytical approach for incorporating both noise sources in the analysis.) This model structure allows for the prediction of profiles using process-related information through a latent-variables approach in which PCA is employed to describe Y space, contrary to its conventional use in compressing X space for regression (as in PCR). Y-PCA regression is an alternative to PLS2, when (i) the number of latent variables for describing and predicting the Y space is not necessarily the same as for the

(6)

The only difference between this formula and its counterpart for ordinary least-squares (OLS) lies in the vector of regression coefficients (BPLS), which is estimated in a different way, leading to different parameter estimates (unless the number of dimensions used in PLS is equal to the number of variables, in which case, BPLS = BOLS). One should note that multiple responses can also be handled in the scope of OLS, by applying the equation BOLS = (XTX)−1XTY

(8)

where t and P are the PCA score vector and loading matrix for the profiles (Y block), respectively; ε1 represents the regression residuals with covariance Λε1 (a diagonal matrix); and ε2 represents the reconstruction error, modeled as an independent and identically distributed (iid) Nm process (m being the number of output variables considered), with zero mean and covariance Σε2. In our work, the models for predicting each score were developed individually, using an OLS framework (see the Case Studies section). The overall prediction formula for Y-PCA regression is thus given by

X = T·PT + E T

t = X·B + ε1, ε1 ∼ iidNa(0, Λε1)

(7)

However, this equation corresponds to estimating simultaneously, but in an independent way, a model for each response, contrary to what happens in PLS, where all responses are taken into account simultaneously when deriving the coefficients matrix, latent variable by latent variable. Of course, expression 7 is valid only when X has full rank, a condition that is often not 4256

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266

Industrial & Engineering Chemistry Research

Article

but is rather intrinsically subjected to random fluctuations (a random variable). Analytical Expressions for Sample-Specific Prediction Intervals. Significant efforts have been made to develop analogous formulas for univariate PCR and PLS. They include analytical expressions derived from error propagation theory, taking into account uncertainties in both X and Y spaces,73−76 as well as resampling and Monte Carlo procedures.77 An example of such an expression, derived from error propagation theory for the variance of predictions, V̆ (ŷi), that is valid for OLS, PCR, and PLS (but after the imposition of some additional simplifications in the cases of PLS and PCR) is75

X space or (ii) different sets of input variables are to be used for predicting distinct aspects of the output profiles (i.e., a different set of variables is used for each score; see the first example in the Case Studies section). Sample-Specific Prediction Intervals. The assessment of the prediction ability associated with regression models is often based on single quantities that provide an estimate of some average performance of the methods over the range of interest in their application, that is, over a region in X space where the models are to be applied in future applications (test conditions). Of course, this region should be covered, in a balanced way, by the samples forming the training set employed to estimate the model parameters. Examples of such quantities are the several estimates of the mean squared error of prediction (MSEP) found in the literature, such as using a test set (MSEPtest), only the training set (apparent MSEP, MSEPapp), the K-fold cross-validation estimate (MSEPCVK), or the adjusted K-fold cross-validation estimate (MSEPCVadj,K), among others.17,72 These single quantities, aside from providing an assessment of the model prediction accuracy, could play an active and important role in the selection of their structure, namely, the set of variables to include in the model and/or the number of latent variables to consider (usually by minimization of the adopted metric or by identification of the onset of a plateau in the downward trend of their values). Even though of some utility in the contexts mentioned above, these average measures of predictabilty are not able to take into account the location in X space where the prediction is to be made. This is an important point in multivariate problems and can create significant distortions in the uncertainty attributed to local predictions. The problem is more serious when the training samples are distributed heterogeneously in X space, as this will necessarily lead to highly variable prediction uncertainties for the estimates, according to the region where they are located. A data collection scheme following a statistical experimental design procedure can ensure more even coverage of the input space. However, quite often, this not a possibility, as happens when data are passively collected by observational methods, such as during plant operation. In these cases, the training set is given, and the only thing the analyst can do is to use some sort of sampling, performed a posteriori, to obtain a more balanced representation of all of the conditions of interest. In all such situations, the most rigorous way for providing the uncertainty associated with a particular prediction is through the use of so-called sample-specific prediction intervals. For OLS, the mathematical expression for sample-specific prediction intervals is well-known and provides the range in which, at a confidence level of (1 − α) × 100%, one is expected to find the next observed value (yi) (α being the significance level, which, in practice, often takes a value of 0.05 or 0.01) yi ∈ yî ± tα /2, n − m σ̂ 2(1 + hi)

V̆ (yî ) = (N −1 + hi)(σΔy 2 + || β ||2 2 σΔX 2) + || β ||2 2 σΔX 2

where N stands for the number of samples in the training set, ||β||22 is the squared Euclidean norm of the regression parameters vector, σΔy2 is the variance of measurement errors in the response, and σΔX2 is the variance of measurement errors in X space. The implementation of this analytical formula, however, causes some difficulties, not only in the estimation of some of its unknown quantities, but also in other technical issues, such as the problem of estimating the number of degrees of freedom involved (using a normal-based approximation), as well as the additional difficulty of having a correlation between the predictions and their associated errors (because of the presence of errors in X space). An approximate analytical formula was derived for predictions made with the Y-PCA regression model structure. As several variables are involved in the output profiles, the prediction intervals must be constructed in such a way that they contain the entire set of observed profiles inside them (1 − α) × 100% of the times (α being the significance level). In other words, representing in parallel coordinates the set of observed profiles corresponding to a given process condition, then (1 − α) × 100% of them should be entirely contained within the samplespecific prediction bands. This means that the prediction bands should consider all of the output profiles simultaneously and not one at a time. (This point will be further discussed and clarified in the next subsection.) The simultaneous prediction band for the ith observation with the Y-PCA regression model, at a given significance level (α), is given by the expression yij ∈ yiĵ ±

χ 2α, my ×

Σ̂ T , jj

(13)

2 where χα,my is the upper α × 100% percentile point for the χ2 distribution with m degrees of freedom and Σ̂T,jj is the jth element of the diagonal of Σ̂T. (Of the several possibilities available to accommodate this effect, the large-sample asymptotic expression was employed.78) To circumvent the complexity and limitations of the analytical procedures for constructing sample-specific prediction intervals, other approaches were developed that are currently regarded as good alternatives to the analytical procedures. These include resampling and Monte Carlo computational approaches. In the following section, we present two methodologies, the first one based on the noise addition (NA) method (also called Monte Carlo simulation of synthetic data sets79) and the other based on the bootstrap residuals (BR) approach.80 These approaches were already employed, with

(10)

where ŷi is the estimate provided by the regression model; tα/2,n‑m is the upper α/2 × 100% percentile point for the Student’s t distribution with n − m degrees of freedom; σ̂ 2 is the estimate if the residuals variance; and hi is the so-called leverage associated with observation i, given by hi = xi T(XTX )−1xi

(12)

(11)

One should note that prediction intervals are not confidence intervals, in the statistical sense, as the quantity under analysis is not a fixed quantity by hypothesis (a population parameter), 4257

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266

Industrial & Engineering Chemistry Research

Article

good results, for estimating sample-specific prediction intervals for univariate responses.77 Flexible Approaches for Constructing Sample-Specific Prediction Intervals. Given the complexity and difficulties in implementing analytical procedures in real-world scenarios, to which one can add the fact that no analogous samplingspecific formula is currently available for PLS2 predictions, we employ in this work approaches that are flexible and easy to program and implement. These are based on resampling and Monte Carlo methodologies and consist of the sequence of stages illustrated in Figure 1.

with fewer simplifying assumptions than the analytical approach. Thus, in the NA approach, one basically generates noise following the assumed noise distributions and adds it to the error-free data, derived from the assumed model, to obtain simulated observed measurements. Afterwards, these artificial observed measurements are used to estimate the corresponding model parameters, from which predictions for the set of conditions under analysis can be performed. Such a procedure is repeated a large number of times, and for each repetition, the relevant prediction errors are computed and saved. In the end, this set of prediction errors is used to establish the empirical distribution of errors. This method is preferred when the data available are not very abundant but the structure of the model is well-defined. More details about the NA procedure are presented in the Appendix. Generating the Empirical Distribution of Prediction Residuals: Bootstrap Residuals (BR) Approach. The NA approach is an option when data are scarce and the analysis has to rely heavily on the assumed model and error distributions. However, when the data set available for training is larger, one can use it to avoid making some structural assumptions a priori. The approach would then be data-driven and could be implemented through a resampling framework, such as the bootstrap approach. In this work, we adapt the bootstrap approach for generating the distribution of prediction errors. Toward this end, a bootstrap residuals (BR) approach is followed, because the bootstrap observations approach80 does not present enough flexibility to mimic all of the relevant random mechanisms to be considered in the present case. Therefore, we derived a BR procedure for generating an approximate empirical distribution of prediction errors, under a given set of conditions, for both model structures considered (Y-PCA regression and PLS2); its pseudoalgorithm is presented in detail in the Appendix. Computing the Simultaneous Prediction Interval. The above-mentioned procedures (NA and BR) are able to generate empirical distributions of prediction errors associated with a given point estimate of an output profile. However, this is only part of what is needed for deriving the final sample-specific prediction intervals, the problem of how to process such empirical distributions to finally come up with the desired prediction interval bounds for a given confidence level established a priori still remains to be solved. An easy, but incomplete way to deal with this problem consists of estimating the lower and upper α/2th percentiles (for the level of significance employed, α) for each output variable in the profile in an individual and independent way using the empirical distributions of residuals obtained with the NA or BR methods. This type of interval is focused on the prediction uncertainty for each response, regardless of the others, and for this reason, it is also known as one-at-a-time intervals. However, such intervals do not comply with the definition of a sample-specific prediction interval at a given confidence level. A sample-specific prediction interval corresponds to the envelope associated with the profile predictions for a given set of values of X variables that completely covers all of the observed profiles (represented graphically in parallel coordinates), (1 − α) × 100% of the times, on average. Therefore, instead of taken individually, as in the one-at-a-time approach, all of the output profile entries must be considered simultaneously to provide the required probability coverage. As the one-at-a-time approach misses this point, it ends up providing overly optimistic prediction uncertainty bands for the estimated output profiles.

Figure 1. Sequence of steps involved in the construction of samplespecific prediction intervals using the NA and BR methodologies.

First, a model is estimated using the training data set (training stage). The model structure must be able to predict profiles and fit into the specific conditions and restrictions of the problem. In this work, we consider the use of the model structures Y-PCA regression and PLS2. Then, a point estimate for a given profile is produced, corresponding to a given set of input conditions (X variables), together with the associated interval estimate, which, in this case, is the sample-specific prediction interval (testing stage). The process of building this prediction band consists of two steps: (i) generation of prediction residuals regarding the sample under analysis, obtaining the corresponding empirical distribution, and (ii) construction of the simultaneous prediction interval with the desired coverage probability, (1 − α) × 100%, using the reference distribution of residuals. Two alternatives are explored here to generate the prediction residuals: noise addition and bootstrap residuals. They are explained in detail in the following subsections, along with the methodology for constructing the simultaneous prediction interval, which will finally lead us to the desired sample-specific prediction interval. Generating the Empirical Distribution of Prediction Residuals: Noise Addition (NA) Approach. The NA approach consists of simulating the random phenomena present in the model (Y-PCA regression or PLS2) and generating the prediction errors that would result from their successive application to the same sample under analysis. This immediately leads to an empirical distribution of prediction errors 4258

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266

Industrial & Engineering Chemistry Research

Article

In more rigorous terms, given an n × m data set with the profile predictions Y̆ (n corresponds to the number of NA simulations or BR resampling trials, and m is the number of outputs in the profile) computed from the errors generated by the NA or BR approach, according to Y̆ = 1·Ŷ + E

presentation of each case study is organized in four parts: (i) case description, in which the process and the problem under analysis are briefly introduced; (ii) data organization, summarizing the structure of the data tables used; (iii) methods and results, where the methods employed are described in detail, and the main results of their implementation are presented; and (iv) case study summary, which provides an overview and discussion of the results obtained. Case Study 1: Prediction of Fiber Orientation Profiles in a Paper and Paperboard Machine. Case Description. Paper and paperboard are examples of products with complex structural features, both at their surface8 and in their inner layers.82 One of the most important structural parameters of paper is fiber orientation, given its strong influence on a wide variety of critical properties for end users, such as elastic properties, tensile strength, hygroexpansivity, and curl. Fiber orientation results from the combined effects of several operating variables and operating conditions, in particular, those located in the so-called wet-end section of the paper machine, that is, the stage where a fiber suspension with low consistency is thrown over a permeable web (or between two permeable webs, according to the type of paper machine) to begin the formation of the future paper sheet. One of the most convenient ways to infer the principal direction of fiber orientation is through the tensile stiffness orientation (TSO) angle. This parameter provides an approximate value for fiber orientation, even though other factors can also interfere, such as drying tensions imposed in the final drying stages of the paper machine. Therefore, modern paper and paperboard mills are equipped with devices that periodically acquire a profile of TSO angles across the paper web, from which the operation status concerning this important structural parameter can be inferred. These profiles consist of TSO angles taken at different positions across the transverse direction of the paper machine (Figure 2). They are routinely analyzed by plant personnel and

(14)

where 1 is an n × 1 column vector, Ŷ is the profile predicted using a given model structure, such as PLS2 or Y-PCA regression (it is a 1 × m row vector), and E is the n × m matrix of prediction errors generated by the NA or BR approach (as detailed in the algorithms presented in the Appendix), one aims to derive the prediction band for the point estimate Ŷ = (ŷ1, ŷ2, ..., ŷm) of the form Cα = C1α × C2α × ... × Cmα , where each Cjα (j = 1, ..., m) is the prediction interval of the jth output, but with a simultaneous coverage level of (1 − α) × 100% m

Pr( ∩ {yj ∈ C jα}) ≥ 1 − α j=1

(15)

These simultaneous prediction intervals can be constructed by adapting the approaches proposed by Mandel and Betensky,81 presented in the Appendix. The first approach (SPI I) provides a simple and flexible solution to this problem, although slightly conservative, a feature that is corrected with the second approach (SPI II), also detailed in the Appendix.



GENERAL ANALYSIS WORKFLOW In the following analysis of real-world case studies, a common workflow was followed. Even though the analysis of the methodologies and specific procedures employed in each stage are described in detail in the next section, we present here the general sequence of steps considered, for the sake of clarity. (i) Collect a data set containing process variables (X data) and profiles (Y data). (ii) Study the variability patterns in the X and Y blocks, specifically, by deriving and analyzing PCA models for each one of them, especially for the Y block. (iii) Depending on the nature of the profiles and the underlying processes, establish a model structure for describing the system (Y-PCA regression or PLS2; see details in the next section). (iv) Estimate the parameters for the selected model structure. (v) Explore the estimated model, to extract relevant knowledge about the underlying process, specifically, for process improvement activities. (vi) Evaluate the prediction ability of the estimated model, as follows: (a) Select samples from the data set. (b) Predict the response profiles for these samples. (c) Compute the associated sample-specific prediction intervals, using the methodologies proposed in this article. (d) Compare the results obtained with the observed profiles. (vii) Summarize the final conclusions and learning outcomes.

Figure 2. TSO profiles collected from a real paper machine.

used to tune machine parameters, to drive the process under proper conditions. Data Organization. In this case study, an extensive database of TSO profiles was analyzed, to locate variables that might be inducing undesirable variability in the profiles, as well as to find better ways to control the fiber orientation profile in the paper machine. The data set had the structure presented in Table 2. Methods and Results. A PCA of all of the TSO profiles available was first conducted (Figure 3). The first two PCs



CASE STUDIES In this section, we present two real-world case studies in which profiles appear as the natural outputs of the analysis. The models required to achieve the analysis goals were constructed using data collected from the processes through passive collection (observational studies). For the sake of clarity, the 4259

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266

Industrial & Engineering Chemistry Research

Article

PC, models were developed for each of them (PC1 and PC2). The models were estimated following a constructive and datadriven approach, namely, a forward stepwise variable selection procedure. This procedure consists essentially of successively including variables in the model if they indeed provide a statistically significant contribution to its improvement. On the other hand, variables that have already been selected for incorporation in the model in earlier stages can be discarded in later steps if they become redundant or insignificant after the introduction of others.83,84 Following such an approach, two parsimonious models were derived: a model for PC1 that incorporates variables FH1, JT2, and PH2, explaining 90.44% of its variability, and another model for PC2 including variables AH2, R62, and R52 that explains 78.13% of the PC2 variability. Even though variable codes cannot be revealed, for confidentiality reasons, the influence of these variables can be rationally interpreted using process knowledge. The quality of the models derived not only allows important variables affecting different aspects of the TSO profile to be pinpointed, but is also good enough to be used for predicting the impact of future modifications of their levels (as long as modifications are introduced within the typical range of normal operating actions). We now analyze the problem of predicting a profile for a given set of operating conditions. The two models developed for PC1 and PC2 can be incorporated within the structure of Y-PCA regression, eqs 8 and 9. Following the general analysis workflow, this model can now be employed to predict profiles, as well as the associated simultaneous prediction intervals, using either the approximated asymptotic analytical formula or the flexible methodologies based on the NA and BR approaches. Figures 4 and 5 present the results obtained for the prediction of the TSO profile corresponding to a given set of

Table 2. Summary of the Data Structure Analyzed in Case Study 2a block X block Y block

rows (production units)

columns

values for the process variables relative to a given production unit TSO angles (profile) measured for a given production unit

values for a process variable regarding all units under analysis values for the TSO angle regarding a given transversal position, for all paper units under analysis

a

Prediction of fiber orientation profiles in a paper and paperboard machine.

Figure 3. Some results of the PCA conducted over the TSO angle profiles: (a) eigenvalue plot (or scree plot), (b) scores plot for the first two PCs (the first and last observations are indicated with circles), (c) loadings plot for PC1, (d) loadings plot for PC2.

explained the majority of the variability present; the third one had an eigenvalue lower than 1, an indicator that its contribution for explaining the overall variability was, in this case, quite weak (data were previously autoscaled). Therefore, the subsequent analysis was focused on modeling the variability of the first two PCs, which, taken together, explained 82.5% of the whole profiles variability (Figure 3a). The scores of these two PCs (Figure 3b), which represent the projection of multidimensional observations in the two-dimensional PCA plane, clearly indicate an operation that has experienced some drift. One can notice the existence of three operating conditions, which justifies a careful analysis to scrutinize what is causing their occurrence in terms of process-related variables. Analyzing the loadings for PC1 (Figure 3c), it is possible to verify that this component represents a contrast between the TSO angles in the two borders of the machine, being useful for detecting any imbalance in TSO across the two halves of the paper machine. On the other hand, PC2 gives more weight to the central measurements, indicating the presence of a bias in the TSO angle in this region (Figure 3d). (The central area is quite important, since it usually provides the best conditions for consolidating the fiber suspension into a paper sheet, from which better paper grades are often selected.) In this context, it is apparent that PC1 and PC2 are truly conveying information about different aspects of the profiles variability. They represent the two building blocks, or components, of the TSO profiles and can depend on different process variables. To analyze which process variables or conditions might be driving the variability conveyed by each

Figure 4. Observed and predicted profiles, together with associated prediction uncertainty bands, obtained with the analytical formula (shaded region) and with the several methods discussed, for the case where residual vectors and matrices were derived from the NA approach (i.e., the one-at-a-time percentile approach), SPI I, and SPI II.

process conditions, relative to a given observation in the data set under analysis (so that the observed profile is also known). In Figure 4, the approaches for estimating the prediction intervals were based on residuals generated using the NA method, whereas in Figure 5, they were obtained from residuals obtained after the application of the BR approach. In both cases, it can be seen that the analytical approach, consisting of an asymptotic approximation to the inference problem, 4260

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266

Industrial & Engineering Chemistry Research

Article

Figure 6. Sampling plan followed in the study, indicating samples taken (A−C) corresponding to each aging period.

Table 4. Summary of the Data Structure Analyzed in Case Study 2a block X block

Figure 5. Observed and predicted profiles, together with associated prediction uncertainty bands, obtained with the analytical formula (shaded region) and with the several methods discussed, for the case where the residual vectors and matrices were derived from the BR approach (i.e., the one-at-a-time percentile approach), SPI I, and SPI II.

Y block a

rows (samples)

columns

composition of all of the values for the amounts of a given analytes for a given wine analyte, in all wine samples sample UV−vis spectra relative to absorbance values at a given a given wine sample wavelength in the UV−vis spectrum, for all wine samples

Prediction of wine UV−vis spectrum across the wine's ageing.

provides the more conservative solution to the problem; that is, it leads to the widest prediction uncertainty band for a given confidence level (95% in this case). On the other hand, the one-at-a-time percentile approach provides the narrowest bands, as it discards the simultaneous consideration of all inference problems regarding the m output variables, therefore being overoptimistic in nature. Approaches SPI I and SPI II correct for these effects, providing uncertainty bands with the desired coverage probability. Table 3 presents the expected coverage for future profiles for the process conditions considered in Figures 4 and 5, based on Table 3. Percentage of Artificial Samples (Profiles) Generated Using the NA and BR Methodologies (Regarding Figures 4 and 5, Respectively), That Exceed, for at Least One Point of the Profile, the Prediction Interval Bounds Computed Using SPI I and SPI II method of generating residuals

SPI I (%)

SPI II (%)

noise addition (NA) bootstrap residuals (BR)

0.50 0.57

1.19 0.87

Figure 7. Evolution of wine chemical composition in the threedimensional space provided by the first three LVs in our PLS2 model (X scores), where labels correspond to the year of wine making (from 1988 to 2006).

high-value product whose production process is not yet fully understood, despite centuries of experience and empirically accumulated knowledge. Using analytical instrumentation and proper experimental protocols, it is possible to measure key analytes present in the wine matrix and their evolution across wine aging, in order to infer, at a second stage, which elementary processes and reactions are taking place during the aging process (a process that takes place in casks, normally made from a certain kind of wood, such as oak42). One important quality feature of wine is its color. In this section, we address the problem of predicting the UV−vis spectrum of wine from its chemical composition, obtained through high-performance liquid chromatography with diode-array detection (HPLCDAD). A total of 25 compounds were selected for deriving the model, including gallic acid, furfural, and protocatechuic acid. Our study focused on Portuguese Madeira wine, given its high market value and economic importance for the production regions. Wine samples from a given producer were collected at different aging stages, covering a total of 20 years (Figure 6). The grape analyzed was the “Malmsey”.

the artificial samples generated by the NA and BR approaches. In other words, the percentages presented in this table reflect the relative numbers of samples that fall beyond the estimated simultaneous prediction intervals (in at least one part of the profile). Analyzing the results presented in this table, it is possible to verify that, in fact, both approaches respect the 5% significance level employed in the study and that the interval computed by SCI II is less conservative than that computed using SCI I. Case Study Summary. The developed modeling approach showed good predictive performance in estimating the TSO profiles from process variables and also enabled the induction of operating information regarding subsets of variables interfering with different profile features. This information is very useful for plant engineers interested in finding the main sources of process variation and drift and in knowing which variables can be more effective for correcting deviations in the intended shape of the profiles. Case Study 2: Prediction of Wine UV−Vis Spectrum across Wine's Aging. Case Description. Wine is a complex 4261

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266

Industrial & Engineering Chemistry Research

Article

other phenomena that might be consistent with such trends. We do not pursue this topic further here, but it has been addressed in detail elsewhere.7,42 On the other hand, the PLS2 model can also be employed to predict the full UV−vis spectrum from knowledge of the chemical composition, as well as to establish the corresponding prediction uncertainty bands. An example of such an analysis is presented in Figure 8, where, for a given sample in the data set, one can verify that the predicted spectrum follows the observed one quite closely, which is also within the sample-specific prediction interval computed from the method SPI II applied to prediction errors generated by the NA approach. Furthermore, it is also possible to analyze, for each section or individual wavelength in the spectrum, the chemical components that are more relevant in explaining their variability, using the variable importance in the projection (VIP) facility of this algorithm.85 Figure 9 presents several such VIP plots (chemical components are represented by an index that goes from 1 to 25). One can see that different parts of the spectrum can be affected by different chemical components, but their roles are quite similar in some regions, namely, the visible part of the spectrum. This information is very useful in studying the impact on wine color of the chemical components present and the reactions taking place, improving our understanding about the fundamental mechanisms affecting wine quality. Case Study Summary. In summary, one can conclude from this case study that the proposed approach not only enables a proper prediction of the whole UV−vis spectrum, but also makes possible an in-depth understanding of the underlying complex mechanisms, identifying aging trends and clarifying the importance associated with the various chemical components in the establishment of color patterns and the predicted absorbances for different regions of the spectrum.

Figure 8. Predicted and observed UV−vis spectra for a wine sample, along with the associated sample-specific prediction interval (prediction errors generated by the NA approach; prediction interval estimated using methodology SPI II). Indicated by vertical lines are the wavelengths currently used for analyzing different aspects associated with the color of wine, according to established standards.

Data Organization. The problem to be addressed in this case study can be stated as the prediction of a full UV−vis spectrum (Y variables) from the chemical composition of a given wine sample (X variables). Thus, the data tables analyzed have the structure summarized in Table 4. Methods and Results. In this case, the PLS2 model structure (eq 5) was employed, as the underlying latent sources of variation structuring the variability in the chemical components and in the UV−vis spectrum are expected to be the same, as assumed in this model. Estimating a PLS2 model for the available data set [using six latent variables (LVs), as a result from a preliminary cross-validation study] and analyzing first the subspace of X scores (Figure 7), it becomes clear that there is indeed an aging trend that reflects the complex physicochemical processes going on during wine aging. One can now use the model to investigate which chemical components play an active role in such a pattern, using, for instance, the contribution plots and conjecture reactions and

Figure 9. VIP scores for each component of the profile, indicating chemical components that are most relevant in their prediction. The compounds are identified on the XX′ axis by an index that goes from 1 to 25. 4262

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266

Industrial & Engineering Chemistry Research



Article

CONCLUSIONS In this article, we have addressed the problem of predicting profiles from process-related data. Profiles are becoming increasingly acquired for product characterization tasks. Therefore, the problem of predicting properties as isolated scalar quantities is now rapidly evolving into one of predicting profiles of properties. This trend requires the development of appropriate tools and inference methods to address the emerging applications in a consistent and adequate way. In this article, we have illustrated how different profile prediction problems can be addressed with simple and flexible tools. The gains are not only centered at the prediction capability level. We have also underlined the added value achieved at the level of interpretation and analysis of the results obtained, leading to a better understanding of the underlying processes and, ultimately, setting the basis for consistent process improvement initiatives. Flexible approaches were proposed for computing samplespecific prediction intervals. These should be used in any prediction activity, as they provide proper uncertainty bands for future observations (profiles, in the current situation). To the best of our knowledge, no other methodology has so far been proposed in the engineering technical literature that is able to provide a similar parsimonious solution to the simultaneous inference problem. Nonlinear input−output relationships can also, to a certain extent, be accommodated within the frameworks presented without further adaptation, through the incorporation of transformed variables in the X-block data set. Future work will address prediction problems involving highorder profiles, exploring, for instance, the multiway methods available15 and developing the associated statistical inference tools.

(a) Generate a realization of the residual probability distribution functions for each observation in the original data set. (b) With these artificially generated residuals and the model estimated from the original data set, compute the corresponding observed output profiles. (c) Estimate new model parameters for this simulated data set. (d) Predict the output profile for the conditions under analysis. (e) Compute the prediction error as the difference between the value provided by the model estimated from the original data and the value obtained in step (d) plus a residual vector generated from another realization of the probability density function for the output profile (as one is addressing the inference of a prediction interval; if the interest is on the corresponding confidence interval, this residual addition would not be necessary). (f) Save the prediction error and return to step a until the predefined number of simulations (n) is reached. (6) Apply SPI I or SPI II to the n × m residuals matrix, say, E, and obtain the sample-specific simultaneous prediction interval for the conditions under analysis. Bootstrap Residuals Approach Applied to Y-PCA Regression and PLS2

(1) Estimate the model parameters. (2) Using the estimated model, compute the residual vectors and matrices, based on the model structure formulas. The residuals thus obtained provide an empirical distribution for the random errors in the models. (3) Set the process variable values whose output profile is to be predicted, as well as the associated sample-specific simultaneous prediction interval. (4) For i = 1, ..., n, perform the following steps: (a) Generate bootstrap samples for the residuals. (b) With the bootstraped residuals and the model estimated from the original data set, compute the corresponding observed output profiles. (c) Estimate the new model parameters for this bootstrapped data. (d) Predict the output profile for the conditions under analysis. (e) Compute the prediction error as the difference between the value provided by the model estimated from the original data and the value obtained in step (d) plus a residual vector randomly selected from the set of output residuals. (f) Save the prediction error and return to step a until the predefined number of bootstrap samples (n) is reached. (5) Apply SPI I or SPI II to the n × m residuals matrix (E) and obtain the sample-specific simultaneous prediction interval for the conditions under analysis. The two methodologies followed for computing the samplespecific simultaneous prediction intervals associated with a given profile prediction are summarized in the following pair of pseudocodes. The first methodology is relative to a more conservative algorithm, whereas the second corrects this feature (namely, the fact that, for some observations, the output



APPENDIX In this appendix, we present the pseudocodes underlying our computational procedures for generating the empirical distributions for the sample-specific prediction errors, namely, the noise addition and bootstrap residuals approaches, and for computing the associated sample-specific simultaneous prediction intervals, namely, SPI I and SPI II. The methodologies followed for generating an empirical distribution for the sample-specific prediction errors using the NA and BR approaches, respectively, are summarized in the following two pseudocodes. Noise Addition Approach Applied to Y-PCA Regression and PLS2

(1) Estimate the model parameters. (2) Using the estimated model, compute residual vectors and matrices, based on the model structure formulas (Y-PCA regression or PLS2; usually, they are computed from the difference between measured and estimated quantities). The residuals thus obtained provide an empirical distribution for the random errors in the models. (3) Estimate parameters for the probability density functions assumed in the model for the residual terms. (4) Set process variable values (X space) whose output profile is to be predicted (Y space), as well as the associated sample-specific simultaneous prediction interval. (5) For i = 1, ..., n, perform the following steps: 4263

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266

Industrial & Engineering Chemistry Research

Article

(2) Brink, M.; Mandenius, C.-F.; Skoglund, A. On-line predictions of the aspen fibre and birch bark content in unbleached hardwood pulp, using NIR spectroscopy and multivariate data analysis. Chemom. Intell. Lab. Syst. 2010, 103, 53−58. (3) Sahni, N. S.; Aastveit, A. H.; Naes, T. In-Line Process and Product Control Using Spectroscopy and Multivariate Calibration. J. Qual. Technol. 2005, 37 (1), 1−20. (4) Bernstein, M. A.; Štefinović, M.; Sleigh, C. J. Optimising reaction performance in the pharmaceutical industry by monitoring with NMR. Magn. Reson. Chem. 2007, 45, 564−571. (5) Larsen, F. H.; van den Berg, F.; Engelsen, S. B. An exploratory chemometric study of 1H NMR spectra of table wine. J. Chemom. 2006, 20, 198−208. (6) Ballabio, D.; Skov, T.; Leardi, R.; Bro, R. Classification of GC-MS measurements of wines by combining data dimension reduction and variable selection techniques. J. Chemom. 2008, 22, 457−463. (7) Pereira, A. C.; Reis, M. S.; Saraiva, P. M.; Marques, J. C. Aroma ageing trends in GC/MS profiles of liqueur wines. Anal. Chim. Acta 2010, 660, 8−21. (8) Reis, M. S.; Saraiva, P. M. Multiscale Statistical Process Control of Paper Surface Profiles. Qual. Technol. Quant. Manage. 2006, 3 (3), 263−282. (9) Reis, M. S.; Bauer, A. Wavelet texture analysis of on-line acquired images for paper formation assessment and monitoring. Chemom. Intell. Lab. Syst. 2009, 95 (2), 129−137. (10) Bharati, M. H.; MacGregor, J. F. Multivariate Image Analysis for Real-Time Process Monitoring and Control. Ind. Eng. Chem. Res. 1998, 37, 4715−4724. (11) Pereira, A. C.; Reis, M. S.; Saraiva, P. M. Quality control of food products using image analysis and multivariate statistical tools. Ind. Eng. Chem. Res. 2009, 48 (2), 988−998. (12) Geladi, P.; Grahn, H. Multivariate Image Analysis; Wiley: Chichester, U.K., 1996. (13) Yu, H.; MacGregor, J. F. Monitoring Flames in an Industrial Boiler Using Multivariate Image Analysis. AIChE J. 2004, 50 (7), 1474−1483. (14) Yu, H.; MacGregor, J. F.; Haarsma, G.; Bourg, W. Digital Imaging for Online Monitoring and Control of Industrial Snack Food Processes. Ind. Eng. Chem. Res. 2003, 42, 3036−3044. (15) Smilde, A. K.; Bro, R.; Geladi, P. Multi-way Analysis: Applications in the Chemical Sciences. Wiley: Chichester, UK, 2004. (16) Woodall, W. H.; Spitzner, D. J.; Montgomery, D. C.; Gupta, S. Using Control Charts to Monitor Process and Product Quality Profiles. J. Qual. Technol. 2004, 36 (3), 309−320. (17) Martens, H.; Naes, T. Multivariate Calibration; Wiley: Chichester, U.K., 1989. (18) Kang, L.; Albin, S. L. On-Line Monitoring When the Process Yields a Linear Profile. J. Qual. Technol. 2000, 32 (4), 418−426. (19) Staudhammer, C.; Maness, T. C.; Kozak, R. A. Profile charts for monitoring lumber manufacturing using laser range sensor data. J. Qual. Technol. 2007, 39 (3), 224−240. (20) Chicken, E.; Pignatiello, J. J.; Simpson, J. R. Statistical process monitoring of nonlinear profiles using wavelets. J. Qual. Technol. 2009, 41 (2), 198−212. (21) Jin, J.; Shi, J. Feature-Preserving Data Compression of Stamping Tonnage Information Using Wavelets. Technometrics 1999, 41 (4), 327−339. (22) Lada, E. K.; Lu, J.-C.; Wilson, J. R. A Wavelet-Based Procedure for Process Fault Detection. IEEE Trans. Semicond. Manuf. 2002, 15 (1), 79−90. (23) De Anda, J. C.; Wang, X. Z.; Roberts, K. J. Multi-Scale Segmentation Image Analysis for the In-Processing Monitoring of Particle Shape with Batch Crystallisers. Chem. Eng. Sci. 2005, 60, 1053−1065. (24) Liu, J. J.; MacGregor, J. F.; Duchesne, C.; Bartolacci, G. Flotation Froth Monitoring Using Multiresolution Multivariate Image Analysis. Min. Eng. 2005, 18, 65−76. (25) Durante, C.; Cocchi, M.; Grandi, M.; Marchetti, A.; Bro, R. Application of N-PLS to gas chromatographic and sensory data of

value can be above the simultaneous prediction interval for some output variables and below it for others, in which case the values would erroneously be counted twice, an infrequent event that is nonetheless corrected with SPI II). Both algorithms are based on the work of Mandel and Betensky.81 These methodologies could also be applied, without additional changes, for estimating the simultaneous confidence intervals for the mean profile (sample-specific confidence interval), but then the NA and BR approaches would have to be slightly changed. In particular, the final error/residual generation and addition (steps 5e and 4e for the NA and BR approaches, respectively) should be discarded. Algorithm SPI I

(1) Generate n samples for the quantity of interest (e.g., by the BR or NA approach). In the present situation, these consist of m-dimensional vectors of predicted responses for a given set of conditions, leading to an n × m matrix, Y̆ . (2) For each column of Y̆ , say, Y̆ i,j for a given j, order the samples and denote them as Y̆ (1,j) * < Y̆ (2,j) * < ··· < Y̆ (n,j) * , * stands for the ith ordered sample in the jth where Y̆ (1,j) column (output variable). * . (3) Define r(i,j) as the rank of Y̆ (1,j) (4) Define the sample-i rank, r(i) = maxj r(i,j) as the largest rank in the ith line. (5) Compute rα/2, the rα/2 percentile of r(i); also compute r1‑α/2, the 1 − α/2 percentile of r(i). (6) Set the lower limit of SPI as Y̆ (r*α/2,1) ,Y̆ (r*α/2,2) , ..., Y̆ (r*α/2,m) and the upper limit as Y̆ (r*1,α/2,1) ,Y̆ (r*1,α/2,2) , ..., Y̆ (r*1,α/2,m) . (1) Algorithm SPI II

Do steps 1−3 of SPI I. (2) Define the relative ranks r*(i,j) = |r(i,j) − (n + 1)/2|, and their signs, s*(i,j) = sign{r(i,j) − (n + 1)/2}. (3) Define r*(i) = maxj r*(i,j) as the highest relative rank of the ith sample and let s*(i) be the corresponding sign. If the maximum values occur at several j values, the corresponding sign is chosen arbitrarily. (4) Set r(i) = (B + 1)/2 + r*(i) × s*(i) (rank corresponding to the most discrepant coordinate of the ith sample). (5) For all j and i, replace Y̆ (i,j) by Y̆ [r(b),j]. (6) Run algorithm SCI I on the values generated in step 5.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The authors acknowledge financial support through Project PTDC/EQU-ESI/108597/2008, cofinanced by the Portuguese FCT and European Union’s FEDER through “Eixo I do Programa Operacional Factores de Competitividade (POFC)” of QREN (FCOMP-01-0124-FEDER-010398).



REFERENCES

(1) Trygg, J.; Kettaneh-Wold, N.; Wallbäcks, L. 2D Wavelet Analysis and Compression of On-Line Industrial Process Data. J. Chemom. 2001, 15, 299−319. 4264

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266

Industrial & Engineering Chemistry Research

Article

traditional balsamic vinegars of Modena. Chemom. Intell. Lab. Syst. 2006, 83, 54−65. (26) Bro, R. Exploratory study of sugar production using fluorescence spectroscopy and multi-way analysis. Chemom. Intell. Lab. Syst. 1999, 46, 133−147. (27) Gurden, S. P.; Westerhuis, J. A.; Bro, R.; Smilde, A. K. A comparison of multiway regression and scaling methods. Chemom. Intell. Lab. Syst. 2001, 59, 121−136. (28) Mortensen, P. P.; Bro, R. Real-time monitoring and chemical profiling of a cultivation process. Chemom. Intell. Lab. Syst. 2006, 84, 106−113. (29) Pasamontes, A.; Callao, M. P. Sequential injection analysis linked to multivariate curve resolution with alternating least squares. Trends Anal. Chem. 2006, 25 (1), 77−85. (30) Salau, J. S.; Honing, M.; Tauler, R.; Barceló, D. Resolution and quantitative determination of coeluted pesticide mixtures in liquid chromatography−thermospray mass spectroscopy by multivariate curve resolution. J. Chromatogr. A 1998, 795, 3−12. (31) Brink, M.; Mandenius, C.-F.; Skoglund, A. On-line predictions of the aspen fibre and birch bark content in unbleached hardwood pulp, using NIR spectroscopy and multivariate data analysis. Chemom. Intell. Lab. Syst. 2010, 103, 53−58. (32) Flumignan, D. L.; Ferreira, F. O.; Tininis, A. G.; Oliveira, J. E. Multivariate calibrations in gas chromatographic profiles for prediction of several physicochemical parameters of Brazilian commercial gasoline. Chemom. Intell. Lab. Syst. 2008, 92, 53−60. (33) Balabin, R. M.; Safieva, R. Z.; Lokamina, E. I. Comparison of linear and nonlinear calibration models based on near infrared (NIR) spectrosopy data for gasoline properties prediction. Chemom. Intell. Lab. Syst. 2007, 88, 183−188. (34) Trygg, J.; Wold, S. PLS Regression on Wavelet Compressed NIR Spectra. Chemom. Intell. Lab. Syst. 1998, 42, 209−220. (35) Meloun, M.; Č apek, J.; Mikšík, P.; Brereton, R. G. Critical Comparison of Methods Predicting the Number of Components in Spectroscopic Data. Anal. Chim. Acta 2000, 423, 51−68. (36) Valderrama, P.; Poppi, R. J. Bilinear least squares (BLLS) and molecular fluorescence in the quantification of the propranolol enantiomers. Anal. Chim. Acta 2008, 623 (1), 38−45. (37) Bharati, M. H.; MacGregor, J. F.; Champagne, M. Using NearInfrared Multivariate Image Regression to Predict Pulp Properties. Tappi J. 2004, 3 (5), 8−14. (38) Bjö rk, A.; Danielsson, L.-G. Spectra of Wavelet Scale Coefficients from Process Acoustic Measurements as Input for PLS Modelling of Pulp Quality. J. Chemom. 2002, 16, 521−528. (39) Burger, J.; Geladi, P. Hyperspectral NIR Image Regression Part I: Calibration and Correction. J. Chemom. 2005, 19, 355−363. (40) Burger, J.; Geladi, P. Hyperspectral NIR Imaging for Calibration and Prediction: a Comparison between Image and Spectrometer Data for Studying Organic and Biological Samples. Analyst 2006, 131, 1152−1160. (41) Delwiche, S. R.; Graybosch, R. A. Examination of spectral pretreatments for partial least-squares calibrations for chemical and physical properties of wheat. Appl. Spectrosc. 2003, 57 (12), 1517− 1527. (42) Pereira, A. C.; Reis, M. S.; Saraiva, P. M.; Marques, J. C. Analysis and assessment of Madeira wine ageing over an extended time period through GC−MS and chemometric analysis. Anal. Chim. Acta 2010, 659, 93−101. (43) Antonelli, A.; Cocchi, M.; Fava, P. Automated Evaluation of Food Colour by Means of Multivariate Image Analysis Coupled to a Wavelet-based Classification Algorithm. Anal. Chim. Acta 2004, 515, 3−13. (44) van Mispelaar, V. G.; Smilde, A. K.; de Noord, O. E.; Blomberg, J.; Schoenmakers, P. J. Classification of highly similar crude oils using data sets from comprehensive two-dimensional gas chromatography and multivariate techniques. J. Chromatogr. A 2005, 1096 (1−2), 156− 164. (45) Pierce, K. M.; Hope, J. L.; Johnson, K. J.; Wright, B. W.; Synovec, R. E. Classification of gasoline data obtained by gas

chromatography using a piecewise alignment algorithm combined with feature selection and principal component analysis. J. Chromatogr. A 2005, 1096 (1−2), 101−110. (46) Bharati, M. H.; MacGregor, J. F.; Tropper, W. Softwood Lumber Grading through On-line Multivariate Image Analysis Techniques. Ind. Eng. Chem. Res. 2003, 42, 5345−5353. (47) Cocchi, M.; Durante, C.; Foca, G.; Manzini, D.; Marchetti, A.; Ulrici, A. Application of a Wavelet-Based Algorithm on HS-SPME/GC Signals for the Classification of Balsamic Vinegars. Chemom. Intell. Lab. Syst. 2004, 71, 129−140. (48) Reis, M. S.; Saraiva, P. M. Analysis and Classification of the Paper Surface. Ind. Eng. Chem. Res. 2010, 49 (5), 2493−2502. (49) Vannucci, M.; Sha, N.; Brown, P. J. NIR and Mass Spectra Classification: Baysean Methods for Wavelet-Based Feature Selection. Chemom. Intell. Lab. Syst. 2005, 77, 139−148. (50) Geladi, P.; Kowalski, B. R. Partial Least-Squares Regression: a Tutorial. Anal. Chim. Acta 1986, 185, 1−17. (51) Höskuldsson, A. Prediction Methods in Science and Technology; Thor Publishing: New York, 1996. (52) Jackson, J. E. Quality Control Methods for Several Related Variables. Technometrics 1959, 1 (4), 359−377. (53) Jackson, J. E.; Mudholkar, G. S. Control Procedures for Residuals Associated with Principal Component Analysis. Technometrics 1979, 21 (3), 341−349. (54) Jaeckle, C.; MacGregor, J. F. Product Design through Multivariate Statistical Analysis of Process Data. AIChE J. 1998, 44 (5), 1105−1118. (55) Kourti, T.; MacGregor, J. F. Process Analysis, Monitoring and Diagnosis, Using Multivariate Projection Methods. Chemom. Intell. Lab. Syst. 1995, 28, 3−21. (56) Kresta, J. V.; MacGregor, J. F.; Marlin, T. E. Multivariate Statistical Monitoring of Process Operating Performance. Can. J. Chem. Eng. 1991, 69, 35−47. (57) MacGregor, J. F.; Kourti, T. Statistical Process Control of Multivariate Processes. Control Eng. Pract. 1995, 3 (3), 403−414. (58) MacGregor, J. F.; Kourti, T. Multivariate Statistical Treatment of Historical Data for Productivity and Quality Improvements. In Foundation of Computer Aided Process Operations (FOCAPO 98); CACHE/CAST Division of AIChE: New York, 1998; pp 31−41. (59) Saraiva, P. M.; Stephanopoulos, G. Continuous Process Improvement through Inductive and Analogical Learning. AIChE J. 1992, 38 (2), 161−183. (60) Wise, B. W.; Gallagher, N. B. The Process Chemometrics Approach to Process Monitoring and Fault Detection. J. Process Control 1996, 6 (6), 329−348. (61) Wold, S. Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics 1978, 20 (4), 397−405. (62) Wold, S.; Sjöström, M.; Carlson, R.; Lundstedt, T.; Hellberg, S.; Skagerberg, B.; Wikström, C.; Ö hman, J. Multivariate Design. Anal. Chim. Acta 1986, 191, 17−32. (63) Kim, K.; Mahmoud, M. A.; Woodall, W. H. On the Monitoring of Linear Profiles. J. Qual. Technol. 2003, 35 (3), 317−328. (64) Eriksson, L.; Trygg, J.; Johansson, E.; Bro, R.; Wold, S. Orthogonal Signal Correction, Wavelet Analysis, and Multivariate Calibration of Complicated Process Fluorescence Data. Anal. Chim. Acta 2000, 420, 181−195. (65) Nomikos, P.; MacGregor, J. F. Multivariate SPC Charts for Monitoring Batch Processes. Technometrics 1995, 37 (1), 41−59. (66) Pedro, A. M. K.; Ferreira, M. M. C. Simultaneously calibrating solids, sugars and acidity of tomato products using PLS2 and NIR spectroscopy. Anal. Chim. Acta 2007, 595, 221−227. (67) Jackson, J. E. A User’s Guide to Principal Components; Wiley: New York, 1991. (68) Jolliffe, I. T. Principal Component Analysis, 2nd ed.; Springer: New York, 2002. (69) Arkun, Y.; Kayihan, F. A Novel Approach to Full CD Profile Control of Sheet-Forming Processes Using Adaptive PCA and 4265

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266

Industrial & Engineering Chemistry Research

Article

Reduced-Order IMC Design. Comput. Chem. Eng. 1998, 22 (7−8), 945−962. (70) Wold, S.; Sjöström, M.; Eriksson, L. PLS-Regression: A Basic Tool of Chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109−130. (71) Höskuldsson, A. PLS Regression Methods. J. Chemom. 1988, 2, 211−228. (72) Mevik, B.-H.; Cederkvist, H. R. Mean Squared Error of Prediction (MSEP) Estimates for Principal Component Regression (PCR) and Partial Least Squares Regression (PLSR). J. Chemom. 2004, 18, 422−429. (73) Denham, M. C. Prediction Intervals in Partial Least Squares. J. Chemom. 1997, 11, 39−52. (74) Faber, K. Comparison of Two Recently Proposed Expressions for Partial Least Squares Regression Prediction Error. Chemom. Intell. Lab. Syst. 2000, 52, 123−134. (75) Faber, K.; Kowalski, B. R. Propagation of Measurement Errors for the Validation of Predictions Obtained by Principal Component Regression and Partial Least Squares. J. Chemom. 1997, 11, 181−238. (76) Phatak, A.; Reilly, P. M.; Penlidis, A. An Approach to Interval Estimation in Partial Least Squares Regression. Anal. Chim. Acta 1993, 277, 495−501. (77) Pierna, J. A. F.; Jin, L.; Wahl, F.; Faber, N. M.; Massart, D. L. Estimation of Partial Least Squares Regression Prediction Uncertainty When the Reference Values Carry a Sizeable Measurement Error. Chemom. Intell. Lab. Syst. 2003, 65, 281−291. (78) Johnson, R. A.; Wichern, D. W. Applied Multivariate Statistical Analysis, 5th ed.; Prentice Hall: Upper Sadle River, NJ, 2002. (79) Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. Numerical Recipes in Fortran: The Art of Scientific Computing, 2nd ed.; Cambridge University Press: Cambridge, U.K., 1992. (80) Efron, B.; Tibshirani, R. J. An Introduction to the Bootstrap; Chapman & Hall/CRC: Boca Raton, FL, 1998; Vol. 57. (81) Mandel, M.; Betensky, R. A. Simultaneous Confidence Intervals based on the Percentile Bootstrap Approach. Comput. Stat. Data Anal. 2008, 52, 2158−2165. (82) Kortschot, M. T. The Role of the Fibre in the Structural Hierarchy of Paper. In The Fundamentals of Papermaking Materials: Transactions of the 11th Fundamental Research Symposium; Baker, C. F., Ed.; Pira International: Leatherhead, Surrey, U.K., 1997; pp 351−399. (83) Draper, N. R.; Smith, H. Applied Regression Analysis, 3rd ed.; Wiley: New York, 1998. (84) Montegomery, D. C.; Peck, E. A.; Vining, G. G. Introduction to Linear Regression Analysis, 4th ed.; Wiley: New York, 2006. (85) Eriksson, L.; Johansson, E.; Kettaneh-Wold, N.; Wold, S. Multiand Megavariate Data Analysis. Part I: Basic Principles and Applications; Umetrics AB: Umeå, Sweden, 2001.

4266

dx.doi.org/10.1021/ie300390h | Ind. Eng. Chem. Res. 2012, 51, 4254−4266