Article pubs.acs.org/IECR
Challenges in the Specification and Integration of Measurement Uncertainty in the Development of Data-Driven Models for the Chemical Processing Industry Marco S. Reis,*,† Ricardo Rendall,† Swee-Teng Chin,‡ and Leo Chiang‡ †
CIEPQPF, Department of Chemical Engineering, University of Coimbra, Rua Sílvio Lima, 3030-790, Coimbra, Portugal Analytical Technology Center, The Dow Chemical Company, Freeport, Texas 77541, United States
‡
ABSTRACT: Obtaining and handling measurement uncertainty information is still a challenge in the chemical processing industries (CPI). From our experience, among the variables most affected by uncertainty, one typically finds the process outputs, comprising concentrations (main product and subproducts, reactants, etc.), measurements of quality properties (mechanical, chemical, etc.), or other relevant information about the end use of the product. With the increasing flexibility of processing units, these quantities can easily span different orders of magnitude and present rather different uncertainties associated with their measurements. This means that heteroscedasticity in the process outputs is a rather prevalent feature in CPI, which must be properly managed and integrated in all the tasks that make use of process data. In this article we address this critical problem, by considering two challenges: (i) specifying measurement uncertainty in heteroscedastic contexts; (ii) integrating uncertainty information on the outputs in model development. We present a critical review of the relevant literature regarding both challenges, and we address them through industrial case studies as well as an extensive Monte Carlo simulation study. For the first challenge, we provide solutions to handle the problem of specifying measurement uncertainties near physical limits (namely at very low concentrations, where intervals can easily include negative concentrations) and how to develop uncertainty models for heteroscedastic measurement devices spanning several orders of magnitude of the measurand. Regarding the second challenge, we conclude that prediction performance can be significantly improved by replacing tacitly homoscedastic modeling methodologies (ordinary least squares, partial least squares, and principal components regression) by alternative approaches that are able to explicitly handle the uncertainty in the outputs. Several important aspects of model development under heteroscedastic noisy conditions in the outputs are also highlighted and discussed.
1. INTRODUCTION Chemical processing industries (CPI) typically generate huge amounts of data containing potentially useful information for conducting process operations (monitoring, control, optimization, scheduling, troubleshooting, etc.), systems improvement, and product development. These tasks require models which are quite often developed by taking advantage of existing process databases (where the inputs, or X variables, are essentially located) together with product quality databases (where the target responses, Y variables or outputs, are obtained). Frequently, the relevant responses are affected by non-negligible amounts of measurement uncertainty associated with the complexity of the sampling process (sampling error and sampling bias), operator variability, complex analytical protocols, and instrumentation uncertainty, especially when measurements involve the quantification of trace levels of analytes. The importance of these sources of measurement uncertainty can be quantified through carefully designed repeatability and reproducibility studies1−3 and, from our experience, typically lead to lower signal-to-noise ratios (SNR) than those found in process sensors for measuring temperature, flow, and pressure. Acknowledging the existence of uncertainty in the Y’s and its impact in all process analysis tasks is not yet a shared concern among all engineers and analysts handling process data. However, the metrology field has been pushing forward the importance of correctly specifying measurement uncertainty, which has been gaining importance among © 2015 American Chemical Society
standardization organizations (all of them now require the specification of the measurement uncertainties for all measurement devices) and in the scientific community (for instance, some scientific journals do not accept measurement reports lacking the specification of the associated uncertainty). We believe it is now opportune to embed its use in the daily process data analysis practices in industry, making measurement uncertainty data one additional source of information regarding the object under study that should be used together with raw measurements. There are, however, some significant challenges to be addressed while making this transition, which can be grouped in two categories: (i) measurement uncertainty specif ication (ii) measurement uncertainty integration The first category regards the problem of assigning in front of each measurement (the final result of the measurement process) its associated uncertainty. There are currently wellestablished standard procedures that can be followed in order to rigorously estimate and convey measurements’ uncertainty. The celebrated Guide to the Expression of Uncertainty in Measurement (usually known as GUM) defines measurement Received: Revised: Accepted: Published: 9159
November 21, 2014 August 4, 2015 August 31, 2015 August 31, 2015 DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
uncertainty as a “parameter, associated with the result of a measurement, that characterizes the dispersion of the values that could reasonably be attributed to the measurand”4 and provides clear guidelines on the way it can be obtained using either a series of repeated measurements (type A evaluation) or other appropriate means (type B). The first approach is clearly aligned with a frequencist statistical estimation process, whereas the second one has an inherently Bayesian nature.5 Associated with the specification of the uncertainty is the construction of measurement uncertainty intervals of the type x ± u(x). These can be obtained using the so-called standard uncertainty, u(x), which simply corresponds to the standard deviation of repeated measurements made under repeatability conditions in a type A evaluation of uncertainty. Similar intervals can also be constructed with a prespecified coverage probability (as in confidence intervals), through the so-called expanded uncertainty, U = k × u (x), leading to the corresponding expanded uncertainty interval, x ± U (k is called the coverage factor which depends on the desired coverage probability). A number of publications report works where the principles of estimating and specifying the uncertainty of measurements are applied, further discussed, and extended to different scenarios.6−8 Other authors have also addressed the issue of estimating the measurement uncertainty structure under more complex situations, namely when the uncertainty sources are expected to present covariance.9,10 The effective use of all these procedures facilitates the uncertainty specification problem in a wide variety of situations, making raw values and the associated uncertainties the two sets of measurement outputs that are actually available for analysis. However, several situations escape the frame of these standard procedures. Some of them are, furthermore, rather common in chemical processing industries, and therefore particularly relevant for process engineers and analysts. We will address here two such situations with higher priority in the concerns of these professionals (see section 2). 1. Estimating measurement uncertainty near physical limits. Consider for instance the problem of measuring very low concentrations of some components in process streams. This situation is very common in the chemical, pharmaceutical, semiconductor, and food industries, where trace concentrations are frequently found. As one approaches a physical limit, measurement distributions become asymmetrical and the usual procedures of specifying uncertainty intervals lead to regions that contain unfeasible values of the measurand. This raises a specification problem for which a solution must be found. We will address this topic in section 2.1 and illustrate the analysis of an industrial case study in section 4.1. 2. Estimating measurement uncertainty in heteroscedastic scenarios. The use of replicates (type A evaluation in the GUM4) or repeatability and reproducibility (R&R) studies2 provides protocols for rigorously estimating the uncertainty of measurements and the components contributing to it, under certain conditions. However, if the variation inherent to the measurement system depends on the level of the measurand, for instance in a proportional way, these procedures are only able to specify the uncertainty at a given level. The specification problem of assigning the uncertainty to future measurements at other levels remains therefore unaddressed in such reference methodologies. This situation is common in analytical equipment, such as spectrophotometers, chromatograph columns, weighting devices, etc. We will propose, assess, and
compare different approaches to address this issue in sections 2.2 and 4.2. The second challenge regards the effective integration of Yuncertainty information (i.e., information regarding the measurement uncertainty affecting the response variables) in data analysis, once it becomes available. Some works have already addressed the potential added value of including uncertainty information in different data analysis tasks, such as in modeling,11−14 process monitoring,15 and process design and optimization.12,16 Multivariate least-squares (MLS) and its univariate version, bivariate least-squares (BLS), are extensions of ordinary least squares (OLS) that are able to explicitly incorporate measurement uncertainty information in the estimation of a linear regression model when both the X’s and the Y’s carry sizable measurement uncertainties. These methods are quite prone to collinearity, a ubiquitous feature in industry that is moderately mitigated with the use of maximum likelihood MLS11,12 or through ridge regularization.11,12 However, a better solution to deal with collinearity in large dimensional problems is to adopt latent variable frameworks. Wentzell et al.13 developed a maximum likelihood approach for estimating principal component models under general (and assumedly known) error structures affecting the measurements, called maximum likelihood principal component analysis (MLPCA). This method was then applied to multivariate calibration,14 extending the consideration of measurement uncertainties to input/output modeling approaches closely related to PCA, such as principal components regression (PCR) and latent root regression (LRR), giving rise to their maximum likelihood versions: MLPCR (maximum likelihood principal component regression) and MLLRR (maximum likelihood latent root regression). Improved algorithms for ́ et al.17 and Reis and MLPCR were then proposed by Martinez 12 18 Saraiva. Bro et al. also proposed a general framework for integrating data uncertainties in the maximum likelihood fitting of empirical models. Analyzing all the contributions referred to above, one can verify that none of them covered specifically the case under discussion in this article, namely, situations where significant uncertainties are located only in the response variables. This is a relevant practical scenario which is shared by many CPI, where daily laboratory analyses are conducted to assess product quality, where important properties are first measured (e.g., concentration, different quality attributes) and then their associated measurement uncertainties are also determined. In this context, it is common that measurement uncertainty for the output variables is readily available and well characterized, as a consequence of legislation and other constraints which are in place to ensure the operation of quality laboratories and the conformance of products with their specifications. On the other hand, there are process operations where the uncertainty of the input variables (e.g., temperature, flow, and pressure readings) is relatively small compared to the changes in the process and it is also less of an emphasis in the literature compared to the uncertainty in the outputs. Focusing once again on the aforementioned literature, one can notice that there is no extensive comparison study contemplating a variety of model structures, noise scenarios and complexities of interest to practitioners. Furthermore, they have not contemplated different levels of sophistication in the integration of measurement uncertainty information in datadriven model development, each level implying a distinct investment overhead or magnitude of effort involved in 9160
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
Figure 1. Deriving measurement uncertainty intervals near physical limits: (a) truncation of the interval; (b) Bayesian approach.
limits in a coherent and effective way. The second problem involves estimating the uncertainty figures when they change together with the mean level of the measurements (heteroscedastic measurement variation). Both aspects are pervasive in industrial data analysis, and raise difficulties in gathering the necessary measurement uncertainty information. They will be addressed in sections 2.1 and 2.2 and then applied in the analysis for industrial data sets and in a simulation study, as described in section 4. 2.1. Handling Measurement Variation in the Neighborhood of Physical Limits. Not rarely the process variable of interest is measured near its physical limit, i.e., close to the boundary beyond which no feasible values exist for it. A typical example is measuring near zero concentrations, the so-called trace concentrations, of orders of a few parts per million or even less. In these conditions, applying the conventional procedures to estimate measurement uncertainty often leads to intervals intercepting the nonfeasibility region. This may be regarded as unacceptable and in conflict with common sense, and in fact, it is also in conflict with the actual definition adopted by the main reference on the specification of measurement uncertainty, the GUM, according to which it “characterizes the dispersion of the values that could reasonably be attributed to the measurand”. However, a more in-depth analysis of the way this problem is treated in the literature reveals that this is not the only perspective at stake. In fact, some authors state that, near the physical limits, where the uncertainty is of about the same order of magnitude of measurements, the GUM principles can no longer be accurately applied (Eurachem/CITAC guide, Appendix F.119). We believe that the right procedure to adopt critically depends on the underlying goal of the analysis: (i) reporting a final result; (ii) assigning the measurement uncertainty to an intermediate result, which will be object of further processing. In the last case (ii), the existence of an intermediate unfeasible region does not represent any particular problem, as it is just an algorithmic step in the computation of the final uncertainty interval. The conventional procedures can thus be straightforwardly applied and remain valid in this case. In fact, even reporting a negative concentration and the associated measurement uncertainty interval is a legitimate expression of the outcome of the measurement system, which only reflects the measurement process and not the physical meaning and constraints to which the measurand should comply. However, the first situation (i) often requires conformance with physical feasibility. Several approaches can be considered for enforcing feasibility into the construction of measurement uncertainty intervals. The
obtaining the uncertainty information. This is an important piece of information when the goal is to find the adequate balance between such extra work and the actual benefits achieved. In this article, we provide solutions to the aforementioned gaps with particular relevant interest for industry. Even though new methodologies will be proposed, we would like to point out that the novelty of this work is not restricted or confined to this particular aspect. Instead, we provide useful guidelines on how to treat measurement uncertainty information starting from its proper specification under difficult but pervasive industrial contexts (the specification challenges), until the study of the added value of their explicit integration in data-driven model development (the integration challenge). The Monte Carlo study, in particular, also brings new aspects and concerns to comparison studies, such as a better control of conditions that guarantee a fair and informative comparison of the methods and a systematic way to summarize a large amount of rigorous results (obtained in the framework of statistical hypothesis testing) that facilitates the comparison of multiple methods in a variety of testing scenarios. This article is organized as follows. Section 2 is dedicated to the specification problem, where the following two challenges are addressed: • handling measurement variation in the neighborhood of physical limits (section 2.1, with results presented in section 4.1) • estimating measurement uncertainty in heteroscedastic scenarios (section 2.2, with results presented in section 4.2) Then, in section 3, we address the integration challenge and present the regression methods considered in the study (the respective results are presented in section 4.3). The results obtained for each challenge are presented in section 4 as described above, and finally, in section 5 we further discuss all the results obtained and summarize the main conclusions of this work.
2. CURRENT CHALLENGES IN THE SPECIFICATION OF MEASUREMENT UNCERTAINTY IN THE CPI AND RELATED INDUSTRIES In this section we address two particular specification challenges that must be properly dealt with when analyzing data from chemical processing industries, problems that are shared by other industrial sectors as well. The first problem regards the treatment of measurement variation near physical 9161
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
ments, and it is this functional dependency that should be specified instead of its realization at some, arbitrarily defined, level. In industry, such dependency is often handled by considering a constant ratio between measurement uncertainty and analyte level, i.e., through a constant relative standard deviation. In a more in-depth analysis, the level dependency of uncertainty can be conceptualized as being composed by three regions (Figure 2). At low magnitudes of the measurand, the
alternatives available are quite diverse in quality, and one should first clearly define performance criteria in order to set the ground for selecting the best method(s) to adopt. The main criteria usually considered are the following:20 (i) feasibility of the uncertainty interval; (ii) bias of the estimate; (iii) consistency with the predefined probability coverage for the expanded uncertainty interval. Among the methods available for this purpose, we will analyze in this article data censoring, data shifting, truncation of the interval, and a Bayesian approach. The first two approaches are highly biased and are not recommended (even though sometimes used in practice). On the other hand, the last two lead to interesting solutions to the specification problem near physical limits: Data Censoring. Data censoring consists of simply rejecting measurement records falling outside the physical limits, from all the calculations. This procedure introduces serious bias and the extended measurement uncertainty interval may still contemplate a region with negative concentrations. Data Shifting. In this method, values falling outside the physical limits are shifted toward their closest limit. This procedure suffers from the same drawbacks as data censoring, although presenting less bias. Truncation of the Interval. This is a very simple and rather effective approach.19−21 It consists of first computing expanded uncertainty intervals using the current statistical methods (frequencist approach). Then, the interval is truncated in order to cover only the values in the feasible region (Figure 1a). This approach is found to be adequate for most of the situations found in practice. Bayesian Intervals. This method formally introduces the requirement of respecting the feasibility domain, through a prior distribution presenting finite density only in the feasible region. This is usually achieved through a uniform distribution, which is combined with a t-distribution (or Gaussian distribution) estimated from measurement data (the likelihood term in the Bayesian formulation), leading to an a posteriori truncated t-distribution (or truncated normal distribution) (Figure 1b). The expanded measurement uncertainty interval is finally obtained as the maximum density interval of the a posteriori distribution, i.e., the shortest interval with the specified probability coverage.20,22 Even though theoretically consistent, the Bayesian approach presents some drawbacks, such as biased estimates, the need to establish a prior distribution, and the fact that it provides a positive interval even when all measurement records are negative. In section 4.1, the above-mentioned procedures will be critically compared regarding conformance to principles (ii) and (iii), through a Monte Carlo study conducted under realistic conditions. They will also be used to establish measurement uncertainty intervals using industrial process measurements. 2.2. Estimating Measurement Uncertainty in Heteroscedastic Scenarios. The procedures for specifying the uncertainty affecting measurements under homocedastic conditions are well-established. They are extensively described in reference documents and standards, such as the GUM and other sources produced by national and international organizations, such as the National Institute of Standards and Technology (NIST). However, they have limited applicability in industrial scenarios where measurements may span several orders of magnitude, as happens for instance in concentration measurements. In these conditions, the measurement uncertainty depends on the mean analyte level of the measure-
Figure 2. Two-stage behavior of the relationship between measurement uncertainty and measurement level.
measurement uncertainty is more or less constant and dependent on features related to the sensitivity and selectivity of the measurement sensing device. At higher levels, the uncertainty becomes a function of level with a given functional shape (e.g., linear or quadratic). Between these two regions there is a transition stage from the low-magnitude regime to the high-magnitude regime, with mixed features. Several model structures for describing the heteroscedasticity of measurement variation have been proposed. One example is based on the following error model:23,24
x = μe η + ε
(1) η
where μ represents the true value of the measurand, μe is the proportional error term with η ∼ independently and identically distributed (iid) N(0,ση2), and ε is the additive error term, with ε ∼ iid N(0,σε2). From this expression the following measurement uncertainty dependency can be derived: u(x) =
2
2
σε 2 + μ2 e ση (e ση − 1)
(2)
A more parsimonious solution can be found in the Eurachem/CITAC guide (Appendix E.5),19 which consists of
u(x) =
s0 2 + (xs12)
(3)
where s0 and s1 are the parameters to be estimated; the first one represents the constant contribution to the overall uncertainty (dominant in the low-magnitude regime) and the second is the proportionality constant for the heteroscedastic dependency (more noticeable in the high-magnitude region). This formula is based on the additive property of variances from independent sources of variation. It requires a relatively large number of values at different levels for calibrating the model (>10 levels). Most often, practical constraints force the replacement of such a parabolic expression with the following linear approximation, which simply requires uncertainty estimates at least at four different levels of the measurand: u(x) = s0* + xs1* 9162
(4) DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
considered for the first time in measurement uncertainty modeling, which will be designated as UM-3. These three models will be tested in section 4.2 with an industrial data set, along with some slightly modified versions described therein. In particular, in that section the accuracy performance for the new approaches UM-2 and UM-3 will be comparatively assessed against the current practice, UM-1. 2.3. More on the Use of Measurement Uncertainty Models: Important Practical Aspects. The uncertainties affecting all response measurements must be specified in order to implement any of the uncertainty based methods contemplated in this work (see section 3.1). For heteroscedastic conditions, this implies that either one has available replicates for each condition in the training set that enables the estimation of the measurement uncertainty (type A, evaluation according to the GUM) or one has a properly estimated uncertainty model, such as UM-1, UM-2, or UM-3. The first option requires collecting multiple samples from the same condition and repeating the measurement protocol for each one of them. This encompasses a working load that is not always feasible in industry. The second approach, on the other hand, requires some work overhead in a preliminary stage, but once it is concluded, the uncertainties for the different levels can be promptly estimated from the models developed, without further testing efforts. This solution is more aligned with the industrial practice and therefore will be considered here more fully. A measurement uncertainty model for the response has the following general formula, from which those presented above are particular cases:
s*0 and s*1 maintain the same interpretation as s0 and s1, respectively. As this work addresses the specification and use of uncertainty information under practical constraints, we will adopt model eq 4 as one the most reliable descriptions of measurement uncertainty patterns available in CPI.25 The model parameters, s*0 and s*1 , can be easily estimated through ordinary least squares (OLS), given the knowledge of {x̅i,u(xi)}i=1:N, obtained through replicated analysis at N different levels of the measurand (N ≥ 4 with at least 10 replicates per level; xi̅ stands for the average of the replicates in the ith level, and u(xi) is the standard uncertainty, given by the standard deviation of the same group of replicates). We will call this approach UM-1 (uncertainty model 1). A subtle assumption of UM-1 is that it tacitly assumes homoscedasticity in the relationship between uncertainty and measurement level. It would be wise to challenge this hypothesis and consider that, possibly, the uncertainty affecting the heteroscedastic model for the measurement uncertainty would also be, in its turn, heteroscedastic. Therefore, we propose in this article that the model structure given by eq 4 should be estimated through weighted least squares (WLS), instead of OLS (these methods will be reviewed in section 3). However, the implementation of this procedure would imply the availability of the following triplet for each observation in the training set, {x̅i,u(xi), var[u(xi)]}i=1:N. The first two components are the same as for UM-1, but the third one, regarding the variance (or squared uncertainty), var[u(xi)], associated with the uncertainty estimate, u(xi), must now be provided in order to implement WLS. We have tested two procedures for obtaining this quantity: (i) nonparametric estimation using bootstrap; (ii) the asymptotic expression for estimating the variance of the sample standard deviation (S) from iid N(·) samples. The last procedure consists of using the following formula: 2 ⎤⎫ ⎧ ⎡ ⎪ 2 ⎛ Γ(n/2) ⎞ ⎥⎪ ⎬ var[S] = σ 2⎨1 − ⎢ ⎜ ⎟ ⎪ ⎢⎣ n − 1 ⎝ Γ[(n − 1)/2] ⎠ ⎥⎦⎪ ⎩ ⎭
u(̂ y) = f (y ; θ)
(6)
In this model, y represents the level at which the measurement uncertainty is to be estimated, û(y), and θ is a vector of constants that parametrize the model. When a new measurement is made for the response, y becomes available and it can be inserted right into the model to obtain the associated estimate of the uncertainty. This is the simplest approach and the one that may seem more appropriate to apply. However, a deeper analysis revels that such measurement is affected by an uncertainty, which will be then be passed on to its estimate, deteriorating the quality of the prediction. We therefore propose an alternative approach, which we have found out to be substantially better. It consists of using the output of a preliminary prediction model for the response, ŷ, such as the one referred to in Table 1 (step 1), instead of the
(5)
where σ is the population standard deviation, Γ(·) is the gamma function, and n is the number of replicates used to estimate the standard deviation (or standard uncertainty). In this article we just report the results obtained with the nonparametric approach, as it allows for more flexibility in the type of distributions to be considered (1000 bootstrap samples were used in the computations). Therefore, the final data table used to estimate the measurement uncertainty model is composed by {xi̅ ,u(xi), varboot[u(xi)]}i=1:N. We will call the resulting uncertainty model UM-2. Another approach for developing the measurement uncertainty models was also proposed and tested in this article. As the uncertainty models should hold for concentrations spanning different orders of magnitude, we also modeled the relationship of measurement uncertainty vs level, in the log−log domain. This approach has the potential to describe the measurement uncertainty behavior over a wide range of concentrations and assumes an error of the proportional type in the functional relationship (the additive error in the regression model estimated in the log−log domain translates into a proportional error term in the original domain of the variables). Therefore, by estimating the model in the log−log domain using OLS, one would be tacitly considering an heteroscedastic error structure in the formulation, even though its explicit form is unknown. This is another new model
Table 1. Alternative Procedure To Derive an Uncertain Model, Based Only on Data from the Training Set (TR) 1. Estimate a simple regression model involving the response, Y, and all the inputs, X’s, using the training set. This can be conducted with OLS, if no singularity or severe ill-conditioning is present; otherwise use PLS. 2. Compute and save the residuals for the training set: RESY. 3. Define a grid for binning the observations in the training set (Y) with nbin bins. 4. For each bin i = 1: nbin do the following: a. Save the observed values Y following in bin i, ... b. ... as well as the associated residuals for the predictions made by the model (both in the same bin i). c. Compute the standard deviation of RESY: si. d. Compute the mean of the responses, Y, in the bin: yi̅ 5. Build a WLS model of si versus yi̅ using the number of observations in each class as the weighting parameter. 9163
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
Figure 3. Comparing the estimates of measurement uncertainty (a) obtained with inserting directly the measured values of the measurand in the uncertainty model and (b) those obtained after using a simple model to estimate the values of the measurand.
Figure 4. Different approaches for deriving a measurement uncertainty model: (a) using replicated data collected from a preliminary study (RR approach); (b) using the training data set.
measured value, y, as input to the uncertainty model. Such a predictive model has the ability to suppress some of the noise affecting the variable, as exemplified in Figure 3, where a comparison is made between the estimates of measurement uncertainty obtained by inserting directly the measured values of the measurand in the uncertainty model (Figure 3a) with those obtained after using a simple model to estimate the values of the measurand (Figure 3b). Each point represented in the plots concerns the pair of values regarding the uncertainty estimated for a given output observation, yi, made with resort to an expression of the type û(yi) = f (yi;θ), versus the actual value of the uncertainty of yi. The “true” values of y (i.e., ỹ) are simulated with resort to one of the base models presented in
section 4.3 (please refer to this section for more details on how they are implemented), and the observed values of y are obtained by adding an iid N(0,u(yi)) random term to the “true” value (in this expression, u(yi) stands for the “true” uncertainty model of y, which is used by the simulator, but of course is assumed to be unknown in the estimation of the data-driven model). Both plots present linear 1:1 trends, but the scatter obtained in the plot of Figure 3b, where an estimate, ŷi, is used instead of yi in f(yi;θ), is much lower than that obtained in Figure 3a, where the actual observed values, yi, are employed. Therefore, if the model quality is reasonably good, its prediction will lead to better estimates of the uncertainty when inserted into eq 6. 9164
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
of approaches and some of them are theoretically able to lead to optimal or close to optimal solutions under the testing scenarios considered (section 4.3.1). It is not foreseen that others not considered in this work could bring significant added value to the analysis. An additional criterion encompasses the selection of methodologies that are natural counterparts regarding the use of measurement uncertainty information, such as in OLS/WLS, for instance. This is important in order to assess the expected added value of engaging in the additional work to obtain measurement uncertainty information by looking to its impact in the final performance of related methods. The methods tested in this work are presented in sections 3.1.1 and 3.1.2. 3.1.1. Ordinary Least Squares (OLS) and Weighted Least Squares (WLS). Ordinary least squares (OLS) is a basic approach for deriving an input/ouput model from process/ quality data:
Another important practical aspect regarding measurement uncertainty models is the type of data used to derive them. Until now it was assumed that uncertainty models can only be derived if multiple replicates of measurements are available. This would be indeed a suitable way to collect the data necessary to derive a good measurement uncertainty model, which can then be estimated through one of the approaches presented above (namely, UM-1, UM-2) to which we will refer, in short terms, as RR, as such data may be collected using the general methodology known as gage repeatability and reproducibility2 (even though the protocol can be now much simpler). This method is schematically represented in Figure 4a. Another possibility consists of using the same input/output raw data available to develop the model also for estimating the uncertainty model, without any extra effort to collect replicates. This procedure is based on a binning approach26−29 and is described in Table 1 and schematically presented in Figure 4b. As it uses only training data, we will refer to it simply as TR. The procedure TR circumvents the need to collect replicated data, but the measurement uncertainty models derived have typically poor quality when compared to those obtained following an RR approach, due to the usually uneven distribution of samples in the training set across the interval of interest of the response and also because of the granularity imposed by the binning process.
n
̂ = arg min {∑ (y − y ̂ )2 } bOLS i i b = [b0 ··· bm]T i = 1
(7)
s.t. yi ̂ = b0 + b1x1, i + ··· + bmxm , i
(i = 1, ..., n)
The least-squares estimate for the parameter vector, b, is given by ̂ = (X̃ TX̃ )−1X̃ Ty bOLS
3. INTEGRATION OF Y-UNCERTAINTY IN MODEL BUILDING Even when information regarding measurement uncertainty is available, it is not trivial to incorporate it for improving the quality of data analysis. In fact, most of the frameworks developed over the years assume only the existence of raw data tables, and little attention was devoted to situations where two tables are available: one for raw data and another, of equal dimensions, for the associated measurement uncertainties.9,12,14,15,18,30 Over the years, several approaches were proposed and studies were conducted where uncertainty information about the X’s13,14,18,31−33 or the X’s and the Y’s11,12,17,34−37 was incorporated in model development. However, the specific situation of handling heteroscedasticity only in the responses for large scale systems remains largely unexplored. This case is particularly relevant in the development of predictive approaches for key quality variables (Y’s) measured in quality control laboratories, based on process variables (X’s). It is a common situation in practical applications that responses present significant amounts of measurement uncertainty due to the sampling process, operator variability, measurement system variation, and limitations in the measurement protocol. On the other hand, in some industries, process variables such as flow, temperature, and pressure readings present comparatively very small amounts of measurement uncertainty. In section 3.1, the methods considered in this study to handle this scenario are presented. 3.1. Methods Considered in This Work. The methods considered in this study were selected according to the following main goals. The first criterion is that the methods’ model structures should be flexible enough to adapt to the most recurrent situations found in practice. The purpose is not to conduct an exhaustive study of all possible modeling approaches, but to focus on those methodologies that handle specifically the important case of heteroscedasticity in the responses. However, the methods covered span different classes
(8)
where X̃ is the extended regressors’ matrix and y is the (n × 1) column vector with the responses ⎡1 ⎢ ⎢1 X̃ = ⎢ ⎢⋮ ⎢1 ⎣
x1,1 ⋯ xm ,1 ⎤ ⎥ x1,2 ⋯ xm ,2 ⎥ ⎥, ⋮ ⋮ ⋮ ⎥ x1, n ⋯ xm ,2 ⎥⎦
⎡ y1 ⎤ ⎢ ⎥ ⎢ y2 ⎥ y=⎢ ⎥ ⋮ ⎢ ⎥ ⎢⎣ yn ⎥⎦
(9)
The OLS model structure considers that each variable brings some predictive power to the model. In other words, the predictive element for this class of linear regression models is the “variable”. When the amount of shared information between the regressors is high, i.e., when they are correlated, this assumption does not hold anymore and some care must be exercised in order to construct a proper model. This is the wellknown multicollinearity problem of OLS. The solution that maintains its model structure unchanged consists of selecting the p (p < m) regressors that bring most of the predictive capability to the model, and leave aside the others that are redundant or correlated with them. This type of solution is generically called variable selection. Among the many methods to select the relevant regressors (such as forward addition, backward removal, forward stepwise, best subsets, genetic algorithms, etc.), we employed the forward stepwise regression protocol. This choice is a balance between the performance achieved, which is among the best achieved by all classical variable selection methods, and the feasibility to integrate a Monte Carlo study, which requires a fast implementation and no supervised fine-tuning in each trial. Forward stepwise regression38,39 consists of successively selecting variables, as long as they are able to bring a statistically significant contribution to the amount of y-variability explained by the model (evaluated with resource to a partial-F test). On the other hand, variables that were selected before and stop being 9165
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
response using OLS. The way the scores are selected can vary, leading to different implementations of PCR. In this work we selected the scores using forward stepwise selection, for the same reasons referred to above. We will call this approach PCR1. PLS, on the other hand, searches for the linear combinations of the X’s and Y’s, the X- and Y-scores, with maximal covariance and satisfying some orthogonality constraints among themselves.54 The number of dimensions to retain in PLS was established using the adjusted Wold’s R criterion (AWRC). Even though no method exists that always provides the best solution, some evidence points out that this method leads to superior results with some consistency.55 A major distinctive feature of LV models regarding those presented in the previous section is that for LV models the predictive entities are the latent variables, which in practice are estimated as linear combinations of the regressors also called variates. Thus, with LV models the predictive elements are variates, which reflect some underlying and unmeasured sources of variation, and not the individual variables. Both PCR and PLS tacitly assume homoscedastic error structures for the variables. For PCR, some approaches were already proposed for handling the presence of heteroscedastic noise in the X’s14 and in the X’s and Y’s.11,12,17 As to PLS this situation is not so well explored, and only a few methods were proposed in the literature for the case of heteroscedastic noise in the X’s and Y’s.11,12 However, these methods did not present a consistently superior performance over the regular PLS formulation (using NIPALS), and therefore were not incorporated in this study. In order to integrate the information on the measurement uncertainty in the responses in PCR, the PCA scores will be regressed onto the responses using WLS. This would be the optimal way to handle uncertainty information in the Y’s within the PCR framework, which we will call PCR2. Therefore, PCR1 and PCR2 are the two counterpart methods that only differ in the use of measurement uncertainty information on the Y’s. Both PCR and PLS methods were implemented with autoscaling as data preprocessing and the associated uncertainties were transformed accordingly, using the properties of the variance. 3.2. Preventing Overfitting Using Measurement Uncertainty Information. Another situation where the use of measurement uncertainty information on the responses can be integrated in the development of predictive models occurs in the selection of the model structure or complexity. A number of methods encompass a preliminary stage of model structure definition (also called system identification), after which its parameters are estimated. Examples of such a preliminary stage include the selection of variables to incorporate in OLS and WLS models, the number of dimensions in latent variables models such as PCR and PLS, or the topology in artificial neural networks (ANN), to mention only a few. For each class of methodologies, different approaches were put forward to define the optimal level of complexity. Some of them are based on stepwise procedures that assess the significance of the additional amount of Y-variability explained by new entries into the model structure (such as variable selection schemes based on the partial-F test26,38), others evaluate the bias−variance trade-off using cross-validation approaches (k-fold, leave-oneout, etc.),43,56 while still others seek to optimize some predefined criteria of optimality in some sense (such as information theoretic criteria, AIC, BIC, or statistical criteria, such as Mallow’s Cp). All these different methodologies aim to
relevant given the more recent entries in the model may also be discarded, if their contribution becomes redundant or not significant. This procedure is fast and usually leads to good models, despite the underlying criteria not being centered on the prediction ability but rather on the quality of fit of the estimated model. The OLS model assumes that only the response carries a sizable error and it is homoscedastic. This justifies its selection, as the scenarios considered in this work only contemplate uncertainty in the Y’s. However, if such an error is heteroscedastic, namely if it depends on the level of the response, the OLS estimates are no longer optimal in a maximum likelihood sense, as the probabilistic model structure is different in this case. Weighted least squares (WLS) is the counterpart of OLS that is able to explicitly incorporate the knowledge of the heteroscedastic behavior of the response in model development. It assumes that one has available information on the inputs, the responses, and the uncertainty associated with the responses, {xi,yi,u(yi)}i=1:n, and estimates the model parameter vector, b, according to the following optimization formulation: ⎧ n (y − y ̂ )2 ⎫ ⎪ ∑ ⎪ i=1 i i ̂ ⎬ = arg min ⎨ b WLS 2 ⎪ ⎪ u(yi ) b = [b0 ··· bm]T⎩ ⎭
(10)
s.t. yi ̂ = b0 + b1x1, i + ··· + bmxm , i
(i = 1, ..., n)
The closed form for the solution is ̂ b WLS = (X̃ TW −1X̃ )−1X̃ TW −1y
(11)
where W is a (n × n) diagonal matrix that contains the squared uncertainty associated with the ith response, in the position (i,i). 3.1.2. Latent Variable Methods (LV). Latent variable (LV) methods form a class of approaches that have found great success in exploring, describing, and predicting industrial and analytical data.40−43 The basic model structure of a LV model is the following: X = TP + E Y = TQ + G
(12)
where X and Y are the (n × m) predictor and (n × r) response matrices (for the general case where r responses are to be handled simultaneously); T is the n × p matrix of latent variables. P and Q are coefficient matrices of appropriate size, and E and G are residual matrices that describe mutually independent unstructured sources of variation. This model structure can be estimated using different approaches. Among the most well-known are principal components regression (PCR)43,44 and partial least squares (PLS).43−50 Besides handling the presence of multicollinearity in a natural and coherent way as well as situations where there are more variables than observations, these methods have the ability to accommodate the presence of moderate amounts of missing data by taking advantage of the existing correlation structure among variables,51−53 and they are also known to be quite robust to the presence of moderate amounts of noise. Therefore, they are natural candidates to be included in this study. PCR consists of estimating first a PCA model from the X variables and then regressing a subset of its scores onto the 9166
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
correct and the random term η is absent, then the observation residuals have the following stochastic behavior, εi ∼ iid N(0,σεi2), where σεi is the standard uncertainty of the ith measurement. This means that the sum of the squared weighted residuals, ε̃i2 = (εi/σεi)2, should follow approximately a χn−m−12 distribution, where n is the number of measurements and m is the number of variables in the model (this happens to be so because the weighted residuals consist of standard normal variables, whose sum of squares follow a χν2 distribution; ν represents the number of the degrees of freedom of the distribution). One consequence of this result is the following. When constructing a model by successively inserting components (variables in the case of WLS or scores in the case of PCR2), the residuals decrease, and so will their weighted sum of squared residuals (WSSR). Eventually we may get to a point where, for a given number of models components, say m, the WSSR is consistent with a χn−m−12. At this moment the model has a proper size given the uncertainty information available, and no more components should be added, as such will lead to overfitting. In order to assess the consistency of the WSSR with the χν2 with the appropriate number of degrees of freedom reflecting the model complexity, we propose a bootstrap procedure with two nested bootstrap sampling cycles: (i) The inner cycle performs bootstrap on the training set observations and for each trial computes the WSSR; in the end of each inner cycle the mean of the WSSR computed is computed and saved. (ii) The outer cycle replicates this process a large number of times, in order to estimate the bootstrap distribution for the mean of WSSR. With the information for the bootstrap distribution of mean WSSR, one can compute the point estimate for the mean and the associated bootstrap confidence intervals. As the mean of χν2 distribution is simply given by ν, it is possible to identify the point where the estimated mean of WSSR crosses n − m − 1, after which overfitting is taking place. Note once again that this crossing does not have to exist; it only occurs when the random term η is zero, and all unstructured variation is due to measurement noise. Therefore, this criterion is a necessary condition to prevent overfitting in parametrized regression models, but not a sufficient one. This procedure is illustrated below for the case WLS. Data was generated from a model containing 10 variables with nonzero coefficients and 10 variables with zero coefficients. A stepwise selection procedure was conducted, and one variable was added at a time to the model, leading to the sequence of confidence intervals for the mean of WSSR presented in Figure 5. As can be seen from Figure 5, the interception occurs after the 10th addition to the model. The fact that the confidence interval (CI) for this level of model complexity is located below the reference line for overfitting may be due to the approximate nature of the cutoff, which is an asymptotic limit assuming, among other things, perfect knowledge about the model and the uncertainties. As the model is actually estimated from data, this behavior is expected. However, even given this limitation, the procedure still provides a useful support for model building under heteroscedastic noise conditions.
construct parsimonious, robust, and accurate predictive models, but no method provides the best overall solution to all classes of problems. Therefore, additional rules that can be easily integrated in the implementation of every structure definition method in order to enhance their performance are important, especially if they are easy to implement and safeguard against overfitting. The availability of information regarding Yuncertainty provides the ground for one such rule. Let us consider the following general model as the one representing the true underlying relationship connecting predictors, x, and measured responses, y. This model contemplates a functional relationship, possibly nonlinear, parametrized by the vector θ. The additive term, η, represents the random contribution of nonstructured sources of process variation. This term may or may not be present, depending on the type of system. The other additive term, ε, regards the dispersion on y strictly caused by measurement uncertainty. y = f (x , θ) + η + ε
(13)
When developing an empirical model for the relationship between x and y, one tries to estimate the deterministic structure f(x,θ) using input/output raw data. Now, with measurement uncertainty information additionally available, it is possible not only to improve the estimation accuracy of the deterministic backbone of the model, f(x,θ), using a suitable optimization framework, but also to use such knowledge in determining effective bounds that guide the process of model building. In particular, it is clear from eq 13 that the expected residuals variance, assuming independency in the two sources of variation (process and measurement), is given by σT 2 = σε 2 + ση 2
(14)
Therefore, no estimated model can present an estimate of the residuals variance falling below σT2. In particular, this implies, as a corollary, that the mean square error (MSE) estimates should be higher than σε2, a quantity that is known if Y-uncertainty information is available. In this condition the proper level of model complexity to be adopted, say CompLev (e.g., the number of latent variables, or of selected variables), should be the minimum between that suggested by the application of a methodology based on a predefined optimization criterion (oc), CompLev (oc), and the level of complexity for which the lower bound is attained: CompLev = min{CompLev(oc), CompLev(MSE = σε 2)} (15)
This is a simple safety test that could be easily implemented in any model building framework based on the knowledge of Yuncertainties, i.e., {xi,yi,u(yi)}i=1:N. However, this simple and effective test can only be straightforwardly integrated in model development in homoscedastic situations when u(yi) is constant over the domain of interest. The case of heteroscedastic measurement noise requires a more complex treatment, because there is no longer such a thing as a constant measurement uncertainty scenario. One easy fix consists of using the mean value of the Y-uncertainty as an estimate of the cutoff. In most cases, where the variation in the response is not large or the dependency on the uncertainty on the level is relatively mild, this approach can provide a good working solution. Alternatively, the following procedure can be adopted. Let us consider the case where the model structure assumed is consistent with a linear regression model. If the model is
4. RESULTS In this section we present the main results obtained in the analysis of the specification and integration challenges addressed in this article, regarding the use of measurement uncertainty information in industrial scenarios. The two 9167
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
estimates are not foreseen to happen. The number of computer generated samples was nsamples = 50 000, with each sample consisting of 10 random observations (replicates). The results obtained consider expanded intervals designed to achieve a 95% confidence interval. They are summarized in Table 3. Table 3. Results of Comparison between Different Methods for Handling Measurement Uncertainty near Physical Limits concn (ppm) 1
1000
Figure 5. Evolution of the confidence interval for the mean of the WSSR estimated through bootstrap, with the number of components (variables in this case) added to the model.
specification challenges will be addressed in sections 4.1 and 4.2, while the results for the integration challenge are presented and discussed in section 4.3. 4.1. Specification Challenge 1: Handling Measurement Variation near Physical Limits. In this section, we compare and analyze the methods available for specifying measurement uncertainty intervals near physical limits: the unchanged GUM method; data censoring; data shifting; truncation of the interval; Bayesian approach. The comparison is made using a Monte Carlo procedure, in terms of the methods’ bias and probability coverage. Bias regards the difference between the true value of the parameter (in this case the noiseless value of the measurand, represented by the population mean, μ, and its estimate, the sample mean, μ̂); coverage concerns the fraction of times the computed expanded intervals actually contain the true value they aim to confine (which desirably should be consistent with the coverage set by the confidence level selected for constructing the intervals). Contrary to previous studies, a truncated t-distribution will now be considered for the Bayesian approach in order to take into account the most recent guidelines on uncertainty modeling.21,25 The main steps taken to generate data and to compare the methods are described in Table 2. Two concentration levels and their respective measurement uncertainties will be considered: 1 and 1000 ppm, with standard uncertainties 0.65 and 18 ppm, respectively. The former case represents one situation near a physical limit, while the second one regards a conventional case where unfeasible
method
coverage (%)
mean bias (ppm)
GUM approach data censoring data shifting truncation of the interval Bayesian approach GUM approach data censoring data shifting truncation of the interval Bayesian approach
94.7 93.56 94.53 94.7 94.7 94.82 94.82 94.82 94.82 95.08
−0.00066 −0.0834 −0.0166 −0.00066 −0.00066 0.0072 0.0072 0.0072 0.0072 0.0072
As can be observed from Table 3, at high concentrations all methods perform equally well, as expected, since they are far from any physical boundary. Without the effect of a physical boundary, the GUM approach, data censoring, data shifting, and the truncation approach must indeed provide exactly the same output, as all the calculations turn out to be the same because no unfeasible concentrations are obtained in such conditions. On the other hand, at low concentrations one is close to a physical limit, and only the expanded uncertainty intervals obtained with the Bayesian approach using a truncated t-distribution or based on the truncation of the interval method present the same bias and coverage as those derived from the GUM approach. Therefore, even though they are able to avoid the overlap with the unfeasible region presented with the intervals derived by the GUM approach, they still retain most of the desirable properties of accuracy in coverage and unbiasedness of the GUM methodology. Data censoring should be avoided given the high bias obtained and poor coverage performance. By retaining positive concentrations, this method will underestimate the measurement uncertainty and overestimates the true mean (negative bias). Shifting data to the natural limit presents a better probability coverage performance, but still leads to a significant bias. The method of truncating the expanded intervals presents, in this case, the same performance as the Bayesian approach, with good coverage and low bias. As it is very easy to implement, this is a good practical solution to adopt in industrial settings.
Table 2. Data Generation and Computation of Performance Metrics for the Comparison Study Regarding the Computation of Expanded Uncertainty Intervals near Physical Limits 1. Let us consider measurements following a Gaussian distribution with some “true” mean, μ, and standard deviation, u(x) (x = μ + ε, where ε ∼ iid N(0,u(x)2)). These parameters are unknown in practice, and will also be considered unknown when implementing the methods under comparison. 2. Define the number of samples (nsamples) and the number of observations per sample (nobs). Set the counter for sampling generator to zero: n = 0. 3. While n < nsamples, perform the following steps: a. Draw random samples from the distribution specified in step 1 and calculate the sampling estimates of the population parameters of interest to derive the confidence intervals: μ̂ and σ̂. b. Calculate the expanded measurement uncertainty interval. Check if it contains μ. c. Increment the sample counter: n → n + 1. 4. Otherwise, stop the generation of samples and compute the fraction of times the expanded interval contained the “true” value (coverage) and the mean difference between the “true” mean and its estimates obtained in all the samples considered (bias). (Note that in this case the “true” value for the mean is known since this is a Monte Carlo simulated study.) 9168
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
To sum up, both the Bayesian and the truncation of intervals approaches are recommended for specifying measurement uncertainty intervals near physical limits and with physical meaning, when such is required (as happens when results should be reported to a general audience, possibly containing members less informed on the specificities of measurement variation). 4.2. Specification Challenge 2: Modeling Heteroscedasticity. In this section the modeling strategies for describing the heteroscedastic behavior of measurement uncertainty presented in section 2.2 are compared: UM-1 (OLS modeling of concentration level vs uncertainty), UM-2 (WLS modeling of concentration level vs uncertainty), and UM-3 (the log−log domain modeling approach; for more details please refer to section 2.2). An industrial data set will be used to test and compare these approaches. The available data set consists of measurements taken at 10 different concentration levels, ranging from low (1 ppm) to high (1000 ppm) concentrations of an analyte (Paladium), more specifically at 1, 2, 5, 10, 15, 30, 50, 100, 250, 500, and 1000 ppm. Replicates are available at each concentration level, from which the associated type A uncertainty can be computed. This provides the essential information for deriving measurement uncertainty models according to the UM-1, UM-2, and UM-3 approaches. From the three types of modeling approaches employed, UM-2 and UM-3 are able to consider the heteroscedastic behavior of the relationship one aims to describe: explicitly in the case of UM-2 and in a tacit way in the case of UM-3. Method UM-1, on the other hand, is based on OLS and therefore assumes a homoscedastic behavior throughout the entire domain. This is clearly a limitation of the method that might hinder its ability to provide estimates of balanced quality throughout the domain of concentrations under analysis. Therefore, for this particular modeling approach, we have considered the estimation of local models at low (1, 2, 5, and 10 ppm) and high (100, 250, 500, and 1000 ppm) concentrations, besides the overall model (LOW, HIGH, GLOBAL, respectively), which will be compared against the remaining approaches. Following the recommendations in the Eurachem/ CITAC guide (Appendix E.5),19 the uncertainty models developed with more than 10 concentration levels (UM-1 GLOBAL, UM-2) make use of the model structure (3); those developed with less than 10 concentrations level (UM-1 LOW and UM-2 HIGH) are based on model structure (4); UM-3 consists of a linear model in the log−log domain. We begin the analysis of the different methods by studying their interpolation ability, i.e., the ability to correctly estimate uncertainties at different levels of the concentration domain covered during the development of the measurement uncertainty models. The results are presented in Table 4, where it is possible to observe the uncertainty computed from measurement replicates and the estimates obtained by the different modeling frameworks. As can be observed in Table 4, UM-2 and UM-3 are adequate to cover the entire range of concentrations. Model UM-3, based on the log−log transformation, reduces the discrepancy between the magnitudes of residuals at high and low concentrations. Model UM-2 achieves this goal by weighting each observation according to the variance of the uncertainty at each concentration. Model UM-1, based on OLS, is rather inaccurate for low concentrations, a behavior that is expected since this approach minimizes the sum of squares of
Table 4. Results of Estimating Measurement Uncertainty at Three Different Concentration Levels Using All Uncertainty Modelsa measurement uncertainty (ppm) concn (ppm)
calcd from measd data
UM-1 (GLOBAL)
UM-1 (LOW)
UM-1 (HIGH)
UM-2
UM-3
5 500 1000
0.10 8.97 18.39
0.60 9.20 18.36
0.14 13.66 27.32
0.46 9.29 18.21
0.11 9.72 19.44
0.15 9.86 18.47
a
Highlighted with italics are the cases showing higher disagreement with the value of uncertainty obtained from measured data.
residuals, giving exactly the same importance to a certain deviation in the low concentration domain, where it can be important, and in the high concentration region, where it is neglectable. Therefore, in the end, the least-squares compromise will affect more the quality of the low concentration estimates than that of the high concentration estimates. To minimize this feature, the local models, UM-1 LOW and UM-1 HIGH, were developed and, indeed, the predictions improved in the regions they were developed, at the expense of larger errors in the domains not covered and where the models are not supposed to be used. We have also tested the extrapolation ability of the different models to extreme concentrations. Table 5 presents the results Table 5. Estimation of Measurement Uncertainty for 1 and 1000 ppm Levels through Extrapolationa measurement uncertainty (ppm) concn (ppm)
calcd from measd data
UM-1 (GLOBAL)
UM-1 (LOW)
UM-1 (HIGH)
UM-2
UM-3
1 1000
0.05 18.39
0.63 17.88
0.02 27.32
0.39 16.67
0.07 19.63
0.14 18.52
a
Points were not considered during model development. Highlighted with italics are the cases showing higher disagreement with the value of uncertainty obtained from measured data.
for estimating the uncertainties at the lowest (1 ppm) and highest (1000 ppm) concentration levels by the different models after each point is excluded from the construction of the models used to predict it. Model UM-2 is the best solution for extrapolation, closely followed by UM-3. One disadvantage of model UM-3 regards the large size of the prediction intervals obtained in the original domain, after transforming back from the log−log domain (Table 6). In this context, UM-2 presents smaller prediction bands. In summary, this study clearly shows that the modeling approaches proposed in this article for developing the uncertainty model, UM-2 and UM-3, clearly outperform the current methodology, UM-1. The recommended methodology is UM-2, given the higher quality of the prediction results obtained with this technique. Figure 6 illustrates the fitting obtained with the UM-2 approach. 4.3. Integrating Y-Uncertainty in Model Building: An Extensive Monte Carlo Simulation Study. In this section we present a summary of the results obtained in the extensive comparison study conducted in order to (i) identify recommended methodologies for integrating the measurement uncertainty information on the responses in the development of data-driven models focused on prediction ability and (ii) assess the added value in obtaining uncertainty information 9169
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
Table 6. Prediction Intervals for the Measurement Uncertainty at 1 and 1000 ppm Concentration Levelsa measurement uncertainty (ppm)
a
concn (ppm)
calcd from measd data
UM-1 (GLOBAL)
UM-1 (LOW)
UM-1 (HIGH)
UM-2
UM-3
1 1000
0.05 18.39
[0; 15.9] [8.91; 23.66]
[−14.01; 14.05] [13.46; 41.19]
[−13.6; 14.4] [8.03; 30.54]
[0; 15.9] [12.03; 25.01]
[7 × 10−4; 26.6] [0.11; 3054]
Confidence level of 95%.
employed can be handled quite well by the selected methods if not better than with any alternatives available. The simulation scenarios are carefully described in section 4.3.1, after which the comparison strategy followed in this work is presented. 4.3.1. Simulation Models and Testing Conditions. Two base models were used in this study, both reflecting classes of recurrent structures found in industry: the latent variable model and the correlation model. The latent variable model used in the simulation study has the following structure: X = TPT + E y = Tc + f
(16)
where X is the (n × m) matrix of measured inputs, y is the (n × 1) column vector of outputs, T (n × p) is the matrix with the latent variables, P is the (m × p) matrix of coefficients (loadings), c is the (p × 1) column vector of regression coefficients, and finally E and f are noise arrays; m is the number of input variables, n is the number of observations, and p is the number of latent variables. The model was simulated assuming that T(i,:) ∼ Na(0,Λa), with Λa = diag(σ), where σ is a column vector containing the standard deviations of the latent variables and diag(·) is the operator that converts a vector into a diagonal matrix with the vector elements in the main diagonal. The loadings are randomly generated in each new trial, respecting PTP = Im. The response noise is such that f(i) ∼ N(0,σi) with σi = f unc(T(i,:)cT), where f unc(·) is the true (and unknown) measurement uncertainty function (a linear model was used). This model was run with the following parametrizations: • number of inputs (m): 6 (small size), 20 (medium size), and 100 (large size)
Figure 6. Fitting obtained with the uncertainty model UM-2.
with different levels of quality, encompassing distinct levels of workload. As mentioned before, the methods considered include natural pairs whose elements only differ in the ability to incorporate uncertainty information, such as OLS/WLS and PCR1/PCR2. The only exception is PLS, which was also considered in the study given its pervasive use in practice, but for which no uncertainty-based alternatives were found to be consistently superior.11,12 Therefore, these uncertainty-based PLS methods were not incorporated in the study in order to avoid the unnecessary high number of alternatives to consider without no expected added value, given previous results already published. Furthermore, the consideration of other data-driven modeling approaches is not critical or even expected to be necessary in this work, as the nature of simulation scenarios
Figure 7. Typical noise patterns in the response for the three levels of noise considered in the simulations for both the latent variable and correlation models. Noise level: (a) low; (b) medium; (c) high. Ỹ stands for the noiseless measurement of the response, and Y is the corresponding noisy measurement, with uncertainty provided by the uncertain model. For the latent variable models, the proportionality constants used to establish the level dependent uncertainty in the low:medium:high noise regimes followed the proportion 1:2:4. For the correlation model, the respective ratios were 1:3:6. 9170
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
Figure 8. (a) Boxplots for the differences between the RMSEP between the method in line i and the method in column j (the horizontal green line indicates the zero level). Also indicated in each cell is the p-value for the pairwise t-test involving the pair of methods in line i and column j and, if the test is significant (p-value < 0.05) the cell is highlighted in yellow. (b) Example of a particular trial (one out of the 200 performed for each condition) showing the predictions for the test set obtained with the methods PLS and PCR2. In this case, the values for the RMSEP were 0.1503 and 0.0743 for PLS and PCR2, respectively.
• number of latent variables (p): 2 (for m = 6) and 4 (for m = 20, 100) • noise levels in the response: small, medium, and large (see Figure 7 for typical noise patterns affecting the response for each noise level) The latent variable model reflects in a natural way the data generating mechanisms found in industry and for this reason was used more extensively in this work. However, it is not the only mechanism that can be postulated for this purpose, and we have considered data to be generated according to the following correlation model:
y = Xb + f
(17)
where X and y and f retain the same meanings as for the latent variable model. b is the (m × 1) column vector of regression coefficients, which consist of two blocks: one block containing half of the variables contributing to the response with randomly generated weights B1(i) ∼ U(0.2,4), and another block with the same number of variables but that are not contributing to the response, even though they are measured, B2(i) = 0: 9171
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
⎡ B1 ⎤ B = ⎢ ⎥, ⎣ B2 ⎦
⎡ B (1) ⎤ ⎢ 1 ⎥ ⎢ B1(2) ⎥ B1 = ⎢ ⎥, ⎢⋮ ⎥ ⎢ ⎥ ⎣ B1(m /2)⎦
⎡0⎤ ⎢ ⎥ 0 B2 = ⎢ ⎥ ⎢⋮ ⎥ ⎢⎣ ⎥⎦ 0
Article
RMSEP of the method in column j. Also indicated in each cell is the p-value for the pairwise t-test involving the pair of methods under analysis, and if the test is significant (p-value < 0.05) the cell is highlighted in yellow. If not, the difference is not statistically significant, meaning that the performance of the two methods is essentially the same under this testing scenario. We will call this result a “tie”. When the result is statistically significant (p-value < 0.05), one of the methods is declared to have a “win” (the one with lower RMSEP) while the other has a “loss”. Figure 9 summarizes the results obtained for the number of wins/ties/losses for the analysis presented in Figure 8.
(18)
The input variables can present several levels of mutual correlation, but only the case of low correlation will be reported here (target Pearson correlation: ρ = 0.1). For the sake of space only the results regarding m = 20 (medium size system) are presented here for this class of models, as this is an approximate upper limit for the size for these type of systems and that can be adequately handled by the respective modeling frameworks. 4.3.2. Assessment Strategy and Performance Criteria. As the number of methods and possible simulation scenarios to consider is quite high, an assessment strategy was carefully designed in order to simplify the comparison of all methods in a large number of scenarios, in a simple and systematic way. The fundamental performance criterion consisted in the prediction ability of the various methods under each testing condition in a new, independently generated test data set with the same number of observations as the training data set. The prediction ability is quantified through the root-mean-square error of prediction (RMSEP). This performance metric can be computed using either the observed values of the response (y, affected by noise, as would be the case in practical applications) or through the noiseless values of the outputs (ỹ, only possible in simulated scenarios, being one of the main advantages of using simulations in methods’ comparison studies). The results do not differ significantly when using one or the other approach, but certainly the aim of the methods is to provide estimates that are closer to the true underlying values of the output measurand, and therefore we adopted the noiseless values of the outputs in the computation of the RMSEP for each method:
Figure 9. Number of wins/ties/losses obtained by each method in the same testing scenario as in Figure 8.
The main interest for practitioners is to know which methods tend to perform better in each scenario. The more compact way to address this concern is to count the number of wins/ties/ losses each method obtained in each testing scenario in the pairwise comparisons against all the other methods. These results already incorporate the variability of the RMSEP in each testing condition through the statistical tests and present the relative ordering of performance of the various methods. As it would be infeasible to present all the figures for each testing condition, for the sake of space and without significant loss in information we will present instead in sections 4.3.4 and 4.3.5 the summary tables for the number of wins/ties/losses, obtained by all methods in each test condition. This is essentially the fundamental information practitioners are mostly interested in, even though our analysis contemplated other aspects as well. 4.3.3. Realistic (RC) and Optimistic (OC) Comparison Scenarios. Finally, one more aspect was considered in the comparison strategy, which aims at conducting valid and fair comparisons among all the methodologies involved. Methods can only be consistently compared if they are developed under comparable levels of knowledge (or ignorance), regarding the simulated scenarios. In order to control the different levels of knowledge that can be assumed for the methods under comparison, these were compared in three distinct scenarios (for each testing condition). Two scenarios involve the development and estimation of methods under realistic conditions (RC), i.e., conditions likely to be found in practice. The third scenario assumes perfect knowledge regarding some features of the simulated scenarios (such as the number of latent dimensions and measurement uncertainties) and is used to assess the added value of perfect information and how much
n
RMSEP =
∑i =test1 (yi ̃ − yi ̂ ) ntest
(19)
In each testing condition, a training data set with ntrain observations was generated in order to identify the model structure and estimate its parameters and another one, with ntest observations (we considered ntrain = ntest), to compute the performance metric, the RMSEP. This process was repeated 200 times (called trials) in order to incorporate the estimation variability in the analysis of the results. This number is high enough to get stable results for the comparisons, but not too high to signal differences that are not of practical interest due to an excessive power of the statistical procedure described next. In each trial, all the methods share the same training and test data sets. Therefore, for each test condition, a two-way table of trials × methods is obtained, where each line contains the RMSEP values for all the methods, obtained and tested with the same data sets. Therefore, one can implement a pairwise statistical hypothesis test in order to evaluate whether the performance of two pairs of methods is equal (the null hypothesis) or different (the alternative hypothesis), and in this case which method is superior. For instance, Figure 8 presents the results for the test condition [model, latent variable (p = 2); number of variables, 6; noise level, medium]. Each cell in Figure 8 contains the boxplot for the pairwise differences between the RMSEP of the method in line i and the 9172
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
the methods are expected to improve if certain identification and estimation tasks can be enhanced. This is the optimistic scenario (OC). More precisely, the three comparison scenarios considered are the following: 1. RC1. In this realistic scenario, the number of latent variables for PLS is estimated using the adjusted Wold’s R criterion (AWRC); the number of latent variables in PCR1 is obtained by forward stepwise regression; PCR2 uses the same set of PC’s selected in PCR1. Furthermore, the uncertainties in the responses necessary to estimate PCR2 are obtained by a simulated parallel study where 10 replicates of the response are obtained at four different levels of the response. This new data set emulates the implementation of a single preliminary repeatability study for obtaining Y-uncertainty information (is the RR option defined in section 2.3; the estimated response obtained with a simpler model was also used with the uncertainty model for estimating the associated uncertainties, as also described in this section). The structure of OLS is identified through forward stepwise regression. WLS adopts the same structure as OLS, and the uncertainty model is the same as for PCR2. 2. RC2. This is another realistic scenario, i.e., a scenario that is fully compatible with the industrial practice, but differs from RC1 only in the fact that now no additional preliminary study is considered in order to get replicated data for estimating the measurement uncertainty model necessary to implement PCR2 and WLS. This model is obtained using just the same training data also employed to identify the regression model structure and to estimate its parameters, as explained in section 2.3 (is the option designated by TR). All the other methods and procedures remain the same as described in RC1. 3. OC. The optimistic scenario assumes access to information usually not available in practice. In this scenario, PLS uses the exact number of latent variables when data is generated using a latent variable model. However, the PCR methods never make use of such information, and the number of latent variables is always obtained by forward stepwise regression. The same happens to OLS and WLS regarding the selection of regressors to incorporate in the model. In this scenario one also assumes to know the exact uncertainties affecting each measurement of the response in the training data set. This is another optimistic assumption, which can be approximated in practice through the implementation of a complete R&R study. These three comparison scenarios will be considered for each test condition in section 4.3.4, where the results obtained in the Monte Carlo comparison study will be presented. 4.3.4. Results: Methods’ Comparison Study under Similar Realistic and Optimistic Scenarios. We begin by presenting the results obtained for multivariate data generated with the latent variable model structure presented in section 4.3.1, eq 16. Table 7 presents the results obtained for the low dimensional system (m = 6, p = 2), where it can be observed that, under realistic and comparable conditions of using the several methods in study, PCR2 leads in general to the best performance, never performing worse than the others and most often presenting a statistically significant better performance in the pairwise comparison tests (p-value < 0.05). This is the case for conditions RC1, where uncertainty models are derived with resource to an additional data set with replicate measurements, and for conditions RC2, with low and medium levels of noise, in which case the uncertainty models are estimated using the same training data set used to estimate the
Table 7. Summary of Results Obtained for Data Simulated for a Latent Variable Model with m = 6, p = 2, and n = 100a comparison scenarios
noise level
PLS
PCR1
PCR2
OLS
WLS
RC1
low medium high low medium high low medium high
0/0/4 0/0/4 0/0/4 0/0/4 0/0/4 0/0/4 0/1/3 2/2/0 3/1/0
2/1/1 1/1/2 1/2/1 1/2/1 1/2/1 1/3/0 1/2/1 0/1/3 0/1/3
4/0/0 3/1/0 3/1/0 4/0/0 4/0/0 1/3/0 4/0/0 2/2/0 2/2/0
1/0/3 1/1/2 1/1/2 1/1/2 1/1/2 1/3/0 0/2/2 0/1/3 0/1/3
2/1/1 3/1/0 2/2/0 2/1/1 2/1/1 1/3/0 2/1/1 2/2/0 2/1/1
RC2
OC
a m, number of variables; p, number of latent variables; n, number of observations. Indicated in each cell is the number of wins/ties/losses obtained for each method in the multiple pairwise comparisons made against all the others in each testing condition. The uncertainty models use the estimated response as input.
model structure and to fit its parameters. The only exception occurs in the case of RC2 with high levels of heteroscedastic noise. In this case, PCR1, PCR2, OLS, and WLS presented similar performances. This result is quite relevant and can be justified by the fact that, at such a noise level, uncertainty models cannot be adequately estimated using the simplest approach followed in RC2, based on data collected in an observational way. These poorly estimated measurement uncertainty models would then degrade the performance of uncertainty-based methods such as PCR2 and WLS, to the same level as or even worse than the classical methods, as they are using in their estimation algorithms, unreliable information. Therefore, the consequence can be to produce models that are even worse than those obtained by simpler methodologies that make fewer assumptions about the noise structure of data. Another interesting finding is the surprisingly low performance of PLS in this study. However, we recall that it is not PLS that is being individually assessed but this method coupled with a given methodology for defining its structure, in this case, the adjusted Wold’s R criterion (AWRC). The same applies to all the other methods, and the way their structures were estimated (latent variables used or observed variables selected). Therefore, what can be inferred is that PLS implemented with AWRC tends to perform worse than PCR methods with stepwise regression. When the optimal number of dimensions is assumed to be known for PLS (under optimistic conditions, OC), its performance improves for medium and high noise magnitudes to the same level as PCR2 (where such information is however not exploited). Therefore, the performance of PLS seems to be critically dependent on the number of latent variables selected to implementing it, degrading considerably if such a number is not the most appropriate. Apparently, in this case we may conclude that even though AWRC is considered to be one of the best methods for performing such a task, it often fails to provide the correct latent structure underlying observed data. Figure 10 shows the histogram of estimated latent variables by this method for one such case, even when the noise magnitude is low (the correct number is p = 2). On the other hand, even though PCR2 does not assume perfect knowledge of the latent variable structure (contrary to measurement uncertainties which are assumed to be known), its performance is never worse than PLS in this optimistic comparison scenario. 9173
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
Table 9. Summary of Results Obtained for Data Simulated for a Latent Variable Model with m = 100, p = 4, and n = 1000a comparison scenarios
noise level
PLS
PCR1
PCR2
OLS
WLS
RC1
low medium high low medium high low medium high
0/0/4 0/3/1 4/0/0 0/0/4 0/3/1 4/0/0 0/3/1 0/2/2 0/0/4
3/0/1 0/3/1 0/0/4 3/0/1 1/2/1 1/0/3 0/3/1 0/2/2 1/0/3
4/0/0 4/0/0 1/1/2 4/0/0 4/0/0 0/0/4 4/0/0 4/0/0 2/1/1
1/0/3 0/2/2 3/0/1 1/0/3 0/1/3 3/0/1 0/2/2 0/2/2 2/1/1
2/0/2 1/2/1 1/1/2 2/0/2 1/2/1 2/0/2 1/2/1 3/0/1 4/0/0
RC2
OC
a m, number of variables; p, number of latent variables; n, number of observations. Indicated in each cell is the number of wins/ties/losses obtained for each method in the multiple pairwise comparisons made against all the others in each testing condition. The uncertainty models use the estimated response as input.
Figure 10. Number of latent variables estimated for PLS using Wold’s R criterion (AWRC), for the case of data generated with a latent variable model (m = 6, p = 2, and n = 100) with low magnitude noise.
The results for the medium size latent variable system (m = 20, p = 4) are presented in Table 8. The main trends in the
In summary, the results obtained for all the conditions tested with the latent variable model show that PCR2 tends to be the best solution for predictive purposes under realistic and comparable testing conditions (RC1 and RC2) when the measurements are corrupted with noise levels likely to be found in practice (low and moderate). For the extreme cases of high heteroscedastic noise level, the measurement uncertainty models necessary to implement uncertainty-based methodologies, such as PCR2, become harder to develop (or require more experimental effort) and can degrade the performance of this type of methods. We have also tested all the methodologies with data generated with the correlation model, eq 17. Even though a variety of combinations of dimensionality, noise levels, and correlation structures were considered for this case, for the sake of space we only present here the results for a typical scenario likely to be found in practice when this kind of model structure is a feasible representation of reality. We consider the case of 20 regressors (m = 20), some of them being significant while the others are not (as described in section 4.3.1), all showing low correlation levels between themselves. The results are presented in Table 10. It is now possible to see that the
Table 8. Summary of Results Obtained for Data Simulated for a Latent Variable Model with m = 20, p = 4, and n = 500a comparison scenarios
noise level
PLS
PCR1
PCR2
OLS
WLS
RC1
low medium high low medium high low medium high
0/0/4 0/0/4 0/0/4 0/0/4 0/0/4 0/0/4 3/1/0 3/1/0 2/0/2
3/0/1 3/0/1 1/0/3 3/0/1 3/0/1 2/0/2 2/0/2 2/0/2 0/1/3
4/0/0 4/0/0 2/2/0 4/0/0 4/0/0 1/0/3 3/1/0 3/1/0 3/1/0
1/1/2 1/0/3 2/2/0 1/0/3 1/0/3 4/0/0 0/0/4 0/0/4 0/1/3
1/1/2 2/0/2 2/2/0 2/0/2 2/0/2 3/0/1 1/0/3 1/0/3 2/1/1
RC2
OC
a
m, number of variables; p, number of latent variables; n, number of observations. Indicated in each cell is the number of wins/ties/losses obtained for each method in the multiple pairwise comparisons made against all the others in each testing condition. The uncertainty models use the estimated response as input.
results are essentially those already highlighted for the small size system. Apparently, there is no significant interaction effect associated with increasing the system size from low to medium size, in latent variable systems. PCR2 continues to be the method presenting the best overall performance, except for high magnitude heteroscedastic noise, where the uncertainty models cannot be well estimated from a single data set collected in a passive way (condition RC2). The performance of PLS also improves significantly under optimistic conditions (OC) where the correct number of latent variables is assumed to be known, achieving the same level of accuracy as PCR2. As to the results for the large size latent variable system (m = 100, p = 4), one can verify that the effect of high magnitude noise is now evident in all realistic testing conditions (RC1, RC2), where PLS achieves the best performance (Table 9). For low and moderate noisy scenarios, PCR2 leads the prediction performance. The optimistic comparison scenario is also dominated by PCR2 for low and moderate magnitudes of noise, whereas for high noise magnitude WLS with stepwise variable selection was found to perform better.
Table 10. Summary of the Results Obtained for Data Simulated for a Linear Regression Model with m = 20, Low Correlation among the Regressors, and n = 500a comparison scenarios
noise level
PLS
PCR1
PCR2
OLS
WLS
RC1
low medium high low medium high low medium high
1/0/3 1/1/2 2/0/2 1/0/3 2/0/2 2/0/2 1/0/3 1/1/2 2/0/2
0/0/4 0/0/4 0/0/4 0/0/4 0/0/4 1/0/3 0/0/4 0/0/4 0/0/4
2/0/2 1/1/2 1/0/3 2/0/2 1/0/3 0/0/4 2/0/2 1/1/2 1/0/3
3/0/1 3/0/1 3/0/1 3/0/1 3/0/1 4/0/0 3/0/1 3/0/1 3/0/1
4/0/0 4/0/0 4/0/0 4/0/0 4/0/0 3/0/1 4/0/0 4/0/0 4/0/0
RC2
OC
a m, number of variables; n, number of observations. Indicated in each cell is the number of wins/ties/losses obtained for each method in the multiple pairwise comparisons made against all the others in each testing condition. The uncertainty models use the estimated response as input.
9174
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
methodology showing systematically the best overall performance is WLS with stepwise variable selection. The only exception occurs for the realistic conditions RC2, where the uncertainty model developed from the training data set has poor quality and ultimately degrades the estimation performance of the uncertainty-based method. Clearly the better performance of the variable selection/WLS methodology can be explained by a better match of the data generating structure and that for the modeling formalism adopted. This is an issue that is often overlooked in practice, to which practitioners should devote more attention instead of blindly adopting their favorite regression approach. 4.3.5. Results: Assessing the Effect of Using Different Levels of Quality of Measurement Uncertainty Information on the Quality of the Models. The Monte Carlo comparison study conducted in section 4.3.4 enables the identification of promising methodologies for handling heteroscedastic noise structures in a variety of realistic scenarios. We now address another fundamental question with special relevancy for practitioners. Being certain, according to the results obtained above, that in the vast majority of realistic conditions an uncertainty-based methodology leads the performance, what would then be the added value of investing more efforts in obtaining a better uncertainty description of data? In order to answer to this research question, all methods were considered as well as their implementation under different uncertainty modeling scenarios leading to uncertainty models of differing quality, namely RC1, RC2, and OC (of course this only has an effect for uncertainty-based methods, as the others do not explicitly incorporate any information about the noise structure). More specifically, we considered the case of the family of PCR models, where a representative was found that presents the best overall performance for latent variable systems, and compared the predictive performances of the different versions of PCR obtained when they all have access to measurement uncertainty information on increasing quality. This sequence ranges from plain PCR1 (that does not explicitly incorporate any information about the uncertainty structure), then PCR2 with the uncertainty model estimated using just the training data set (condition TR, used in condition RC2), after which comes PCR2 implemented with an uncertainty model developed from a separate data set containing replicates (RR, used in condition RC1), and finally PCR2 implemented with perfect information about data uncertainties (as assumed in condition OC), which virtually corresponds to conducting a preliminary accurate R&R study. The results obtained for the latent variable model, with different dimensionalities and heteroscedastic noise levels, are presented in Table 11 (note that, even though only the results for the PCR family are presented, all methods were considered and tested against each other). They show that, for low and medium noise levels, the performance of the PCR method improves with the quality of measurement uncertainty information one is able to integrate in the estimation stage. In low noise scenarios PCR2 with TR and PCR2 with RR tend to perform similarly, but for moderate noise levels the advantage of using replicates (RR) becomes evident. In certain cases, it reaches close to the case where perfect information regarding uncertainties is assumed to be known and available (medium noise levels). Interestingly, for high noise level scenarios, the results show that it is better not to use uncertainty information at all (PCR1) than to use it through a poorly estimated uncertainty model (PCR2/TR). However, if one is able to estimate a better measurement
Table 11. Summary of Results Obtained for Data Simulated with a Latent Variable Model of Different Dimensionalities and with Different Heteroscedastic Noise Levelsa size m=6
m = 20
m = 100
noise level
PCR1
PCR2 (TR)
PCR2 (RR)
PCR2 (TRUE)
low medium high low medium high low medium high
5/0/3 1/2/5 1/5/2 5/0/3 5/0/3 2/0/6 5/0/3 1/3/4 1/0/7
6/1/1 3/3/2 1/5/2 6/1/1 6/0/2 1/0/7 6/0/2 5/1/2 0/0/7
6/1/1 5/3/0 5/2/1 6/1/1 7/0/1 4/3/1 7/0/1 7/0/1 2/2/4
8/0/0 6/2/0 7/1/0 8/0/0 8/0/0 7/1/0 8/0/0 8/0/0 5/1/2
a
Indicated in each cell is the number of wins/ties/losses obtained for the PCR methods in the multiple pairwise comparisons made against all the others in each testing condition (all methods were included in the comparison study even though only the results for PCR methods are shown: OLS, PLS, WLS-TR, WLS-RR, WLS-TRUE, PCR1, PCR2TR, PCR2-RR, and PCR2-TRUE).
uncertainty model (RR, OC), part of the performance of uncertainty methods is recovered. From these results one can conclude that, for medium and high noise level scenarios, developing a better measurement uncertainty model through preliminary repeatability studies can be a good option, as statistically significant better models will be obtained as a result of such added experimental effort. For low noise levels, using a simpler approach (TR) may be enough to achieve good prediction accuracy.
5. DISCUSSION AND CONCLUSIONS In this article, different aspects regarding the specification and integration of measurement uncertainty information were addressed and several solutions were considered, tested, and proposed. They emanate from industrial practice and constitute pervasive concerns in the chemical process industries. We have structured them as specification challenges and integration challenges. Regarding the uncertainty specification problem in the neighborhood of physical or natural limits, we would like to underline that the procedure to pursue is rather goal dependent. Two main classes of objectives are usually at stake in this matter: (i) reporting a final result; (ii) assigning the measurement uncertainty to an intermediate result, which will be further processed. In general, we advocate the use of the principles stipulated in the GUM and other related standard documents, in order to specify the uncertainty that affects measurements and establish the uncertainty intervals (standard or expanded). This holds, in particular, for the measurements made when the measurand assume values near natural or physical limits, in which case part of the uncertainty intervals or even the measurements themselves may fall in the unfeasible region. These intervals may be considered as legitimate expressions of the measurement device outcome, which should not reflect the theoretical constraints that the measurand must comply with. As shown in section 4.1, this is also the best procedure from the standpoint of bias and coverture accuracy. Therefore, for goals belonging to class (ii), we strongly recommend its use. We are aware, however, that by doing so the uncertainty intervals thus computed contradict their own interpretation made by the GUM (“...values that could reasonably be attributed to the measurand”). In other words, 9175
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
Notes
the procedures established by the GUM to estimate uncertainty intervals may not be consistent with its own definition, when measurements are made close to physical limits. Even though some authors state that this does not represent necessarily a contradiction, since the GUM principles should not be applied in situations where measurements and their uncertainties are of the same order of magnitude, this inconsistency can be mitigated by changing the way uncertainty intervals are computed, in order to make them consistent with physical constraints and with the GUM interpretation of uncertainty. Several possible approaches were covered in this regard in section 4.1, and from the results obtained one could appreciate that it is possible to retain most of the desirable properties of coverage and unbiasedness of the GUM approach while enforcing feasibility in the expanded uncertainty intervals, by adopting a Bayesian approach or a truncation interval approach. Therefore, these are recommended when the report of the result (goal of class (i)) must fully comply with the feasible region for the measurand or when the measurement outcomes are to be used in model-based platforms that require physically meaningful inputs (e.g., process simulators and optimization frameworks). Regarding the second specification problem, involving the development of measurement uncertainty models able to properly describe the behavior of measurement uncertainties spanning several orders of magnitude of the measurand, our study clearly shows that the proposed methodology UM-2 clearly outperforms the current methodology, UM-1, and present better properties than UM-3, namely narrower prediction intervals obtained. In what concerns the second challenge (the integration challenge), the results obtained are quite interesting from different perspectives. First they show that the methodology PCR2 tends to show, in a rather consistent way, the best performance across the different scenarios covered regarding different dimensionalities and noise magnitudes, when data is generated with a latent variable model structure. The second observation is that under comparable circumstances, PLS, a chemometric methodology known to cope well with noisy features, presented a performance that is critically dependent on the ability to correctly estimate its pseudorank. Our study also shows that, for medium and high noise level scenarios, it is quite justifiable to conduct preliminary repeatability studies in order to derive significantly better models. In fact, another important observation is that using poorly estimated uncertainty models during model development can lead to predictions that are even worse than when not using measurement uncertainty information at all. Thus, in order to take full advantage of uncertainty information, measurement uncertainty models of sufficiently good quality should be derived first. As for data generated with the correlation model structure, the best approach is WLS coupled with stepwise variable selection. This observation underlines the importance of selecting the modeling framework that best matches the inner structure of the system under analysis. The discovery and exploitation of such structures should therefore be a theme for further research in the future.
■
The authors declare no competing financial interest.
■
REFERENCES
(1) Barrentine, L. B. Concepts for R&R Studies; ASQC Quality Press: Milwaukee, WI, 1991. (2) ASQ/AIAG. Measurement Systems Analysis, 3rd ed.; DaimlerChrysler Corporation, Ford Motor Company, General Motors Corporation; ASQ: Milwaukee, WI, 2002. (3) Costa, R.; Angélico, D.; Reis, M. S.; Ataíde, J.; Saraiva, P. M. Paper Superficial Waviness: Conception and Implementation of an Industrial Statistical Measurement System. Anal. Chim. Acta 2005, 544, 135−142. (4) JCGM. Evaluation of Measurement Data - Guide to the Expression of Uncertainty in Measurement; JCGM 100:2008 (GUM 1995 with minor corrections); JCGM: Paris, 2008; p 134. (5) Kacker, R.; Jones, A. On use of Bayesian statistics to make the Guide to the Expression of Uncertainty in Measurement consistent. Metrologia 2003, 40, 235−248. (6) Kessel, W. Measurement Uncertainty According to ISO/BIPMGUM. Thermochim. Acta 2002, 382, 1−16. (7) Kimothi, S. K. The Uncertainty of Measurements; ASQ: Milwaukee, WI, 2002. (8) Lira, I. Evaluating the Measurement Uncertainty; Institute of Physics Publishing: Bristol, U.K., 2002. (9) Wentzell, P. D.; Lohnes, M. T. Maximum Likelihood Principal Component Analysis with Correlated Measurements Errors: Theoretical and Practical Considerations. Chemom. Intell. Lab. Syst. 1999, 45, 65−85. (10) Leger, M. N.; Vega-Montoto, L.; Wentzell, P. D. Methods for systematic investigation of measurement error covariance matrices. Chemom. Intell. Lab. Syst. 2005, 77, 181−205. (11) Reis, M. S.; Saraiva, P. M. A Comparative Study of Linear Regression Methods in Noisy Environments. J. Chemom. 2004, 18 (12), 526−536. (12) Reis, M. S.; Saraiva, P. M. Integration of Data Uncertainty in Linear Regression and Process Optimization. AIChE J. 2005, 51 (11), 3007−3019. (13) Wentzell, P. D.; Andrews, D. T.; Hamilton, D. C.; Faber, K.; Kowalski, B. R. Maximum Likelihood Principal Component Analysis. J. Chemom. 1997, 11, 339−366. (14) Wentzell, P. D.; Andrews, D. T.; Kowalski, B. R. Maximum Likelihood Multivariate Calibration. Anal. Chem. 1997, 69, 2299− 2311. (15) Reis, M. S.; Saraiva, P. M. Heteroscedastic Latent Variable Modelling with Applications to Multivariate Statistical Process Control. Chemom. Intell. Lab. Syst. 2006, 80, 57−66. (16) Rooney, W. C.; Biegler, L. T. Design for Model Parameter Uncertainty Using Nonlinear Confidence Regions. AIChE J. 2001, 47 (8), 1794−1804. (17) Martínez, À .; Riu, J.; Rius, F. X. Application of the Multivariate Least Squares Regression Method to PCR and Maximum Likelihood PCR Techniques. J. Chemom. 2002, 16, 189−197. (18) Bro, R.; Sidiropoulos, N. D.; Smilde, A. K. Maximum Likelihood Fitting Using Ordinary Least Squares Algorithms. J. Chemom. 2002, 16, 387−400. (19) Eurachem/CITAC guide: Quantifying Uncertainty in Analytical Measurement, 3rd ed.; Ellison, S. L. R., Williams, A., Eds.; 2012. (20) Cowen, S.; Ellison, S. L. R. Reporting measurement uncertainty and coverage intervals near natural limits. Analyst 2006, 131, 710−717. (21) Analytical Methods Committee, The Royal Society of Chemistry. Measurement Uncertainty and Confidence Intervals near Natural Limits; Royal Society of Chemistry: London, 2008; p 2. (22) Analytical Methods Committee, The Royal Society of Chemistry.. Measurement uncertainty evaluation for a non-negative measurand: an alternative to limit of detection. Accredit. Qual. Assur. 2008, 13, 29−32.
AUTHOR INFORMATION
Corresponding Author
*E-mail: marco@eq uc.pt. Tel.: +351 239 798 700. Fax: +351 239 798 703. 9176
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177
Industrial & Engineering Chemistry Research
Article
(49) Höskuldsson, A. Prediction Methods in Science and Technology; Thor Publishing: Copenhagen, 1996. (50) Wold, S.; Sjöström, M.; Eriksson, L. PLS-Regression: A Basic Tool of Chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109−130. (51) Kresta, J. V.; Marlin, T. E.; MacGregor, J. F. Development of Inferential Process Models Using PLS. Comput. Chem. Eng. 1994, 18, 597−611. (52) Nelson, P. R. C.; Taylor, P. A.; MacGregor, J. F. Missing Data Methods in PCA and PLS: Score Calculations with Incomplete Observations. Chemom. Intell. Lab. Syst. 1996, 35, 45−65. (53) Walczak, B.; Massart, D. L. Dealing with Missing Data: Part I. Chemom. Intell. Lab. Syst. 2001, 58, 15−27. Walczak, B.; Massart, D. L. Dealing with Missing Data: Part II. Chemom. Intell. Lab. Syst. 2001, 58, 29−42. (54) Höskuldsson, A. PLS Regression Methods. J. Chemom. 1988, 2, 211−228. (55) Li, B.; Morris, J.; Martin, E. B. Model selection for partial least squares regression. Chemom. Intell. Lab. Syst. 2002, 64, 79−89. (56) Wold, S. Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics 1978, 20 (4), 397−405.
(23) Rocke, D. M.; Lorenzato, S. A two-component model for measurement error in analytical chemistry. Technometrics 1995, 37 (2), 176−184. (24) Rocke, D. M.; Durbin, B.; Wilson, M.; Kahn, H. D. Modeling uncertainty in the measurement of low-level analytes in environmental analysis. Ecotoxicol. Environ. Saf. 2003, 56, 78−92. (25) Eurachem/CITAC Guide CG4 - Quantifying Uncertainty in Analytical Measurement, 3rd ed.; Ellison, S. L. R., Williams, A., Eds.; 2012. (26) Chaterjee, S.; Price, B. Regression Analysis by Example, 2nd ed.; Wiley: New York, 1998. (27) Wentzell, P. D.; Tarasuk, A. C. Characterization of heteroscedastic measurement noise in the absence of replicates. Anal. Chim. Acta 2014, 847, 16−28. (28) de Brauwere, A.; Pintelon, R.; De Ridder, F.; Schoukens, J.; Baeyens, W. Estimation of heteroscedastic measurement noise variances. Chemom. Intell. Lab. Syst. 2007, 86 (1), 130−138. (29) Wang, J.-H.; Hopke, P. K. Estimation of the heteroscedastic noise in large data arrays. Anal. Chim. Acta 2000, 412 (1−2), 177−184. (30) Reis, M. S.; Saraiva, P. M. Generalized Multiresolution Decomposition Frameworks for the Analysis of Industrial Data with Uncertainty and Missing Values. Ind. Eng. Chem. Res. 2006, 45, 6330− 6338. (31) Andrews, D. T.; Wentzell, P. D. Applications of Maximum Likelihood Principal Component Analysis: Incomplete Data Sets and Calibrations Transfer. Anal. Chim. Acta 1997, 350, 341−352. (32) Kiers, H. A. L. Weighted least squares fitting using ordinary least squares algorithms. Psycometrika 1997, 62 (2), 251−266. (33) Gabriel, K. R.; Zamir, S. Lower rank approximation of matrices by least squares with any choice of weigths. Technometrics 1979, 21 (4), 489−498. (34) Martínez, À .; Riu, J.; Rius, F. X. Lack of Fit in Linear Regression Considering Errors in both Axis. Chemom. Intell. Lab. Syst. 2000, 54, 61−73. (35) Riu, J.; Rius, F. X. Assessing the Accuracy of Analytical Methods Using Linear Regression with Errors in both Axis. Anal. Chem. 1996, 68, 1851−1857. (36) Tellinghuisen, J. Least squates in calibration: dealing with uncertainty in x. Analyst 2010, 135, 1961−1969. (37) Cheng, C.-L.; Riu, J. On estimating linear relationships when both variables are subject to heteroscedastic measurement errors. Technometrics 2006, 48 (4), 511−519. (38) Draper, N. R.; Smith, H. Applied Regression Analysis, 3rd ed.; Wiley: New York, 1998. (39) Montegomery, D. C.; Peck, E. A.; Vining, G. G. Introduction to Linear Regression Analysis, 4th ed.; Wiley: Hoboken, NJ, 2006. (40) Jackson, J. E. Quality Control Methods for Several Related Variables. Technometrics 1959, 1 (4), 359−377. (41) Kourti, T.; MacGregor, J. F. Process Analysis, Monitoring and Diagnosis, Using Multivariate Projection Methods. Chemom. Intell. Lab. Syst. 1995, 28, 3−21. (42) Kresta, J. V.; MacGregor, J. F.; Marlin, T. E. Multivariate Statistical Monitoring of Process Operating Performance. Can. J. Chem. Eng. 1991, 69, 35−47. (43) Martens, H.; Naes, T. Multivariate Calibration; Wiley: Chichester, U.K., 1989. (44) Jackson, J. E. A User’s Guide to Principal Components; Wiley: New York, 1991. (45) Geladi, P.; Kowalski, B. R. Partial Least-Squares Regression: a Tutorial. Anal. Chim. Acta 1986, 185, 1−17. (46) Haaland, D. M.; Thomas, E. V. Partial Least-Squares Methods for Spectral Analysis. 1. Relation to Other Quantitative Calibration Methods and the Extraction of Qualitative Information. Anal. Chem. 1988, 60, 1193−1202. (47) Helland, I. S. Some Theoretical Aspects of Partial Least Squares Regression. Chemom. Intell. Lab. Syst. 2001, 58, 97−107. (48) Helland, I. S. On the Structure of Partial Least Squares Regression. Commun. Stat.-Simul. 1988, 17 (2), 581. 9177
DOI: 10.1021/ie504577d Ind. Eng. Chem. Res. 2015, 54, 9159−9177