Statistical Significance Testing as a Guide to Partial Least-Squares

Nov 21, 2007 - Statistical Significance Testing as a Guide to Partial Least-Squares (PLS) Modeling of Nonideal Data Sets for Fuel Property Predictions...
1 downloads 0 Views 3MB Size
Energy & Fuels 2008, 22, 523–534

523

Statistical Significance Testing as a Guide to Partial Least-Squares (PLS) Modeling of Nonideal Data Sets for Fuel Property Predictions Kirsten E. Kramer, Robert E. Morris,* Susan L. Rose-Pehrsson, Jeffrey Cramer, and Kevin J. Johnson NaVal Research Laboratory, Code 6181, Washington, D.C. 20375-5320 ReceiVed July 13, 2007. ReVised Manuscript ReceiVed September 7, 2007

Partial least-squares (PLS) was used to formulate property models for a set of 43 jet fuel samples using near-infrared (NIR), Raman, and gas chromatography (GC) data. A total of 28 different properties were evaluated for each technique. Given that the data set was small and several of the property distributions were nonideal, significance testing was used for model formulation and evaluation. A statistical F test was applied for selection of latent variables, determining the significance level of the model compared to random chance and evaluating the increase in prediction error when the expected sources of error were fully incorporated. A chart was used to categorize the confidence in the modeling ability as high, low, or indeterminate with the given data set.

The protocol for safety inspection of aircraft fuels used by the Navy is very cumbersome and labor-intensive. Aviation fuels are tested for properties such as density, flash point, particulate matter, fuel system icing inhibitor (FSII), and free water. Visual examination is repeated frequently throughout the day, and testing for particulates, FSII, and free water is repeated during refueling operations. Each property is tested individually using test methods defined by the American Society for Testing and Materials (ASTM). Aboard naval vessels, a substantial amount of time, resources, and laboratory space is devoted to carrying out these tests. Therefore, there is considerable interest in developing a sensor-based method that could characterize multiple properties with one measurement. Ideally, instrumentation could be used for individual samples (a benchtop analyzer) as well as for continuous real-time monitoring (mounted on the storage unit). Strong candidates for this approach are nearinfrared (NIR) spectroscopy1,2 and Raman spectroscopy.3,4 At the Naval Research Laboratory (NRL), property modeling of fuels has been evaluated using NIR, Raman, and gas chromatography (GC).5 For all three techniques, a correlation of the raw data with the property values of interest was performed using partial least-squares6–8 (PLS). This algorithm is very powerful for calibrating property information to subtle or weak analytical signals. However, PLS models are easily

overmodeled9–12 if an appropriate experimental design13–15 has not been implemented. To formulate a robust calibration, several conditions should be met: (1) the property values being modeled should span a relevant range; (2) all expected sources of variance in the instrumental data (i.e., short-term noise, day-to-day drift, and long-term drift) should be expressed in the training data; and (3) a randomized sampling protocol should be followed. Ideally, the variances would be uniformly distributed, and the range in property values would be high compared to the errors in the reference measurements used to obtain those values. Furthermore, the number of samples should be high enough to be statistically meaningful when a multivariate model is used.16,17 When the calibration data do not meet these requirements, overmodeling becomes more likely and more difficult to assess. Overmodeling, also referred to as overfitting, occurs when, in this instance, the spectral information is correlated to property information when no inherent relationship exists. Overmodeling may be due to chance correlations, selection of a model size that is too high, or a combination of the two. The result is a calibration model that is not robust for future data or data collected external to the model. Fuel property data, by nature, tend to be limited in scope because the majority of specification fuels are produced with a fairly narrow range of properties. While the overall goal of a property assessment methodology is to detect outliers, i.e., offspecification fuels, this process still depends upon a quantitative

* To whom correspondence should be addressed. E-mail: robert.morris@ nrl.navy.mil. (1) Macho, S.; Larrechi, M. S. Trends Anal. Chem. 2002, 21 (12), 799– 806. (2) Blanco, M.; Villarroya, I. Trends Anal. Chem. 2002, 21 (4), 240– 250. (3) Williams, K. P. J.; Aries, R. E.; Cutler, D. J.; Lidiard, D. P. Anal. Chem. 1990, 62, 2553–2556. (4) Cooper, J. B.; Wise, K. L.; Groves, J.; Welch, W. T. Anal. Chem. 1995, 67, 4096–4100. (5) Johnson, K. J.; Morris, R. E.; Rose-Pehrsson, S. L. Energy Fuels 2006, 20, 727–733. (6) Geladi, P.; Kowalski, B. R. Anal. Chim. Acta 1986, 185, 1–17. (7) Brereton, R. G. Analyst 2000, 125, 2125–2154. (8) Martens, H. Chemom. Intell. Lab. Syst. 2001, 58, 85–95.

(9) Small, G. W. Trends Anal. Chem. 2006, 25 (11), 1057–1066. (10) Liu, R.; Chen, W.; Gu, X.; Wang, R. K.; Xu, K. J. Phys. D: Appl. Phys. 2005, 38, 2675–2681. (11) Arnold, M. A.; Burmeister, J. J.; Small, G. W. Anal. Chem. 1998, 70, 1773–1781. (12) Faber, N. M.; Rajkó, R. Spectroscopy (Amsterdam) 2006, 18 (5), 24–28. (13) Wold, S.; Sjöström, M.; Carlson, R.; Lundstedt, T.; Hellberg, S.; Skagerberg, B.; Wilkström, C.; Öhman, J. Anal. Chim. Acta 1986, 191, 17–32. (14) Wold, S.; Josefson, M.; Gottries, J.; Linusson, A. J. Chemom. 2004, 18, 156–165. (15) Liang, Y.; Fang, K.; Xu, Q. Chemom. Intell. Lab. Syst. 2001, 58, 43–57. (16) Jackson, D. L. Struct. Equation Model. 2007, 14 (1), 48–76. (17) Faber, N. M. Chemom. Intell. Lab. Syst. 1999, 49, 79–89.

Introduction

10.1021/ef700403s

This article not subject to U.S. Copyright. Published 2008 by the American Chemical Society Published on Web 11/21/2007

524 Energy & Fuels, Vol. 22, No. 1, 2008

assessment of the critical fuel properties. Thus, as a minimum, the PLS models must be capable of quantitatively predicting the property values over the respective specification ranges with an acceptable level of uncertainty. It is common practice, when designing a chemometric experiment, to formulate the training set such that the values to be modeled are uniformly distributed over the range of prediction. However, for reasons described above, this is neither possible nor practical for fuel property data. Thus, we are left with the task of developing calibration models from nonideal training set data, and this is a major reason why efforts to employ chemometric modeling of fuel properties from compositional measurements have often not been entirely successful, although there are exceptions. For example, the correlation of the octane number18 to NIR spectra is the most widespread, with numerous analyzers based on this method on the market today. For a training set of data that is limited in scope, particular care must be taken when interpreting the results of PLS calibration. Preliminary studies may indicate that spectral data can be successfully modeled to a property when really the result is due to overfitting. This paper describes methods of guarding against overly optimistic interpretations of PLS calibrations for small or nonideal data sets. The significance tests described in this study are particularly useful for small data sets or property distributions that do not follow an ideal experimental design. The procedures can also be used to determine if more samples are needed to prevent systematic overmodeling and whether the models may be producing an overly optimistic prediction error. Another advantage of expressing results in terms of statistical probabilities is that side-by-side comparisons can be made across different properties and across different analytical methodologies.

Kramer et al. Table 1. Summary of Jet Fuel Property Data property ASTM (measurement units) D4052 D93 D3828 D5972 D5949 D2622 D1840 D1319 D6379 D1319 D1159 D1319 D3701 D4809 D445 D445 D445 D1218 D2624

Experimental Section Hydrocarbon Fuels. The data set consisted of 43 jet fuel samples: 12 Jet A, 20 Jet A-1, 9 JP-8, and 2 JP-5. The different types of fuels vary little in terms of compositional differences and did not show clustering in principal component analysis score plots of the spectral or chromatographic data. The physical and chemical properties that were measured for each fuel did not appear to be influenced by the jet fuel type (for example, Jet A did not have systematically lower densities than the others; Jet A-1 did not have systematically higher viscosities; etc.); therefore, including the various types into one data set was justified. Furthermore, if there were subtle differences in the fuel types that influenced the spectral and chromatographic data, then providing the models with the various types would encourage the algorithms to extract features based on property information rather than on fuel-type information. For each sample, 28 property values were measured using ASTM testing methods. In Table 1, the property, ASTM reference test number, minimum (min), maximum (max), and range is listed as well as the reproducibility values (reprod) reported in the ASTM methods. For viscosity, thermal stability, and final boiling point distillation, no reproducibility values were found; therefore, 10% of the mean property value was used as the reproducibility value. For thermal stability models, one sample was removed because it was a clear outlier, containing a “0” value for the temperature. Software. All calculations were performed in MATLAB version 7.0 (Mathworks, Inc., Natick, MA), and the PLS algorithm was implemented from the PLS_Toolbox version 3.5 (Eigenvector Research, Inc.). NIR Spectroscopy. NIR spectra were obtained with a Cary model 5E transmission scanning spectrophotometer for neat fuel samples referenced to air backgrounds in random order over 1 day. (18) Kelly, J. J.; Barlow, C. H.; Jinguji, T. M.; Callis, J. B. Anal. Chem. 1989, 61, 313–320.

D3242 D3241 D5001 D86 D86 D86 D86 D86 D86 a

density at 60 °F (g/mL) flash point (PM) (°F) flash point (mini) (°F) freeze point (°C) pour point (°C) total sulfur (ppm) naphthalenes (vol %) aromatics (vol %) aromatics (vol %) saturates (vol %) olefins (vol %) olefins (vol %) hydrogen (wt %) net heat content (btu/lb) viscosity at 20 °C (mm2/s) viscosity at -20 °C (mm2/s) viscosity at -40 °C (mm2/s) refractive index conductivity (pS/m) acid number [(mg of KOH)/g] thermal stability breakpoint (°F) lubricity WSDa (mm) initial boiling point (IBP) (°F) 10% distillation (°F) 20% distillation (°F) 50% distillation (°F) 90% distillation (°F) final boiling point (FBP) (°F)

min

max

range

reprod

RER

0.7902 0.8235 0.0333 0.0005 66.6 105.0

154.0

49.0

8.5

5.8

103.0

144.0

41.0

6.2

6.6

-67.0

-43.6

23.4

1.30

18.0

-80.0

-60.0

20.0

6.8

2.9

13

2453

2440

30.0

81.3

0.1

3.8

3.7

0.069

53.6

11.8

22.0

10.2

2.7

3.8

13.0

24.4

11.4

1.897

6.0

75.8

87.0

11.2

4.4

2.5

0.06

1.53

1.47

0.4

3.7

0.7

2.3

1.6

2.1

0.8

13.71

14.47

0.76

0.11

6.9

18 331 18 569 238

77.4

3.1

1.4

3.0

1.6

0.176

9.1

3.0

6.2

3.2

0.41

7.7

6.1

14.6

8.5

0.85

10.0

1.44

1.456

0.016

0.0005 32.0

0

329

329

17.0

19.4

0

0.019

0.019

0.003

6.3

265

370

105

28.2

3.7

0.54

0.71

0.17

0.07

2.4

299.8

362.8

63.0

15.3

4.1

329.1

388.6

59.5

10.8

5.5

340.2

400.0

59.8

13.9

4.3

366.5

440.7

74.2

24.2

3.1

420.6

492.0

71.4

11.6

6.2

454.0

521.7

67.7

24.2

2.8

BOCLE wear scar diameter (WSD).

A Suprasil cell with a 10 mm path length was used. Spectra were obtained over a wavelength range of 1000–2200 at 1 nm/s with a resolution of 1 nm, producing 1201-point spectra. The NIR data from the 43 samples are shown in Figure 1A. NIR spectra were imported into MATLAB as comma-separated variables (CSV) and mean-centered prior to analysis. Raman Spectroscopy. Raman spectra were acquired with a Bruker Vertex 70 Fourier Transform (FT) spectrometer equipped with a FT-Raman module using a 1064 nm excitation laser and optical filters to avoid fluorescence. Spectra were collected from 505 to 3500 cm-1 in 7 mm glass sample cells using a 24-position autosampler with a resolution of 4 cm-1, producing 750-point spectra. FT-Raman spectra for the 43 samples are illustrated in Figure 1B. Raman data were imported into MATLAB from the

PLS Modeling of Nonideal Data Sets

Energy & Fuels, Vol. 22, No. 1, 2008 525

Figure 1. Analytical data for the 43 jet fuel samples by NIR (A), Raman (B), and GC (C). For GC, the total run time was 32 min. The y axis represents arbitrary units in all three plots.

of the correlation optimized warping (COW) algorithm19,20 and mean-centered prior to analysis. Chemometric Modeling. In this discussion, a bold capital X will be used to symbolize the spectral data matrix and a lowercase bold y will denote the property values of a given property across all of the samples, because they were modeled individually and were in the form of a column vector. The letters X and y will also be used in the general sense, when referring to the regressor and regressand portions of the PLS model. Range Error Ratio (RER). The RER21 of each property vector was computed as RER )

Figure 2. Illustration of PLS density model size selection methods using MSEP values from LOO-CVs of density using NIR data. If the model size is chosen on the basis of MSEPmin, 5 latent variables would be used.

raw Bruker instrument datafiles using an in-house written MATLAB program. Raman spectra were mean-centered prior to analysis. GC. GC data were collected using an Agilent model 6890 capillary gas chromatograph equipped with a flame ionization detector. An autosampler was used to inject 1.0 µL aliquots of the sample to the split/splitless injector held at 250 °C, with a split flow ratio of 60:1. A 50 m × 0.2 mm Ultra 2 5% phenylmethylsiloxane capillary column was used, with a helium carrier gas flow rate of 1 mL/min. The oven temperature profile was 60 °C for 0 min, to 250 at 10 °C/min, holding for 3 min. GC chromatograms for all 43 samples are plotted in Figure 1C. Chromatographic data were imported into MABLAB from the Agilent Chemstation datafiles using an in-house written MS Windows program. The chromatograms were aligned using a MATLAB implementation

(ymax - ymin) yreprod

(1)

where ymax and ymin denote the maximum and minimum values of the property vector, y, and yreprod denotes the error in the reference method used to obtain y. The reproducibility values shown in Table 1 are taken from the published values for reproducibility published in the respective ASTM test method. Although visual inspection of the calibration plots were most useful for detecting a poorly distributed y, the RER values helped to confirm these findings as well as to provide a numerical assessment of the range in terms of its inherent error. A low RER would indicate that the range of measured values of a given property in a set of fuel samples is not significant compared to the inherent uncertainty of the value produced by the ASTM test method itself. Thus, for a given property, a data set with a low RER would imply that the ASTM test method would be the major source of uncertainty in the PLS predicted value of that property. A property measured by two (19) Vest Nielsen, N. P.; Carstunsen, J. M.; Smedsgaard, J. J. Chromatogr., A 1998, 805, 17–35. (20) Tomasi, G; van der Berg, F.; Andersson, C. J. Chemom. 2004, 18, 231–241. (21) Davies, A. M. C. Spectroscopy (Amsterdam) 2006, 18 (4), 23–24.

526 Energy & Fuels, Vol. 22, No. 1, 2008

Kramer et al.

Figure 3. Correlation plots of PLS models of density (LV ) 4) and pour point (LV ) 3) using NIR data. Model sizes are based on the F test (p ) 70%) of the cross-validated PRESS values. Unity lines are plotted for clarity. Horizontal error bars in B and D represent the error in the ASTM method used to obtain the reference values.

different testing methods may have nearly the same range but very different RER values. The RER values for each property are reported in Table 1. Cross-Validation Procedure. Cross-validations were performed using a leave-one-out protocol22 (LOO-CV), and the mean squared error of prediction (MSEP) of the cross-validation was computed as

0.05 for a p ) 95% probability level may be used to test whether the variances of two populations are significantly different. Model Size Selection. For latent variable analysis, the ratio

n

was compared to the critical value of the F statistic corresponding to n ) 43 degrees of freedom for both MSEP, which is a function of the number of latent variables, and the minimum value of MSEPmin. Because the F statistic determines whether one variance is significantly different than another variance, the MSEP was used for this analysis, rather than RMSECV, which is based on the rootmean-square of the variance. The most common way of using RLV to determine the model size is to find the minimum number of LVs that can be used, whereby RLV does not exceed the F critical value19 at R ) 0.05 or 0.10. In this study, two threshold levels were tested for LV determination, R ) 0.05 and 0.30. The latter threshold allowed for higher model sizes to be selected, which were sometimes necessary for producing an acceptable level of error for some predicted property values. Figure 2 plots the MSEP values as a function of the number of latent variables for a PLS density model using NIR data. The trend exhibits a steep decrease followed by a plateau, whereby a clear minimum is not apparent. Errors denote the three LV selection methods. If the MSEPmin is used as the criteria for model selection, five LVs would be chosen. Applying the F test with a threshold of R ) 0.30 (p ) 70) would indicate that four LVs may be used before the MSEP is significantly different from the MSEPmin. For the second threshold tested, R ) 0.05 (p ) 95), three LVs would be chosen as the model size. These two methods of selecting the model size will be referred to as p ) 70 and 95. Model Significance Computation. To establish whether or not a model was statistically meaningful (i.e., different from a random guess), the ratio

MSEP )

∑ (yi - y^ i)2 i)1

(2) n where n is the number of samples in a LOO-CV, i represents the sample left out, and yi and yˆ i are the measured and predicted property values, respectively. The root-mean standard error of crossvalidation (RMSECV) was computed as

RMSECV )



n

∑ (y - y ) i

^ 2 i

i)1

(3) n which is the root-mean-squared error of eq 2. Statistical F Test. The Haaland–Thomas23 F-test statistic F may be defined as A/df, A (4) B/df, B where A and B are independent random variables with chisquared distributions associated with df,A and df,B degrees of freedom, respectively. A high F ratio can be used to infer that the two populations, A and B, are significantly different from each other. If the probability levels are computed as (1 – R) × 100%, then the F ratio corresponding to a specific R value of F)

(22) Martens, H. A.; Dardenne, P. Chemom. Intell. Lab. Syst. 1998, 44, 99–121. (23) Haaland, D. M.; Thomas, E. V. Anal. Chem. 1988, 60, 1193–1202.

RLV )

MSEP MSEPmin

(5)

PLS Modeling of Nonideal Data Sets

Energy & Fuels, Vol. 22, No. 1, 2008 527

Figure 4. Latent variables selected from three different methods, PRESSmin (striped), PRESS, p ) 70 (black), and PRESS, p ) 95 (gray) for NIR models (A) and Raman models (B) of 28 jet fuel properties. n

∑ (y

^ 2 i,r - yi,r /dfr

RS )

)

i)1 n

∑ (y

(6)

i,o - yi,o ^

2

) /dfo

i)1

was computed, whereby yi,r and yˆ i,r correspond to reference and predicted y values of each sample (i) for a PLS model in which the y vector was shuffled randomly.24–27 The yi,o and yˆ i,o values correspond to the reference and predicted values of each sample (i) used in the original model. In this equation, dfr was the number of samples in the model multiplied by the number of randomizations performed (43 × 30) and dfo was the number of samples in the original model (43). Under the assumption that RS should have the same distribution as F, significance levels of RS were computed from the F distribution. Monte Carlo Error Estimation. A final significance test was used to judge whether or not the expected sources of error in both X and y were adequately incorporated into the model. For this analysis, the ratio RE )

MSEPE MSEP

(7)

(24) van der Voet, H. Chemom. Intell. Lab. Syst. 1994, 25, 313–323. (25) van der Voet, H. Chemom. Intell. Lab. Syst. 1995, 28, 315. (26) Nestor, P. G.; O’Donnell, B. F.; McCarley, R. W.; Niznikiewicz, M.; Barnard, J.; Shen, Z. J.; Bookstein, F. L.; Shenton, M. E. Schizophr. Res. 2002, 53, 57–66. (27) Beger, R. D.; Schnackenbery, L. K.; Holland, R. D.; Li, D.; Dragan, Y. Metabolomics 2006, 2 (3), 125–134.

was computed, whereby MSEPE refers to the cumulative MSEP from several iterations of LOO-CV in which normally distributed, random error was added to X and y. Errors in X reflected the instrumental short-term noise and were calculated from sample replication. Errors in y reflected the reported ASTM reproducibility values. Error calculations are described in the following section. Instrumental Noise Calculation. For the NIR data, five replicates of one sample were collected on 6 different analysis days and mean-centered according to each analysis day. The meansubtracted spectra were compiled into one 30 × 1201 matrix, representing the short-term noise of the spectrometer. For the NIRscanning instrument, the deviations appeared to be wavelengthdependent; therefore, a noise vector was computed. The noise vector was calculated by taking the standard deviation at each wavelength of the noise matrix across all replicates. The value of the noise vector at a given wavelength was the value of σ for the random, normally distributed noise added to the corresponding wavelength of the NIR spectral data. For the FT-Raman data, 25 samples were replicated in duplicate. These samples were subtracted from each other to form a 25 × 778 matrix. This noise was constant over the wavelength range; therefore, the standard deviation of the entire matrix was computed as the noise (scalar) value. This scalar was used as the σ for the random noise added to each wavelength of the Raman data. For GC analysis, the expected error in X includes nonlinear peak shifting and peak shape changes that are difficult to model. Trying to estimate these errors went beyond the scope of this study; therefore, the Monte Carlo procedure was not investigated for the GC data. Errors in y were assumed to be normally distributed with σ ) (yreprod/1.96). Because the reproducibility values in the ASTM test methods represent 1.96 standard

528 Energy & Fuels, Vol. 22, No. 1, 2008

Kramer et al.

Figure 5. Cumulative percentage of y modeled for NIR (gray), Raman (black), and GC (striped) data for model sizes selected with the p ) 70 threshold.

Figure 6. Probability levels for model randomization studies for LV selection methods p ) 70 (A) and p ) 95 (B) for NIR (gray), Raman (black), and GC (striped) data.

deviations from the mean measurement (95% confidence levels), this factor was used to standardize the error term. The significance level of RE was interpreted as an indication that the reference errors were under-represented in the calibration data. For the F test, the degrees of freedom were 4300 (43 × 100 iterations) for the

numerator and 43 for the denominator of RE. The MSEPE itself was assumed to be a more robust estimate of the MSEP. The Monte Carlo approach28 allowed for a more realistic cross-validation error to be determined when only a modest number of samples were available.

PLS Modeling of Nonideal Data Sets

Energy & Fuels, Vol. 22, No. 1, 2008 529

Figure 7. Correlation plots of density (A), randomized density (B), pour point (C), randomized pour point (D), distillation at 50% (E), and randomized distillation at 50% (F) for NIR data. Unity lines are plotted for clarity.

Goodness of fit was characterized by the fraction of y described by the PLS model. This value is commonly referred to as the R2 value and is computed as R2 ) 1 -

SSMOD STOT

(8)

where SSMOD refers to the sum of squares of the (yi – yˆ i) terms and SSTOT refers to the sum of squares of the (yi – ym) terms, where ym denotes the mean of y. The cumulative percentage of y modeled is simply the R2 multiplied by 100. The R2 may also be described as the square of the correlation coefficient between y and yˆ .

Results and Discussion Inspection of Property Distributions. As was mentioned previously, the values for some of the fuel properties followed a narrow range and relatively nonuniform distribution. Values of the selected fuel properties that are outside or near the extremes of the fuel specifications are rare, and most of the properties cannot be physically manipulated without introducing artifical compositional artifacts that would change other aspects of the sample matrix. The concepts of y distribution and RER are illustrated in Figure 3 using PLS correlation plots of NIR data. One property that exhibits a desirable distribution for PLS modeling is density. As shown in Figure 3A, reference density (28) Fernández Pierna, J. A. F.; Jin, L.; Wahl, F.; Faber, N. M.; Massart, D. L. Chemom. Intell. Lab. Syst. 2003, 65, 281–291.

values corresponding to the x axis follow a uniform distribution. In Figure 3B, the same plot is provided but x-axis error bars are plotted in gray, representing the error in reference measurement. For density, the error bars are neglegible, indicating that the reference method used to obtain the density values (a digital density meter) was highly precise. One of the worst offenders in terms of a nonideal distribution is the pour point, plotted in parts C and D of Figure 3. The poor distribution and low RER indicate that more samples are needed for assessing PLS modeling ability. Low confidence was placed in PLS models for the pour point and other properties that exhibited a narrow range and/or low RER. Statistical Analysis for Model Size Selection. Perhaps the most important concern using PLS with a nonideal data set is proper selection of the model size, i.e., the number of latent variables. This is critical to the proper performance of the PLS calibration models. Larger model sizes (more LVs) will increase the precision with which the model describes the training set data and will produce better correlations. However, the more information that is incorporated into the model, the more specific to that particular training set the models tend to become. Thus, as the model size increases, it tends to become less robust and less capable of modeling new fuel samples that are only slightly different than the training set fuels. Thus, it is critical to appropriately balance the model size with model robustness, and the normally accepted methods to accomplish this have proven to not be suitable for treating nonideal training sets.

530 Energy & Fuels, Vol. 22, No. 1, 2008

Kramer et al.

Figure 8. Correlation plots of density (A), randomized density (B), pour point (C), randomized pour point (D), distillation at 50% (E), and randomized distillation at 50% (F) for Raman data.

Generally, cross-validation is commonly used to select model size. However, basing the number of latent variables on the minimum MSEP may result in overmodeling. In this study, the statistical F test was found to be more useful for selecting a more conservative model size than that which corresponded to the minimum MSEP of cross-validation. Figure 4 gives the results of LV selection methods for NIR data (Figure 4A) and Raman data (Figure 4B). Model sizes reflecting the minimum MSEP are plotted as striped bars, and those chosen by significance tests are plotted in black (p ) 70) and gray (p ) 95). The p ) 95 was a much stricter threshold, selecting much lower model sizes (LV ) 1 in some instances). The LV selection study allowed for an initial assessment of which properties might be modeled successfully and which would most likely not provide an accurate PLS model, even when a larger data set was provided. One indication of poor modeling ability was when only one LV was selected for both p ) 70 and p ) 95 significance levels. Selection of only one LV implied that the analysis did not benefit significantly from using a multivariate model. Clearly, the individual sample data from all three analytical methods are highly overlapped (see Figure 1); therefore, it can be inferred that a multivariate model is necessary to extract meaningful information from the data. When a higher model size is selected but a low percentage of y is modeled, this is another indication that the modeling ability is poor. For the data used in this study, the models may be

unsuccessful simply because the property information contained in y is not described by the information recorded in X. Another factor may be a poor distribution of the property values in y. Figure 5 shows the cumulative percentage of y modeled for the three types of data, NIR (gray), Raman (black), and GC (striped), for model sizes chosen with the more relaxed p ) 70 significance threshold. Using p ) 70 allowed for the error to be on the side of optimism in terms of modeling prospects. The figure shows that, even using the more generous model sizes, the percentage of y was quite low in a number of instances. Conductivity was not expected to correlate with the spectral data; therefore, the poor models were not surprising. For conductivity, sulfur content, and thermal stability, only one LV was selected at both p ) 70 and p ) 95 significance levels for all three techniques. Conductivity and sulfur content have relatively high RERs (19.4 and 81.3, respectively) and fairly uniform distributions; therefore, larger data sets would likely not improve these property models. Thermal stability does not produce a meaningful model because of its narrow range and the nonquantitative nature of that measurement. Various other models did not appear to be successful, but this could be attributed to a poor distribution of y. When y only contains a small number of values or when most of the data are clustered at one specific value, not enough information exists for the PLS algorithm to model the data effectively. In these instances, the modeling ability may seem poor but more data

PLS Modeling of Nonideal Data Sets

Energy & Fuels, Vol. 22, No. 1, 2008 531

Figure 9. Correlation plots of density (A), randomized density (B), pour point (C), randomized pour point (D), distillation at 50% (E), and randomized distillation at 50% (F) for Raman data for model sizes of significance threshold p ) 95. Model sizes are two, one, and one for density, pour point, and distillation, respectively.

are required to make a definitive conclusion. It is expected to take a considerable amount of effort to build up a database of fuels that spans the desired calibration ranges for all properties. The objective is to span at least the acceptable range of safety levels so that outliers may be detected or flagged for further analysis. Model Significance Estimation. To formulate a nonbiased statistical assessment of the PLS calibrations, significance testing was used to determine the improvement of the modeling errors compared to a random guess. It is important to exercise caution when interpreting the results of PLS modeling of fuel properties from nonideal calibration (training) set data; oftentimes the PLS algorithm has the ability to find serendipitous correlations between X and y when, in reality, no relationship exists. If the experimental design and sampling is appropriately carried out, a false statistical relationship is most likely due to overfitting. For this study, a model size was selected followed by computation of RS defined by eq 6 for 30 randomized models. The RS ratios were interpreted as critical values of a F distribution, and values of R were determined accordingly. Probability levels were then computed as (1 – R) × 100%. Figure 6 shows the probabilities for model sizes chosen using p ) 70 (Figure 6A) and p ) 95 (Figure 6B) for NIR (gray), Raman (black), and GC (striped) data. A modeling probability of 50% or less infers that the PLS prediction is no better than a random guess. For

simplicity, only properties showing a measurable modeling ability are plotted. These were chosen on the basis of at least one technique producing above 80% cumulative y for the results plotted in Figure 5. For parts A and B of Figure 6, the NIR and GC data show much greater model significance than the Raman data. The low significance may be due to either unsuccessful modeling or overmodeling of the data, examples of which are illustrated in subsequent figures. Figure 7 shows NIR correlation plots for model sizes based on p ) 70 for three properties, density (parts A and B of Figure 7), pour point (parts C and D of Figure 7), and temperature at 50% fraction distillation (parts E and F of Figure 7). Units for the axes of each property are given in Table 1. For each plot, the x axis represents the reference values and the y axis represents the PLS predicted values. Model sizes were four, three, and three for density, pour point, and distillation, respectively. Plots A, C, and E of Figure 7 are for the original models (where X and y were matched appropriately), and plots B, D, and F of Figure 7 give three examples of randomized models, where y values (properties) were shuffled. For reference, a unity line is plotted. Data for plots A, C, and E of Figure 7 should fall along the unity line if correlation is successful. For plots A, C, and E of Figure 7, the R2 is 0.96, 0.48, and 0.74, respectively. For plots B, D, and F of Figure 7, the points should be horizontally distributed about the mean of the data, the

532 Energy & Fuels, Vol. 22, No. 1, 2008

Kramer et al.

Figure 10. Percent cumulative y modeled for NIR original models (solid gray), NIR randomized models (speckled gray), Raman original models (black), and Raman randomized models (speckled black) for model sizes chosen with p ) 70 (A) and p ) 95 (B) thresholds. The speckled bars are high in many instances, particularly for Raman data of A, where higher model sizes were used.

expected trend for a random guess model. The significance values, plotted in Figure 6A, represent whether or not the modeling errors in the original plots are significantly different from the errors computed for the randomized plots. The significance values are high for all three of these properties for the NIR data. Figure 8 is identical to Figure 7 but shows Raman rather than NIR data. All parameters defined for Figure 7 correspond to Figure 8, except the model sizes were four latent variables for each property (density, pour point, and distillation) and the R2 values are 0.96, 0.90, and 0.90, for plots A, C, and E of Figure 8, respectively. This figure illustrates the powerful ability for PLS to overfit data and helps to explain the low Raman significance values in Figure 6A. For plots B, D, and F of Figure 8, a strong correlation is found even when the data are randomized. For these plots, the data should not fall along the unity line but should resemble those of Figure 7. The fact that four LVs could model the data so effectively (even for density, which displays a good y distribution) was surprising. The randomized density model (Figure 8B) produces a higher measure of scatter about the unity line than that of the original model (Figure 8A). This resulted in a high level of significance for the Raman density model (Figure 6A) because the F test was based on differences in modeling errors. However, a visual inspection of Figure 8B and the R2 values (0.92, 0.95, and 0.85 for the three random models) clearly indicates that overmodeling has occurred. Figure 9 shows the results of Raman data in which the p ) 95 threshold was used to select model size. This resulted in two, one, and one latent variables for density (parts A and B of Figure 9), pour point (parts C and D of Figure 9), and distillation

(parts E and F of Figure 9), respectively. For these model sizes, the random plots followed the expected trends (although density still produces a slight correlation) but the model sizes of only one LV did not effectively model pour point and distillation. Therefore, the significance values of these properties (Figure 6B) were still low, but for a different reason than for Figure 6A. The low values in Figure 6A were due to overmodeling, while those of Figure 6B were due to ineffective modeling. In both cases, low significance values reflect the fact that the PLS models were not very different from random chance. Another indicator that the Raman data were overfit was the percentage of y modeled for the random chance models. Figure 10 shows the cumulative percentage of y modeled for NIR original models (solid gray), NIR randomized models (speckled gray), Raman original models (black), and Raman randomized models (speckled black) for model sizes chosen with p ) 70 (Figure 10A) and p ) 95 (Figure 10B) thresholds. For randomized models, the values represent the mean of 30 models. Instances where speckled bars are just as high as their preceding solid bars suggest that data are overmodeled and/or the y vector is of poor distribution. For Raman data, particularly when higher model sizes were used (Figure 10A), there are many occurrences of a high percentage of y values for random models. For the randomized models, significance testing of RS, visual inspection of the correlation plots, and the cumulative percentage of y were useful diagnostics for detecting overmodeling. Inspection of the random model correlation plots was necessary because the significance test for RS was not powerful enough to detect all instances of overfitting. For NIR and GC data, most models showed high significance levels, even for the more restrictive (p ) 95) model size selection method. For Raman

PLS Modeling of Nonideal Data Sets

Energy & Fuels, Vol. 22, No. 1, 2008 533 Table 2. Summary of Property Modeling Assessments for NIR Data

property

Figure 11. Values of RE (A) and the corresponding probabilities (B) for NIR data using model sizes of p ) 95 significance level. Plot A represents the factor increase in the MSEP when the expected error in X and y was added to the data before multiple cross-validations. The significance levels of A are plotted in B as probabilities that the crossvalidation errors of the replicated data are different from the original MSEP values.

data, however, significance values were low for both p ) 70 and p ) 95 model sizes. When higher model sizes were selected (p ) 70), the low significance values were likely due to overmodeling. When the more restrictive model sizes were chosen (p ) 95), oftentimes, the model size was only one (Figure 4B), resulting in a low significance because of ineffective modeling. Without the randomization studies, the visual interpretation and R2 values of the original correlation plots would indicate that Raman was a more qualified technique for modeling many of the properties. However, the randomization tests indicated that NIR models were more statistically meaningful. Error Analysis Using Monte Carlo Replication. An issue of interest to any newly proposed technique is the extent to which the analytical error may be compromised if the new methodology were to replace the old. For this application, it was desirable that the overall analytical error not be greater than that of the original (ASTM) method, and thus, a conservative estimation of the prediction error was needed. Given a high number of samples available for calibration, it may be assumed that the reference errors in y are statistically averaged out. For the small set of data used in this experiment, the errors in y were not assumed to be fully represented by the data and, for properties such as pour point, which is not a continuous measurement, the chance of overmodeling was high. Therefore, it was assumed that the RMSECV could be a low estimate of the prediction error for this data set. To test whether the expected errors in X and y were adequately incorporated into the model and to produce a more conservative estimate for the RMSECV, a Monte Carlo approach was taken. For this study, a model

density at 60 °F flash point (PM, °C) flash point (mini, °C) freezing point (°F) total sulfur (% mass) napthalenes (% vol) aromatics (FIA, % vol) aromatics (HPLC, % vol) saturates (% vol) hydrogen (% wt) viscosity at 20 °C viscosity at -20 °C viscosity at -40 °C refractive index conductivity acid number [(mg of KOH)/kg] IBP distillation (°F) 10% distillation (°F) 20% distillation (°F) 50% distillation (°F) 90% distillation (°F) FBP distillation (°F)

RER g 10

cumulative y modeled > 90%

RS significance > 99%

RE significance < 70%

X

X

X X

X

X X

X

X

X

X

X

X X

X

X

X

X

X

X

X X X X

X X

X X X

X

X X X

X X X

X X X X

size was selected followed by 100 iterations of LOO-CV, whereby the expected error was added to both X and y. The ratio RE defined by eq 7 was treated as the critical value of a F distribution and converted to the corresponding R value and probability level. This study was performed for NIR and Raman data, where the error in X (instrumental noise) could be estimated using replicates. Results will be shown for NIR data, because more confidence was placed in these models according to the randomization studies discussed previously. Figure 11 plots the results for NIR data for models chosen with the p ) 95 threshold for the RE ratios (Figure 11A) and the corresponding probability levels (Figure 11B). The RE values may be interpreted as the factor increase in the MSEP values when the expected error in X and y was added. Values near or below 1.0 indicate that the error was not significantly increased, producing values near 50% in plot B. For density, napthalenes, viscosity at -20 °C, and refractive index, the original RMSECV appears to be a good estimate of the prediction error. For all other properties, a more conservative measure of the error would be the square root of MSEPE (the RMSECV of the Monte Carlo replicates). The errors in the predicted aromatic and saturate contents were particularly sensitive to errors in X and y. Trends were similar for the model sizes chosen at the p ) 70 threshold. It should be noted that cross-validation errors are expected to be lowest when a sufficiently large (statistically meaningful) data set is available for calibration. For such a data set, where

534 Energy & Fuels, Vol. 22, No. 1, 2008

a normally distributed error in y is manifested throughout the range of property levels, predictions should be closer to the mean (true) expected value for each sample. Initial Assessments of Modeling Abilities for Various Properties. To provide some initial statements about the PLS modeling prospects of each property, a simple chart was generated for each technique. The charts included information reflecting the PLS modeling ability as well as information indicating a poorly distributed y, which reduced the level of confidence in the results. An example of such a chart is given in Table 2 for NIR data for model sizes selected at the p ) 95 threshold. This chart is merely a qualitative tool, and the thresholds were arbitrarily set. However, because the criteria are based on statistical significance values, such a chart should be applicable across methodologies and instrumentation. Using Table 2 as a guide, the various properties can be assessed according to the criteria that they fulfill. For density and refractive index, all four criteria were met, lending a high level of confidence in the capabilities for PLS to model these properties from NIR spectra. Fulfilling the RE criterion (the final column of the chart) indicated that the expected error of X and y appeared to be successfully incorporated into the model, even with the small set of data. Therefore, expanding the data set would be useful but not critical for assessing the modeling prospects. For properties that failed both the percentage of y and the RS criterion (the second and third columns of the chart), a low confidence in modeling ability was determined from this set of data. These properties included pour point, sulfur, olefins, conductivity, acid number, thermal stability, and lubricity. For sulfur and conductivity, where the RER criterion was met, a higher level of doubt was placed on the modeling ability because the distribution of y was better suited for PLS. For all other properties, it was difficult to make a conclusive statement about the modeling ability because of the limited scope of the data. For these properties, successful models could be the result of overmodeling and more data would be necessary to prove robustness, reliability, and linearity over a wider range. Conclusions The goal of this study was to develop statistical methodologies to accurately assess the applicability of NIR, Raman, and GC compositional data to develop accurate PLS calibrations for

Kramer et al.

specification jet fuel properties from a limited, nonideal set of samples. Because the available training set data were limited in scope, extra precautions were taken when formulating the PLS models and interpreting the results. Significance testing was successful in selecting model sizes and pinpointing instances of overmodeling. The first indication that PLS modeling may be unsuccessful was when model sizes of only one LV were selected for both p ) 95 and p ) 70 levels of significance. The results of the randomization studies indicated that the NIR data were producing much more statistically meaningful models than the Raman data. In most instances, the higher R2 values of the Raman calibration plots were found to be the result of overfitting the data. The fact that Raman models were systematically overfitting the data indicated that the PLS algorithm was not able to extract meaningful information from the data, and the best way a model could be fit was for PLS to correlate unrelated (random noise) aspects of the data with the property of interest. Inspection of the random models helped to reveal when this phenomenon was possible. The Monte Carlo approach to error evaluation allowed for a more realistic prediction error to be estimated. For many of the properties, incorporation of the expected error in X and y caused the MSEP values to increase. This procedure allowed for the estimation of a more realistic analytical error when only a modest set of data was available. Oftentimes, an overly optimistic assessment of PLS modeling ability is formulated when a data set is limited. The significance tests described here provided useful information about the effectiveness of PLS modeling and can be employed to indicate when more data are necessary to confirm the modeling ability. Reporting the results in terms of probabilities allowed for sideby-side comparisons to be made across different properties as well as across different analytical techniques. Acknowledgment. The authors thank the Office of Naval Research and the Naval Air Systems Command (NAVAIR 4.4.5) through the Navy Fuels and Lubes IPT for supporting this work. They also thank the following individuals for their assistance in acquiring fuel samples and specification data: Stan Seto (GE Aircraft Engines c/o Belcan Corporation) and Jeffrey M. Johnson (Boeing Commercial Airplane Group). EF700403S