Spectroscopic and Chemometric Analysis of Binary and Ternary

Mar 13, 2016 - ... Analysis of Binary and Ternary Edible Oil Mixtures: Qualitative and ... chemometric data analysis in mechanochemical reaction monit...
6 downloads 0 Views 4MB Size
Article pubs.acs.org/ac

Spectroscopic and Chemometric Analysis of Binary and Ternary Edible Oil Mixtures: Qualitative and Quantitative Study Ozren Jović,† Tomislav Smolić,‡ Ines Primožič,† and Tomica Hrenar*,† †

Department of Chemistry, University of Zagreb Faculty of Science, Horvatovac 102a, HR-10000 Zagreb, Croatia Agrokor d.d., Research and Development Department, Trg Dražena Petrovića 3, HR-10000 Zagreb, Croatia



S Supporting Information *

ABSTRACT: The aim of this study was to investigate the feasibility of FTIR-ATR spectroscopy coupled with the multivariate numerical methodology for qualitative and quantitative analysis of binary and ternary edible oil mixtures. Four pure oils (extra virgin olive oil, high oleic sunflower oil, rapeseed oil, and sunflower oil), as well as their 54 binary and 108 ternary mixtures, were analyzed using FTIR-ATR spectroscopy in combination with principal component and discriminant analysis, partial least-squares, and principal component regression. It was found that the composition of all 166 samples can be excellently represented using only the first three principal components describing 98.29% of total variance in the selected spectral range (3035−2989, 1170−1140, 1120−1100, 1093−1047, and 930−890 cm−1). Factor scores in 3D space spanned by these three principal components form a tetrahedral-like arrangement: pure oils being at the vertices, binary mixtures at the edges, and ternary mixtures on the faces of a tetrahedron. To confirm the validity of results, we applied several cross-validation methods. Quantitative analysis was performed by minimization of root-mean-square error of cross-validation values regarding the spectral range, derivative order, and choice of method (partial least-squares or principal component regression), which resulted in excellent predictions for test sets (R2 > 0.99 in all cases). Additionally, experimentally more demanding gas chromatography analysis of fatty acid content was carried out for all specimens, confirming the results obtained by FTIR-ATR coupled with principal component analysis. However, FTIR-ATR provided a considerably better model for prediction of mixture composition than gas chromatography, especially for high oleic sunflower oil.

F

the problem (e.g., similar oil composition, band overlapping, baseline shifts), but reducing the dimensionality by PCA simplifies the problem and produces clear information. PCs are linear combinations of the original spectral data that represent the highest possible variance under the orthogonality constraint to the preceding components. PCA applied to the spectral data can be used to classify different binary and ternary mixtures.4,10 Discriminant analysis is a multivariate method for testing differences between groups and is used for determination of the minimal number of dimensions needed to describe these differences. Particularly useful is discriminant analysis with Mahalanobis distance metrics where the correlation between different classes is taken into account. With the use of both Mahalanobis distance and only the first few components that account for the most explained variance, the discriminant analysis might yield promising results even on ternary mixtures of unknown oils.12 There are numerous studies on adulteration of oils that consider adulterated olive oil with, e.g., sunflower, rapeseed, or

ourier transform infrared spectroscopy (FTIR) is widely used in the analysis of different food products.1−3 Combining multivariate analytical techniques with the FTIR data enables qualitative and quantitative analysis in the tracing of adulteration and identifying adulterants in syrups,4 meat,5 honey,6 or oils,7 although other methods exist.8,9 However, the literature consists mostly of the data representing the adulteration based on binary mixtures, and only a few papers are devoted to the study of ternary mixtures using FTIR.10,11 In the work of A. Rohman,11 quantitative analysis of ternary mixtures of olive oils using principal component regression (PCR) and partial least-squares regression (PLS) was presented. Therefore, it is important to perform qualitative discrimination between different ternary mixtures as well as identification of the presence or absence of a particular oil in an unknown ternary mixture. Principal component analysis (PCA) classifies different samples into groups using factor scores that are projections of samples on the axis of covariance (or correlation) matrix eigenvectors. It enables extraction of chemically relevant information and determination of mixture composition using a few uncorrelated variables (principal components, PC). That extraction is usually a demanding task due to the complexity of © 2016 American Chemical Society

Received: February 4, 2016 Accepted: March 13, 2016 Published: March 13, 2016 4516

DOI: 10.1021/acs.analchem.6b00505 Anal. Chem. 2016, 88, 4516−4524

Article

Analytical Chemistry

Table 1. Fatty Acid Composition Determined by Gas Chromatography in Pure Oils: Extra Virgin Olive Oil (O), High Oleic Sunflower Oil (H), Rapeseed Oil (R), and Sunflower Oil (S) Fatty acid:a

14:0

16:0

16:1

17:0

17:1

18:0

18:1

18:2

18:3

20:0

20:1

20:2

22:0

22:1

22:2

22:6

24:0

24:1

O H R S

− 0.1 0.1 0.1

11.3 5.5 6.1 7.3

0.8 0.1 0.4 0.2

− 0.1 0.1 0.1

0.1 − 0.1 −

2.8 3.9 1.6 3.7

77.0 54.0 61.6 28.1

5.8 34.7 20.2 58.5

0.6 0.1 7.9 0.5

0.5 0.3 0.4 0.3

0.3 0.2 0.9 0.2

− − − −

0.2 0.7 0.2 0.6

− − 0.1 −

0.1 0.1 0.1 0.2

− − − 0.1

0.1 0.2 0.1 0.2

0.5 − − −

a

14:0 Myristic acid, 16:0 Palmitic acid, 16:1 Palmitoleic acid, 17:0 Margaric acid, 17:1 Heptadecenoic acid, 18:0 Stearic acid, 18:1 Oleic acid, 18:2 Linoleic acid, 18:3 Linolenic acid, 20:0 Arachidic acid, 20:1 Gadoleic acid, 20:2 Eicosadienoic acid, 22:0 Behenic acid, 22:1 Erucic acid, 22:2 Docosadienoic acid (Brassic acid), 22:6 Docosahexaenoic acid, 24:0 Lignoceric acid, 24:1 Nervonic acid.

with pure oils as the first and the last member of the set. Ternary oil mixtures containing olive oil were prepared using 4 different edible oils (O, H, R, and S), giving 3 diverse sets of mixtures containing olive oil (increments 10% by volume) labeled OHR, OHS, and ORS. The total number of samples was 4 × 1+6 × 9+3 × 36 = 166. Mixtures were prepared in small test tubes using a micropipette, with 1000 μL of oil in each sample stored in a dark and cold place before the spectral measurement. Fatty acid content (for each methylated sample) was determined by gas chromatography using a Perichrom 2100 gas chromatograph with a flame ionization detector. The temperature range was from 25 to 450 °C, polarization 250 V, dynamic linear range better than 108, and detection limit 10−12 g/s. Attenuated Total Reflectance Spectra Measurements. Attenuated total reflectance (ATR) spectra were recorded in the spectral range 4000−600 cm−1 with a Bruker Equinox 55 spectrometer applying 2 cm−1 of nominal resolution and 128 scans. The spectrometer was placed in an air-conditioned room, and all samples were allowed to equilibrate to room temperature before measurement. Measurements were conducted on a diamond ATR crystal. To obtain each absorbance spectrum, a background of the clean and dry ATR crystal was first collected. Immediately following the collection of each background scan, approximately 1 drop of a sample was placed on the ATR crystal by using a pipet, ensuring that the crystal surface was free of air bubbles. Between sampling, diamond crystal was cleaned with the cleaning agent (KemexA, Kemika d.d., Zagreb, Croatia), followed by water and ethanol, and dried. Dry crystal was also checked for possible remainings. All samples were measured several times to ensure the reproducibility of the recorded spectra, which is crucial for further chemometric analysis. Since we expected some very subtle changes in vibrational spectra (Figure 1), any experimental uncertainties that could lead to incorrect conclusions should have been excluded. Multivariate Data Analysis. Principal Component Analysis. Numerical analyses were performed using the PCA, second-order tensor decomposition tool. The data matrix with mean-centered columns X of rank r, arranged in I rows (spectra) and J variables (wavenumbers), is decomposed as a sum of r matrices tipτi of rank 1

corn oil. Additionally, adulteration of cod-liver oil,13 virgin coconut oil,14 and another oils15 is subject to intensive research. Due to similar oleic acid content, adulteration studies for olive oil with high oleic sunflower oil are lacking or rare, even for binary mixtures. Introducing sunflower oil as the third component with almost the same fatty acid profile as high oleic sunflower oil (excluding oleic and linoleic acids) also complicates this analytical problem and makes the analysis of this ternary mixture quite challenging (Table 1). With further extension of mixtures to rapeseed oil (containing considerable amount of oleic acid and similar degree of unsaturation as the high-oleic sunflower oil), one can presume that analyses, whether qualitative or quantitative, will be particularly difficult. Reduction of complete spectral data to the limited spectral range is inevitable in conducting PLS or PCR multivariate analysis with inclusion of derivative techniques. In the literature, different methodologies for appropriate selection of spectral variables can be found. For example, S. Sivakesava and J. Irudayaraj used spectral ranges with the high correlation coefficient between absorbances and concentrations.6 Some of the authors, like A. Rohman et al. in PLS analysis of the calibration set, used spectral ranges with the highest coefficient of determination R2 to assess the validation samples and concluded that high a R2 of calibration was followed by a high R2 of validation.11 A similar procedure was applied by M. M. Paradkar and J. Irudayaraj, who selected the spectral ranges 1500−1800 and 2800−3000 cm−1, for estimation of caffeine content in soft drinks.16 In other studies, conclusions were made on the root-mean-square error of calibration (RMSEC) and R2 of calibration and the corresponding root-mean-square error of prediction (RMSEP) and R2 values of the validation set for specific spectral ranges depending on their final values.4 In this work, qualitative and quantitative analysis using PLS and PCR as chemometric tools were performed. The optimal number of components using the minimal root-mean-square error of cross-validation (RMSECV) was calculated. Furthermore, PLS and PCR on different spectral ranges for each ternary mixture and spectral range that yields the lowest RMSECV were performed. The value of RMSECV was investigated in dependence on the spectral range, the derivative order, and the applied method (PLS or PCR).



EXPERIMENTAL SECTION

Sample Preparation and Gas Chromatography. Pure oil samples of extra virgin olive oil (O), high oleic sunflower oil (H), rapeseed oil (R), and sunflower oil (S) were obtained by courtesy of Zvijezda d.d.,17 the Croatian largest edible oil and margarine factory. Binary oil mixtures were prepared using 4 different edible oils (O, H, R, and S) giving 6 diverse sets of mixtures (increments 10% by volume) labeled OH, OR, OS, HR, HS, and RS. Each binary mixture was made of 11 samples

r

X=

∑ tipiτ i=1

where ti stands for the score and pτi for the loading vectors. PCA finds the best linear projections for a high-dimensional set of data in the least-squares sense. Scores represent projections of the original points in the principal component direction and 4517

DOI: 10.1021/acs.analchem.6b00505 Anal. Chem. 2016, 88, 4516−4524

Article

Analytical Chemistry

with exclusion of the i-th row was adjusted to be the same as the sign of PC loading vectors of PCA conducted on the complete X. The same was done with the signs of PC scores (U(−j)). For each cross-validation, the predicted residual error sum of squares for m principal components of the leave-one-row-out cross-validation, PRESS( f), was calculated as I

PRESS(f ) =

J

∑ ∑ (xi ,j − xî ,j(f ))2 i=1 j=1

where xî , j(f ) denotes the predicted element of the meancentered data matrix X, with f principal components. Wold’s Rindex was used for determination of the optimal f

Figure 1. ATR spectra of extra virgin olive oil (O), high oleic sunflower oil (H), rapeseed oil (R), and sunflower oil (S).

R=

can be used for classification, whereas loadings represent eigenvectors of the data covariance (or correlation) matrix and can be used for the identification of variability in the data. Ideas of PCA go back to Beltrami18 and Pearson19 while the name was introduced by Hotelling.20 More details can be found in the recent literature.21,22 Data obtained by ATR spectral measurements of different ternary oil mixtures were exported to the ASCII format and arranged in the matrix X (numbers written in a free format). Several different subsets were chosen for spectral analysis. Data in the range (3700−2600 and 1900−600 cm−1) were selected, providing 6 data matrices of binary mixtures (dimensions: 11 samples × 2492 wavenumbers), and one matrix with all binary mixtures (dimensions: 58 samples × 2492 wavenumbers). Data matrices were also built for three ternary mixtures (dimensions: 66 samples × 2492 wavenumbers), and one matrix for the set of all samples (4 × 1+6 × 9+3 × 36 = 166, dimensions: 166 samples × 2492 wavenumbers). Further analyses were repeated on reduced spectral ranges using ATR spectra, their first, and second derivatives. Data were mean-centered, and PCA on the covariance matrix was carried out using our multivariate analysis code23 based on the NIPALS algorithm.24 The Barlett χ2 on the null hypothesis H0 that the correlation matrix is an identity matrix was statistically significant, and H0 can be rejected. Therefore, sufficient correlation exists to safely proceed with PCA. The Kaiser−Meyer−Olkin measure of sampling adequacy was calculated for each individual variable and for the overall data set. Each value was greater than 0.99, thus confirming the sampling adequacy. To test the robustness and possible outliers, we additionally performed a robust PCA (ROBPCA)25 study (Figures S-1−S-3). The classification results obtained from both PCA methods were comparable. For proper selection of the principal components obtained from the mean-centered matrix X (I × J), several different criteria were considered. In addition to the Kaiser rule, four different cross-validations were used: row-wise cross-validation (also called single cross-validation), Wold’s cross-validation, cross-validation of Eastment and Krzanowski, and eigenvector cross-validation. Cross-validations were computed following the same procedures as in ref 26 using our scripts for program R (ref 27), which relies on SVD algorithms,28 excluding Wold’s procedure, which relies on NIPALS algorithm.29 Wold’s crossvalidation was performed with K = 6 and with the same pattern of missing values as in ref 26. When performing the Krzanowski procedure, the sign of PC loading vectors (V(−i)) obtained from PCA on the X matrix

PRESS(f ) RESS(f − 1)

(1)

RESS( f − 1) denotes the residual error sum of squares for f − 1 PC reproduction of the original data matrix X. Although there are no strict rules on how many principal components it is necessary to retain, if f > 1 and f − 1 < 1, then it is advised to keep f − 1 principal components and to ignore all principal components ≥ f (since they account for undesirable noise). In the cross-validation procedure of Eastment and Krzanowski, apart from the R index, the W index was calculated also: W (f ) =

PRESS(f − 1) − PRESS(f ) Dr (f ) PRESS(f ) Dfit (f )

Dr (f ) = J(I − 1) − f (I + J − f − 1) Dfit (f ) = I + J − 2f

(2)

For each binary mixture (OH, OR, OS, HR, HS, and RS), only row-wise cross-validation and eigenvector cross-validation were performed. A complete set of all binary mixtures was validated through four cross-validation procedures using the spectral range (3700−2600 and 1900−600 cm−1), as well as (3035−2989, 1170−1140, 1120−1100, 1093−1047, and 930− 890 cm−1). Ternary mixture sets (OHR, OHS, and ORS) and the set of all 166 samples was validated through four crossvalidation procedures using the spectral range (3035−2989, 1170−1140, 1120−1100, 1093−1047, and 930−890 cm−1). The eigenvectors of covariance matrices were also visually studied to prevent possible loss of information in omitted principal components. Discriminant Analysis with Mahalanobis Distance. Discriminant analysis with Mahalanobis distances on PC scores was performed for the set of all 166 samples in the spectral range (3035−2989, 1170−1140, 1120−1100, 1093−1047 and 930−890 cm−1) using two kinds of discriminant analysis. In the first kind, each particular oil (O, H, R, and S) was identified in a tested set where analyzed oil is either present or absent in the unknown mixture. Two classes are defined: one that represents the presence and the other that represents the absence of oil under the consideration. The Mahalanobis distance is defined as d(x , i) = (x − μi )τ Si−1(x − μi )τ

(3)

where x denotes unknown sample of the mixture under analysis, μi denotes class center, and Si denotes covariance matrix of class i. Both x and μi are f dimensional vectors (and Si 4518

DOI: 10.1021/acs.analchem.6b00505 Anal. Chem. 2016, 88, 4516−4524

Article

Analytical Chemistry is f × f dimensional matrix), since they consist of the first f principal components considered. In quadratic discriminant analysis (QDA), each class under consideration has its covariance matrix. In linear discriminant analysis (LDA), all classes have the same covariance matrix S S=

1 n−C

n

RMSECV =

n

(4)

RMSE (for optimal f ) n ∑i = 1 (φcal,actual(i) − φcal*,calculated(i))2 = n−f−1 n = RMSECV · n−f−1

C

∑ niSi i=1

C denotes the number of different classes and ni the number of samples in class i. The considered oil will belong to the class with the lowest Mahalanobis distance of all classes. Leave-one-out (LOO) procedure on the training set (containing 83 samples) was carried out with different numbers of first f PCs, ranging from 2 to 12. The percentage of correct classification (PCC) is defined as the number of correctly classified samples divided by the overall number of classified samples. Optimal f is determined on the highest PCC of LOO cross-validation on the training set (among all other first f PC-s considered). In the selection of the optimal number of the first f PC-s, the chance of overfitting increases with the number of PC-s, so with each new PC, PCC must advance at least 1% to be accepted. After the optimal f is determined, PCA is performed on 83 training samples and the first f factor scores are used to build class centers and covariance matrices. Then the first f factor scores are predicted for 83 test samples, and each test sample is classified according to eq 3 with respect to the minimal Mahalanobis distance of the corresponding class. To ensure that the former division of samples into the training and test sets was not biased in favor of the obtained results, LOO cross-validation was additionally performed. This crossvalidation was carried out on the whole set of 166 samples to compare the results with those already obtained for the training and test sample sets. This whole procedure was repeated for each particular oil (O, H, R, and S oil), and both linear and quadratic discriminant analysis were performed. The second kind of discriminant analysis is the classification of unknown binary and ternary mixtures, and in this analysis, there were nine classes instead of two, which were present in the identification of particular oils. QDA was performed only with LOO cross-validation on a set of all 166 samples with f at most 8. In LDA, nine classes were used instead of two. All computations were performed in program R using the “svd” function that relies on the SVD algorithm. PLS and PCR. Partial least-squares (PLS) analysis and principal component regression (PCR) analysis were applied to the FTIR spectra of all binary mixtures, each of the three investigated ternary sets (OHR, OHS, and ORS), and the set that consists of all 166 samples. For assessment of the volume fraction of individual oils, PLS was initially conducted on several different spectral ranges using the first derivatives of ATR spectra. Three ranges of minimum RMSECV were further selected in assessing original ATR spectra, and second derivatives for PLS analysis. PCR was performed on the spectral range with minimum RMSECV determined by PLS for ATR spectra and their first and second derivatives. Each PLS and PCR analysis also yielded the number of components in the model, RMSEC, R2 for calibration, RMSEP, and R2 for validation. RMSEC, RMSEP, and RMSECV are defined as

n

RMSEP =

∑i = 1 (φval,actual(i) − φval,calculated(i))2 n

(5)

where n denotes the number of samples for calibration or validation, f is the number of components used in model, “actual” refers to the true volume fraction of certain oil in binary mixtures, and “calculated” refers to the volume fractions predicted by the multivariate model. LOO cross-validation was performed on calibration sets. Calculated values (eqs 4 and 5) were denoted with *. To avoid the overfitting in selecting the number of components, f was determined with the minimal RMSE (for optimal f) value that equals RMSECV with correction by degrees of freedom used for RMSEC. In this way, we prevented the risk of selection of too many components (although RMSECV has to be defined and is defined with the n in the denominator). Mean centering of both spectral variables and concentrations was applied before performing PLS and PCR. All samples are summarized in Supporting Information Tables S-1−S-5 (in these tables the sample labeling is also defined). For calibration set samples with odd numbers (in each table), all pure oils were chosen, whereas in the validation set the samples with even numbers were selected. All computations were performed in program R using our script with the “svd” function of the library “base” regarding PCR, and regarding the PLS that relies on the SIMPLS algorithm.30



RESULTS AND DISCUSSION Spectra Analysis of Pure Oils. The fatty acid content of 4 pure vegetable oils (O, H, R, and S) is shown in Table 1, and their ATR spectra are presented in Figure 1. These spectra are dominated by triglycerides absorption bands and look very similar due to the similar chemical composition of oils. Most of the bands are already assigned in the literature,31−34 and here we present a summary (Table 2). Differences in these spectra manifest mostly in intensities of bands and can be perceived by careful investigation of Figure 1. In the C−H stretching region variations in intensities of bands at 3006, 2958, 2922, 2875, and 2853 cm−1 are quite significant. Furthermore, some minor changes can be observed at 1744 cm−1 (CO stretching), in the range from 1200 to 1000 cm−1, and at 915 cm−1 (C−C stretching). It was reasonable to predict that the eigenvectors describing variations related to the composition of oil mixtures should comprise these features, having implications on the further chemometric analysis that will rely on these modest variations. Principal Component Analysis of Oil Mixtures. Ternary Oil Mixtures. Principal component analysis was initially performed on spectral ranges (3700−2600 and 1900−600 cm−1) separately for each set of ternary mixtures. The reproduction of the corresponding ternary diagram (Scheme

n

RMSEC =

∑i = 1 (φcal,actual(i) − φcal*,calculated(i))2

∑i = 1 (φcal,actual(i) − φcal,calculated(i))2 n−f−1 4519

DOI: 10.1021/acs.analchem.6b00505 Anal. Chem. 2016, 88, 4516−4524

Article

Analytical Chemistry Table 2. Assignment of Bands in the Vibrational Spectra of Edible Oils Wavenumber/ cm−1

Group

3480 wa 3006 m 2958 sh 2922 vs 2875 sh 2853 s 1744 s 1654 w 1465 m 1447 sh 1438 sh 1418 w 1397 w 1378 m 1238 m 1161 s 1118 m 1097 m 968 w 914 w

alcohol cis RHCCHR −CH3 −CH2− −CH3 −CH2− RCOOR cis RHCCHR −CH2− −CH3 −CH2− cis CH cis C−H −CH3 −CH2− ester ester ester trans RHCCHR cis H−CC−H

722 s

−CH2−

Assignment O−H str. C−H str. (sym) C−H str. (anti) C−H str. (anti) C−H str. (sym) C−H str. (sym) CO str. CC str. C−H sciss. asym def. C−H sciss. C−H rock. C−H in plane bend. C−H def. (sym) C−H out-of-plane bend. C−O str. C−O str. C−O str. CC bend. H−CC−H out-of-plane bend. C−H rock.

a

Abbreviations: vs, very strong; s, strong; m, medium; w, weak; vw, very weak; str., stretching; bend., bending; sciss., scissoring; twist., twisting; def., deformation; rock., rocking; sym, symmetrical; anti, antisymmetrical; asym, asymmetrical.

Scheme 1. Ternary Diagram for ORS Mixture

Figure 2. PCA performed on the first derivatives of spectral range (3035−2989, 1170−1140, 1120−1100, 1093−1047, and 930−890 cm−1): (a) OHR ternary set, (b) OHS ternary set, and (c) ORS ternary set.

1) using the first two principal components (regardless of the ATR spectra derivative order) was unsuccessful. Out of all components, only PC1 loadings were relevant due to the high correlation with the degree of unsaturation. However, classification based only on PC1 was inadequate, e.g. binary mixture OS (50:50) had the same PC1 score value (degree of unsaturation) as pure R or H (Figure 2c). Therefore, for the ORS ternary mixture, it is necessary to have at least one additional principal component (PC2) that accounts for variations caused by the content of R. In the investigated spectral ranges there was a poor correlation between PC2 scores and fraction of R oil in the ORS ternary mixture. Similar results were obtained for the other ternary mixtures where the correlation was negligible as

well. It was evident that in the analysis there are too many variables contained that account for an additional noise in eigenvectors. From this point forward, we improved our analysis in two separate directions. One was the use of the first numerical derivatives of the ATR spectra to reduce possible baseline noise in the spectra, and the second was the use of only selected parts of the spectra where the changes were most prominent. The use of second derivatives did not improve the obtained classification results, but resulted in increased numerical noise in PC loadings. These spectral ranges were selected by careful investigation of previously obtained eigenvectors. For the ORS ternary mixture, it was found that use of the first derivatives in the spectral range (3035−2989, 1080−1047, and 930−890 cm−1) 4520

DOI: 10.1021/acs.analchem.6b00505 Anal. Chem. 2016, 88, 4516−4524

Article

Analytical Chemistry

Figure 3. PCA performed on the first derivatives of the spectral range (3035−2989, 1170−1140, 1120−1100, 1093−1047, and 930−890 cm−1) on all 166 samples (4 pure oils, 9 × 6 binary mixtures, and 3 × 36 ternary mixtures).

Table 3. Total Variance Represented by Principal Components and the Corresponding Kaiser Test for the First Derivatives of ATR Spectra in Spectral Range (3035−2989, 1170−1140, 1120−1100, 1093−1047, and 930−890 cm−1) OHR mixtures

OHS mixtures

ORS mixtures

Set of all binary mixtures

Set of all mixtures

PCn

Individual variance/%

Kaiser test

Individual variance/%

Kaiser test

Individual variance/%

Kaiser test

Individual variance/%

Kaiser test

Individual variance/%

Kaiser test

PC1 PC2 PC3 PC4 PC5 PC6

94.44 2.01 0.26 0.22 0.20 0.17

61.39 1.31 0.17 0.14 0.13 0.12

98.35 0.33 0.11 0.08 0.07 0.07

63.93 0.21 0.05 0.05 0.04 0.04

97.66 0.99 0.19 0.09 0.08 0.06

63.48 0.64 0.12 0.06 0.05 0.04

97.63 1.00 0.32 0.12 0.07 0.06

55.65 0.57 0.18 0.07 0.04 0.03

96.88 1.08 0.33 0.16 0.08 0.08

159.9 1.78 0.55 0.27 0.13 0.12

yields R2 of 0.986 between PC2 scores and the volume fraction of the R oil. Thereby PC2 describes approximately 1.1% of the overall variance in the data set whereas PC1 is describing the degree of unsaturation with 97.6% of the overall variance. However, the other sets of ternary mixtures were not well reproduced. Additional fine-tuning of the results was made by applying first derivatives of ATR spectra on the following spectral range: (3035−2989, 1170−1140, 1120−1100, 1093−1047, and 930− 890 cm−1). PCA conducted on this spectral range excellently reproduces not only the separate ternary mixtures with 2 principal components (Figure 2) but also a complete set of all mixtures with only 3 principal components (Figure 3). In Table 3 are given the total variances for ternary mixtures and Kaiser rule. Figure 2 displays three triangles, each representing a corresponding ternary mixture. From Tables S-6−S-8, it is evident that, by Wold’s R and Eastment’s W index criterion, ternary OHR mixtures can be represented with only two

components (and the rest is noise), except Wold’s crossvalidation, where the third component (R = 0.988) might also be taken into consideration. In general, cross-validation procedures support the classification of samples with only two principal components (Figure 2), which presents good reproductions of the corresponding ternary diagrams (Scheme 1). With the ternary diagram, both qualitative and quantitative analysis of ternary mixtures are possible (sample labels are defined in Tables S-1−S-5). Furthermore, PCA was applied to the complete set of all 166 samples using the same spectral range, and only the first three principal components can classify each sample (Figure 3). As PC1 describes the degree of unsaturation with extreme values of O and S (that represents two vertices of the tetrahedron), PC2 describes the content of R oil (the third vertex) and PC3 the content of H oil (the fourth vertex of the tetrahedron) (Figure 3a). In between the vertices, the edges of the tetrahedron represent the corresponding binary mixtures 4521

DOI: 10.1021/acs.analchem.6b00505 Anal. Chem. 2016, 88, 4516−4524

Article

Analytical Chemistry

Table 4. Determination of Optimal Number of Principal Components for the Complete Set of All Samples (166 samples in the data matrix) with Four Different Cross-Validations, R and W Indices (see eqs 1 and 2), and First Derivatives of ATR Spectra: 3035−2989, 1170−1140, 1120−1100, 1093−1047, and 930−890 cm−1 LOO for rows m comp

PRESS

PRESS

0 1 2 3 4 5 6 7 8 9 10

0.00109 3.38 × 10−5 2.21 × 10−5 1.85 × 10−5 1.68 × 10−5 1.59 × 10−5 1.51 × 10−5 1.43 × 10−5 1.36 × 10−5 1.30 × 10−5 1.24 × 10−5

0.110 3.46 × 10−5 2.3 × 10−5 1.95 × 10−5 1.79 × 10−5 1.74 × 10−5 1.68 × 10−5 1.62 × 10−5 1.57 × 10−5 1.51 × 10−5 1.49 × 10−5

eigenvector CV R

0.0319 0.679 0.882 0.965 1.04 1.06 1.08 1.10 1.11 1.15

PRESS 0.110 3.52 × 2.37 × 2.04 × 1.89 × 1.87 × 1.84 × 1.81 × 1.78 × 1.74 × 1.75 ×

10−5 10−5 10−5 10−5 10−5 10−5 10−5 10−5 10−5 10−5

Wold procedure R

PRESS

0.0325 0.700 0.921 1.02 1.11 1.16 1.20 1.25 1.28 1.35

0.110 3.82 × 2.37 × 2.19 × 1.88 × 1.78 × 1.64 × 1.56 × 1.51 × 1.46 × 1.42 ×

10−5 10−5 10−5 10−5 10−5 10−5 10−5 10−5 10−5 10−5

Krzanowski procedure R

0.035 0.701 0.991 1.02 1.06 1.03 1.03 1.06 1.07 1.09

PRESS 0.110 3.52 × 2.38 × 2.06 × 1.91 × 1.95 × 1.96 × 1.96 × 1.95 × 1.91 × 1.99 ×

10−5 10−5 10−5 10−5 10−5 10−5 10−5 10−5 10−5 10−5

R

W

0.0325 0.705 0.932 1.03 1.16 1.23 1.30 1.36 1.40 1.54

2.67 × 103 42.0 13.6 6.81 −1.44 −0.462 −0.188 0.543 1.82 −3.46

amount of ester CO and C−O functional groups, that cannot be determined by GC analysis.32 In PLS performed on GC data (11 independent variables, Table S-9) for the volume fraction of H oil, the R2 was only 0.838 for test samples when three latent variables were used, implying the presence of other information in spectra independent of the fatty acid composition. Discriminant Analysis. The final results in the identification of each particular oil (presence or absence of the complete set of all samples) are displayed in Tables 5 and S-10. QDA yielded significantly better results than LDA, probably because each class had a different covariance matrix. It is clear that application of QDA on the spectral range (3035−2989, 1170−1140, 1120−1100, 1093−1047, and 930−890 cm−1) can lead to quite plausible determination of the presence or absence of oil in an unknown sample. QDA LOO−CV on all samples yielded 89% classification with f = 3. Although the majority of binary mixtures were not classified as well as ternary mixtures, 89% of the classification accuracy based on only the three principal components necessary for classification represents satisfactory results. Quantitative Analysis. PLS and PCR results for quantitative determination of O in the ORS mixture that include several considered spectral ranges and derivative orders are displayed in detail in Table S-11. Those results are an excellent example of RMSECV minimization to achieve the best possible model. The correlation coefficient (R2) has been shown as the highest for the lowest RMSECV, and these high values of R2 for validation sets are not very common in the literature.11 Minimization of RMSECV should not be used only for the selection of the optimal number of variables in the PLS or PCR model but also for spectral range and/or derivative order selection. The detection limit calculated for the validation set according to IUPAC recommendations35 for both type I and type II errors equals 0.05, and that for O in the ORS mixture is 0.93%. In this analysis, the optimal number of components f was selected by applying a correction for the degrees of freedom (eqs 4 and 5). RMSECV, which depends on the number of components, usually has a global minimum with one or more local minimums.12 In order to make the procedure robust, this method selects a global minimum, but with respect to RMSECV “corrected for degrees of freedom’’that might prevent selection of too large a number of components, thus preventing the overfitting. Nonetheless, optimal spectral range,

whereas each face of the tetrahedron corresponds to one ternary mixture. The total variance and Kaiser rule, as well as results of four different cross-validations, for the complete set of all samples are given in Tables 3 and 4, respectively. Most of the cross-validation tests suggest that three PCs provide the satisfactory description of the original data while the rest represent the noise. Note that PC2 accounts for only 1.08% and PC3 for only 0.33% of the total variance, yet they describe two important dimensions for representation of all 166 mixtures. Analysis of PC Loading Vectors. The loadings of the first three principal components obtained by conducting PCA on the first derivatives of the spectral range (3035−2989, 1170− 1140, 1120−1100, 1093−1047, and 930−890 cm−1) are displayed in Figure 4. Since the PCA was calculated using the first derivatives, the spectral band positions in the loadings will be at the values of zero, and the actual changes of the bands should be around these positions. The most accentuated changes in PC1 loadings were at 3010 cm−1 due to the CH stretching vibration of the cis double bond and to the olefinic bending vibration at 915 cm−1. Other bands from 1170−1047 cm−1 are not as prominent. It is evident that PC1 describes changes in the concentration of cis double bonds, i.e. changes in the degree of unsaturation. The PC2 loading vector shows multiple overlapped bands in the range 3035−2989 cm−1. Since different fatty acids absorb at different wavenumbers, but the increased amount of double bonds results in higher wavenumber, PC2 loadings are probably related to the differences in the contents of palmitic, oleic, linoleic, and linolenic acid. Strong and wide positive first derivative shoulders in the range from 1069 to 1054 cm−1 and a pronounced negative shoulder at 1073 cm−1 imply changes in the content of linolenic acid because the spectrum of methyl linolenate has relatively strong bands between 1075 and 1060 cm−1 (Figure S-4). For ORS mixtures, R2 was set between PC2 scores and the mass fraction of linolenic acid (18:3) was 0.97, which means that PC2, although accounting for only 1.08% of the overall variance, mostly describes the fraction of linolenic (18:3) acid. Regarding PC3, its factor scores are well correlated with the content of H oil. Strong bands at 1151 cm−1 and between 1120 and 1080 cm−1 account for the C−O stretching vibration, i.e. differences of ester triacyl bonds. These changes do not necessarily represent the fraction of certain fatty acids. Thus, some authors suggested that IR can be used in prediction of the 4522

DOI: 10.1021/acs.analchem.6b00505 Anal. Chem. 2016, 88, 4516−4524

Article

Analytical Chemistry

Table 5. Discriminant Analysis with Mahalanobis Distance between Factor Scores Performed on the First Derivatives of the Spectral Range (3035−2989, 1170−1140, 1120−1100, 1093−1047, and 930−890 cm−1) Quadratic discriminant analysis (QDA) PCC Oil O H R S

f 4 3 3 3

LOO-CV (Train)

Test

LOO-CV (All)

97.6 92.8 96.4 94.0

100 97.6 98.8 92.8

99.4 95.2 97.0 95.1

Quantification of linolenic acid is possible in mixtures containing R, since only that oil contains linolenic acid in significant amounts (Table 1). The analysis of regression coefficients showed that for quantification of linolenic acid, the highest absolute value for the coefficient at 1068 cm−1 is needed (Figures S-4 and 4). The same observation may be noticed for regression pronounced coefficients at 2955 and 2942 cm−1 (which are of the opposite sign, respectively). Only the spectrum of methyl linolenate shows a distinct band at 2954 cm−1 among the other methyl esters present in significant amounts in analyzed vegetable oils. For other oils, the most prominent coefficients can be found at 1168, 1148, 1120, 1110, 1102, and 912 cm−1. Minimization of RMSECV for (1) spectral range, (2) derivative order, and (3) choice of method (PLS or PCR) was either not or very seldom used for adulteration purposes to the best of our knowledge. The results we have obtained regarding both cross-validation (RMSECV) and validation (RMSEP) for the OHS set using a previously described minimization procedure are statistically significantly better (ttest on squared deviations of residuals between two models, p < 0.05, and F-test on residuals) for each of the three constituents of the mixture, compared to the PLS model applied to the whole spectral range of original ATR spectra. Some studies efficiently select wavelengths using sinergy interval PLS and genetic algorithm siPLS with root-meansquare error of lowest cross-validation RMSECV.36 They use RMSECV for both selections of the number of components and spectral variables.36,37 That procedure is rarely used, since the computational costs are substantial, and still, it is only an estimation. Nevertheless, in this study we also performed siPLS for H oil in OHS and ORS mixtures; the results can be found in Supporting Information Tables S-13 and S-14. Very briefly, siPLS obtained for H in ORS RMSECV of 1.13% and RMSEP of 1.81%, which are better results than the results obtained in Table S-12. For H oil in OHS mixture, the obtained RMSECV of 1.84% and RMSEP of 2.09% were both lower. The results can be additionally slightly enhanced with the use of siPLS. Thus, we have achieved that R2 of the validation set for each oil in each considered sample set is greater than 0.99 (Tables S11−S-14). By these results, we conclude that FTIR coupled with chemometric methods, e.g. PLS and PCR, are very powerful tools for quantitative analysis of ternary oil mixtures.

Figure 4. PC loading vectors of first derivatives of ATR spectra, spectral range (3035−2989, 1170−1140, 1120−1100, 1093−1047, and 930−890 cm−1).

derivative order, and method (PLS or PCR) were selected with RMSECV with only the n in the denominator (eq 4). Results for all other oils are summarized in Table S-12. Here are displayed only models with minimum RMSECV (in the case of O in the ORS mixture set; bolded row in Table S-11). In some cases, minimal RMSECV did not result with maximal but nearly maximal R2 for validation. However, there is little space for improvement in real systems, since RMSECV is a good estimate, but still only an estimate. The limit of detection of H in the OHS mixture was 2.19%.



CONCLUSION The results of this study show that application of multivariate numerical methodology on FTIR-ATR spectra can reproduce the composition of binary (Figure S-5) and ternary oil mixtures using only the first few principal components. In 3D space, 4523

DOI: 10.1021/acs.analchem.6b00505 Anal. Chem. 2016, 88, 4516−4524

Article

Analytical Chemistry

(11) Rohman, A.; Che Man, Y. B. Food Anal. Methods 2011, 4, 155− 162. (12) Wehrens, R. Chemometrics with R, Multivariate Data Analysis in the Natural Sciences and Life Sciences; Springer-Verlag, Berlin, Heidelberg, 2011. (13) Rohman, A.; Che Man, Y. B. J. Am. Oil Chem. Soc. 2009, 86, 1149−1153. (14) Rohman, A.; Man, Y. B. C.; Ismail, A.; Hashim, P. J. Am. Oil Chem. Soc. 2010, 87, 601−606. (15) Wang, L.; Lee, F. S. C.; Wang, X.; He, Y. Food Chem. 2006, 95, 529−536. (16) Paradkar, M. M.; Irudayaraj, J. J. Food Sci. 2002, 67, 2507−2511. (17) Zvijezda; http://www.zvijezda.hr. (18) Beltrami, E. Giornale di Matematiche ad Uso degli Studenti Delle Universita 1873, 11, 98−106. (19) Pearson, K. Philos. Mag. 1901, 2, 559−572. (20) Hotelling, H. J. Educ. Psychol. 1933, 24, 417−441. (21) Jolliffe, I. T. Principal Component Analysis; Springer, Berlin, 1986. (22) Smilde, A.; Bro, R.; Geladi, P. Multi-way Analysis with Applications in the Chemical Sciences; John Wiley & Sons Ltd, Chichester, 2004. (23) Hrenar, T. moonee, Program for Manipulation and Analysis of Multi- and Univariate Data, rev. 0.6827. (24) Geladi, P.; Kowalski, B. Anal. Chim. Acta 1986, 185, 1−17. (25) Filzmoser, P.; Todorov, V. Anal. Chim. Acta 2011, 705, 2−14. (26) Bro, R.; Kjeldahl, K.; Smilde, A. K.; Kiers, H. A. L. Anal. Bioanal. Chem. 2008, 390, 1241−1251. (27) R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/. (28) Singular Value Decomposition of a Matrix; http://stat.ethz.ch/ R-manual/R-patched/library/base/html/svd.html. (29) Bioconductor: Open Source Software for Bioinformatics; http://www.bioconductor.org/packages//2.10/bioc/html/ pcaMethods.html. (30) de Jong, S. Chemom. Intell. Lab. Syst. 1993, 18, 251−263. (31) Yang, H.; Irudayaraj, J. J. Am. Oil Chem. Soc. 2001, 78, 889−895. (32) Safar, M.; Bertrand, D.; Roberta, P.; Devaux, M. F.; Genot, C. J. Am. Oil Chem. Soc. 1994, 71, 371−377. (33) Baeten, V.; Meurens, M. J. Agric. Food Chem. 1996, 44, 2225− 2230. (34) Jović, O.; Smolić, T.; Jurišić, Z.; Meić, Z.; Hrenar, T. Croat. Chem. Acta 2013, 86, 335−344. (35) Currie, L. A. Pure Appl. Chem. 1995, 67 (10), 1699−1723. (36) Qiang, L.; Mingjie, T.; Jianrong, C.; Huazhu, L.; Chaitep, S. Maejo Int. J. Sci. Technol. 2010, 4, 113−124. (37) Sarrafi, A. H. M.; Konoz, E.; Ghiyasvand, M. E-J. Chem. 2011, 8, 1670−1679.

scores of these components form a tetrahedron, with pure oils being at the vertices, binary mixtures at the edges, and ternary mixtures on the faces of the tetrahedron. Several crossvalidation procedures as well as PC scores discriminant analysis using Mahalanobis distances support these findings, suggesting that all other principal components represent statistical noise. For qualitative prediction of unknown samples using Mahalanobis distances, QDA was preferred over LDA. Despite the fatty acid composition similarity between high-oleic sunflower oil and the OS binary mixture, and taking into account almost the same iodine value as the rapeseed oil, higholeic sunflower oil can still be very well discriminated from other oils, resulting in PCC = 97.6% and R2 > 0.99 (for test samples). PLS and PCR using the minimization of RMSECV with respect to spectral range and derivative order in all cases resulted in high R2 > 0.99 (for test samples), and hence could be used to quantify the content of extra virgin olive oil, sunflower oil, rapeseed oil, and high-oleic sunflower oil in an unknown binary or ternary mixture.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.analchem.6b00505. Grid PCA, ROBPCA, spherical PCA, attenuated total reflectance spectra, labels for samples, optimal number of principal components, results of GC analysis, performance of different multivariate models, best multivariate results, and results of PLS and SiPLS (PDF)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]; Phone +385 1 4606 072. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This study was supported by the Ministry of Science of Croatia under the research project Spectroscopic analysis of unsaturated systems and metal compounds (MSES 119-11913422959).



REFERENCES

(1) Hernández-Martínez, M.; Gallardo-Velázquez, T.; Osorio-Revilla, G. Eur. Food Res. Technol. 2010, 231, 321−329. (2) Kemsley, E. K.; Holland, J. K.; Defernez, M.; Wilson, R. H. J. Agric. Food Chem. 1996, 44, 3864−3870. (3) Downey, G.; Briandet, R.; Wilson, R. H.; Kemsley, E. K. J. Agric. Food Chem. 1997, 45, 4357−4361. (4) Paradkar, M. M.; Sivakesava, S.; Irudayaraj, J. J. Sci. Food Agric. 2003, 83, 714−721. (5) Al-Jowder, O.; Defernez, M.; Kemsley, E. K.; Wilson, R. H. J. Agric. Food Chem. 1999, 47, 3210−3218. (6) Sivakesava, S.; Irudayaraj, J. J. Food Sci. 2001, 66, 787−791. (7) Lai, Y. W.; Kemsley, E. K.; Wilson, R. H. Food Chem. 1995, 53, 95−98. (8) Härmä, H.; Peltomaa, R.; Pihlasalo, S. Anal. Chem. 2015, 87, 6451−6454. (9) Samanta, S.; Kar, C.; Das, G. Anal. Chem. 2015, 87, 9002−9008. (10) Nicolaou, N.; Xu, Y.; Goodacre, R. J. Dairy Sci. 2010, 93, 5651− 5660. 4524

DOI: 10.1021/acs.analchem.6b00505 Anal. Chem. 2016, 88, 4516−4524