Experimental Design, Near-Infrared Spectroscopy, and Multivariate

Nov 1, 2012 - In the project described here, the students worked on experimental design, using near-infrared spectroscopy and multivariate calibration...
2 downloads 0 Views 1MB Size
Laboratory Experiment pubs.acs.org/jchemeduc

Experimental Design, Near-Infrared Spectroscopy, and Multivariate Calibration: An Advanced Project in a Chemometrics Course Rodrigo R. de Oliveira, Luiz S. das Neves, and Kássio M. G. de Lima* Universidade Federal do Rio Grande do Norte (UFRN), Instituto de Química, Grupo de Pesquisa em Quimiometria Aplicada (GPQA), CEP 59072-970 Natal, RN, Brazil S Supporting Information *

ABSTRACT: A chemometrics course is offered to students in their fifth semester of the chemistry undergraduate program that includes an indepth project. Students carry out the project over five weeks (three 8-h sessions per week) and conduct it in parallel to other courses or other practical work. The students conduct a literature search, carry out laboratory work, and write a technical report on a research subject of selfchoice. In the project described here, the students worked on experimental design, using near-infrared spectroscopy and multivariate calibration, to develop methods to predict the properties of biodiesel and diesel blends. In addition to dealing with the chemometric tasks, the students synthesized the biodiesel sample and understood the importance as a renewable energy source. KEYWORDS: Graduate Education/Research, Upper-Division Undergraduate, Analytical Chemistry, Laboratory Instruction, Computer-Based Learning, Hands-On Learning/Manipulatives, Applications of Chemistry, Chemometrics, IR Spectroscopy hemometrics was originally defined as “the art of extracting chemically relevant information from data produced in chemical experiments”,1 but often it is also defined by the methods employed. The application of multivariate calibration, pattern recognition, and design of experiments to solve chemical problems are some of the topics often connected to chemometrics. These are mainly statistical methods, but many instructors believe that they are best taught to chemistry students by chemists with experience in solving real-world problems. The scope of chemometrics is wide; applications are found in many fields and the toolbox of useful methods is diverse. This variety provides a major challenge in introducing chemometrics to chemistry undergraduate students. A chemometrics course should ideally give an orientation and “taste” of different methods and applications as well as an understanding about how chemometrics may assist and provide answers to the students’ own research questions.2,3 Herein, we present a laboratory project for a chemometrics course where the focus is on the experimental design and chemometric technique partial least-squares (PLS) analysis to select factors in the pretreatment of PLS models for transmittance spectra in the near-infrared (NIR) region. The goal of the experiment is to determine the simultaneous density and content of biodiesel in biodiesel/diesel blends synthesized in the laboratory. The near-infrared spectroscopy (NIRS), combined with chemometrics, is an appropriate technique to examine these properties.4,5 NIRS is a type of vibrational spectroscopy that uses electromagnetic radiation in the wavelength range from 750 to 2500 nm. NIR spectroscopy is often omitted from the chemistry curriculum because the spectra do not consist of “clean” fundamental transitions, but

C

© 2012 American Chemical Society and Division of Chemical Education, Inc.

represent complicated sums of combination and overtone bands.6,7 However, this is an excellent opportunity to expose students to the principles of computational algorithms that, if carefully and properly applied, allow the extraction of meaningful information from seemingly complex spectral traces.



OVERVIEW OF THE PROJECT The students developed the project and collected the experimental data in the instrumental analysis laboratory. The project demonstrates the power of NIR spectroscopy and chemometrics application. These methods have been proposed for the analysis of biodiesel and the determination of quality parameters of biodiesel/diesel blends. The students used a factorial design to select which chemometric preprocessing techniques could increase the predictive ability of multivariate models through the RMSEP (root-mean-square error of prediction).



PROCEDURE AND TECHNIQUES

Sample Preparation

In the laboratory, students blended biodiesel and petroleum diesel in fractions of biodiesel in diesel ranging from 0% (v/v) to approximately 20.5% (v/v) with a final mixture volume equal to 50 mL. The students synthesized the biodiesel from soybean vegetable oil via a transesterification reaction. The soybean vegetable oil was dried in an oven at 100 °C for 2 h to avoid soap formation during the reaction. Methanol (Chemco) was Published: November 1, 2012 1566

dx.doi.org/10.1021/ed200765j | J. Chem. Educ. 2012, 89, 1566−1571

Journal of Chemical Education

Laboratory Experiment

the standard of ASTM D4052. The digital densimeter DMA model 4500M (ANTON PAAR) was used with 3 min sample readings. The density values of the blends were found to vary between 836.6 and 845.8 kg/m3.

mixed with a mass of potassium hydroxide (KOH; Synth), correspondent to 1% (w/w) of soybean oil weight used, under constant mechanical agitation at 500 rpm. KOH was used as catalyst in the transesterification reaction. After cooling, the soybean oil was mixed with the methanol/catalyst solution in a molar ratio of 6:1 (methanol/vegetable oil). The mixture was kept in a Becker reacting by, approximately, 2 h under agitation at 500 rpm and at room temperature (21−25 °C). After the reaction was completed, the mixture was transferred to a separating funnel (Figure 1), where the glycerol was separated

Methods of Data Treatment

Three different preprocessing techniques were studied in this work. The methods of preprocessing, the multivariate method of calibration, and factorial design employed on this data are explained below. The Savitzky−Golay (SG) smoothing and derivative methods are used to mathematically reduce the random noise with the goal of increasing the signal-to-noise ratio of the spectra data.8 The SG method typically uses a window that can be thought of as a region of influence. For SG smoothing, the points in the window are used to fit a certain polynomial function by least-squares and the value of the center point is changed to the value of the fitted curve at the same position. This procedure is repeated for each group of windows points, dropping one at the left side of the window and picking up one at the right each time.8 Therefore, the window width directly affects the resulting smoothing. For SG derivative, the method is basically the same, but the fitted curve is derived before replacing the value of the center point. For SG derivative, it is important to remove baseline features. The noise level, the number of data points, and the sharpness of the features should all be considered when applying the SG derivative. The multiplicative scatter correction (MSC) is a preprocessing tool developed to correct for the significant light-scattering problems in reflectance spectroscopy. When using MSC, one assumes that the variable number dependence of scattering or baseline signal is different from that of the chemical information. One advantage of this approach over the derivative methods is that the preprocessed spectra resemble the original spectra. Partial least-squares (PLS) regression is a linear multivariate data analytical method developed to handle data with high correlation, such as a NIR spectrum. In PLS, a suitable set of latent variables are obtained using both independent (X matrix) and dependent variables (y vector) block by means of an iterative process that maximizes the covariance between these two blocks. The data matrix X is formed by the multivariate matrix (e.g., NIR spectra) and the vector y contains the reference values (e.g., concentration). Two sets of models corresponding to the linear algebraic relation between their scores are obtained of the form9−13

Figure 1. Biodiesel separation system setup.

by gravity. After removing the glycerol phase, the remaining solution containing a mixture of methyl esters was washed with a diluted aqueous solution of HCl to neutralize any trace of the catalyst that was not removed along with the glycerol. A subsequent wash with distilled water was performed. The washed biodiesel was placed in an oven at 100 °C for 1 h to evaporate the moisture and alcohol that remained at that stage. After cooling, the sample of synthesized biodiesel was stored at 10 °C to avoid oxidation.

A

X = TPT + E =

NIR Analysis

∑ tipiT + E i=1

The students recorded the biodiesel blends transmittance spectra in the NIR region using a model MB 160 spectrophotometer (Bomem). Each spectrum was an average of 50 scans obtained with a resolution of 8 cm−1 in the range of 750−2500 nm using quartz cuvette with an optical path of 1 mm. An average time of 40 s was used to obtain each spectrum.

(1)

A

y = Tq T + f =

∑ tipiT + f i=1

(2)

where E (n × p) and f (n × 1) are error matrixes containing the parts of X (n× p) and y (n × 1), respectively, which are not explained by the model; n and p are the number of samples (rows) and variables (columns), respectively; ti is the column vector that comprise the score matrix T (n × A); pi and qi are the loading that comprise P (p × A) and q (1 × A) loading matrixes, respectively, with A equal to the number of latent variables suitable to explain the variance of the variables. These matrixes are shown in Figure 2. The parameters of interest (ŷ),

Reference Methods

The biodiesel content in the 38 blends was determined according to European norm EN 14078 using a Fourier transform infrared (FTIR) spectrophotometer model IRAffinity-1 (Shimadzu). A minimum value of 0% (v/v) and a maximum of 20.47% (v/v) of pure synthetic biodiesel were obtained. The densities at 20 °C of the 38 blends of biodiesel/ diesel were determined by digital densimeter method following 1567

dx.doi.org/10.1021/ed200765j | J. Chem. Educ. 2012, 89, 1566−1571

Journal of Chemical Education

Laboratory Experiment I

RMSEP =

∑i = 1 (yi − yi ̂ )2 I

(4)

where (y) is the actual parameter value, (ŷ) is the predicted value by the PLS model, and I is the number of external validation samples (15), on this work.



HAZARDS Proper safety equipment should be worn (laboratory coat, gloves, and safety goggles) at the biodiesel synthesis step, blending with diesel, and acquiring experimental data. Hot vegetable oil, biodiesel, commercial diesel, and methanol are flammable. Potassium hydroxide (caustic), hydrochloric acid (corrosive), commercial diesel, biodiesel, and methanol can cause skin and eye irritation. Do not swallow or inhale any of the chemicals. Handle all substances with care. Disposal must follow proper waste disposal regulations.

Figure 2. Decomposition of matrix X and vector y on their scores and loadings.

for a set of samples, are obtained by multiplication of the X matrix by a suitable regression vector b (p × 1), calculated as, ŷ = Tq T = XW(PTW)−1q T = Xb

(3)



where W (p × A) is the weight matrix obtained by the PLS algorithm. Experimental design and optimization of experiments are useful tools for solving problems in many situations. An example is how to perform an experiment to obtain the maximum yield. In other words, to decide which variables are more important and how they affect the experiment. One of the simplest means to execute an experimental design and introduce this issue for undergraduate students is performing a factorial design. This is used to determine the influence or effect of a number of experimental variables, called factors, on a determined response and to know which factors are most important to an experiment. A brief explanation is presented on this work; for more details on the effect calculation in full factorial design, the students should read the references.13−15 In full factorial design with a combination of n factors and two levels in an experiment, a factorial design will be composed with 2n runs. The factor levels are indicated with − (minus) for low level and + (plus) for high level. The main effects estimated for one factor represent the mean values of the effects against each level for the other factors, and the interactions effects is the mean effect when two or more factors are changed simultaneously.

STUDENTS RESULTS The original NIR spectra (750−2500 nm) for the 38 biodiesel/ diesel blended samples analyzed without prior mathematical

Figure 3. NIR spectra from 38 biodiesel/diesel blends: light gray is the selected variable range 1 (1000−2500 nm); dark gray is the selected variable range 2 (1600−2500 nm); (A) high noise level region; (B) 2nd overtone and combination region of C−H stretch; (C) 1st overtone region for C−H bonds; and (D) combination of C−H CO bonds and 2nd overtone of CO bond.

Multivariate Calibration and Data Analysis

The data analysis and construction of the chemometric models were performed using the software Unscrambler 9.7 (CAMO) for the construction of the PLS regression models and STATISTICA 7 (StatSoft Inc.) for the full factorial design. Both software programs are easy to use and new trial versions could be downloaded from the Internet. The students applied the preprocessing methods to the data. They smoothed the original spectra and calculated the first derivative using the algorithm proposed by Savitzky and Golay8 with a second-order polynomial function and window widths of 5 and 3 data points, respectively. Multiplicative scatter correction (MSC) was also applied to the data. The selection of spectral bands was also evaluated in the factorial design. The PLS regression was used as a multivariate calibration model correlating the X matrix (spectra preprocessed) with the y vector (experimentally determined blend compositions or densities of the samples). Separated PLS models were built for each parameter. A group of 23 samples was used on the calibration set, and the remaining 15 samples were used for external validation. Root mean square error of prediction (RMSEP) was calculated as expressed as

Table 1. Factors and Levels for the Study of Pretreatment of PLS Models Factors

Code

Level (−)

Level (+)

Spectral Range Selection MSC Savitzky−Golay smoothing Savitzky−Golay first derivative

F C S D

variable range 1 No No No

variable range 2 Yes Yes Yes

processing are shown in Figure 3. The range between 750 and 1000 nm was not used for analysis because the high noise level present on the spectra. There are four prominent bands that correspond to the second overtone localized at 1150−1250 nm and the combination region at 1300−1515 nm for the C−H stretch; the first overtone region 1650−1900 nm for C−H bonds, and, finally, the combination region for the C−H bond and combination bands for the CO and C−H bonds 1568

dx.doi.org/10.1021/ed200765j | J. Chem. Educ. 2012, 89, 1566−1571

Journal of Chemical Education

Laboratory Experiment

Figure 4. Variable range 1 (1000−2500 nm) of the 38 samples NIR spectra after reprocessing, points SG first derivative, 5 points SG smoothing, both with second-order polynomial, and MSC.

Figure 6. Surface Response of RMSEP for the prediction of the density (kg/m3) according to the MSC of the factors C and Savitzky− Golay first derivative, D. White dots are the experimental values used to fit the response surface.

covering the 2100−2500 nm band.16,17 For PLS models used in this work, students chose two ranges to examine: variable range 1 covering regions 1000−2500 nm and variable range 2, a smaller range (1600−2500 nm) covering mainly the combination bands. The biodiesel contribution is present in all the NIR spectra. To assess which factors and their levels in the pretreatment of spectra resulted in lower RMSEP for the parameters investigated by the PLS method, the students performed a 24 full factorial design, totaling 16 experiments, which result on the combination of factors summarized in Table 1. The first factor shown in Table 1, F, corresponds to the selection of a smaller spectral range (variable range 2 is the + level) to reduce the number of variables in the group with the highest number (variable range 1 is the − level). The bands corresponding to each group’s selected variables are shown in Figure 3. The second factor, C, is the MSC; the third factor, S, is the Savitzky−Golay smoothing using a window with five points and second-order polynomial function; and the last factor, D, is the first derivative Savitzky−Golay using three

points and a second-order polynomial. Figure 4 shows the result of one of these levels of the factorial design. The full factorial design that was applied at this activity presents four factors (in coded levels, positive and negative) and the response, RMSEP values. After the calculation of the main effects and interactions between factors, these effects can be positive or negative. The Pareto charts enable students to visualize the effects of the factors on the RMSEP. Positive values mean the error increases when the factor is selected (has a positive level in Table 1), and negative values mean the error decreases. For the purposes of calibration, the more negative the RMSEP, the greater the predictive capacity of the PLS model. By looking the effects in the Pareto charts for the density (Figure 5A) and composition (Figure 5B) and evaluating the RMSEP, the students can see that the application of the Savitzky−Golay first derivative for the density parameter

Figure 5. Pareto charts built on the pretreatment for the PLS models on data from biodiesel on (A) density and (B) sample composition. F, C, S, and D are the main effects defined in Table 1. 1569

dx.doi.org/10.1021/ed200765j | J. Chem. Educ. 2012, 89, 1566−1571

Journal of Chemical Education

Laboratory Experiment

Figure 7. Graph of prediction models, PLS, illustrating the correlation between measured and predicted parameter: (A) density and (B) blend composition.

significantly decreases the RMSEP values (−0.157 kg/m3). For the composition parameter, it was observed that the application of the MSC values of RMSEP decreases to −0.1501%. It was also noted that selecting the smaller range of variables (variable range 2) gives higher RMSEP values; this increase is more significant in the preprocessing done for the determination of biodiesel content (Figure 5B). Therefore, spectra were preprocessed by applying the SG first derivative and MSC using the range of 1000−2500 nm (variable range 1) for PLS models. Figure 6 shows the response surface between the RMSEP and factors D and C for determining the density of the blends, showing areas with lower values of RMSEP. PLS calibration models were prepared by students using factorial design results on the pretreated spectra. Spectra and the values of the property to predict were mean-centered prior to calibration and validation. The number of latent variables to be included in the calibration models was estimated by leaveone-out cross-validation procedures. In this way, the predictive ability of a model is tested by sequentially removing one sample from the calibration set and the property of this sample is predicted by using the derived calibration model. This procedure is repeated for all samples from the calibration samples set and for the different number of factors. Comparing the actual and the cross-validation predicted values, an error parameter can be calculated as RMSEP (root-mean-square error of prediction) or in this particular case of cross-validation, it is called RMSECV. The best number of latent variables is chosen when RMSECV reaches a minimum value or it stabilizes without decreasing or increasing anymore. It is worthwhile to mention that the ideal situation is when the number of latent variables also represents the number of chemical constituents that vary on the analyzed samples (apart from physical variation, which is hopefully removed with the sample pretreatment methods). Two latent variables were chosen by the students using these criteria. This low number agrees with the changes in NIR spectra corresponding to changes in composition of biodiesel/ diesel mixtures. The correct choice of pretreatment and experimental design procedures allowed the best predictive ability (low error and better correlation) of PLS models in the determination of possible composition changes in biodiesel blends.

Figure 7 shows the standard curves constructed from the PLS models for the density and composition of biodiesel analyzed after the pretreatment step for the validation set. The density parameter showed a correlation coefficient of 0.98 and a RMSEP of 0.225 kg/m3. The composition prediction results showed a correlation coefficient of 0.97, and a RMSEP of 0.666% (v/v). The values of R2 for both parameters were close to 1, showing a good correlation between points resulting in a straight line, both in the stage of calibration and in prediction. This also demonstrated that a correct number of factors were made by the students. There is no overfitting on these calibration models, as depicted by the same error proportion in both calibration and prediction set. On the basis of these acceptable values, the students considered the models satisfactory and efficient for the current analysis.



CONCLUSION As was shown by this project, the students successfully completed the experimental design with multivariate calibration models, PLS, for the determination of density and composition of biodiesel/diesel blends. The discussions during this project allowed for an understanding of special topics in chemometrics by undergraduates, especially of a widely used tool in spectroscopy (NIR). The proposed experiment brought together key concepts related to near-infrared spectroscopy, multivariate regression methods, and experimental design through a practical analytical application, as well as the experience in producing a type of biofuel. The proposed procedure proved to be suitable as an experimental practice in the discipline of chemometrics and could be shared with organic chemistry or technologic chemistry courses for chemistry students. This project can be run in five weeks that involve the acquisition of spectra, the reference method, and construction of multivariate models.



ASSOCIATED CONTENT

S Supporting Information *

Steps for developing a near-infrared analysis, experimental design, and multivariate calibration. This material is available via the Internet at http://pubs.acs.org. 1570

dx.doi.org/10.1021/ed200765j | J. Chem. Educ. 2012, 89, 1566−1571

Journal of Chemical Education



Laboratory Experiment

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS At Propesq-UFRN and the Institute of Chemistry, UFRN, for the scholarship received by the student Rodrigo Rocha de Oliveira.



REFERENCES

(1) Wold, S. Chemom. Intell. Lab. Syst. 1995, 30, 109−115. (2) Cazar, R. A. J. Chem. Educ. 2003, 80, 1026−1029. (3) Oberg, T. J. Chem. Educ. 2006, 83, 1178−1181. (4) Wanke, R.; Stauffer, J. J. Chem. Educ. 2007, 84, 1171−1173. (5) Pierce, K. M.; Schale, S. P.; Le, T. M.; Larson, J. C. J. Chem. Educ. 2011, 88, 806−810. (6) Pasquini, C. J. Braz. Chem. Soc. 2003, 14, 198−219. (7) Blanco, M.; Villarroya, I. TrAC, Trends Anal. Chem. 2002, 21, 240−250. (8) Savitzky, A.; Golay, M. J. E. Anal. Chem. 1964, 36, 1627−1639. (9) Vandeginten, B. M. G.; Massart, D. L.; Buydens, S.; De Jong, S.; Lewi, P. J.; Smeyers-Verveke, J. Handbook of Chemometrics and Qualimetrics: Part B; Elsevier: Amsterdam, 1998. (10) Otto, M. Chemometrics: Statistics and Computer Application in Analytical Chemistry, 2nd ed.; Wiley-VCH: Weinheim, Germany, 2007. (11) Geladi, P.; Kowalski, B. R. Anal. Chim. Acta 1986, 186, l−17. (12) Brereton, R. G. Analyst 2000, 125, 2125−2154. (13) Brereton, R. G. Applied Chemometrics for Scientists; John Wiley & Sons Ltd: Chichester, U.K., 2007. (14) Lundstedt, T.; Seifert, E.; Abramo, L.; Thelin, B.; Nyström, Å.; Pettersen, J.; Bergman, R. Chemom. Intell. Lab. Syst. 1998, 42, 3−40. (15) Bruns, R. E.; Scarmino, I. S.; de Barros Neto, B. Data Handling in Science and Technology, Vol. 25: Statistical DesignChemometrics, 1st ed.; Elsevier: Amsterdam, 2006. (16) de Lira, L. F. B.; de Vasconcelos, F. V. C.; Pereira, C. F.; Paim, A. P. S.; Stragevitch, L.; Pimentel, M. F. Fuel 2010, 89, 405−409. (17) Xiaobo, Z.; Jiewen, Z.; Povey, M. J. W.; Holmes, M.; Hanpin, M. Anal. Chim. Acta 2010, 667, 14−32.

1571

dx.doi.org/10.1021/ed200765j | J. Chem. Educ. 2012, 89, 1566−1571