LABORATORY EXPERIMENT pubs.acs.org/jchemeduc
An Advanced Analytical Chemistry Experiment Using Gas Chromatography Mass Spectrometry, MATLAB, and Chemometrics To Predict Biodiesel Blend Percent Composition Karisa M. Pierce,* Stephen P. Schale, Trang M. Le, and Joel C. Larson Department of Chemistry, Seattle Pacific University, Seattle, Washington 98119, United States
bS Supporting Information ABSTRACT: We present a laboratory experiment for an advanced analytical chemistry course where we first focus on the chemometric technique partial least-squares (PLS) analysis applied to one-dimensional (1D) total-ion-current gas chromatography mass spectrometry (GCTIC) separations of biodiesel blends. Then, we focus on n-way PLS (nPLS) applied to two-dimensional (2D) gas chromatography mass spectrometry (GC MS) separations of biodiesel blends. The purpose of the experiment is to determine the percent composition, by volume, of biodiesel in an unknown blend of biodiesel and conventional diesel. A secondary goal is to compare the prediction results of the PLS model to the n-PLS model to see if there is an advantage to analyzing multiple dimensions. The instructor initially creates a PLS model and an n-PLS model using separations of standard biodiesel blends where the percent compositions are known and vary from 0% to 20%. Then, the student collects the GC-TIC and GC MS chromatograms of an unknown biodiesel blend to regress onto PLS and n-PLS models and discover the percent composition of the unknown sample. KEYWORDS: Graduate Education/Research, Upper-Division Undergraduate, Analytical Chemistry, Laboratory Instruction, Computer-Based Learning, Hands-On Learning/Manipulatives, Chemometrics, Chromatography, Mass Spectrometry
C
hemometrics is any multivariate mathematical technique applied to chemical data.1 Chemometric techniques are used in both industry and academic research to analyze data from spectroscopy, imaging, microscopy, chromatography, or any multivariate detector for the purpose of converting the experimental data into information about complex samples. Herein, we present a laboratory experiment for an advanced analytical chemistry course where we first focus on the chemometric technique partial leastsquares (PLS) analysis applied to one-dimensional (1D) total-ioncurrent gas chromatography mass spectrometry (GC-TIC) separations of biodiesel blends. Then, we focus on n-way PLS (n-PLS) applied to two-dimensional (2D) gas chromatography mass spectrometry (GC MS) separations of biodiesel blends. The purpose of the experiment is to determine the percent composition by volume of biodiesel in an unknown blend of biodiesel and conventional diesel. A secondary goal is to compare the prediction results of the PLS model to the n-PLS model to see if there is an advantage to analyzing multiple dimensions. The instructor initially creates a PLS model and an n-PLS model using separations of standard biodiesel blends where the percent compositions are known and vary from 0% to 20%. Then, the student collects the GC-TIC and GC MS chromatograms of an unknown biodiesel blend to regress onto PLS and n-PLS models and discover the percent composition of the unknown sample. Some chemometrics laboratory experiments have already been published in this Journal,2 12 and we introduce another experiment that is appropriate for an advanced analytical chemistry course. The Copyright r 2011 American Chemical Society and Division of Chemical Education, Inc.
students learn about four major themes in analytical chemistry: (i) high-throughput analysis from an industrial chemist’s point of view; (ii) the multidimensional advantage from a theoretical point of view; (iii) calibration and modeling from a pedagogical point of view; and (iv) chemical fingerprinting and pattern recognition from a forensics point of view. A detailed description of each theme follows. The students learn that one of the major advantages of chemometrics is the ability to treat chromatograms as reproducible chemical fingerprints that are amenable to high-throughput process analysis because, in a nontraditional fashion, chemical extraction and complete resolution between every peak is not required to discover desired information. Traditionally, researchers chemically extract a few analytes of interest prior to submitting a sample to chromatographic separation.13 17 However, chemical extraction of target analytes requires certain a priori knowledge about the chemical matrix that may not be available for truly unknown samples. It is also traditional for chromatographers to aim for complete resolution of every component of the sample of interest by optimizing a GC method that efficiently uses the entire peak capacity of the separation space. These optimally resolved chromatograms are used to calibrate chemical concentrations by integrating peak areas and comparing standardized retention times, and some articles in this Journal propose laboratories based on this traditional GC MS approach.18 20 This
Published: March 15, 2011 806
dx.doi.org/10.1021/ed100917x | J. Chem. Educ. 2011, 88, 806–810
Journal of Chemical Education
LABORATORY EXPERIMENT
array of opportunities to comprehensively glean information from multidimensional data. From a pedagogical point of view, chemometric modeling can be thought of as an extension of traditional univariate calibration into multivariate calibration. In traditional univariate calibration, the researcher prepares standard samples, obtains instrument signals for each standard, builds a calibration curve of signal versus concentration, and then projects the unknown sample signal onto that calibration curve to discover the unknown sample concentration, as depicted in Figure 1A.26 31 In multivariate calibration, the researcher obtains entire sample profiles of each sample and builds a calibration curve (chemometric model) of the entire sample profile versus the given information about a chemical property of interest. The given information could be concentrations of analytes of interest, percent composition, protein titer, chemical class, or any other quantitative characteristic(s) in which the researcher is interested. Then, the researcher submits the entire unknown sample profile to that model, the entire sample profile is compared to the standard sample profiles, and the desired chemical property is predicted for that unknown sample, as depicted in Figure 1B. In terms of the experiment described herein, the multivariate calibration curve is the PLS model that is built using entire GC-TIC chromatograms and known percent compositions. Then, the entire GC-TIC chromatogram of an unknown sample is projected onto the PLS model where it is mathematically compared for similarity to the given modeled chromatograms and the unknown’s percent composition is predicted. The PLS model was developed by mathematically loading independent variables in the given chromatograms that have variations in signal intensity that positively or negatively correlate with variations in the given percent compositions. In other words, PLS highly loads retention times of signals that co-vary with percent composition.32 In Figure 1C, it can be observed that two of the chromatographic peaks have relative signals that positively correlate with the given percent compositions, whereas two other chromatographic peaks have relative signals that do not correlate with the given percent compositions. Thus, unknown chromatograms submitted to the PLS model are algebraically compared for similarity to the given modeled chromatograms using the heavily loaded peaks. Finally, a percent composition for the unknown chromatogram is predicted. To evaluate the robustness of the model using a traditional technique, a cross validation model is built whereby, one-at-atime, each chromatogram is pulled out of the data set, the model is rebuilt with the remaining chromatograms, and then that left-out chromatogram is submitted to the model to predict that sample’s percent composition. This is repeated for every chromatogram in the data set and yields a leave-one-out cross validation plot of predicted percent composition versus actual percent compositions. The PLS model is evaluated based on the slope, R2, and relative root-mean-squared-error of cross validation (RMSECV) values of this plot. When the model is robust and useful, the covariations between percent composition and chromatographic signal are strong, the slope = 1.00, ideal R2 values are close to 1.0, and the ideal RMSECV value approaches 0. A robust model will then yield reliable percent composition predictions for unknown samples. We chose to build chemometric models specifically for biodiesel blends because, from a forensics point of view, the ability to predict biodiesel percent compositions is important to regulatory agencies, fuel compliance officers, and distributors of transport fuel who monitor fuel adulteration, quality, contamination, accuracy of reported percent compositions, and sample authentication. In this
Figure 1. Illustration of PLS: (A) Traditional univariate calibration curve. (B) Multivariate calibration curve illustrating PLS. (C) Signals with variations that positively or negatively correlate with the given sample information are highly loaded in the PLS model.
traditional approach requires resolution between the peaks, but improving resolution requires more separation time, and as sample complexity increases, instrument time also increases. Because computation time is generally less expensive than instrument time, the theme in this laboratory experiment is the same theme as in highthroughput industrial research: resolution of all components in a complex sample can be sacrificed for a speedier separation if the desired chemical information can still be obtained. In fact there are articles in this Journal that use manual fingerprint analysis of GC MS data where students manually compare TIC chromatograms of similar complex samples,21 25 but we describe automated fingerprinting analysis that uses commercially available PLS software. The students learn about the theory of the multidimensional advantage. This theory describes how multidimensional instrumentation increases the information content per unit of instrument time so it is expected that 2D GC MS chromatograms should yield a better model than 1D GC-TIC chromatograms simply because the 2D data contains more chemical information than the 1D data. Traditionally, the GC-TIC or a small number of manually selected ion chromatograms are used, while a majority of ion chromatograms are ignored, thus, throwing away possibly valuable information for the sake of easier manual data analysis. The students find that native instrument software often does not provide the tools for comprehensive multidimensional analysis, so exporting the data out of the native instrument software and importing it into a platform such as MATLAB, which is capable of multidimensional analysis, provides a wider 807
dx.doi.org/10.1021/ed100917x |J. Chem. Educ. 2011, 88, 806–810
Journal of Chemical Education
LABORATORY EXPERIMENT
Figure 3. Leave-one-out cross validation results are shown for (A) the GC-TIC training set applied to PLS and (B) the GC MS training set applied to n-PLS.
The chromatograms were exported from the instrument software and then imported into MATLAB and concatenated into a single 3-way variable with dimensions equal to the number of samples number of chromatographic retention time data points number of mass to charge ratios, all as described in the Notes for Instructor in the Supporting Information. Then, using the commands in Figure 1 in the Notes for Instructor, the chromatograms in this single 3-way variable were baseline corrected and normalized to yield the preprocessed 1D TIC chromatograms and 2D GC MS chromatograms. A representative TIC chromatogram of a 5% biodiesel blend is shown in Figure 2A. The large peaks eluting near 60 min are fatty acid methyl esters that are characteristic of biodiesels. A representative GC MS chromatogram of a 5% biodiesel blend is shown in Figure 2B. Parent ions can be seen at high molecular weights. One hypothesis for this experiment is that, due to the multidimensional advantage, an n-PLS model built using 2D GC MS chromatograms should be more accurate than a PLS model built using 1D TIC chromatograms because the 2D chromatograms contain more valuable chemical information that is lost when the original 2D data is compressed into the TIC form. The TICs from the 13 biodiesel blends in the 1D training set were modeled by PLS using the PLS Toolbox graphical user interface as described in the Notes for Instructor. The PLS algorithm and n-way n-PLS algorithm were from the commercial software PLS Toolbox for MATLAB. Both used the same SIMPLS algorithm.34,35 A leave-one-out cross validation was used to evaluate the robustness of the model. The PLS cross validation plot is shown in Figure 3A and the best-fit line had slope = 0.94, y intercept = 0.81, and R2 = 0.99. The RMSECV was 0.79, which is 7.4% of the mean of the known percent composition data, suggesting that the student using this PLS model could expect at least an average 7.4% error in prediction. The slope, y intercept, and RMSECV reveal that biodiesel samples from different sources are difficult to model, just as a realistic forensic sample would be difficult to model when the source of the forensic sample is unknown. Next, the 13 2D training set chromatograms were modeled by n-PLS according to the Notes for Instructor. The n-PLS cross validation plot is shown in Figure 3B and the best-fit line had slope = 0.96, y intercept = 0.46, and R2 = 0.99. The RMSECV was 0.61, which was 5.8% of the mean of the known percent composition data. Assuming that a 0.18 decrease in RMSECV and 1.7% decrease in percent error is statistically significant, then these values revealed that the 2D chromatograms did contain more valuable chemical information than the 1D TICs, causing the n-PLS model to perform slightly better than the PLS model during leave-one-out cross validation. Once the PLS and n-PLS models are built, the instructor can give the students unknown biodiesel blends to run through the GC MS,
Figure 2. Representative chromatograms of a 5% biodiesel blend are shown as (A) a 1D GC-TIC and (B) a 2D GC MS surface plot.
laboratory experiment, we use biodiesels and conventional diesels from a variety of sources, which is a problematic source of variation and error in forensic analysis of truly unknown samples. The students see that the GC MS provides adequate peak capacity and chemical selectivity that can be harnessed using chemometrics to build a robust model to overcome source variability in forensic applications. The specialized equipment required for this experiment includes quadrupole GC MS, MATLAB by MathWorks (Natick, MA), and the PLS Toolbox for MATLAB by Eigenvector Research, Inc. (Manson, WA). The n-PLS algorithm for MATLAB was from the N-way Toolbox by Rasmus Bro (University of Copenhagen) who kindly makes this freely available online.33
’ EXPERIMENTAL PROCEDURE The biodiesel blends and conventional diesels were purchased at the pump from different retailers. These were mixed in various combinations to yield a range of percent compositions between 0% and 20%. Percent composition is defined as the percent, by volume, of biodiesel in the biodiesel/conventional diesel blend. The source and composition of each sample is listed in Table 1 in the Notes for Instructor in the Supporting Information. This discussion is divided into two parts: (i) the instructor builds and evaluates the PLS and n-PLS models and then (ii) the students use the models to evaluate unknown samples. Before class, the instructor can build the PLS and n-PLS models for the students using a training set of known biodiesel samples. The instrumental method for obtaining the chromatograms is described in the Notes for Instructor in the Supporting Information. A polar Innowax column was used to obtain the chromatograms that are discussed in the remainder of this document. However, the same samples were also separated using a nonpolar HP-5 column that yielded chromatograms containing the traditional alkane “backbone” pattern that is typically expected for separations of diesels. A figure in the Notes for Instructor shows the chromatograms of a biodiesel blend separated by both the Innowax column and the HP5 column. The Notes for Instructor describe how both the polar and nonpolar columns provide data sets that work for this experiment. 808
dx.doi.org/10.1021/ed100917x |J. Chem. Educ. 2011, 88, 806–810
Journal of Chemical Education
LABORATORY EXPERIMENT
to chromatography. The main lessons conveyed by this experiment are that even though the chromatograms had what is considered “poor resolution” by traditional chromatographers, we could still predict a quantitative property of an unknown sample using chemometric modeling. The students were able to apply the high-throughput fingerprint analysis techniques, see the multidimensional advantage, understand multivariate calibration as an extension of traditional univariate calibration, and learn about a forensic application. Figure 4. Prediction results are shown for (A) nine “unknown” GCTIC chromatograms submitted to the PLS model and (B) nine “unknown” GC MS chromatograms submitted to the n-PLS model.
import into MATLAB, and submit to the models. Thus, the next part of this discussion describes how the student predicts an unknown biodiesel percent composition and compares the results of the 1D and 2D models. The students should obtain chromatograms for unknown biodiesel blends, import the chromatograms into MATLAB, baseline correct and normalize them, and submit them to the PLS and n-PLS models for prediction by following the Instructions for Student in the Supporting Information. These unknown chromatograms can be considered an independent test set. Herein, we use an independent test set to demonstrate the results from students who analyzed 9 unknowns. The PLS predictions for the 9 test set GC-TICs were plotted versus the actual percent compositions and this plot had a best-fit line with slope = 0.86, y intercept = 1.92, R2 = 0.96, and the average percent error was 9.8% as shown in Figure 4A. The n-PLS model yielded even better prediction results with slope = 0.89, y intercept = 1.26, R2 = 0.97, and the average percent error was reduced to 6.2%, as shown in Figure 4B. Therefore, the students can expect to get a reasonably accurate prediction of the percent composition for their unknown samples using the method described herein. The Notes for Instructor contain a discussion of an outlier that was detected in the set of unknown samples and there is also a discussion of the importance of analyzing the loadings in PLS and n-PLS models. In addition, it is worth noting that if the retail biodiesel source is not 100% biodiesel or if the percent composition provided by the retailers is inaccurate, then any blends the instructor makes will also be inaccurate, so the Notes for Instructor discuss how to handle this uncertainty.
’ ASSOCIATED CONTENT
bS
Supporting Information Notes for the instructor and notes for the students. This material is available via the Internet at http://pubs.acs.org.
’ AUTHOR INFORMATION Corresponding Author
*E-mail:
[email protected].
’ REFERENCES (1) Beebe, K. R; Pell, R. J.; Seasholtz, M. B. Chemometrics, A Practical Guide; John Wiley and Sons, Inc: New York, 1998. (2) Msimanga, H. Z.; Elkins, P.; Tata, S. K.; Smith, D. R. J. Chem. Educ. 2005, 82, 415–424. (3) Cazar, R. A. J. Chem. Educ. 2003, 80, 1026. (4) Gilbert, M. K.; Luttrell, R. D.; Stout, D.; Vogt, F. J. Chem. Educ. 2008, 85, 135. (5) Wanke, R.; Stauffer, J. J. Chem. Educ. 2007, 84, 1171–1173. (6) Rodríguez, C.; Amigo, J. M.; Coello, J.; Maspoch, S. J. Chem. Educ. 2007, 84, 1190. (7) Grung, B.; Nodland, E.; Førland, G. M. J. Chem. Educ. 2007, 84, 1193. .; Pagani, A. P.; Olivieri, A. C.; Goicoechea, H. C. J. (8) Ribone, M. E Chem. Educ. 2000, 77, 1330. (9) Lang, P. L.; Miller, B. I.; Nowak, A. T. J. Chem. Educ. 2006, 83, 280–282. (10) Chau, F. T.; Chung, W. H. J. Chem. Educ. 1995, 72, A84. (11) Howery, D. G.; Hirsch, R. F. J. Chem. Educ. 1983, 60, 656. (12) Delaney, M. F.; Warren, V. J. Chem. Educ. 1981, 58, 646. (13) Mayotte, D.; Donahue, C. J.; Snyder, C. A. J. Chem. Educ. 2006, 83, 902. (14) Heimbuck, C. A.; Bower, N. W. J. Chem. Educ. 2002, 79, 1254. (15) Hardee, J. R.; Long, J.; Otts, J. J. Chem. Educ. 2002, 79, 633. (16) Wilson, R. I.; Mathers, D. T.; Mabury, S. A.; Jorgensen, G. M. J. Chem. Educ. 2000, 77, 1619. (17) Fleurat-Lessard, P.; Pointet, K.; Renou-Gonnord, M. J. Chem. Educ. 1999, 76, 962. (18) Hodgson, S. C.; Casey, R. J.; Orbell, J. D.; Bigger, S. W. J. Chem. Educ. 2000, 77, 1631. (19) Quach, D. T.; Ciszkowski, N. A.; Finlayson-Pitts, B. J. J. Chem. Educ. 1998, 75, 1595. (20) Bishop, R. D. J. Chem. Educ. 1995, 72, 743. (21) Mowery, K. A.; Blanchard, D. E.; Smith, S.; Betts, T. A. J. Chem. Educ. 2004, 81, 87. (22) Sodeman, D. A.; Lillard, S. J. J. Chem. Educ. 2001, 78, 1228. (23) Schultz, E.; Pugh, M. E. J. Chem. Educ. 2001, 78, 944. (24) Galipo, R. C.; Canhoto, A. J.; Walla, M. D.; Morgan, S. L. J. Chem. Educ. 1999, 76, 245. (25) Henck, C.; Nally, L. J. Chem. Educ. 2007, 84, 1813. (26) Dwyer, T. J.; Fillo, J. D. J. Chem. Educ. 2006, 83, 273. (27) Brush, R. C.; Rice, G. W. J. Chem. Educ. 1994, 71, A293. (28) McGowin, A. E.; Hess, G. G. J. Chem. Educ. 1999, 76, 23. (29) Witter, A. E. J. Chem. Educ. 2005, 82, 1538.
’ HAZARDS Biodiesel and conventional diesel can cause irritation to the eyes and skin. If swallowed, they are harmful and possibly carcinogenic. Biodiesel and conventional diesel may contain H2S, a very flammable and toxic gas. Biodiesel and conventional diesel are flammable. The use of a hood and gloves is recommended when preparing samples. Dispose of waste in accordance with EPA policies. ’ CONCLUSION The improvement in n-PLS predictions over PLS predictions means that it is a valuable skill to be able to export data out of native instrument software and import it into a platform such as MATLAB that allows the user to do multidimensional analysis. Detailed instructions for analyzing the 1D data and the 2D data at the MATLAB command line are included in the Supporting Information. These details could be a useful reference for students beginning to use MATLAB and chemometrics applied 809
dx.doi.org/10.1021/ed100917x |J. Chem. Educ. 2011, 88, 806–810
Journal of Chemical Education
LABORATORY EXPERIMENT
(30) Asleson, G. L.; Doig, M. T.; Heldrich, F. J. J. Chem. Educ. 1993, 70, A290. (31) Hill, D. W.; Mcsharry, B. T.; Trzupek, L. S. J. Chem. Educ. 1988, 65, 907. (32) Brereton, R. G. Chemometrics: Data Analysis for the Laboratory and Chemical Plant; Wiley: New York, 2003. (33) Quality and Technology. University of Copenhagen. http:// www.models.life.ku.dk/(accessed Mar 2011). (34) de Jong, S. Chemometr. Intell. Lab. 1993, 18, 251. (35) Wise, B. M.; Gallagher, N. B.; Bro, R.; Shaver, J.. M.; Windig, W.; Koch, S. R. PLS_Toolbox 3.5 for use with MATLAB, 2004.
810
dx.doi.org/10.1021/ed100917x |J. Chem. Educ. 2011, 88, 806–810