Introducing Chemometrics to Graduate Students - ACS Publications

Chemometrics was originally defined as “the art of ex- tracting chemically relevant information from data produced in chemical experiments” (1), b...
0 downloads 0 Views 109KB Size
In the Classroom

Introducing Chemometrics to Graduate Students Tomas Öberg Department of Biology and Environmental Science, University of Kalmar, SE-391 82 Kalmar, Sweden; [email protected]

Chemometrics was originally defined as “the art of extracting chemically relevant information from data produced in chemical experiments” (1), but often it is also defined by the methods employed. The application of multivariate calibration, pattern recognition, and design of experiments to solve chemical problems are some of the topics often connected to chemometrics. These are mainly statistical methods, but many instructors believe that they are best taught to chemistry students by chemists with experience from solving real problems. Several articles in this Journal have also outlined suitable experiments and exercises to teach multivariate calibration (2–6), pattern recognition (7–9), design of experiments (10–12), and simplex optimization (13–18). Statistical design of chemical experiments was also discussed before the chemometrics era (19, 20). The chemometric methods enjoy an ever-increasing popularity, especially in analytical and organic chemistry. In a recent review—the fifteenth of a series in Analytical Chemistry—it is noted that more than 15,000 citations are found when the terms “pattern recognition” and “multivariate calibration” are used in a Chemical Abstracts search (21). Large impact has been seen in the chemical and the pharmaceutical industry and many areas of chemometrics have been assimilated by other disciplines (21, 22). Chemometrics is now assisting computer-aided molecular design and the development of structure–activity relationships (SARs) within the related field of chemoinformatics. The scope of chemometrics is thus wide, applications are found in many fields, and the toolbox of useful methods is diverse. This provides a major challenge in introducing chemometrics to graduate students. An introductory course should ideally give an orientation and “taste” of different methods and applications as well as an understanding about how chemometrics may assist and provide answers to the students’ own research questions. Chemometrics curricula have been discussed before, both in this Journal (23–25) and in other publications (26–31). Despite this interest in developing education in chemometrics, the lack of undergraduate and graduate courses has been seen as a real obstacle for the scientific development of this field (32). We have also been challenged by people working in industry to devote more effort to teaching design of experiments and optimization as part of ordinary laboratory coursework (33), and this challenge is not a new one (34, 35). The lack of suitable texts was a problem in the early days of chemometrics, but as Delaney and Warren, Jr. anticipated many good textbooks are now available (23). However, a shortage of qualified faculty may be an obstacle in developing and teaching chemometrics courses. Now some relief to that problem is also in sight. Two of the major developers of chemometrics software—CAMO and Umetrics—have released complete training packages, with easy-to-read introductory texts and exercises to accompany their products. Both 1178

Journal of Chemical Education



these packages have received favorable reviews in the Journal of Chemometrics (36–38). Here, the aim is to describe an introductory course in chemometrics for graduate students where one of these packages—textbook and software—was used as the primary instruction material. Course Objective and Content The course objective was to provide an introduction to statistical design of experiments and multivariate data analysis. The students should develop an understanding of theory and strategies in adopting chemometric methods for experimentation and data analysis. The course should also provide practical hands-on experience in using research software within the field. Finally, the students should gain a critical understanding of the scope and limitations in empirical model building. The two-course segments covered the topics and subtopics outlined in Table 1. Participation in all course components was obligatory, and the satisfactory completion of prescribed exercises was required. Students also had to present the results from a project assignment, both as a brief written report and orally. A grade of “pass” or “fail” was awarded upon completion of the course. Materials and Methods The main course reading was the 5th edition of “Multivariate Data Analysis—In Practice” by Esbensen and co-workers (39), which is a part of the training package from CAMO. In 21 chapters and 598 pages this book covers principal component analysis, classification, multivariate calibration using principal component regression and partial least-squares regression, and design of experiments. Each method is intro-

Table 1. Topics and Subtopics in the Chemometrics Course Topics

Subtopics

Introduction

Basic statistics; Visualization of data

Screening designs

Factorials and fractional factorials; Plackett–Burman designs; D-optimal designs

Optimization and Steepest ascent; empirical model building Response surface designs; Simplex optimization Pattern recognition

Principal components analysis (PCA) Classification, SIMCA-methodology

Multivariate calibration

Multiple linear regression (MLR); Principal component regression (PCR); Partial least–squares regression (PLSR)

Vol. 83 No. 8 August 2006



www.JCE.DivCHED.org

In the Classroom

duced and fully explained with real-world examples and accompanied with a multitude of computer exercises that can be executed with the included training version of the Unscrambler software, version 7.51 (CAMO Inc.). The typical outline of such an exercise is first to state the purpose, then introduce the data set, followed by a step-by-step guidance on how to proceed with the software, and in between each step a number of questions are presented to stimulate the student to thoroughly learn and understand the methodology. Additional papers on analysis of variance (40), design of experiments (41), desirability functions (42), simplex optimization (43), and partial least-squares regression (44) were handed out to the course participants as required readings. These additional readings were needed for topics not covered fully in Esbensen’s book. Supplementary readings were the two classical textbooks: Statistics for Experimenters by Box, Hunter, and Hunter (45) and Multivariate Calibration by Martens and Næs (46). The training version of the Unscrambler software is similar to the full-version software except that new data sets cannot be imported or saved. The methods covered by this specialized chemometrics software are: scientific visualization, principal component analysis, classification, multivariate calibration, and design of experiments. There are 25 data sets included in the training package, and most of them are from real-world chemical applications. The training package (book and software) is available in classroom sets of ten and reasonably priced for academic users. The full version of the software—also available at a discount price for academic users—was used to carry out the project assignments. The course consisted of a series of formal lectures, seminars, and practical computer exercises followed by a project assignment tailored to each participant’s own research. Twenty-one hours of lectures and instructor-assisted computer exercises were spread out over six half-days during the first two weeks of the course, and the participants were expected to work an additional forty hours on their own. Sixty hours were then allotted to the project assignments. The credits earned from this course were three in the Swedish system, equal to approximately two in the U.S. system or four in the European credit transfer and accumulation system (ECTS). Results and Discussion The five topics are briefly described, followed by a discussion of the project assignments.

Basic Statistics and Visualization The purpose of the introduction was to bring all students to a common level and introduce basic statistical concepts, such as measures of central tendency and dispersion, frequency distributions, correlation, and regression. The importance of a variance-stabilizing transformation was illustrated with an example from GC–MS analysis of organic micro-pollutants (47). The other part covered was methods and tools for visualization of univariate, bivariate, and multivariate data. The computer exercises were focused on plotting raw data, studying correlation using 2D-plots and regression, frequency histograms, and variable statistics. These exercises also served as a general introduction to the user interface of the Unscrambler software. www.JCE.DivCHED.org



Screening Designs Screening designs of experiments were presented by first discussing the perils of the “one-factor-at-a-time” approach and the importance of factor interactions. The setup and analysis of a full-factorial experiment to study a chemical reaction was first described applying only simple arithmetic and then expanded with analysis of variance (ANOVA) and multiple linear regression (MLR). This gave an opportunity to discuss other basic concepts such as residuals, collinearity, lack-of-fit, and the need of replicates. Subsequently, fractional factorials were introduced together with Plackett–Burman and D-optimal designs. Blocking and randomization were also included in this part. The computer exercise was to study the Willgerodt–Kindler reaction using data from Carlson and co-workers (48).

Optimization and Empirical Model Building The evaluation of multiobjective optimization criteria by desirability functions was described as a prelude to the different optimization methods. The steepest-ascent method was given a brief introduction before proceeding with response surface methodology (RSM). The basic factorial designs were extended to three-level factorials, central composite, and Box–Behnken designs. The previous coverage of regression was generalized to empirical model building using polynomials and their visualization with response surfaces or contour plots. Finally the sequential simplex method was illustrated with an example from optimization of HPLC separation of a pharmaceutical product. The computer exercise was to optimize enamine synthesis built on an example from Carlson’s textbook Design and Optimization in Organic Synthesis (49).

Pattern Recognition A general discussion of projections was first introduced in this part. The focus was then set on principal component analysis (PCA) and its convenience to analyze and visualize data tables. The linear algebra was briefly described, but the interpretation of scores and loadings was mainly shown visually as the mapping of a multidimensional space down to two-dimensional plots. The lectures continued with coverage of scaling, weights, outliers, and different validation methods to select the optimal model complexity. Comparison of class patterns using PCA was introduced through Wold’s method for soft independent modeling of class analogy (SIMCA) (50). One of the PCA exercises was to evaluate a geological data set from the Troodos area of Cyprus (51). Another exercise utilized the classical Iris data set of Fisher, both for PCA modeling and SIMCA classification (52).

Multivariate Calibration MLR and least-squares fitting was discussed already in the design of experiments part. PCR was then easy to explain as a simple extension, where a transformation to principal components is used to avoid collinearity problems with nondesigned data. Partial least-squares regression (PLSR) was taught as an intermediate technique to obtain the maximum covariance between the response variable(s) and linear combinations of the original x variables. The matrix equations

Vol. 83 No. 8 August 2006



Journal of Chemical Education

1179

In the Classroom

were presented briefly, but the focus was primarily to gain conceptual understanding by visual illustration of the various projections. Similarly, visual inspection of calibration results was selected as a primary mode of interpretation. The previous discussion of validation was extended further as was the use of classification to define the domain of model applicability. The interpretability of latent variables and regression equations was discussed and the similarity to empirical linear free-energy relationships was highlighted (53). Computer exercises performed with PCR and PLSR included data sets from food research, NIR-spectroscopy, water quality, and the paper industry.

Project Assignments Learning is always best motivated when applying new methods to one’s own research problems. Each participant was therefore requested to bring data sets or experimental design problems from their own research for the project assignments, but for some newly enrolled graduate students the data were supplied either by their supervisor or the course organizer. The project assignments showed a substantial diversity, with contributions from organic, analytical, environmental, and geochemistry. Four examples of these independent projects are: • Prediction of bupivacaine binding to a molecular imprinted polymer (design of experiments and multivariate calibration) • A structure–activity relationship for the thermal decomposition of chlorinated and non-chlorinated organic compounds (multivariate calibration) • Source apportionment of copper and zinc in samples from a road traffic environment (pattern recognition and multivariate calibration) • Correlation pattern of metals and pH in leachates from alum shale (pattern recognition)

The projects were carried out independently, but with some assistance (through email) from the course organizer (a few hours in total). This provided additional opportunity to tackle problems such as handling of nondetects, simplification of polynomial models, variable selection, and choice of transformations. The project results were presented in a short paper and orally at a seminar, with the other students acting as opponents. Conclusions This introductory course was based upon a chemometrics training package—literature and software—that has a proven track record. The training package has a well thought out strategy to introduce chemometrics, and it was therefore possible to launch this graduate course with less time for preparations than would otherwise be required. Similar training packages on multivariate data analysis and design of experiments are also available from other vendors, for example, Umetrics (54, 55). This could enable the entry of chemometrics into the graduate curriculum where the number of prospective students is limited.

1180

Journal of Chemical Education



The teaching was focused on a few statistical methods, selected for their usefulness in solving chemical research problems. The mathematics was kept to a minimum, practical aspects and conceptual understanding were highlighted, and the exercises confronted the students with a diverse set of applications. This approach provided an opportunity to reach further in skill and understanding than is usually the case for such a short introductory course. This course is however only a beginning. The opportunity provided to work with data from their own research is crucial for student motivation. Therefore it is strongly advised that a full version of suitable chemometrics software is made available during the introductory course period. Some of the students that were newly enrolled when the course started now use multivariate methods to analyze their own data and are actively searching for solutions applying pattern recognition and multivariate calibration. This experience seems to suggest that chemometrics should be introduced as early as possible to graduate students and preferably already at the undergraduate level. The previous experience of the author is that many students attending chemometric courses do not continue as active users of the methodology. The most frequent explanation for these “dropouts” is the lack of interest, support, and understanding from colleagues and fellow students. It is therefore encouraging to note that we have a growing interest for chemometrics among the senior research staff, and a training package such as the one used here is of course ideal for anyone interested in self-study. The challenge that now lies ahead at our university is to create an environment that can stimulate and support the future use of these rational methods for chemical research. For this to succeed, it is probably important to have a critical mass of users as early as possible. It is then beneficial for us to have all of the participants from this introductory course working together in just a few research groups. A next step for us could be an advanced chemometrics course to deepen and broaden the understanding of the subject, but promoting continuous learning should perhaps be given an even higher priority. George Box’s famous Monday night beer and statistics seminars at University of Wisconsin–Madison are an outstanding example on how informal meetings to discuss data-analysis problems can benefit and educate all the participants (56). Acknowledgments I gratefully acknowledge the feedback and encouragement received from students and colleagues in carrying out this course. Three anonymous reviewers are thanked for many helpful comments and suggestions to improve the manuscript. Literature Cited 1. Wold, S. Chemom. Intell. Lab. Syst. 1995, 30, 109–115. 2. Msimanga, H. Z.; Charles, M. J.; Martin, N. W. J. Chem. Educ. 1997, 74, 1114–1117. 3. Harvey, D. T.; Bowman, A. J. Chem. Educ. 1990, 67, 470– 472. 4. Cartwright, H. J. Chem. Educ. 1986, 63, 984–986.

Vol. 83 No. 8 August 2006



www.JCE.DivCHED.org

In the Classroom 5. Houghton, T. P.; Kalivas, J. H. J. Chem. Educ. 2000, 77, 1314– 1318. 6. Ribone, M. É.; Pagani, A. P.; Olivieri, A. C.; Goicoechea, H. C. J. Chem. Educ. 2000, 77, 1330–1333. 7. auf der Heyde, T. P. E. J. Chem. Educ. 1990, 67, 461–469. 8. Cazar, R. A. J. Chem. Educ. 2003, 80, 1026–1029. 9. Rusak, D. A.; Brown, L. M.; Martin, S. D. J. Chem. Educ. 2003, 80, 541–543. 10. Harvey, D. T.; Byerly, S.; Bowman, A.; Tomlin, J. J. Chem. Educ. 1991, 68, 161–168. 11. Oles, P. J. J. Chem. Educ. 1998, 75, 357–359. 12. Palasota, J. A.; Deming, S. N. J. Chem. Educ. 1992, 69, 560– 563. 13. Amenta, D. S.; Lamb, C. E.; Leary, J. J. J. Chem. Educ. 1979, 56, 557–558. 14. Leggett, D. J. J. Chem. Educ. 1983, 60, 707–710. 15. Sangsila, S.; Labinaz, G.; Poland, J. S.; van Loon, G. W. J. Chem. Educ. 1989, 66, 351–353. 16. Shavers, C. L.; Parsons, M. L.; Deming, S. N. J. Chem. Educ. 1979, 56, 307–309. 17. Steig, S. J. Chem. Educ. 1986, 63, 547–548. 18. Stolzberg, R. J. J. Chem. Educ. 1999, 76, 834–838. 19. Norcross, B. E.; Clement, G.; Weinstein, M. J. Chem. Educ. 1969, 46, 694–695. 20. Smith, R. B.; Billingham, E. J., Jr. J. Chem. Educ. 1968, 45, 113. 21. Lavine, B.; Workman, J. J., Jr. Anal. Chem. 2004, 76, 3365– 3372. 22. Wold, S.; Sjöström, M. Chemom. Intell. Lab. Syst. 1998, 44, 3–14. 23. Delaney, M. F.; Warren, F. V., Jr. J. Chem. Educ. 1981, 58, 646–651. 24. Howery, D. G.; Hirsch, R. F. J. Chem. Educ. 1983, 60, 656– 659. 25. Msimanga, H. Z.; Elkins, P.; Tata, S. K.; Smith, D. R. J. Chem. Educ. 2005, 82, 415–424. 26. Ortiz, M. C.; Herrero, A.; Rueda, M. E.; Sanllorente, S.; Reguera, C. Q. Analítica 1999, 18, 151–156. 27. van Staden, J. F. Fresenius. J. Anal. Chem. 1997, 357, 221– 223. 28. Nijenhuis, B. T. Mikrochim. Acta 1991, 2, 550–554. 29. Kleywegt, G. J.; Bent, H.; Klaessens, J. W. A.; van den Heuvel, E. J. Chemom. Intell. Lab. Syst. 1988, 3, 3–5. 30. O’Haver, T. C. Chemom. Intell. Lab. Syst. 1989, 6, 95–103. 31. Vandeginste, B. G. M. Anal. Chim. Acta 1983, 150, 199–206. 32. Brown, S. D. Chemom. Intell. Lab. Syst. 1995, 30, 49–58.

www.JCE.DivCHED.org



33. 34. 35. 36. 37. 38. 39.

40. 41.

42.

43. 44. 45.

46. 47. 48. 49. 50. 51. 52. 53. 54.

55.

56.

Luberoff, B. J. J. Chem. Educ. 2000, 77, 1557–1557. Wilks, S. S. Anal. Chem. 1947, 19, 953–960. Smallwood, H. M. Ind. Eng. Chem. 1951, 43, 2071–2073. Carlson, R. J. Chemom. 2001, 15, 495–496. Shaffer, R. E. J. Chemom. 2002, 16, 261–262. Tauler, R. J. Chemom. 2002, 16, 117–118. Esbensen, K.; Guyot, D.; Westad, F.; Houmøller, L. P. Multivariate Data Analysis—In Practice, 5th ed.; CAMO Inc.: Woodbridge, NJ, 2002. http://www.camo.com (accessed May 2006). Grafen, A.; Hails, R. Modern Statistics for the Life Sciences; Oxford University Press: New York, 2002; Chapter 1. Lundstedt, T.; Seifert, E.; Abramo, L.; Thelin, B.; Nyström, A.; Pettersen, J.; Bergman, R. Chemom. Intell. Lab. Syst. 1998, 42, 3–40. Walters, F. H.; Parker, L. R., Jr.; Morgan, S. L.; Deming, S. N. Sequential Simplex Optimization: A Technique for Improving Quality and Productivity in Research, Development, and Manufacturing; CRC Press: Boca Raton, FL, 1991; Chapter 8. Öberg, T.; Deming, S. N. Chem. Eng. Prog. 2000, 96, 53–59. Wold, S.; Sjöström, M.; Eriksson, L. Chemom. Intell. Lab. Syst. 2001, 58, 109–130. Box, G. E. P.; Hunter, W. G.; Hunter, J. S. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building; Wiley: New York, 1978. Martens, H.; Næs, T. Multivariate Calibration; Wiley: New York, 1989. Öberg, T. Chemom. Intell. Lab. Syst. 2004, 73, 29–35. Carlson, R.; Lundstedt, T.; Shabana, R. Acta Chem. Scand. Ser. B 1986, 40, 534–544. Carlson, R. Design and Optimization in Organic Synthesis; Elsevier: Amsterdam, 1992. Wold, S. Pattern Recog. 1976, 8, 127–139. Thy, P.; Esbensen, K. J. Geophys. Res. 1993, 98, 11799– 11805. Fisher, R. A. Ann. Eugen. 1936, 7, 179–188. Wold, S. Chem. Scripta 1974, 5, 97–106. Eriksson, L.; Johansson, E.; Kettaneh-Wold, N; Wold, S. Multi- and Megavariate Data Analysis: Principles and Applications; Umetrics Inc.: Kinnelon, NJ, 2001. http:// www.umetrics.com (accessed May 2006). Eriksson, L.; Johansson, E.; Kettaneh-Wold, N.; Wikström, C.; Wold, S. Design of Experiments: Principles and Applications; Umetrics Inc.: Kinnelon, NJ, 2001. Peña, D. Chemom. Intell. Lab. Syst. 2002, 63, 5.

Vol. 83 No. 8 August 2006



Journal of Chemical Education

1181