Chemometrics in the chemistry curriculum - Journal of Chemical

Aug 1, 1983 - Journal of Chemical Education 2017 94 (9), 1324-1328 ... Huggins Z. Msimanga , Phet Elkins , Segmia K. Tata and Dustin Ryan Smith...
2 downloads 0 Views 4MB Size
computer ~ e r i e4~3. .

...,

edited by JOHN W. MOORE . ,ci

uniV ~ , S

s,~~ .,

Chemometrics in the Chemistry Curriculum Darryl G. Howery City University of New York, Brooklyn College, Brooklyn, NY 11210 Roland F. Hirsch Seton Hall University, South Orange, NJ 07079 The impact of computers on chemistry is probably the major phenomenon of the past decade in cbemistry. In such diverse areas as information handling, large-scale computations, instrument design, process control and on-line data acquisition, the computer's importance is well established. Further, as the chemist's ability to acquire data increases, he is faced also with the necessity of handling and interpreting ever increasing quantities of information. Fortunately, another area which has received tremendous impetus from the computer age is the broad field of data analysis. The task of analyzing chemical data has been made much more tractable by combining the computational capabilities of the computer with the interpretive powers of a host of mathematical methods. This marriage has created a still evolving field called "chemometrics," which, in simple terms, deals with the analysis of data from chemical measurements. More generally, chemometrics involves the application of mathematics, statistics, and computer science to handle, classify, interpret, and predict data of chemical interest. Through chemometrics, data become more meaningful as information is extracted from measurements. In this article, we seek to introduce the reader painlessly to chemometrics. We first trace the development of this important new area of chemistry, and then summarize some of the reasons for teaching chemometrics. A third objective (and the original iustification for this article) is to summarize the heldat the Fall 1981 Meeting of the American Chemical Society in New York City. T h e symposium, organized by the authors for the Division of Chemical Education, was jointly s~onsoredwith the Divisions of Analytical Chemistry and of domputers in Chemistry. Our final objective is to give thumbnail descriptions of a few of the chemometric methods which are havingthe greatest impact on chemistry. Development of Chemometrics Chemometrics reached maturity in three fairly distinct stages. In stage one, which lasted up to the late 1960's, a host of mathematical approaches to data analysis were systematized. Researchers in such diverse fields as genetics, the behavioral sciences, engineering, and mathematical statistics were responsible for most of the early methodological advances. For instance, two of the current mainstays of chemometrics, factor analysis and pattern recognition, were first formulated and applied by psychologists and engineers, respectively. Thus, acceptance of data-analysis methods surprisingly occurred first in difficult-to-quantify fields. Inordinately complex data, 656

Journal of Chemical Education

such as measurements obtained in agricultural field experiments o r i n psychological testing, begged for broad methods of analysis. Such data could seldom he explained with a single variable or with even two or three variables. The data of the real world, including that of chemistry, usually can be explained well only with multivariate models. Towards the end of the first period, computers began to play a key role. As algorithms were written to take advantage of high-speed computers, new approaches were developed a t a rapid rate during the 1960's. During stage one, chemists generally paid only lip service to data analysis, limiting themselves to the calculation of descriptive statistics (e.g., mean and standard deviation) and confidence limits. This timidity occurred in spite of the publishing of books specifically written by and for chemists, such as the very readable volume by Youden ( I ) and the comprehensive books by Davies and Goldsmith of Britain's Imperial , have been updated regularly. Chemical Industries ( 2 , 3 ) which Still, there was one great early success for chemometrics (the name was proposed later): starting with the linear free energy relationship model and applying regression analysis, large quantities of data from organic chemistry were related to specific properties of molecules. In stage two, covering the decade of the 1970's, chemists become active not only in using available data-analysis methods in research, hut also in developing and modifying these methodologies to meet the specific needs of chemists. Chemometrics became a distinct subdiscipline of chemistry, though most chemists still hesitated, out of u- familiarity, to use its methods. Chemometrics grew rapidly fur two reasons. First, computers became ubiquitous and useful programs were made available. Second, powerful mathematical methods began to prove themselves in cbemistry. Using multivariable methods in consort with computers, complex, heretofore impossible-to-solve problems could he tackled. Factor analysis and pattern recognition became major tools in the chemometric arsenal. Analytical chemists, pressured by the enormous increase in capabilities to generate analytical data, have been leaders in the chemometric revolution. Advances in chemometrics are now covered in the fundamentals (even-numbered years) section of Analytical Chemistry Annual Reviews (4). The availability of several hooks covering applications to chemical problems and of numerous computer programs and packages has made chemometrics accessible to the research and teaching community. In stage three, which is just beginning and will continue for some years, chemometrics is pushing its way into the classroom. Portions of courses and even entire courses are being

devoted to chemometrics in the U S . and Eurone. While most of the courses are taught at the graduate level, some are heing develoned for undereraduates. The svm~osiumon teachine chemometrics, whichwe discuss below; demonstrated that the encroachment of chemometrics into the curriculum has h e m plified programs which can be run on microcomputers should be key dividends of this stage. We anticipate within a few years a fourth stage, in which chemometrics will become a truly routine tool in chemistry. This stage will evolve naturally as chemometrics proves itself by solving many practical problems and as the average chemist feels more at home with more sophisticated levels of data analysis. Then, chemometrics will become a standard component of the methodologies of chemistry and, as a consequence, chemometric methods will be built into chemical instruments. Why Teach Chemometrics?

A growing number of chemical educators, including the authors (who have taught chemometrics to undergraduate and graduate students), feel that chemometrics is becoming a full-fledged discipline justifying classroom attention. The most fundamental reason for teaching chemometrics is that chemometrics furnishes new ways of doing chemistry. Efficient new ways of looking at and of extracting information from chemical measurements, of modeling data, and of designing experiments are now availahle. Problems which are otherwise difficult and nerhans imnossible to solve can he routinely studied with chemoketrik With the use of programs utilizing multivariate methods, relationships among complex data are readily discerned. Computer graphic presentations are playing a significant role in making results understandable. In some situations, chemometrics even offers viable substitutes for laboratory. experiments, especially where . the measurements are hazardous, expensive, or time consuming. A second, practical reason relates to the broad applicability of the methods. Since the techniques are general, all kinds of chemical data might benefit from data analysis of some sort. Further, thanks to the large memories and computational capabili'ties of computers, quite large data sets can he investieated. In the more complex areas of chemistry, where the fu-ndamentals are poorly understood, chemometrics offers particularly unique insights. The broad acceptance of dataanalysis methods in the behavioral sciences suggests the value of applying chemometric methods when fundamental approaches, such as quantum mechanical ones, cannot be applied. The third reason for teaching chemometrics is tied to a broader aspect of chemical education. So pervasive is the comnuter in chemistw that manv" arme - that a course covering the 'diverse aspects of computers in chemistry should be standard course in the traininn-of orofessional chemists. In . such a course, chemometrics would he one of the main topics. Pedagogically, this mix of chemometrics with computer science is an attractive package. Perhaps the major task of the chemical educator during this decade is to guide chemists into the computer age. Some chemists must have the ability to adapt proven methodologies and to develop new methods as the need arises. Training such persons, who will he the guides for bench chemists, is the responsibility of the chemical educator. Educators, beset on all sides by "important"new fields, will have to decide to what extent chemometrics should displace traditional topics within existing courses and, ultimately, whether chemometrics should be taught as a separate course. The Symposium

The symposium presentations demonstrate the many ways in which chemometrics is heing incorporated into the chem-

istry curriculum at the undergraduate and graduate levels and into continuing education programs. Several universities now offer full semester courses in chemometrics as part of their graduate chemistry programs. Michael Delaney described the aproaches he has used at Tufts and Boston Universities. A description of his course has ap(5).He defines chemometrics as peared in THIS JOURNAL chemical information processing, and includes topics such as library searching and graph theory in addition to those described elsewhere in the symposium. Alice Harper of the Universitv of Georgia . offers two eraduate courses, one on statistics and the second covering special topics in chemometrics, including data structure, preprocessing of data, modeling and experiment design, pattern recognition, and factor analysis. The first course is based primarily on the recent text by Box, Hunter, and Hunter ( 6 ) which has many examples from the chemical industry. Considerable emnhasis is placed on learnine the use of standard statistical packages sich as MINITAB: developed a t Penn State and very widely available, TUSTAT 11, originated at the University of Nevada-Reno, and the ARTHUR program of Kowalski and associates at the Universitv of Washineton (which was referred to by several speakek at the symposium). James L. Fasching of Rhode Island University teaches chemometrics as half of the first semester advanced undergraduate-graduate analytical chemistry course. He emphasizes fundamental statistical concepts to promote an underI, behind the st>standing t r i h..i,. tmn; .and I ~ ns>~lmptims uhiitic3ted metlma;, even t h m r h this lillli16 the time available for the more advanced multivariate techniques. K. J. Johnson of the University of Pittsburgh discussed a junior/senior level course for undergraduate chemistry majors which combines numerical methods and data analysis, and includes extensive chemical illustraions (7). The topics range from NMR simulation to regression and curve smoothing. Such courses are an efficient and convenient means of introducing the chemistry major to modern chemometric techniaues. Several of the speakers noted the need for more teaching of chemometrics to undergraduates. Bernard Vandeginste of the Katholieke Universiteit, Nijmegen, Holland, and Bruce Kowalski of the Universitv of Washineton in Seattle stressed this point and spoke of the central role of chemometrics in filling the ean .. . between analvtical chemistrv and anwlied .. mathematics. Hy their deiinitiou. "chen~nnretr~as ia the chtmical discir~linewhich uses n~arhematicaland statistical methods. . .to'ohtain in the optimal way relevant information about material systems." The development of short courses on aspects of chemometrics indicates that the practicing chemist feels a real need for updating his knowledge of a field in which he probably received no formal academic training. Indeed, the growth in short-course offerings may he the best measure of the vigor of chemometrics, for continuing education is the means through which most chemists will he trained in its methods for the next ten years or more. The authors suspect that chemometrics may already rank third, behind chromatography and spectroscopy, in the number of different short courses currently available. Peter Lykos of the Illinois Institute of Technology described a chemometrics course develoned at LLT. which is availahle on videotape through an academic media-sharing consortium, EDUNET. and a service for continuine education of eneineers. AMCEE. He stressed that the aval'lahility of inexpensive computer devices allows the average chemist to take advantage of advanced techniques in chemometric areas such as spectral and signal analysis, graphics displays, and operations research. Stanlev Demine of the Universitv of Houston described his ~ hr covrrs the shnrt course in experimtm debiyn. In t w t (hays fundamentals of matrix alyehra, the modeling ofexperimentul Volume GO

Number 8 August 1983

657

.

svstems. . operation . " ~. ~resnonse ~ ~ surfaces. and the evolutionarv (EVOP) and Simplex approaches to experiment optimization. Emnhasis is nlaced on distineuishine between uncertainties in experimental results caused by choice of an improper model and those caused by imprecision in the measurement process. Chemometric Methods Chemometrics spans a vast array of methods, a few of which we have mentioned in passing. In this section, we give brief introductions, based on the symposium, to some of the major methods of chemometrics, including recommended readings to aid prospective teachers. An extensive bibliography for chemometrics covering many methods will be published in the future. The monograph of Massart et al. (81, though written primarily for researchers, is the most comprehensive general reference for chemometrics. Readable introductions to several chemometric methods are available in the hook edited by Hirsch (9). Valuable ideas are also con,tained in the book by Mosteller and Tukey (10). Three of the most widely used chemometric methods are multiple regression analysis, factor analysis and pattern recognition. Other techniaues, such as experimental design (11) and optimization (12); analysis of variance (13), and information theory (14) have also proven useful in chemistry. We shall focus our attention on the first three methods. Though we are deliberately avoiding mathematics here, two definitions are needed for the ensueing discussion. Most of the chemometric methods deal with vectors and matrices of data. A vector is a one-dimensional array, i.e., a list of numbers. A matrix is a two-dimensional array of numbers, i.e., a grid of numbers. Much of the mathematics of chemometrics is matrix algebra involving manipulations of vectors and matrices. Such

. .

express a set of data as a simple equation. MRA starts with a vector of measured data. called the dependent variables. \.nri~l~lv. or fwrm an. T h w a certnin mmhrr ~~rlc.pc.~dent WIH ted as a nu+11d~m d e l t u r thc (I:II;I. I n the la>^ sttSp,the proportionality constants, called regression coefficients, are calculated to give the best possible fit to the data, usually by " the least squares method. A valid regression model predicts the measured data accurately, and can therefore be used is assumed to obey the equation d = b ~ f l +bzf2

+ . . . bnfn

(1)

where d is the dependent-variable datum, the b's are the multiple regression coefficients calculated by the least-squares method, and the f's are the n independent variables (factors) in the model. The objective of MRA is to calculate the best set of b's (there are only n of them) for the set of dependent variables. The entire set of data forms a vector d, so that where b is the vector of multiple-regression coefficients which best fits the data, and f is a matrix, each row of which represents the factors for a particular datum. Various combinations of factors can he run to find out which set of indenendent variables best models the data, i.e., predicts the datakith the smallest error. Evaluation of the coefficients in extended Hammett-type relations, called linear free-energy relationships (141, are a striking application of MRA in chemistry. Variations in kinetic and equilibrium constants can be modeled in terms of fundamental or empirical properties of organic molecules. An alternative is to express the behavior of a group of molecules as the sums of contributions from the individual structural units present in the group. This MRA approach is called To-

658

Journal of Chemical Education

pological Analysis by Jacques-Emile Dubois, Jacques Chretien and other coworkers, who developed it in Paris (15). (In this case, the b's calculated from eqn. (2) are structural-fragment constants). The symposium contained two presentations on regression analysis, one by Duhois, Chretien, and Roland Hirsch dealing with the topological analysis technique, and the other by Daniel Huchital and Loretta Kiel of Seton Hall University describing Mangelsdorf's method for obtaining reaction rate constants. MRA is easy to carry out. The excellent MINITAB package (17) contains one of the best of the many availahle MRA programs. Close attention should be paid to procedures for testing significance of models and for setting confidence limits on predictions. The texts by Draper and Smith (18) and Daniel and Wood (19) are recommended for guidance. Factor Analvsis is nerhaos the most versatile of the . (FA) . chemometric methods. The primary purpose of FA is to interpret the underlying factors responsible for data. Starting with a matrix of data, FA furnishes a purely mathematical model consistine of abstract factors. These factors are then transformed mathematically in an effort to better understand the data. Two equations describe the basis of FA. Each datum in the matrix is modeled by the equation

where d is a datum associated with a particular row and column of the data matrix. and the r's and c's are the n row and column factors which model the data. For example, if the rows and columns represent molecules, then the factors are separated into row-molecule properties and column-molecule properties. The matrix equivalent of eqn. (2) is

where D is the matrix for all of the data, and R and Care the matrices involving the n factors for each row species and for each column species. The two factor matrices can he calculated by standard methods, after which various transforms of this abstract model can be carried out. Using FA, the number of factors can be found, the data can he correlated, and often physical meaning can he given to the factors. The monograph of Malinowski and Howery (19) covers the methods and applications of FA in chemistry. A useful brief account of FA is also availahle (8,p. 185).Factor analysis can he particularly valuable when the researcher initially has little or no insight into the data. In chemistry, FA has found application extensively in component analysis. Here, the number of components present in multicomponent mixtures (n in eqn. (3)) can he determined, and the identity of the components can he tested. Edmund R. Malinowski of Stevens Institute of Technology showed in his symposium presentation how to identify components in an unresolved chromatographic peak using FA, while Patricia C. Tway and Hugh B. Woodruff of Merck, Sharp & Dohme Research Labs and L. J. Cline Love of Seton Hall University presented a second example, describing their work on the automated FA of mass spectrometric data eenerated in chromatoeraohic studies. Another presented an approach for teaching FA to undergraduate research students. Pattern recognition (PR) is perhaps the most utilitarian of the chemometric methods. The purpose of PR is to classify a species based on a series of measurements (a pattern) on that species. Typically, PR classifies in the following manner. A matrix is formedfrom the patterns for a numb& of species. Then a decision vector which divides the patterns into an assigned binary (two-type) classification is calculated. After the validity of the decision vectors is checked using other patterns of known classification, the decision vector is employed to classify unknown patterns.

trics is finding application, and the significant effort to teach

Each pattern is classified according to the value of a binary discriminant function, s: where thew's are then components of the decision vector, and the d's are the n aieces of data in the uattern. For a set of patterns of assigned s's, the hest vector w is calculated. Then, usine" the set of calculated w's and the set of d's for a new pattern, the value of s for the new pattern is calculated to determine its classification. The monograph of Varmuza (21) provides an overview of aattern recognition methods and of auulications in chemistrv. A shorter reiiew could also he consulied (8,p. 215). he main apdication of PR in chemistrv has been to classifv molecules .. ;N < < ~ r d i nI:S , t h t l r s ~ w ~ : t r ,I ~+ I,M i:~lly niass .ptw t r ; ~ lmid i l l I l ~ i ~ ~ l u cimi~~~l i r ; ~ r t . d > w , ~ t r : , I I I : I I I < T I ~ , . ~ ' I . ~ . s i ~ i , ; ~ t i cUm portant &hstances according to their biological activity is a recent promising application. Two symposium presentations were devoted specifically to PR. One, by Peter C. Jurs of Penn State, was a general introduction to the topic, and the other, by John C. MacDonald of Fairfield University, showed how scientists at a university are making use of this technique. One final technique described at the symposium was catastrophe theory, which was presented by Sherwood Washburn of Seton Hall University. Catastrophe theory was developed by Rene Thom as a means of treating mathematically discontinuous processes, such as phase transitions. With increasing interest in the mathematical modeling of chemical processes, scientists are more and more frequently turning to catastrophe theory for explanations of experimental results (22). Summary

ev& worthy ch&ornetric method e;en he mentioned for lack of space. However, we hope that we have provided a perspective on the state of development of this field that will encourage the reader to study and adopt the techniques appropriate to his or her laboratory and teaching needs. Literature Cited

l

(1) Youden. W.J.,"Ststistl~slMethods for Chemists: Wiley,NewYork, 1951. (21 "Ststistical Methods in Research and Production: 4th ed., cuireded. Davies, 0 . L.. and Goldrmith, P. L. IEdiroral, Longmsn Group, London, 1976. 13) "The Design and Analysis of Industrial Experiments: 2nd ed., corrected. Dsvies, 0. L. (Edilorl, LonemanGroup,I.undon, 1978. (4) Frank.1. E., and Kuwslski,B. R.,Anol. Chrm. 54,232R(1982). ( 5 ) Delaney, M. F., and Warren. ,Jr..F. V.. J . C H E M E D U C . . I1981J. ~~,~~~ G. E. P., Hunter. W. G., and Hunter J. S.,"Statistics for Experimenters: An in^ ~(61 Box, troducliun to Design. Data Analyair, and Model Building: Wiley, New York, 1WL .....

(7) Johnson, K. J.. ''Numerical Methods in Chemistry;' Marcel ~ e k k e r ,N ~ W~ o r k , 1980. I81 Masssrt, D. L.. Dijlutra, A , and Kaufman, L., "Evaluation and Optimilation of Laboratory Methods and Anslilicsl Procedures: Elsovier. Amsterdam, 1978. (91 "Sfafisiim,(. Hirsrh, R. F. (Editor), Franklin Institute Press. Philadelphia. 1918. 110) Mwteller. F.. and Tukey. J. W.. "Data Analysis and Regression: Addison~wesley. Reading. MA. 1977. 111) Demine,S. N..and Morgan. S. L..L Clin Chem.. 25,840(1979). 112) Deming,S. N.,andParker. Jr., L. R., CRCCKt. RPU.A n d Chem.. 7,187 (1978). (13) Hir8ch.R. F.,Anai Chem. 49,691A (1977). (141 Eekschlager,K..snd Stepanek, V.,Anol. Chem.. 54.1115A (1982). (15) '"Corr~iationAnalysis in Chemistry Recent Advances." Chapman, N. B., and Shorter, J. (EddoclJ.Plenum Press, New York, 1978. (16) Chretien.J. R., and Dubois. J . E.. J. Chrarnologr.. 158.43 (1978). (17) Ryan,Jr..T.A..Juiner,B.L..andRyan,B.F.,"MinitabSfudentHandbuok."Durbury Presa. North Scituata, MA, ,976. (18)Draper, N. R.. and Smith, H.,"Applied Regresion Analysis," Znded.. Wiley, New York, 19RI. ~~~~~

119) Daniel. C., and Wood, F. S., "Fitting Equations toData: Computer AnslysisofMultifachur Data," 2nd ed.. Wiley,New York, 1980. (20) Malinowki. E. R., and Howery, D. G.. "Factor Analysis in Chemistry: Wiley~lnter~ science, New York, 1980. (211 varzuma. K., "Pattern Remgnitio" in Chemistry: springer ve.1sg. New York. 1980. (221 Gilmoro. R.. "Catastrophe Theory for Scimtisfs and Engineem: Wiley-Interscience, New York, 1981.

Volume 60 Number 8 August 1983

659