Anal. Chem. 2004, 76, 3365-3372
Chemometrics Barry Lavine*,† and Jerome J. Workman, Jr.‡
Department of Chemistry, Clarkson University, Potsdam, New York 13676, and Argose Inc., Waltham, Massachusetts 02451 Review Contents Image Analysis Sensors Microarrays Literature Cited
3368 3369 3369 3370
This review, the fifteenth of the series, and the thirteenth with the title of Chemometrics covers the most significant developments in the field from January 2002 to January 2004. As in the previous review (A1), breakthroughs and advances in the field have been highlighted, trends evaluated, and challenges that must be successfully met to ensure continued progress in the field enunciated. The current review is limited to approximately 100 references, which continues to pose a challenge since the number of citations on chemometrics continues to show steady growth. Over 15 000 citations, for example, appear when the terms pattern recognition and multivariate calibration are used as keywords in a Chemical Abstracts search. This comes as no surprise since many areas of chemometrics have been assimilated by other disciplines. The extraction of information from chemical data drives research in chemometrics, and growth in this field will continue as long as practitioners of chemometrics continue to solve problems that need to be solved as opposed to solving problems that can be done, simply because they can be. There were more papers presented on bioanalytical, biomedical, and nanotechnology subjects than on classical instrumental techniques at the recent Pittsburgh Conference. However, the opposite is true when examining the chemometric literature, which begs the question, has chemometrics failed to branch out into other areas of chemistry beyond pattern recognition and calibration to become a computational and multivariate branch of chemistry? Has it lived up to the expectations expressed in Svante Wold’s paper entitled, Chemometrics: What Do We Mean With It and What Do We Want From It, which was published in 1995 (A2)? Wold wrote, “The art of extracting chemically relevant information from data produced in chemical experiments is given the name of chemometrics in analogy with biometrics, econometrics, etc.” Wold also wrote that in all applied branches of science the difficult and interesting problems are defined by the applications. “Therefore, chemometrics must not be separated from chemistry or even be allowed to become a separate branch of chemistry; it must remain an integral part of all areas of chemistry.” A problem that has received considerable attention in the chemometric literature is calibration transfer. Brown (A3) has compiled an extensive review of the literature on the various preprocessing and standardization methods used for transfer of † ‡
Clarkson University. Argose Inc.
10.1021/ac040053p CCC: $27.50 Published on Web 04/22/2004
© 2004 American Chemical Society
multivariate calibration models between instruments. Lima (A4), Greensill (A5), and Small (A6) have evaluated and compared the performance of various standardization methods used for transfer of calibration models between instruments. Using the wavelet transform to preprocess the data and remove noise, Brown (A7) and Galvao (A8) have shown improvements in the robustness of calibration models, thereby circumventing the entire problem. It should be noted that although these solutions exist in the literature, their use is not widespread for applied problem solving. At present, the general approach used to tackle this problem involves formulating a fixed mathematical solution to a chaotic system. This has a fundamental drawback, which is the failure to obtain a noise “snapshot” of the system in real time followed by subtraction (or some other dynamic adjustment) to yield an appropriate correction. Instead, a calibration is done at a particular time and is expected to hold for an indefinite period. Furthermore, previous solutions to the problem of calibration transfer have focused on variability between the first-order instrument responses for more than one instrument when attempting to develop a suitable regression model to quantitate spectral response across instruments. However, there are other higher order variations between instruments over time that need to be addressed to ensure a successful calibration transfer. These are best tackled by having a set of reference standards that can be used to calibrate all instruments at any moment in time. It becomes a simple matter of periodically checking the instrument for drift or unmodeled variation. Chemometrics could then be used to conform the instrument signal to the standard samples and report how far the current instrument response of the standard samples lies outside the modeled variance and for developing methods to update the models using information about the unmodeled variance. Current methods of calibration transfer rarely address these issues, which are crucial when applying multivariate techniques to spectroscopic data for accurate quantitative analysis over time. At the other extreme, computed-aided molecular design including quantitative structure-activity relationship (QSAR)-type techniques need to become more of a focus area in chemometrics. Practitioners of chemometrics bring mathematical, statistical, and chemical expertise to bear on complex problems. For QSAR, issues that need to be investigated include molecular descriptor generation, improved algorithms for multivariate analysis, and libraries of molecular properties. These different components need to be systematized and centralized in a single core facility. They need to be extended from spectroscopic and reactivity properties of molecules to toxicity, mutagenicity, and other structure-activity relationships (SARs). The field of chemoinformatics, which encompasses the analysis, visualization, and use of chemical information as a surrogate variable for other data or information, Analytical Chemistry, Vol. 76, No. 12, June 15, 2004 3365
has developed through assimilation of ideas and methods from chemometrics. The importance of chemometrics in 2-D and 3-D QSAR has been the subject of several review articles (A9-A12). Using multivariate analysis techniques such as principal component analysis or partial least squares, nonionic organic pesticides can be partitioned into different environmental compartments based on their physicochemical properties (A13), compounds for drug development can be optimized (A14), and catalysts can be designed using quantum mechanical chemical parameters as molecular descriptors for the formulation of a QSAR (A15). During this reporting period, there have been a few papers published on the development of new molecular descriptors for 2-D and 3-D QSAR (A16, A17) and new algorithms for multivariate analysis (A18, A19). During the same time period when chemoinformatics evolved, chemometrics primarily focused on process analytical chemistry. The expectation was that engineers would depart from their traditional approach and accept the use of sensors and multivariate analysis for process monitoring and control. Although chemometrics has been able to make some progress in this field, the vision of an engineering discipline rooted on chemometrics has not been realized. Furthermore, analytical scientists and chemometricians have been relegated to service departments in most pharmaceutical, chemical, and biotechnology organizations, rather than as a key element in experimental discovery, new product development, and process optimization. Meanwhile, new problems and challenges appeared but were largely ignored by most practitioners of chemometrics because of their ties to the more mundane problems found in process analytical chemistry, medicinal chemistry, or biotechnology. What are the important problems associated with chemometrics and therefore multivariate thinking? From the broadest perspective, these problems fall into several general categories or topics. Such a list of important problems might include calibration and calibration transfer, signal processing and digital filtering, second and higher order data processing, machine learning, propagation of uncertainty in machine learning, image enhancement, hyperspectral image analysis, computer-aided molecular design such as QSAR and other in silico techniques, and data fusion. It is important to note that chemometric research groups today often collaborate with psychometricians, bioinformaticians, statisticians, chemical engineers, and electrical and computer engineering groups. The next review article will be used to address the details of new algorithmic and mathematical approaches in the field. For this review, only applications with dramatic significance and at least a modicum of published research papers have been selected for inclusion. During this reporting period, we observed that a number of Web-based resources exist to support chemometrics. The North American Chapter of the International Chemometrics Society hosted at Ohio University has a web page located at http:// iris4.chem.ohiou.edu/ that contains information about a variety of topics in chemometrics. A useful website for general chemometric reference use is located at http://www.disat.unimib.it/ chm/Links%20Chemometrics.htm. Another hosted by Umea University is located at http://www.anachem.umu.se/cgi-bin/ jumpstation.exe?Chemometrics. However, the Umea website needs some updating of its 44 web links, which covers the 3366
Analytical Chemistry, Vol. 76, No. 12, June 15, 2004
following subjects: software and shared algorithms, work groups, tutorials, teaching materials, bibliographies, research, user contacts, and conferences. A searchable bibliographic site maintained by the Chemical Institute of Canada is located at http://home.nas.net/∼dbc/cic_hamilton/chemo.html. The site, which has 28 links, contains the following subheadings: Abstract, Hypertexts, Conferences, Journals, Resources, Research Groups, Societies, and Software. Other valuable websites for those interested in chemometrics include the following: Chemometrics World by Wiley found at http://www.wiley.co.uk/wileychi/chemometrics/journal.html, the Chemometrics Web News by Milano Chemometrics, and the QSAR Research Group located at http://www.disat. unimib.it/chm/Chemometrics.htm. There are also web sites that focus on specific topic areas of importance to chemometric users including numerical computing http://www.mathworks.com/moler/chapters.html (MatLab site), software http://www.chemometrics.com/software/software.html, signal processing http://chemdiv-www.nrl.navy.mil/6110/6112/ chemometrics/sigproc.html, computer-aided molecular design http://panizzi.shef.ac.uk/cisrg/links/ea_bib.html, and neural network applications http://www.faqs.org/faqs/ai-faq/ neural-nets/part7/section-2.html http://prettyview.com/ann/ http://www2.chemie.uni-erlangen.de/publications/ANNbook/ publications/ACIEE93_503/5_6_Applications-01.html. A website to locate new and existing chemometrics books can be found at http://search.barnesandnoble.com/booksearch/ results.asp?WRD)Chemometrics&userid)54RWBA5W62. Books published since 2003 include, Nature-inspired Methods in Chemometrics, Ricardo Leardi, Editor. This book surveys the application of genetic algorithms and neural networks. Both theoretical and applied aspects of these chemometric tools are described. Another book targeted for the chemometrics audience is Chemometrics: Data Analysis for the Laboratory and Chemical Plant, by Richard G. Brereton. In this book, the author has focused on the main concepts of chemometrics, which he defines as experimental design, signal processing, pattern recognition, calibration, and evolutionary data. He explores the basic principles and applications of these concepts through problem solving. The text has worked examples to demonstrate chemometric concepts. There are 54 problems included along with relevant appendixes in subjects such as matrix algebra, statistics, and commonly used algorithms. A third book targeted for the chemometrics audience is A UserFriendly Guide to Multivariate Calibration and Classification, by Tormod Naes, Tomas Isaksson, Tom Fearn, and Tony Davies. The aim of the text is to provide the reader with a guide to the field of multivariate calibration and classification through a careful survey of the literature. Topics that are covered and treated at great length include data preprocessing techniques (e.g., Fourier filtering, wavelets, multiplicative scatter correction, orthogonal signal correction, derivatives, and the standard normal variate method), nonlinearity in calibration, partial least squares, principal component regression, and multiple linear regression analysis. Chemometrics software continues to proliferate, particularly packages containing algorithms intended for broad applications. The main software packages that enjoy widespread use include the following: Camo Unscrambler at http://www.camo.com/rt/ Products/unsc/unscnewrelease; Infometrix Pironette athttp:// www.infometrix.com/; Chemometrics Toolbox for MATLAB at
http://www.chemometrics.com/software/chemometrics.html; Factor Analysis Toolbox for MATLAB at http://www. chemometrics.com/software/fatb.html; FOSS-NIRSystems ISIscan at http://www.foss.dk/c/p/solutions/products/showprodfamily. asp?prodfamilypkid)124&languageId)1&stepselect)3, Galactic Grams Multi-Quant for Windows, and other platforms located at http://www.chemometrics.com/software/multiquant.html, and Multi-Qual for Windows at http://www.chemometrics.com/ software/multiqual.html. Many of these packages are rooted in traditional chemometric history and have enjoyed a decade or more of critical use. There are also several journals dedicated to the mechanics and logic of chemometrics. These journals include Chemometrics and Intelligent Laboratory Systems, Journal of Chemometrics, and the Journal of Chemical Information and Computer Science. Other journals covering chemometrics in their general editorial scope, but more focused on applications include the following: Environmetrics, Analytical Chemistry, Analytical Letters, and Analytica Chimica Acta. Much of the effort in chemometrics has been directed toward exploiting other literatures in an effort to find an existing method that might solve a chemical problem of interest. In the past, the invention of new methods has not played a large role in the field. Mining of other fields for chemometric “gold ore” has worked reasonably well for most of the 20th century since chemistry during this time period was a data-poor discipline relying on well thought out hypotheses and carefully designed experiments to develop solutions to scientific problems, which was also true of the fields being mined. Recently, both chemistry and biology have begun to evolve into data-rich fields, thereby opening up the possibility of data-driven research. This, in turn, has led to a new approach for solving scientific problems, which consists of four interrelated steps: (1) measure a phenomenon or process using instrumentation that generates data effortlessly and inexpensively, (2) analyze the multivariate data, (3) iterate if necessary, and (4) create and test a model that will provide fundamental multivariate understanding of the process being investigated. This new approach to scientific problem solving constitutes a true paradigm shift since multiple experimentation and chemometrics are used as a vehicle to investigate the world from a multivariate perspective. Mathematics is not used for modeling per se but is more for discovery and is thus a data microscope to sort probe and look for hidden relationships in data. In the previous review, we had discussed the implications of this new paradigm for discovery and cited examples of its use in fields other than chemistry. During this reporting period, we have observed that chemists are beginning to take advantage of this new approach to problem solving because of high-throughput experimentation, which is a result of the proliferation of microreactors. They allow complex chemistries to be conducted at a miniature scale at relatively low temperatures. The favorable kinetics and increased yield and efficiency of microreactor systems promise a bold change in the traditional chemical and pharmaceutical manufacturing processes. Potyrailo (A20) has shown that conditions for polymerization reactions can be optimized. A 96microreactor array for combinatorial screening of new catalysts was used and the properties of the polymer measured in situ were correlated to the polymer formulation and reaction conditions
using the appropriate multivariate optimization function. Strategies for developing new high-throughput screening tools and multivariate methods for prediction of material properties and determination of the contributing factors to combinatorial-scale chemical reactions have been discussed by Potyrailo (A21) and Tuchbreiter (A22). Chemometrics is an application-driven field. Any review of this field cannot and should not be formulated without focusing on so-called novel or exciting applications. Therefore, this review has been divided into three sections with each section corresponding to an application area that has been judged to be exciting or hot. The criteria used to select these application areas are based in part on the number of literature citations uncovered during the search and in part on the perceived impact that developments in these areas will have on chemometrics and analytical chemistry. The three application areas highlighted in this review are image analysis, sensors, and microarrays. Two of the three areas were highlighted in the previous review. Image analysis attempts to exploit the power gained by interfacing human perception with cameras and imaging system. It is the interface between data and the human operator. Insight into chemical and physical phenomena can be garnered where the current superior pattern recognition of humans over computers provides us with a strong argument to develop chemometric tools for imaging. These include tools for interpretation, creation, or extraction of virtual images from real data, data compression and display, image enhancement, and three-dimensional views into structures and mixtures. Chemometrics has an even greater potential to improve sensor performance than miniaturization of hardware. Fast computations combined with multivariate sensor data can provide the user with continuous feedback control information for both the sensor and process diagnostics. The sensor can literally become a selfdiagnosing entity, flagging unusual data that arise from a variety of sources including sensor malfunction, process disruption, unusual events, or sampling issues. Microarrays have allowed the expression level of thousands of genes or proteins to be measured simultaneously. Data sets generated by these arrays consist of a small number of observations (e.g., 20-100 samples) on a very large number of variables (e.g., 10 000 genes or proteins). The observations in these data sets often have other attributes associated with them such as a class label denoting the pathology of the subject. Finding genes or proteins that are correlated to these attributes is often a difficult task since most of the variables do not contain information about the pathology and as such can mask the identity of the relevant features. The development of better algorithms to analyze and to visualize expression data and to integrate it with other information is crucial to making expression data more amenable to interpretation. We would like to be able to analyze the large arrays of data from a microarray experiment at an intermediate level using pattern recognition techniques for interpretation. At the very least, such an analysis could identify those genes worthy of further study among the thousands of genes already known. Other potential focal topics that will not be treated in great detail in this review but are worthy of mention include estimation of kinetic rate constants, protein folding, DNA hybridization, and metabonomics. Brereton (A23) discusses the relative merits of a Analytical Chemistry, Vol. 76, No. 12, June 15, 2004
3367
variety of methods to estimate rate constants from spectral data. Smilde (A24) investigated constrained least squares as one approach to improve the accuracy of the estimation and concluded that using constraints does not necessarily result in an improvement in the accuracy of the rate constant estimate. Rutan (A25) was able to successfully resolve the reactant, product, and intermediate spectra and determine the rate constant for the degradation of an herbicide using NMR and alternating least squares. Olivieri (A26) used both alternating least squares and parallel factor analysis to determine second-order rate constants for two pesticides: carbaryl and chlorypyrifos. Using iterative target testing factor analysis, Zhu (A27) was able to resolve twoway kinetic spectra data. Tauler (A28-A31) applied multivariate curve resolution with alternating least squares to study intermediate species in proteinfolding processes, monitor temperature-dependent protein structural transitions, and study nucleic acid melting and salt-induced transitions. Rutan (A32) and Kvalheim (A33) studies the selfassociation of alcohols (methanol, propanol, butanol, pentanol, hexanol, heptanol) by infrared and Raman spectroscopy using alternating least squares, evolving factor analysis, iterative target testing factor analysis, and orthogonal projection to resolve the spectra and determined concentration profiles as a function of composition. Metabonomics, which is a rapidly emerging field of research combining sophisticated analytical instrumentation such as NMR with multivariate statistical analysis to generate complex metabolic profiles of biofluids and tissues, received considerable attention during this reporting period as evidenced by the large number of publications on this subject. There were several reviews (A34-A36) published on the chemometric contributions to the evolution of the field with emphasis on characterizing and interpreting complex biological NMR data using pattern recognition techniques. Defernez (A37) used principal component analysis to investigate whether there are factors that may affect the NMR spectra in a way that subsequently decreases the robustness of the metabolic fingerprint. Nicholson (A38) showed that discriminant PLS with orthogonal signal correction was effective at removing confounding variation obscuring subtle changes in NMR profile data. Holmes (A39) also discussed multivariate techniques that may be useful for minimizing confounding biological and analytical noise present in the metabolic data. The analytical reproducibility of proton NMR for metabolic fingerprinting was investigated by Nicholson (A40), who used principal component analysis to evaluate the effect that different spectrometers at different operating frequencies had on the observed profiles. During this reporting period, there were several unique and innovative applications of chemometrics that do not fit in a particular category but should be reported to the community. They include the use of principal components to reduce the combinatorial explosion of possibilities in conformational analysis of organic molecules (A41), the monitoring of the conservation state of wooden boards from the 16th century (A42) based on their Raman spectra, which were being periodically collected, assessing the structural similarity of G-protein coupled receptors using principal property descriptors to characterize their amino acid sequences (A43, A44), and the use of wavelets and principal component analysis to eliminate instrumental variation in peptide maps obtained by liquid chromatography (A45). 3368
Analytical Chemistry, Vol. 76, No. 12, June 15, 2004
IMAGE ANALYSIS Chemical imaging is a combination of molecular spectroscopy and digital imaging. Data sets generated by chemical imaging are large, are multivariate, and require significant processing. Review articles on near-infrared and Raman spectroscopy for chemical imaging have appeared in the literature (B1, B2). Segmentation and classification tasks can be impeded by the high dimensionality of the data. Willse (B3) proposes multivariate methods based on Poisson and multinomial mixture models to segment SIMS images into chemical homogeneous regions. Fulghum (B4) demonstrates that additional information can be obtained from XPS imaging data when multivariate methods are applied. Ruckebusch (B5) discussed the use of time-resolved step-scan FT-IR and chemometrics to study the photocycle of bacteriorhodopsin. Three-dimensional data recorded over time were suitably unfolded and studied using principal component analysis, evolving factor analysis, and multivariate curve resolution. Transient intermediates formed in the time domain were identified. Alternating least squares was used by Sum (B6) to extract concentration profiles and individual spectra from FT-IR images of in situ plant tissue. Hancewicz (B7) discusses the use of confocal Raman spectroscopy and selfmodeling curve resolution to measure the concentrations of phaseseparated biopolymers in foods. The use of chemometrics to analyze descriptive image information in pharmaceutical powder technology and pharmaceutical process control has been investigated by Laitinen (B8) and Tauler (B9). Multivariate curve resolution has played an important role in analyzing image data. Tauler (B10) has reviewed the contribution of this methodology to unraveling multicomponent processes and mixtures from images. The influence of selectivity and sensitivity on detection limits in multivariate curve resolution using iterative target testing factor analysis as the specific method studied has been treated by Rodriguez-Cuesta (B11). Duponcehl (B12) has investigated the influence of instrumental perturbations on the performance of widely used multivariate curve resolution methods. Van Benthem (B13) has reviewed the effect of equality constraints on the performance of alternating least squares. Hopke (B14) describes the development of a new convergence criterion for multivariate curve resolution, and Lavine (B15) describes a new method to perform multivariate curve resolution based on a Varimax extended rotation. Visser (B16) presents an information theoretical framework that can be used to extract pure component spectra from images without prior knowledge of the system under investigation, and Sin (B17) discusses a new spectral reconstruction algorithm based on maximum entropy. Larsen (B18) discusses the use of maximum autocorrelation factors to extract information from images where there is an ordering of objects. Multiway methods also play an important role in the analysis of image data. Esbensen (B19) provides an overview of multiway methods. Object-oriented data modeling (B20), which can provide a framework for multiway methods based on the PLS paradigm, is treated by Esbensen in a separate publication. Smilde (B21) also offers a framework for sequential multiblock component methods to study complex data sets. Rutan (B22) describes an improvement in the three-way alternating least squares multivariate curve resolution algorithm that makes use of the recently introduced multidimensional arrays of MATLAB. Gurden (B23) discusses principal component analysis and parallel factor analysis
for the analysis of both single images and movies with similarities and differences between the two methods highlighted. A problem in multiway analysis is the estimation of chemical rank. Xie describes two approaches for tackling this problem: two-mode subspace comparison (B24) and principal norm vector orthogonal projection (B25). Jack-knife techniques for the detection of outliers, which can be deleterious to the performance of parallel factor analysis and related methods, is described by Bro (B26). SENSORS During this reporting period, there have been a large number of papers published on the applications of chemometrics to sensors. A brief survey of the more interesting applications is provided in this section. Many of the applications have focused on detection of biological organisms. Fry (C1) has developed a microporous polyethylene disposable optical film that is mostly transparent to IR light to characterize bacterial strains by FT-IR for subsequent classification by principal component analysis and hierarchical clustering. Bacterial cultures are harvested and placed onto the film where they are allowed to dry. Principal component analysis obtained for second-derivative spectra that are meancentered explain 98% of the total cumulative variance and provide sufficient information for classification. Goodacre (C2) showed that surface-enhanced Raman spectroscopy (SERS) on colloidal silver could be used to fingerprint whole bacteria and fungi. Discriminant analysis and hierarchical clustering identified patterns in the Raman spectra characteristic of the strain level of the particular organism. Raman spectra and pattern recognition techniques were also used to differentiate basal cell carcinoma from its surrounding noncancerous tissue (C3) and identify epithelial cancer cells (C4). Microorganisms on food surfaces could be differentiated using Fourier transform IR (C5). A Mahalonobis distance metric was used to evaluate and quantify the statistical differences in the spectra of six different microorganisms. Sensor applications involving the detection of specific compounds focused on sugars. Ben-Amotz (C6) demonstrated the feasibility of using Raman and PLS for classification and quantitation of oligosaccharides. Potentiometric assays were also developed to detect saccharides. PLS and multiple linear regression analysis were used to quantitate the responses of a potentiometric sensor array on a laboratory in a chip with a correlation coefficient of 0.7 being obtained (C7). A glucose biosensor based on SERS was developed that relies on an alkanethiolate monolayer that acts as a partition layer preconcentrating the glucose. Chemometric analysis of the captured SERS spectra reveals that glucose can be reliably quantitated at physiological levels (C8). Noninvasive glucose monitoring with NIR diffuse reflectance spectroscopy remains an active, yet controversial, research area with a large and growing literature. Li (C9) has recently reviewed the necessary instrumental precision required to achieve this goal as well as the biological complexity of this problem. An alternative to noninvasive glucose monitoring for diabetics is monitoring the changes in tear proteins from diabetic patients. Using electrophoretic methods, changes in protein patterns may contain information about glucose levels based on the Wilks lambda test (C10). Despite the large number of failed attempts to solve the noninvasive glucose problem for insulin dosing, this commercially
and medically important application continues to receive funding due to its market attractiveness for investors. Many of the citations on the application of chemometrics to sensors have focused on improving sensor performance. Brown (C11) was able to use wavelet analysis to remove a nonconstant, varying spectroscopic background from near-IR data leading to a simpler and more parsimonious multivariate linear model. Signal denoising and baseline correction using discrete wavelets was also demonstrated in a study on microchip electrophoresis. Liu (C12) was able to show that baseline drift, which is a frequently occurring problem with chip devices, can be circumvented. The fast wavelet transform through the WILMA algorithm has also been coupled with multiple linear regression analysis and partial least squares for the selection of optimal regression models. Using this approach, Cocchi (C13) was able to improve the predictive ability of regression models. The wavelets that primarily contained noise were discarded with the remaining wavelets used for spectral reconstruction. There were other approaches taken during this reporting period to improve the signal-to-noise ratios of the data. Martens (C14) developed a method to prewhiten spectra, which makes the instrument blind to certain interferences while retaining its analytical sensitivity. The method consists of shrinking the multidimensional data space of the spectra in the off-axis directions corresponding to the spectra of the interferences. A nuisance covariance matrix is developed, and each spectrum is multiplied by the square root of the matrix. Vogt (C15) proposed the idea of secured principal components for detection and correction calibration models that fail because of uncalibrated spectral features. The proposed algorithm searches for these features and corrects them in the disturbed sample. Esbensen (C16) took a different approach to robustifying a multivariate calibration model. He used latent variable modeling and Kalman filter theory as a means for optimizing PLS and PCR predictors used in calibrations. Piovoso (C17) was concerned about the deleterious effects that multivariate outliers have on a calibration model and has focused his attention on outlier replacement in the score space generated by the principal component analysis of the data. Wavelength selection can also improve the performance of a PLS calibration model. Although genetic algorithms have been used to identify the most informative features, there is the problem of overfitting. Olieveri (C18) presents a new procedure, which involves iterative reinitialization of the genetic algorithm based on a statistical analysis of the data. Monte Carlo simulations using a theoretical three-component system illustrate how partial least squares regression greatly benefits from variable selection when the analyte of interest is a minor component. MICROARRAYS The application of microarrays to toxicology, medicine, and biology promises to revolutionize these fields. Salter has stated in a recent review article that choice and validation of the statistical methods used to analyze the data is crucial to the success of this field (D1). Microarray experiments can generate enormous amounts of data. These large data sets are complex and the relevant information they contain may be difficult to access. Analysis of the data to find the genes that are under- and overexpressed may involve hypothesis testing or pattern recognition to correlate genes with specific class labels. Morrison (D2) Analytical Chemistry, Vol. 76, No. 12, June 15, 2004
3369
in his review of this subject stated that preprocessing the data is crucial to ensure extraction of relevant information. Correlation among covariates is a serious problem that can confound the analysis of microarray data, which is why some workers advocate the use of the Mahalonobis distance to compare vectors of gene expression (D3). Missing expression values is another problem that can confound an analysis of gene expression data. Oba (D4) has developed a method to estimate missing values using a Bayesian network to implement principal component analysis. Cross-platform comparisons of microarray data are desirable and important for the rapid development of this technology. However, these comparisons usually require the work to obtain a list of expression data common to all arrays and then comparing the data in this subset. Culhane (D5) has developed a procedure called co-inertia analysis that identifies trends of co-relationships in multiple data sets, which contain the same samples. Many of the methods used to analyze microarray data during this reporting period involved principal component analysis. Wall (D6) showed that singular value decomposition is able to detect patterns in noisy data sets. Musumarra (D7) demonstrated that SIMCA and discriminant PLS is able to provide bioinformatic clues about tumor histotypes. They performed this study using the National Cancer Institute gene expression database. Landgrebe (D8) demonstrated that principal component analysis and permutation validated principal component analysis can make comparisons of gene expression profiles with respect to different conditions and select genes that may prove interesting to investigate. Berglund (D9) showed that PLS has advantages in analyzing microarray data since it can model data sets with large numbers of variables and with few observations. A response model was derived describing the expression profile over time expected for periodically transcribed genes and was used to identify budding yeast transcripts with similar profiles. Shu (D10) showed that kernel density methods could be used in supervised learning of gene expression profile data. The kernel density method demonstrated excellent performance in recovering clusters and in grouping large data sets into compact and well-isolated clusters. The method was more robust than K-means. In conclusion, the field of chemometrics is well positioned to offer solutions to a variety of important multivariate problem solving issues facing science and industry in the 21st century. The ever-expanding endeavors of imaging, sensor development, chemoinformatics, combinatorial chemistry, and bioinformatics will all prove to be challenging opportunities for new scientific insights and improved processes. Barry K. Lavine is an Associate Professor of Chemistry at Clarkson University in Potsdam, NY. He has published more than 90 papers in chemometrics and is on the editorial board of several journals. He is also Assistant Editor of Chemometrics for Analytical Letters. His research interests encompass many aspects of the application of computers to chemical analysis including multivariate curve resolution, pattern recognition, and multivariate calibration using genetic algorithms and other evolutionary techniques. Jerome (Jerry) J. Workman, Jr. is Chief Technical Officer and Vice President of Research & Engineering at Argose Inc., Waltham, MA. In his career, Workman has focused on molecular and electronic spectroscopy and chemometrics and has received many key awards for his work. Over the past twenty-five years he has published widely, including numerous tutorials, scientific papers and book chapters, individual text volumes, software programs, and inventions.
LITERATURE CITED (A1) Lavine, B. K.; Workman, J. Anal. Chem. 2002, 74 (12), 27632769. 3370
Analytical Chemistry, Vol. 76, No. 12, June 15, 2004
(A2) Wold, S. Chemolab 1995, 30 (1) 109-115. (A3) Feudale, R. N.; Woody, N. A.; Tan, H.; Myles, A. J.; Brown, S. D.; Ferre, J. Chemolab 2002, 64 (2), 181-192. (A4) Lima, F. S. G.; Borges, L. E. P. J. Near Infrared Spectrosc. 2002, 10 (4), 269-278 (A5) Greensill, C. V.; Walsh, K. B. J. Near Infrared Spectrosc. 2002, 10 (1), 27-35. (A6) Zhang, L.; Small, G. W.; Arnold, M. A. Anal. Chem. 2003, 75 (21) 5905-5915. (A7) Tan, H.; Brown, S. D. Anal. Chim. Acta 2003, 490 (1-2), 291301. (A8) Galvo, R. K. H.; Jose, G. E.; Dantas Filho, H. A.; Araujo, M. C. U.; Paiva, H. M.; Saldanha, T. C. B.; Nunes de Souza, E. S. O. Chemolab 2004, 70 (1), 1-10. (A9) Tong, W.; Welsh, W. J.; Shi, L.; Fang, H.; Perkins, R. Environ. Toxicol. Chem. 2003, 22 (8), 1680-1695. (A10) Eriksson, L.; Andersson, P. L.; Johansson, E.; Tysklind, M. J. Chemom. 2002, 16(8-10), 497-509. (A11) Livingstone, D. J.; Manallack, D. T. QSAR Comb. Sci. 2003, 22 (5), 510-518. (A12) Norinder, U.; Haeberlein, M. Methods Principles Med. Chem. 2003, 18, 358-405 (Drug Bioavailability). (A13) Gramatica, P.; Papa, E.; Francesca, B. Anal. Chem. 2004, 84 (1-3), 65-74. (A14) Hajduk, P. J.; Mendoza, R.; Petros, A. M.; Huth, J. R.; Bures, M.; Fesik, S. W.; Martin, Y. C. J. Comput. Aided Mol. Des. 2003, 17 (2-4) 93-102. (A15) Lu, Q.; Yu, R.; Shen, G. J. Mol. Catal. A: Chem. 2003, 198 (1-2), 9-22. (A16) Stiefl, N.; Baumann, K. J. Med. Chem. 2003, 46 (8) 13901407. (A17) Bergstroem, C. A.; Strafform, M.; Lazorova, L.; Avdeep, A.; Luthman, K.; Artursson, P. J. Med. Chem. 2003, 46 (4), 558570. (A18) Patankar, S. J.; Jurs, P. C. J. Chem. Inf. Comput. Sci. 2002, 42 (5), 1053-1068. (A19) Lavine, B. K.; Davidson, C. E.; Breneman, C.; Katt, W. J. Chem. Inf. Comput. Sci. 2003, 43 (6), 1890-1905. (A20) Potyrailo, R. A.; Wroxzynski, R. J.; Lemmon, J. P.; Flanagan, W. P.; Siclovan, O. P. J. Comb. Chem. 2003, 5 (1), 8-17. (A21) Potyrailo, R. A. Proc. SPIE 2002, 4578, 366-377 (Fiber Optic Sensor Technology and Applications 2001). (A22) Tuchbreiter, A.; Marquardt, J.; Kappler, B.; Honerkamp, J.; Kristen, M. O.; Mulhaupt, R. Macromol. Rapid Commun. 2003, 24 (91), 47-62. (A23) Thurston, T. J.; Brereton, R. G.; Foord, D. J.; Escott, R. E. A. J. Chemom. 2003, 17 (6), 313-322. (A24) Bijlsma, S.; Boelens, H. F. M.; Hoefsloot, H. C. J.; Smilde, A. K. J. Chemom. 2002, 16 (1), 28-40. (A25) Bezemer, E.; Rutan, S. Anal. Chim. Acta 2002, 459 (2) 277289. (A26) Espinosa-Mansilla, A.; de la Pena, A. M.; Goichoechea, H. C.; Olivieri, A. C. Appl. Spectrosc. 2004, 58 (1), 83-90. (A27) Zhu, Z. L.; Cheng, W.-Z.; Zhao, Y. Chemolab 2002, 64 (2), 157-167. (A28) Naeva, S.; de Juan, A.; Tauler, R. Anal. Chem. 2002, 74 (23), 6031-6039. (A29) Naeva, S.; de Juan, A.; Tauler, R. Anal. Chem. 2003, 75 (20), 5592-5601. (A30) Jaumot, J.; Escaja, N.; Gargallo, R.; Gonzalez, C.; Pedroso, E.; Tauler, R. Nucleic Acids Res. 2002, 30 (17), 9-18. (A31) Jaumot, J.; Avino, A.; Eritja, R.; Tauler, R.; Gargallo, R. J. Biomol. Struct. Dyn. 2003, 21 (2), 267-278. (A32) Holden, C. A.; Hunnicutt, S.; Sanchez-Ponce, R.; Craig, J. M.; Rutan, S. C. Appl. Spectrosc. 2003, 57 (5) 483-490. (A33) Stordrange, L.; Christy, A. A.; Kvalheim, O. M.; Shen, H.; Liang, Y. J. Phys. Chem. A 2002, 106 (37), 7543-8553. (A34) Mendes, P. Briefings Bioinf. 2002, 3 (2), 134-145. (A35) Holmes, E.; Antti, H. Analyst 2002, 127 (12), 1549-1557. (A36) Shockcor, J. P.; Holmes, E. Curr. Top. Med. Chem. 2002, 2 (1), 35-51. (A37) Defernez, M.; Colquhoun, I. J. Phytochemistry 2003, 62 (6), 1009-1017. (A38) Gavaghan, C. L.; Wilson, I. D.; Nicholson, J. K. FEBS Lett. 2002, 530 (1-3), 191-196. (A39) Holmes, E. Abstr. Pap, 226th ACS National Meeting, New York, September 7-11, 2003; Paper ANYL-209. (A40) Keun, H. C.; Ebbels, T. M. D.; Antti, H.; Bollard, M. E.; Beckonert, O.; Schlotterbeck, G.; Senn, H.; Niederhauser, U.; Holmers, E.; Lindon, J. C.; Nicholson, J. K. Chem. Res. Toxicol. 2002, 15 (11), 1380-1386. (A41) Bruini, A. T.; Leite, V. B. P.; Ferreira, M. M. C. J. Comput. Chem. 2002, 23 (2), 222-236. (A42) Marengo, E.; Robotti, E.; Liparota, M. C.; Gennaro, M. C. Anal. Chem. 2003, 75 (20), 5567-5574. (A43) Gunnarsson, I.; Andersson, P.; Wikberg, J.; Lundstedt, T. J. Chemom. 2003, 17 (1), 82-92. (A44) Lapinsh, M.; Gutcaits, A.; Prusis, P.; Post, C.; Lundsted, T.; Wikberg, J. E. S. Protein Sci. 2002, 11 (4), 795-805.
(A45) Andersson, F. O.; Kaiser, R.; Jacobsson, S. P. J. Pharm. Biomed. Anal. 2004, 34 (3), 531-541. IMAGE ANALYSIS (B1) Koehler, F. W.; Lee, E.; Kidder, L. H.; Lewis, E. N. Spectrosc. Eur. 2002, 14 (3), 12-19. (B2) Shafer-Peltier, K. E.; Haka, A. S.; Motz, J. T.; Fitzmaurice, M.; Dasari, R. R. J. Cell. Biochem. 2002, 39, 125-137. (B3) Willse, A.; Tyler, B. Anal. Chem. 2002, 74 (24), 6314-6322. (B4) Artyushkova, K.; Fulghum, J. E. Surf. Interface Anal. 2002, 33 (3), 185-195. (B5) Ruckebusch, C.; Duponchel, L.; Somberet, B.; Huvenne, J. P.; Saurina, J. J. Chem. Inf. Comput. Sci. 2003, 43 (6), 19661973. (B6) Budevska, B. O.; Sum, S. T.; Jones, T. J. Appl. Spectrosc. 2003, 57 (2), 124-131. (B7) Pudney, P. D.; Hancewicz, T. M.; Cunningham, D. G.; Gray, C. Food Hydrocolloids 2003, 17 (3), 345-353. (B8) Laitinen, N.; Antikainen, O.; Rantanen, J.; Yliruusi, J. J. Pharm. Sci. 2004, 93 (1), 165-176. (B9) deJuan, A.; Tauler, R.; Dyson, R.; Marcolli, C.; Rault, M.; Maeder, M. TrAC, Trends Anal. Chem. 2004, 23 (1), 70-79. (B10) deJuan, A.; Tauler, R. Anal. Chim. Acta 2003, 500 (1-2), 195210. (B11) Rodriguez-Cuesta, M. J.; Boque, R.; Xavier, R. Anal. Chim. Acta 2003, 476 (1), 111-122. (B12) Duponchel, L.; Elmi-Rayaleh, W.; Ruckebusch, C.; Huvenne, J. P. J. Chem. Inf. Comput. Sci. 2003, 43 (6), 2057-2067. (B13) Van Benthem, M. H.; Keenan, M. R.; Haaland, D. M. J. Chemom. 2002, 16 (12), 613-622. (B14) Gan, F.; Hopke, P. K. Anal. Chim. Acta 2003, 49 5(1-2), 195203. (B15) Lavine, B. K.; Ritter, J. P.; Voigtman, E. Microchem. J. 2002, 72 (2), 163-178. (B16) Visser, E.; Lee, T.-W. Chemom. Intell. Lab. Syst. 2004, 70 (2), 147-155. (B17) Sin, S. Y.; Widjaja, E.; Yu, L. E.; Garland, M. J. Raman Spectrosc. 2003, 34 (10), 795-805. (B18) Larsen, R. J. Chemom. 2002, 16 (8-10), 427-435. (B19) Huang, J.; Wium, H.; Qvist, K. B.; Esbensen, K. H. Chemo. Intell. Lab. Syst. 2003, 66 (2), 141-158. (B20) Esbensen, K. H.; Hoskuldsson, A. J. Chemom. 2003, 17 (1), 34-44. (B21) Smilde, A. K.; Westerhuis, J. A.; deJong, S. J. Chemom. 2003, 17 (6), 323-337. (B22) Bezemer, E.; Rutan, S. C. Chemom. Intell. Lab. Syst. 2002, 60 (1-2), 239-251. (B23) Gurden, S. P.; Lage, E. M.; deFaria, C. G.; Joekes, I.; Ferreira, M. M. C. J. Chemom. 2003, 17 (7), 400-412. (B24) Xie, H.-P.; Jiang, J.-H.; Long, N.; Shen, G.-L.; Wu, H.-L.; Yu, R.-Q. Chemom. Intell. Lab. Syst. 2003, 66 (2), 101-115. (B25) Xie, H.-P.; Jiang, J.-H.; Shen, G.-L.; Yu, R.-Q. Comput. Chem. 2002, 26 (2), 183-190. (B26) Riu, J.; Bro, R. Chemom. Intell. Lab. Syst. 2003, 65 (1), 3549. SENSORS (C1) Mossoba, M. M.; Khambaty, F. M.; Fry, F. S. Appl. Spectrosc. 2002, 56 (6), 732-736.
(C2) Jarvis, R. M.; Goodacre, R. Anal. Chem. 2004, 76 (1), 40-47. (C3) Nijssen, A.; Schut, T. C. B.; Heule, F.; Caspers, P. J.; Hayes, D. P.; Neumann, M. H. A.; Puppels, G. J. J. Invest. Dermatol. 2002, 119 (1), 64-69. (C4) Stone, N.; Kendall, C.; Smith, J.; Crow, P.; Barr, H. Faraday Discuss. 2004, 126 141-157 (Applications of Spectroscopy to Biomedical Problems, 2003),. (C5) Yang, H.; Irudayaraj, J. J. Mol. Struct. 2003, 646 (1-3), 3543. (C6) Mrozek, M. F.; Zhang, D.; Ben-Amotz, D. Carbohydr. Res. 2004, 339 (1), 141-145. (C7) Aoki, K.; Uchida, H.; Katsube, T.; Ishimaru, Y.; Iida, T. Anal. Chim. Acta 2002, 471 (1), 3-12. (C8) Yonzon, C. R.; Haynes, C. L.; Zhang, X.; Walsh, J. T., Jr.; Van Duyne, R. P. Anal. Chem. 2004, 76 (1), 78-85. (C9) Li, Q.; Hu, X.; Xu, K. Proc. SPIE-Int. Soc. Opt. Eng. 2002, 4916, 465-472 (Optics in Health Care and Biomedical Optics). (C10) Grus, F. H.; Sabuncuo, P.; Dick, H. B.; Augustin, A. J.; Pfeiffer, N. BNC Ophthalmol. 2002, 2. (C11) Tan, H.-W.; Brown, S. D. J. Chemom. 2002, 16 (5), 228-240. (C12) Liu, B.-F.; Sera, Y.; Matsubara, N.; Otsuka, K.; Terabe, S. Electrophoresis 2003, 24 (18), 3260-3265. (C13) Cocchi, M.; Seeber, R.; Ulrici, A. J. Chemom. 2003, 17 (89), 512-527. (C14) Martens, H.; Hoy, M.; Wise, B. M.; Bro, R.; Brockhoff, P. B. J. Chemom. 2003, 17 (3), 153-165. (C15) Vogt, F.; Mizaikoff, B. J. Chemom. 2003, 17 (4), 225-236. (C16) Ergon, R.; Esbensen, K. H. J. Chemom. 2002, 16 (8-10), 401407. (C17) Hoo, K. A.; Tvarlapati, K. J.; Piovoso, M. J.; Hajare, R. Comput. Chem. Eng. 2002, 26(1), 17-39. (C18) Goicoechea, H. C.; Olivieri, A. C.; J. Chem. Inf. Comput. Sci. 2002, 42(5), 1146-1153. MICROARRAYS (D1) Salter, A. H.; Nilsson, K. C. Curr. Opin. Drug Discovery Dev. 2003, 6 (1), 117-122. (D2) Morrison, D. A.; Ellis, J. T. DNA Cell Biol. 2003, 22 (6), 357394. (D3) Chilingaryan, A.; Gevorgyan, N.; Vardanyan, A.; Jones, D.; Szabo, A. Math. Biosci. 2002, 176 (1), 59-69. (D4) Oba, S.; Sato, M.-A.; Takemasa, I.; Monden, M.; Matsubara, K.-I.; Ishii, S. Bioinformatics 2003, 19 (16), 2088-2096. (D5) Culhane, A. C.; Perriere, G.; Higgins, D. G. BMC Bioinf. 2003, 4. (D6) Wall, M. E.; Rechtsteiner, A.; Rocha, L. M. Practical Approach Microarray Data Anal. 2003, 91-109. (D7) Musumarra, G.; Barresi, V.; Condorelli, D. F.; Scire, S. Biol. Chem. 2003, 384 (2), 321-327. (D8) Landgrebe, J.; Wurst, W.; Welzl, G. GenomeBiology 2002, 3 (4). (D9) Johansson, D.; Lindgren, P.; Berglund, A. Bioinformatics 2003, 19 (4), 467-473. (D10) Shu, G.; Zeng, B.; Chen, Y. P.; Smith, O. H. Comp. Funct. Genomics 2003, 4 (3), 287-299.
AC040053P
Analytical Chemistry, Vol. 76, No. 12, June 15, 2004
3371