Molecular Factor Analysis Applied To Collections of NMR Spectra

Mar 9, 2004 - Charles D. Eads,* Carrie M. Furnish, Isao Noda, Kenton D. Juhlin, Dale A. Cooper, and. Stephen W. Morrall. Miami Valley Laboratories, Th...
16 downloads 0 Views 110KB Size
Anal. Chem. 2004, 76, 1982-1990

Molecular Factor Analysis Applied To Collections of NMR Spectra Charles D. Eads,* Carrie M. Furnish, Isao Noda, Kenton D. Juhlin, Dale A. Cooper, and Stephen W. Morrall

Miami Valley Laboratories, The Procter & Gamble Company, P.O. Box 538707, Cincinnati, Ohio 45253

It is often useful to identify and quantify mixture components by analyzing collections of NMR spectra. Such collections arise in metabonomics and many other applications. Many mixtures studied by NMR can contain hundreds of compounds, and it is challenging to analyze the resulting complex spectra. We have approached the problem of separating signals from different molecules in complex mixtures by using self-modeling curve resolution as implemented by the alternating least-squares algorithm. Alternating least squares uses nonnegativity criteria to generate spectra and concentrations from a collection of mixture spectra. Compared to previous applications of alternating least squares, NMR spectra of complex mixtures possess unique features, such as large numbers of components and sample-to-sample variability in peak positions. To deal with these features, we developed a set of data preprocessing methods, and we made modifications to the alternating least-squares algorithm. We use the term “molecular factor analysis” to refer to the preprocessing and modified alternating least-squares methods. Molecular factor analysis was tested using an artificial data set and spectra from a metabonomics study. The results show that the tools can extract valuable information on sample composition from sets of NMR spectra. This report describes a set of techniques that help to identify and quantify molecules in mixtures by analyzing collections of NMR spectra. Metabonomics1,2 provides an important and fruitful application of these techniques. Metabonomics seeks to evaluate the physiological responses of organisms that have been dosed with drugs or toxicants or whose metabolism has been somehow perturbed. The basic experimental strategy in metabonomics is to collect samples of biofluids such as urine or blood plasma from animals subjected to these perturbations. The biofluid samples are usually analyzed by NMR spectroscopy. A wide array of multivariate statistical pattern recognition methods are then used to interpret metabonomic data. Metabonomics is especially promising in drug and toxicology research. In these areas, the multivariate statistical pattern * To whom correspondence should be addressed. E-mail: [email protected]. (1) Nicholson, J. K.; Lindon, J. C.; Holmes, E. Xenobiotica 1999, 29, 11811189. (2) Nicholson, J. K.; Connelly, J.; Lindon, J. C.; Holmes, E. Nat. Rev. Drug Discuss. 2002, 1, 153-161.

1982 Analytical Chemistry, Vol. 76, No. 7, April 1, 2004

recognition approach can be used to screen compounds for safety and physiological impact. This work can be done rapidly and at an early stage in the discovery and development process, potentially leading to significant cost and time savings. Similar methods can be used for NMR spectra of other kinds of mixtures, so the technology is broadly applicable. For example, NMR and chemometric strategies can be used to evaluate the compositions of mixtures generated by consumers using various products, by ecosystems, by industrial production facilities, etc. A number of techniques are available for gaining information on the composition and behavior of complex mixtures based on their NMR spectra. For example, chromatographic analysis prior to NMR spectroscopy (LC NMR) is used to separate individual molecules so that their individual spectra can be identified. Diffusion-ordered spectroscopy3,4 can be used to separate the signals from different molecules in a mixture based on their differing diffusion coefficients. In the area of metabonomics, practitioners typically use statistical pattern recognition tools5 including principal components analysis (PCA), partial leastsquares analysis (PLS), and many others. The goal of such analyses is often to classify samples according to toxin, dose, time elapsed since treatment, or mechanism of action. Once a successful statistical discrimination and classification of samples is achieved, it is frequently useful to identify the molecules responsible for the discrimination and classification. It is also frequently useful to estimate the concentrations of these molecules in the samples. This information can be used, for example, to identify biomarkers of diseases and to gain detailed mechanistic information on drugs and toxicants. Easy compound identification is among the most important challenges facing practitioners of metabonomics. Background and Motivation for Molecular Factor Analysis of NMR Data. NMR data analysis using multivariate statistics often begins by arranging experimental spectra as the rows of a data matrix, D. D is then expressed as the product of a scores matrix, S, and a loadings matrix, L. The purpose of this bilinear decomposition is to gain useful and uniquely defined information from D. Many widely used statistical analysis tools are distinguished by their differing strategies and criteria for performing the data decomposition. (3) Johnson, C. S., Jr. Prog. Nucl. Magn. Reson. Spectrosc. 1999, 34, 203-256. (4) Morris, K. F.; Johnson, C. S., Jr. J. Am. Chem. Soc. 1992, 114, 3139-3141. (5) Lindon, J. C.; Holmes, E.; Nicholson, J. K. Prog. Nucl. Magn. Reson. Spectrosc. 2001, 39, 1-40. 10.1021/ac035301g CCC: $27.50

© 2004 American Chemical Society Published on Web 03/09/2004

Principal components analysis6 is a benchmark approach for achieving this decomposition. It is used widely in metabonomics studies. In PCA, the rows of L are chosen to be orthonormal basis vectors for the data matrix. All the spectra in the original data matrix can be reconstructed as linear combinations of the PCA loadings vectors. The scores matrix gives the amounts of each loadings vector in each spectrum. The principal components are sorted so that the first row accounts for the largest possible variance of D, the second row accounts for the second largest variance of D, etc. Robust numerical methods such as singular value decomposition (SVD) are available for finding these ordered, orthogonal loadings vectors and the corresponding scores. PCA is useful because the factors that contribute to the variance in the data are often closely related to the independent variables characterizing the experimental design. Thus, the scores for different samples can frequently be used to classify samples. Also, it is often possible to use just a few principal components to account for the vast majority of the variance of a data matrix. This gives a significant reduction in the dimensionality of the problem. However, the largest sources of variance are not always related to the independent variables. In metabonomics, for example, age, sex, diet, and time of day for sample collection7 can all influence the spectra of urine. These variables can dominate the variance, obscuring the relevant information in PCA analysis. It is especially difficult to control these variables in studies on humans. Also, PCA analysis does not necessarily produce loadings vectors that are easily related to the spectra of materials that are varying from sample to sample. The mathematical requirement that the loadings be orthogonal causes a mixing of signals from different molecules when there is peak overlap. Biofluids nearly always have signal overlap due to the large number of ingredients. This makes it more challenging to identify important compounds from the loadings vectors based on known chemical shifts. Since the loadings often represent mixtures of compounds, it is difficult to extract concentrations directly from PCA scores. For these cases, supervised techniques are available that circumvent some of these issues with PCA. Examples of supervised techniques used for NMR data analysis include partial least squares,8 orthogonal signal correction,9 and others. One perspective on PCA and related techniques is that latent variables, usually called factors, underlie the data. A factor can be viewed as a fundamental physical cause for the variance in the data. The term factor analysis refers to the use of multivariate statistics to identify and quantify these factors.10 Factor analysis tries to construct a scores matrix that summarizes the amount of each factor contributing to each sample and a loadings matrix that describes the nature of the contributions each factor makes to the data set. For NMR spectroscopy, it is known that the sample-to-sample spectral variability is due to differing concentrations of molecules. From the perspective of factor analysis, an ideal method would (6) Otto, M. Chemometrics: Statistics and Computer Application in Analytical Chemistry; Wiley-VCH: Weinheim, Germany, 1999. (7) Bollard, M. E.; Holmes, E.; Lindon, J. C.; Mitchell, S. C.; Branstetter, D.; Zhang, W.; Nicholson, J. K. Anal. Biochem. 2001, 295, 194-202. (8) Geladi, P.; Kowalski, B. R. Anal. Chim. Acta 1986, 185, 1-17. (9) Gavaghan, C. L.; Wilson, I. D.; Nicholson, J. K. FEBS Lett. 2002, 530, 191196. (10) Malinowski, E. R. Factor Analysis in Chemistry, 3rd ed.; John Wiley and Sons: New York, 2002.

associate the factors with the molecular components of the solutions. The scores would then correspond to the concentrations of those molecules, and the loadings would correspond to their spectra. Therefore, in molecular factor analysis (MFA), we attempt a decomposition of the data matrix such that each row of the loadings matrix is intended to correspond to the NMR spectrum of an individual mixture component, and each row of the scores matrix is intended to correspond to a list of concentrations of the individual components. The challenge is to make the loadings spectrum-like and the scores concentration-like. A variety of methods are available to achieve this.10 The remainder of this paper describes the use of one of these methods, alternating least squares, for evaluating NMR data. METHODS Alternating Least-Squares Algorithm for Molecular Factor Analysis The aim is to decompose the m × n data matrix D as a product of an m × k scores (concentration) matrix, S, and a k × n loadings (component spectra) matrix, L.

D ) SL Here, m is the number of experimental spectra, n is the number of frequencies in each spectrum, and k is the number of factors used to describe the data. Since neither ingredient spectra nor concentrations are known at the outset, assumptions about the general properties of individual spectra and concentrations are imposed as constraints on the scores and loadings. One hopes that by properly choosing and implementing these constraints, the resulting scores and loadings will correspond well with actual concentrations and component spectra. In practice, the two matrices are derived from the data matrix by applying nonnegativity criteria to the concentrations and signal intensities and by attempting to find the simplest (fewest peaks) representations of the spectra consistent with nonnegativity. The nonnegativity constraints are imposed because NMR spectra are nonnegative under normal circumstances. Solute concentrations are always nonnegative. We find empirically that this approach does a good job of recovering concentrations and NMR spectra from real data sets for samples of known composition. The algorithm we use, nonnegative alternating least squares (ALS), is well known in the chemometrics literature.11-13 It is one of several techniques available for self-modeling curve resolution.14,15 In the NMR literature, ALS has already been applied to resolve signals in LC NMR studies,16 and it has been used for resolving mixture components in diffusion-ordered spectroscopy.17 To our knowledge, it has not been used in analysis of metabonomics data or in cases where spectra in a data set arise from disparate samples. (11) Karjalainen, E. J. Chemom. Intell. Lab. Syst. 1989, 7, 31-38. (12) Karjalainen, E. J.; Karjalamen, U. P. Anal. Chim. Acta, 1991, 250, 169179. (13) Gemperline, P. J. Anal. Chem. 1999, 71, 5398-5404. (14) Leger, M. N.; Wentzell, P. D. Chemom. Intell. Lab. Syst. 2002, 62 171188. (15) Jiang, J.-H.; Liang, Y.-Z. Ozaki, Y. Chemom. Intell. Lab. Syst. 2003, 65, 5165. (16) Shen, H.; Airiau, C. Y.; Brereton, R. G. J. Chemom. 2002, 16, 469-481. (17) Van Gorkom, L. C. M.; Hancewicz, T. M. J. Magn. Reson. 1998, 130, 125130.

Analytical Chemistry, Vol. 76, No. 7, April 1, 2004

1983

The ALS algorithm is given in listing 1. The initial guess is made by generating a random nonnegative L matrix, followed by solving D ) SL for S. To execute the “condition S” step of the algorithm, all negative values in S are replaced with zero. In most applications of this algorithm, the conditioning step for L is to replace negative values with zero and to set the integral of each vector to a fixed value. However, for MFA, we find that this approach gives spectra and concentrations that represent mixtures of compounds that have significant covariance within the data set. To simplify the spectra, we use a threshold criterion rather than a nonnegativity criterion for conditioning the L matrix. Listing 1. Alternating least-squares algorithm: make initial guesses for S and L while not converged condition S solve D ) SL for L condition L solve D ) SL for S end while loop solve D ) SL for L solve D ) SL for S We condition the L matrix using the following procedure. For each row of L, the element having the largest absolute value is identified. A sign change is applied to the entire vector if that element is negative. This makes the tallest peak in the spectrum positive. Then, all elements that are less than some specified fraction (typically 0.25) of this largest value are set to zero. During execution of the algorithm, we monitor the rank of L. If it becomes less than the number of factors sought, k, random noise is added to the entire L matrix to restore its rank. Practical Data Conditioning Considerations for Molecular Factor Analysis in NMR Spectroscopy. In chemometrics, it is very common to preprocess data to improve the performance of multivariate analysis methods. In this section, we describe the preprocessing tools that we have found helpful in performing MFA. Not all of these methods are used in the example applications presented below. However, we find that having these methods in our data analysis tool set expedites our ability to process collections of NMR spectra. Binning. In metabonomics it is common to reduce the resolution of NMR spectra prior to statistical analysis. Typically, one divides a spectrum into a number of equally spaced regions, usually 0.04 ppm in width, leading to 250 segments (bins) over a 10 ppm chemical shift range. The total signal intensity within each region is integrated and placed into a data vector. The binned data are then subjected to statistical analysis. Resolution reduction is valuable for two main reasons. It reduces the amount of data quite substantially, which is useful because high-resolution NMR spectra are often acquired using 16 384 or more points. It also helps to compensate for the spectrum-to-spectrum variability in chemical shifts from specific compounds. This variability arises from small differences in pH that can influence compound ionization and possibly from concentration-dependent intermolecular interactions. The price for the advantages of binning is reduced precision of the chemical shifts. Also, small signals can be overwhelmed by large signals that happen to reside in the same segment. Nonetheless, overall the procedure proves to be indispensable. 1984

Analytical Chemistry, Vol. 76, No. 7, April 1, 2004

To avoid some of these pitfalls, we sometimes use a binning algorithm that preserves chemical shift resolution while still reducing the number of points. We call the algorithm smart binning to distinguish it from the standard approach just described. In smart binning, we first use a simple heuristic algorithm to determine high-resolution peak positions and ranges corresponding to each peak. This algorithm defines a peak as a location having higher intensity than both of its neighbors. The range of a peak is determined by finding the points at which the signal intensity drops below a threshold or at which the signal begins to increase, indicating a nearby peak. The peak locations and ranges are used to construct a mask. Mask construction requires a reference spectrum that includes signals from all the spectra. We find it valuable to use either an average spectrum for this or an envelope spectrum constructed from the highest signal from among all the spectra at each chemical shift. On occasion it is necessary to alter the mask manually. For example, if the pH changes extensively among the samples, signals from some molecules such as acetic acid will move significantly. We can manually intervene in the mask construction process to define the entire chemical shift range spanned by such compounds as constituting a single peak. Once the mask is established, the total intensity within each peak region defined in the mask is then calculated for each spectrum. The number of peaks tends to be comparable to the number of bins produced using standard binning. Small peaks are preserved if they are resolved from large peaks. Better spectroscopic resolution is preserved during the statistical analysis because the positions and ranges corresponding to each peak are recorded. To our knowledge, smart binning is a new procedure. It attempts to address a significant aspect of metabonomics data preprocessing. We recognize that it should be subjected to a comprehensive evaluation of its performance, advantages, and disadvantages. We have not yet completed this analysis, though we hope to present such an analysis in a future report. Smoothing. In the standard binning procedure, signals originating from the same type of molecule may appear in different bins in different spectra due to spectrum-to-spectrum variability. This can make data analysis more difficult, especially if narrow bins are used. On some occasions, we find it advantageous to convolve the spectra with Lorentzian or Gaussian peaks having a width larger than the spectrum-to-spectrum variability. This prevents abrupt switches in the binned position of any peak that happens to drift. Smoothing also improves the performance of the thresholding procedure used in our ALS algorithm. Signals that are small, but still above the threshold at their highest point, will have most of their area below the threshold if they are broadened and smoothed to occupy several bins. Such peaks will therefore be suppressed more than larger signals. During the ALS cycle, this disproportionate reduction of the smaller peaks tends to lead to factors that have fewer peaks. Loadings are therefore more likely to correspond to spectra of single molecules or to collections of closely covariant molecules. Least Common Spectrum Removal. Standard chemometric analyses frequently employ mean-centering of the data matrix columns. Mean-centering cannot be used with MFA because it

makes the nonnegativity criteria impossible to enforce. It remains useful, however, to identify and remove the nonvarying component of the data set by constructing a least common spectrum. This is achieved by inspecting each column, identifying the smallest intensity, and placing that intensity in the least common spectrum. Subtraction of this least common spectrum from each spectrum in the data set enhances the performance of the algorithm while maintaining nonnegativity. The least common spectrum is itself often of interest for identifying mixture components that are not changing within a study. Reconstruction Using k Singular Values. The molecular factors determined from independent MFA analyses of a single data set are not always identical. Although the larger factors are usually the same, the smaller factors are biased by the random numbers used to generate the initial guesses. To guarantee that the same factors are generated in multiple independent runs, the data matrix can be subjected to singular value decomposition followed by reconstruction using the k largest singular values. If the intent is to extract k molecular factors, then the data matrix is reconstructed by retaining the first k singular values, setting all others to zero. This creates a reconstructed data matrix with a rank of k. When this operation is performed, the MFA algorithm reliably finds k molecular factors that reproduce the reconstructed data matrix within machine precision. A very important implication of this result is that the molecular factors span the same space as the principal components. Molecular factors generated from reconstructed data matrices can therefore be viewed as resulting from an oblique rotation of the principal components. Molecular factors contain the same information as principal components, redistributed in a more useful form. Presentation of MFA Results. Since PCA and MFA produce analogous scores and loadings matrices, it is convenient to present scores values from MFA using a plot analogous to the scores plot often used with PCA. In the case of MFA, the scores are intended to represent concentrations of specific compounds (or groups of compounds that covary within the data set). However, to keep the terminology simple, in this report we refer to a plot correlating pairs of concentrations derived from MFA as scores plots. Similarly, we present loadings vectors from PCA and MFA as normal NMR spectra, recognizing that PCA loadings are not intended to correspond to actual spectra and also recognizing that the data have been subjected to extensive preconditioning. Preconditioning makes the loadings vectors slightly less recognizable as true NMR spectra. Metabonomics Data. The rat urine for the metabonomics study was acquired from three groups of 12 animals each. The control group was fed ad libitum. A second group was fed 80% of what the control group ate. A third group was fed 60% of what the control group ate. Urine was collected at the study outset and at 1, 2, and 4 weeks. Prior to the start of the study, all rats were acclimated on an ad libitum diet for 7 days. Sample preparation involved mixing 0.4 mL of urine with 0.3 mL of a standard buffer. Spectra were acquired using solvent presaturation on a Varian Innova 600-MHz NMR spectrometer. Data were originally acquired using 32K complex points. Initial data processing was performed using the program NMR Processor (ACD Labs, Inc.). Spectra were subjected to 1 Hz line broadening, Fourier transformation, manual phasing, and manual

Figure 1. Two of 50 artificial mixture spectra illustrating the nature of the randomly constructed test data set.

Figure 2. Principal component spectra 2 and 5 from a set of 10component mixtures. These show that the principal components do not look like NMR spectra of pure components because they have negative and positive peaks and too many signals, and do not correspond to any actual components.

baseline correction. Processed spectra were then transferred to Matlab (The Mathworks, Inc.). All additional spectral and statistical processing was performed using programs (Matlab M files) developed in our laboratory. Data were aligned so that the internal chemical shift reference, TSP, occurred at an identical location in all spectra. The region of the spectra containing residual water was set to zero intensity. Spectra were truncated to the chemical shift range between 0.2 and 10 ppm. Prior to binning, a piecewise cubic Hermitian polynomial interpolation algorithm (supplied by Matlab) was used to increase the number of points in the spectra to 24 500, giving a whole number of points per segment. Binning was performed to a resolution of 0.02 ppm per bin. Binned spectra were scaled so that they all had the same total integrated intensity. A Gaussian smoothing function having a width of 2 bins was applied to enhance the performance of the MFA algorithm. Least common spectrum removal was performed. PCA and MFA were then performed. The MFA procedure requires specification of three control parameters: the number of factors, the number of iterations, and the fractional intensity threshold. The number of Analytical Chemistry, Vol. 76, No. 7, April 1, 2004

1985

Figure 3. Ingredient spectra extracted using the MFA algorithm. By imposing the criteria of positive peaks, positive concentrations, and minimal complexity via threshold-based conditioning, the MFA algorithm correctly extracts the ingredient spectra from a collection of mixture spectra. The top spectra show pure ingredient spectra. The bottom spectra show the corresponding loadings generated by MFA.

Figure 4. Example mixed factor generated by analyzing the test data set using a zero threshold. This loading shows that mixed spectra occur if threshold-based conditioning is not used.

factors used is explored below. The number of iterations was chosen to be at least 100 per factor. The threshold was typically 0.25. RESULTS Synthetic Data Test. To explore the performance of the molecular factor analysis procedure, an artificial NMR-like data set consisting of known component spectra was generated. Ten noise-free ingredient spectra were constructed. For each ingredient spectrum, a number of peaks per spectrum between 1 and 6 was chosen using a random number generator. For each of these peaks, random choices were made for intensities between 1 and 3, Lorentzian line widths between 0.5 and 2, and positions between 1 and 500 on an arbitrary scale. These ingredient spectra were used to generate 50 mixture spectra consisting of randomly weighted combinations of the ingredient spectra. Normally distributed noise was added to produce the final simulated mixture spectra similar in appearance to typical NMR spectra of biofluids. Two of the 50 random mixture spectra are shown in Figure 1. SVD was used to generate a set of PCA scores and loadings. Figure 2 shows two of the loadings vectors. The appearance of the loadings vectors is not spectrum-like. They contain many signals per spectrum, negative intensity, and do not directly contain information on the individual components that are varying from sample to sample. Figure 3 shows that MFA gives a more desirable result. To generate these spectra, the alternating least-squares algorithm was 1986

Analytical Chemistry, Vol. 76, No. 7, April 1, 2004

run to generate 10 molecular factors. Parameters for the run included 1000 iterations and a truncation threshold of 0.25. Comparison to the known “true” spectra shows that loadings correspond to individual ingredient spectra. All 10 extracted spectra were of comparable quality. This demonstrates that the algorithm can extract meaningful spectroscopic information from large NMR data sets. When the threshold parameter is set to zero, that is, when the unmodified ALS algorithm is used, one often obtains mixed factors. Figure 4 shows an example mixed factor that was generated by analyzing the test data set using a zero threshold. This illustrates that satisfying nonnegativity criteria does not guarantee the generation of pure component spectra. The thresholdbased conditioning procedure was introduced as a simple way to generate pure spectra with little computational cost. Even when a threshold is used, MFA does not always successfully separate factors corresponding to the true ingredient spectra. This occurs under the following circumstances resulting from the random initial guess matrices: First, the largest peak in the mixed factor corresponds to a signal that already is accounted for in a separate factor. Second, the smaller peaks in the mixed factor correspond to signals that are not accounted for in any separate factor. Under these circumstances, the algorithm retains the mixed factor in order to retain a sufficiently high rank. It may be possible to introduce heuristics into the computer code to deal with this situation. This happens more frequently when the mixtures are very complex, such as in metabonomics. To identify these mixed factors, we perform at least three independent MFA runs, making sure that corresponding factors can be identified in each run. Mixed factors can then be easily identified and eliminated. Metabonomics Data. The artificial data set just described provides a good initial characterization of the MFA technique and proved useful in developing the algorithm. However, real data sets have features that are not included in this trial set. In metabonomics, for example, biofluid spectra are usually much more complex, having very many components whose concentrations vary from sample to sample. In this case, there are more independently varying factors than there are sample spectra, and

Figure 5. Comparison of molecular fractors to principal components. The first two principal components (top row) contain the same information as the corresponding two molecular factors (bottom row). The molecular factors are the best solution to creating a nonnegative combination of the principal components and also maintaining nonnegative scores. The first principal component has been multiplied by -1.

one cannot hope to disentangle all of the components. Furthermore, such spectra exhibit sample-to-sample variability in chemical shifts for certain molecules. To explore the consequences of these types of issues, and to determine the benefits of MFA for evaluating this type of data set, a preliminary analysis of a rat urine metabonomics data set is presented. The data were acquired as part of a larger study aimed at investigating physiological consequences of caloric restriction. We intend to publish a detailed account of that study elsewhere. That account will address the physiological relevance of the results and present a more detailed experimental protocol. Here we summarize our experience so far in applying MFA to the analysis of metabonomics data. Analysis of Metabonomics Data. Our current practice is to first choose k, the number of molecular factors that we want. We extract the largest k principal components from the data matrix. We then reconstruct the data matrix using these k principal components as described in the Methods section. Finally, we perform MFA on the reconstructed data matrix using the modified alternating least-squares algorithm. It is informative to explore the dependence of the results on k. For small k, the procedure produces molecular factor spectra that combine signals from compounds whose concentrations increase or decrease in concert. As k is increased, the algorithm begins to separate the groups into individual compounds whose concentrations are changing. When k is set too high, the algorithm fails to produce meaningful factors. There are several reasons for this failure. For example, the original data matrix may not have sufficient information to justify an attempt to extract a large number of factors. Also, noise can cause different signals originating from within a single type of molecule to vary independently. In this case, the algorithm may break the spectra of individual molecules into separate factors. As a consequence, it is necessary to use trial and error to determine the best value of k. It is also sometimes useful to retain several analyses with different values of k since the algorithm may perform best for different compounds using different values of this parameter.

Some of these features are illustrated for the caloric restriction data in the following figures. Figure 5 compares principal component loadings with molecular factor spectra for k ) 2. Since mean centering is not performed and the sample spectra are everywhere positive, the first principal component is very similar to an average spectrum. The second principal component contains positive and negative peaks and is therefore not spectrum-like. In contrast, the two molecular factors are spectrum-like in appearance. They are the linear combination of the two principal components that best gives nonnegative spectra and concentrations. PCA scores plots and the analogous MFA concentration correlation plots provide additional insight. These are shown in Figure 6 for the k ) 2 analysis of the 40% caloric restriction group. PCA and MFA both separate the t ) 0 samples from the others equally well. The advantage of MFA for this low k value analysis is that the molecular factors appear to correspond to groups of compounds whose relative concentrations increase together, or decrease together, in relation to the onset of caloric restriction. A similar mechanistic grouping is not possible with the principal components. Figure 7 shows the third PC loadings vector and a scores plot of the second versus the third principal component. The third PC does not improve discrimination of samples from different time points. In fact, no principal components beyond the second improve the discrimination among samples from different time points or restriction groups when viewed as two-dimensional scores plots. Nonetheless, additional PC loadings contain information that provides an increasingly accurate description of the data matrix. MFA can be viewed as a tool for interpreting this additional information in terms of the chemical composition of the samples. Spectra of individual compounds that vary from sample to sample begin to emerge using higher values of k. For example, Figure 8 shows MFA loadings determined using k ) 12. In this case, several of the loadings correspond to spectra of specific compounds that can be readily identified. The highest quality spectrum generated is that of lactic acid. One urine sample from Analytical Chemistry, Vol. 76, No. 7, April 1, 2004

1987

Figure 7. Third PC loadings vector and a scores plot of the second versus the third principal component. The third PCA loadings vector (top) contains spectroscopic information, but it does not help discriminate among samples when viewed in a scores plot (bottom). Symbols: t ) 0 weeks b; t ) 1 week 0; t ) 2 weeks ]; t ) 4 weeks 3. Figure 6. Comparison of a principal component scores plot (top) with the molecular factor concentration plot (bottom). The comparison shows that both exhibit good separation of t ) 0 samples from those collected 1 week or longer into the caloric restriction treatment. Symbols: t ) 0 weeks b; t ) 1 week 0; t ) 2 weeks ]; t ) 4 weeks 3.

one of the animals had an anomalously high concentration of this compound. This apparently made it easy for the algorithm to generate the spectrum. The spectra of creatine and hippuric acid are more typical. Several of the factor spectra generated had a similar quality. We have not yet been able to identify all of the compounds or compound groups represented by these spectra. This may be due to unfamiliarity with the compounds they represent. Alternatively, these MFA loadings could be combinations of spectra, or they could be fragments of spectra. The worst case factor is also shown in Figure 8. This factor clearly consists of contributions from many compounds. Mixed factors like this should be expected given that in this run only 12 factors were extracted from a data set reflecting the concentrations of perhaps hundreds of compounds. By inspecting the scores plots, it can be seen that these specific compounds can be related to the extent and duration of caloric restriction. Figure 9 plots the concentrations of factor 3 versus factor 2 (urea vs allantoin) determined from this k )12 analysis. The relative concentrations of these compounds appear to be reciprocally related to each other. There are several pairs of factors that provide similar discrimination among the time points. Figure 9 also shows the concentrations of factor 11 versus factor 3. In this plot, it is clear that factor 11 (not yet identified) exhibits a spike during the time course in some animals, but not all, at the 2-week time point. We wish to stress that this was not apparent from PCA. Rather, it required the ability to follow changes in concentration of specific compounds, or groups of covarying 1988 Analytical Chemistry, Vol. 76, No. 7, April 1, 2004

compounds, using MFA. It is possible that all or most of the animals exhibit this spike at different times but that we missed its detection because of the urine collection schedule. DISCUSSION NMR spectroscopy can provide valuable information on the chemical composition of complex mixtures. This can in turn lead to important insight into processes as complex as metabolism. To extract this information, it is frequently necessary to exploit multivariate statistical tools. These tools are required because the information is spread among all the spectra in a collection, because there may be extensive complexity leading to peak overlap, and because there is nonideal behavior in the spectra in the sense that peak positions can change, noise is present, etc. This report describes several data processing and multivariate analysis tools that when used together can facilitate the direct chemical interpretation of collections of spectra of complex mixtures. The main idea is to exploit the alternating least-squares algorithm for self-modeling curve resolution. This algorithm attempts to generate scores and loadings matrices that correspond directly to concentrations and ingredient spectra, respectively. The ALS algorithm required minor modification, through the use of a threshold rather than straight nonnegativity, to perform well with properly preconditioned NMR data. The task of finding pure ingredient spectra and concentrations from mixture spectra was reduced to the task of finding nonnegative scores and loadings in a bilinear matrix decomposition. A number of algorithms are available for achieving this, though the simplest appears to be alternating least squares. The purpose of introducing the threshold criterion in the ALS procedure was to suppress small signals in mixed factors in order to achieve pure component spectra as loadings. Other options for

Figure 8. Several molecular factors from metabonomics data. These factors are easily identified as lactic acid (top left), creatine (top right), and hippuric acid (bottom left). A mixed factor is also shown (bottom right). These factors were taken from an MFA run in which a total of 12 factors were generated.

Figure 9. Individually assignable compounds that discriminate samples acquired before and after caloric restriction (top). One molecular factor spiked for the 2-week time point in about half of the animals (bottom). These factors were extracted from a k)12 analysis. Symbols: t ) 0 weeks b; t ) 1 week 0; t ) 2 weeks ]; t ) 4 weeks 3.

obtaining the simplest possible loadings vectors include maximizing the power or minimizing the entropy. Though we have not explored these methods, the threshold approach appears empirically to work well, and it was very easy to incorporate into the ALS algorithm.

We found several data preconditioning methods very useful for improving the performance of the ALS algorithm. Smart binning, based on construction of a mask, preserves the highresolution chemical shift information while still providing a reduction in the number of points. A second benefit follows from the ability to adjust the reference spectrum used to generate the mask. Smoothing and broadening reduce the effect of sample-tosample variability when used with standard binning if the chemical shift variability is not too large. The smoothing procedure also improves the performance of the peak identification heuristics used in the smart binning procedure. Finally, peak broadening improves the performance of the thresholding algorithm for simplifying spectra. Least common spectrum removal has several uses. Primarily, it guarantees that every signal will vary from zero to some maximum value. This improves the performance of the MFA algorithm. It is also very useful to know the spectrum of the components whose concentrations do not change among the samples. In those cases where the number of factors sought, k, is less than the number of compounds that are actually varying, it is useful to perform data set reconstruction using k singular values. This makes the extracted molecular factors reproducible from run to run. Furthermore, the algorithm finds combinations of the first k principal components that best correspond to nonnegative concentrations and spectra, in effect performing an oblique rotation of the principal components. Many of the resulting nonnegative scores and loadings correspond to individual molecular species that dominate the variance within the data set. Others represent combinations of materials that covary within the data set. The method was tested using an artificial data set, on which it performed reliably. However, it is clear that real NMR data are more difficult to analyze. The most difficult problem appears to be that chemical shifts are highly dependent on pH and possibly on composition-dependent intermolecular interactions. When used with appropriate data preconditioning methods, we found that the MFA algorithm could be used successfully to determine individual Analytical Chemistry, Vol. 76, No. 7, April 1, 2004

1989

component spectra in moderately complex mixtures. In very complex mixtures, it is possible to obtain spectra and concentrations of individual compounds or sets of compounds that dominate the variance in the data. There are two main types of ambiguities that arise in selfmodeling curve resolution: rotation and intensity. The rotation ambiguity results from the possibility of multiple nonnegative solutions. Because of the high resolution offered by NMR, this is less of a problem here than in other applications of self-modeling curve resolution. Experience with both model and real data shows that though the issue still exists in NMR, it can be dealt with. In particular, we use a threshold criterion rather than straight nonnegativity, and we use multiple runs with random starts. These strategies appear to deal effectively with this potential problem. The intensity ambiguity stems from the fact that the spectrum for any single component can be scaled with an arbitrary factor.

1990

Analytical Chemistry, Vol. 76, No. 7, April 1, 2004

In the present application, we scale all component spectra to fixed integrated intensity. For NMR spectroscopy, relative peak intensities are proportional to the number of hydrogen atoms giving rise to each signal. It is possible, therefore, to scale assigned component spectra so that the corresponding scores are proportional to the relative component concentrations. Even without quantitative scaling, the procedure described here identifies spectra of individual components and gives information on concentration changes. ACKNOWLEDGMENT The authors thank Pete Rodriguez for encouraging this research program. Received for review November 4, 2003. Accepted January 28, 2004. AC035301G