Anal. Chem. 2006, 78, 5559-5569
Analysis of Four-Way Two-Dimensional Liquid Chromatography-Diode Array Data: Application to Metabolomics Sarah E. G. Porter,† Dwight R. Stoll,‡ Sarah C. Rutan,*,† Peter W. Carr,‡ and Jerry D. Cohen§
Department of Chemistry, Virginia Commonwealth University, 1001 West Main Street, Richmond, Virginia 23284-2006, Department of Chemistry, University of Minnesota, 207 Pleasant Street SE, Minneapolis, Minnesota 55455, and Department of Horticultural Science, University of Minnesota, 1970 Folwell Hall, St. Paul, Minnesota 55108
Two-dimensional liquid chromatography (2D-LC) is rapidly gaining popularity for the analysis of very complex mixtures, including proteomic and metabolomic samples. It provides an effective strategy for separating such samples, because the resolving power of 2D-LC is far superior to that of traditional single-dimension separations. The present work focuses on the development of data analysis methods for the extremely large data sets, on the order of 10 million data points, generated by 2DLC with diode-array detection (DAD). Specifically, we have applied and adapted chemometric methods to the analysis of maize seedling digests, focusing on compounds related to the biosynthetic pathways of indole-3-acetic acid, the primary growth regulator in plants. The chemometric techniques of window target testing factor analysis (WTTFA), along with parallel factor analysis - alternating least squares (PARAFAC-ALS) were used to analyze 2D-LCDAD chromatograms of a sample composed of 26 indolic standards, 2 extracts of mutant orange pericarp maize seedlings, 2 extracts of wild-type maize seedlings, and a blank sample. The indolic compounds studied belonged to six spectrally unique groups, and WTTFA was able to specifically identify the presence or absence of any of the 26 indolic standards in the mutant and wild-type samples. A PARAFAC-ALS algorithm and an ALS algorithm with flexible constraints were successfully applied to resolve the spectrally rank deficient data and to demonstrate the quantitative potential of multivariate curve resolution methods. Using this procedure, 95 total peaks were resolved in the data set analyzed. Of those 95 peaks, 45 were found in both the mutant and wild-type maize samples, 16 peaks were unique to the mutant maize samples, 13 peaks were unique to the wild-type maize samples, and 15 peaks were unique to the standard chromatograms. Of the 26 standards included in the data set, several indole acetic acid conjugates were identified and quantified in the maize samples at levels of approximately 0.3-2 µg/g plant material. Two-dimensional separation methods have gained popularity in recent years, and increasingly complex samples are being * To whom correspondence should be addressed. E-mail:
[email protected]. † Virginia Commonwealth University. ‡ Department of Chemistry, University of Minnesota. § Department of Horticultural Science, University of Minnesota. 10.1021/ac0606195 CCC: $33.50 Published on Web 07/07/2006
© 2006 American Chemical Society
analyzed in proteomics research using these approaches.1 Twodimensional liquid chromatography (2D-LC) techniques have been applied in order to take advantage of the superior resolving power afforded by using orthogonal separation mechanisms.2 Multidimensional separations in proteomics research were recently reviewed by Issaq et al.1 Although relatively few applications to metabolomics have been published, it is clear from the apparent complexity of the plant metabolome that such a powerful separation tool will soon be invaluable to this field.3,4 Stoll and Carr5 have described a novel approach for fast 2DLC analysis using gradient ion exchange in the first dimension and ultrafast, high-temperature reversed-phase gradient elution in the second dimension along with diode array detection (DAD). The improved speed of the separation is achieved by the use of high temperatures (>100 °C), short columns, and high linear velocities in the second-dimension separation. The system allows for the separation of the compounds present in a tryptic digest of bovine serum albumin in less than 30 min. Previously published 2D-LC methods had analysis times of hours or days rather than minutes,6 although more recent work has significantly improved upon this time.7 More recently, Stoll et al.8 have demonstrated the use of high temperatures to greatly speed up two-dimensional HPLC with reversed-phase gradient elution in both chromatographic dimensions. They were able to generate a peak capacity of 870 in only 25 min. This fast 2D-LC-DAD separation used a single second-dimension column, and based on the fact that each first-dimensional peak appeared in at least two consecutive seconddimension chromatograms, also moved closer to meeting the sampling rate requirement for 2D-LC suggested by Murphy et al.9 The technique was applied to extracts of wild-type and mutant maize seedling tissues and a set of indolic metabolite standards (1) Issaq, H. J.; Chan, K. C.; Janini, G. M.; Conrads, T. P.; Veenstra, T. D. J. Chromatogr., B 2005, 817, 35-47. (2) Giddings, J. C. Anal. Chem. 1984, 56, 1258A. (3) Dunn, W. B.; Ellis, D. I. TRAC, Trends Anal. Chem. 2005, 24, 285-94. (4) Oksman-Caldentey, K. M.; Inze, D.; Oresic, M. Proc. Natl. Acad. Sci. U.S.A. 2004, 101, 9949-50. (5) Stoll, D. R.; Carr, P. W. J. Am. Chem. Soc. 2005, 127, 5034-35. (6) Opiteck, G. J.; Ramirez, S. M.; Jorgenson, J. W.; Moseley, I. I. I. Anal. Biochem. 1998, 258, 349-61. (7) Ikegami, T.; Hara, T.; Kimura, H.; Kobayashi, H.; Hosoya, K.; Cabrera, K.; Tanaka, N. J. Chromatogr., A 2006, 1106, 112-17. (8) Stoll, D. R.; Cohen, J. D.; Carr, P. W. J. Chromatogr., A, in press. (9) Murphy, R. E.; Schure, M. R.; Foley, J. P. Anal. Chem. 1998, 70, 1585-94.
Analytical Chemistry, Vol. 78, No. 15, August 1, 2006 5559
as a first step in demonstrating that 2D-LC is a practical highspeed analytical methodology for small molecules. We are currently working on developing and applying chemometric methods to analyze the huge data sets (on the order of 10 million data points for a set of six samples) obtained using the 2D-LC-DAD system described by Stoll et al.8 The focus of this work is the detection and analysis of compounds related to the biosynthetic pathways of indole-3-acetic acid (IAA). The use of multiple-wavelength channels should provide a good deal more information than what can be obtained from traditional analysis (i.e., single-wavelength peak integration) of such a chromatogram. Although our results are based on the analysis of 2D-LC-DAD data, the principles should be applicable to mass spectrometric detection (or any other multichannel detector) as well. Chemometric methods have long been used by chemists to analyze complex data.10-14 Using multichannel detectors such as DAD with one-dimensional chromatographic separations allows for the identification of analytes by both their spectral characteristics and their chromatographic retention.13,15 Adding a second chromatographic dimension and analyzing multiple samples results in higher order data that can be analyzed very effectively using multivariate curve resolution techniques.16 Many different techniques for analyzing two-dimensional chromatography (both 2D-LC and 2D-GC) data have appeared in recent years, including curve resolution methods such as parallel factor analysisalternating least squares (PARAFAC-ALS) and trilinear decomposition,17,18 as well as the generalized rank annihilation method (GRAM).19 Hollingsworth et al.20 have very recently compared several different methods for visually comparing 2D-GC data sets as images, using only the retention time data; however, it would be advantageous to also consider the information gained using multichannel detectors such as DAD and MS. Quantitative studies have been published based on two-dimensional gas chromatography/mass spectrometry data (2D-GC/MS),21 and 2D-GC with flame ionization detection.22 Fraga and Corley19 have used GRAM to quantitatively analyze 2D-LC with single-wavelength detection; however, we are not aware of any publications dealing with the quantitative analysis of quadrilinear 2D-LC-DAD data. Here, we present two different analysis methods for dealing with 2D-LC-DAD data that will address the need for identifying (10) Booksh, K. S.; Lin, Z. H.; Wang, Z. Y.; Kowalski, B. R. Anal. Chem. 1994, 66, 2561-69. (11) Henshaw, J. M.; Burgess, L. W.; Booksh, K. S.; Kowalski, B. R. Anal. Chem. 1994, 66, 3328-36. (12) Sanchez, E.; Kowalski, B. R. Anal. Chem. 1986, 58, 496-99. (13) Sanchez, E.; Ramos, L. S.; Kowalski, B. R. J. Chromatogr. 1987, 385, 151-64. (14) Cecil, T. L.; Rutan, S. C. Anal. Chem. 1990, 62, 1998-2004. (15) Chen, J. C.; Rutan, S. C. Anal. Chim. Acta 1996, 335, 1-10. (16) Synovec, R. E.; Prazen, B. J.; Johnson, K.; Fraga, C. G.; Bruckner, C. A. In Advances in Chromatography; Brown, P. R., Grushka, E., Eds.; Marcel Dekker: New York, 2003; Vol. 42, pp 1-43. (17) Sinha, A. E.; Hope, J. L.; Prazen, B. J.; Fraga, C. G.; Nilsson, E. J.; Synovec, R. E. J. Chromatogr., A 2004, 1056, 145-54. (18) Gross, G. M.; Prazen, B. J.; Synovec, R. E. Anal. Chim. Acta 2003, 490, 197-210. (19) Fraga, C. G.; Corley, C. A. J. Chromatogr., A 2005, 1096, 40-49. (20) Hollingsworth, B. V.; Reichenback, S. E.; Tao, Q.; Visvanathan, A. J. Chromatogr., A 2006, 1105, 51-58. (21) Sinha, A. E.; Fraga, C. G.; Prazen, B. J.; Synovec, R. E. J. Chromatogr., A 2004, 1027, 269-77. (22) van Mispelaar, V. G.; Tas, A. C.; Smilde, A. K.; Schoenmakers, P. J.; van Asten, A. C. J. Chromatogr., A 2003, 1019, 15-29.
5560
Analytical Chemistry, Vol. 78, No. 15, August 1, 2006
and quantifying targeted metabolites in metabolic profiling studies, as well as more comprehensive comparative studies in metabolomics.23 First, an algorithm that can identify target metabolites in an unknown sample based on both spectral characteristics and two-dimensional retention times will be very valuable for metabolic profiling studies. To this end, we have adapted Lohnes’24 window target testing factor analysis (WTTFA) algorithm and have applied it to 2D-LC-DAD data. Second, a method must be devised whereby the components present in each sample and the differences in concentration (or relative concentration) of those components can be determined. Many metabolomic studies require quantitative comparisons between samples where the identity of the components may be unknown. Curve resolution methods such as PARAFAC-ALS are well suited to such analyses;16 however, the issue of rank determination and the application of chemically relevant constraints become very important. We propose that the use of a PARAFAC-ALS algorithm25 on rank-deficient systems can adequately provide such information, if the maximum rank of the data can be determined. Further refinement of the resolved profiles can be achieved by applying an alternating least-squares algorithm with flexible constraints (fALS).26 Qualitative Metabolite Profiling with Window Target Testing Factor Analysis. We have adapted Lohnes’s method of WTTFA24 for the purpose of qualitative screening of 2D-LC-DAD chromatograms. This method provides a mechanism to search a chromatogram for regions where a spectrum is found that closely resembles a known standard spectrum (the library or target spectrum) and to determine the retention time of the putative matches. A detailed description of the WTTFA algorithm can be found elsewhere.24 In brief, the algorithm requires the following steps: (1) singular value decomposition (SVD) is performed on a small subsection (retention time window) of a one-dimensional chromatogram; (2) the matrix of abstract spectra calculated in step 1 is truncated to the first R factors, (where R is the rank, taken as the maximum number of components expected in the window); (3) the target spectrum is projected onto the subspace described by the abstract spectra; and (4) the correlation between the target spectrum and the projected target spectrum calculated in step 3 is calculated. The method is based on the target factor analysis algorithm described by Malinowski.27 These calculations are initially performed for the first W data points (where W is the window size); after the analysis is completed in the first window, the window is incremented by one time unit, and the analysis repeated. The window is then moved one data point at a time along the chromatographic time axis until the end of the chromatogram. The degree of correlation between the target spectrum and the projected target spectrum of interest is measured by an angle, θ, between the two spectra described as vectors in n-dimensional space: (23) Fiehn, O. Plant Mol. Biol. 2002, 48, 155-71. (24) Lohnes, M. T.; Guy, R. D.; Wentzell, P. D. Anal. Chim. Acta 1999, 389, 95-113. (25) Andersson, C. A.; Bro, R. Chemom. Intell. Lab. Syst. 2000, 52, 1-4. (26) Bezemer, E.; Rutan, S. C. Chemom. Intell. Lab. Syst. 2002, 60, 239-51. (27) Malinowski, E. R. Factor Analysis in Chemistry, 2nd ed.; John Wiley & Sons: New York, 1991.
θ ) cos-1 F
(1)
where F is the correlation coefficient between the target spectrum and the projected target spectrum.28 For two identical spectra (F ) 1.000), this angle is exactly 0°. However, in the presence of noise, it is more likely that a library spectrum and the projected spectrum in an independently obtained chromatogram will have a small but nonzero value of θ. Therefore, a threshold value for θ must be determined; the test is considered positive for that spectral factor in any region of the chromatogram where θ is both below this threshold and stays below the threshold for some minimum time. A plot of the angle θ against time shows that, in regions where either no peak is eluting (or any peak with a spectrum not matching the library spectrum), the correlation is poor (the angle is high). In a region where a peak elutes with a spectrum consistent with a library spectrum, the angle θ drops below some critical threshold. The time range over which θ remains below the threshold is the elution region for that library component. Multiway Modeling with PARAFAC. The PARAFAC model has been used extensively for modeling multivariate chromatographic data.29-31 The goal of any curve resolution technique is to extract the pure component (species) profiles, either chromatographic peaks or spectra, present in a set of data and to determine the relative contribution of each species to the data set. An excellent and detailed description of the general PARAFAC model can be found in the book by Smilde et al.32 Specifically, we employ PARAFAC for the analysis of four-way data, where the elements of data set X (xijkl), can be represented as R
xijkl )
∑a b c
ir jr krdlr
+ eijkl
(2)
r)1
where R is the number (data rank) of components used in the PARAFAC model, eijkl is an error term that contains the residual variance, and air, bjr, ckr, and dlr describe the elements of matrices A, B, C, and D, respectively. In the case of our 2D-LC-DAD data, matrix A consists of the different sample concentration profiles (or relative concentrations), matrix B consists of the firstdimension retention profiles (i.e., chromatographic peak shapes), matrix C consists of the second-dimension retention profiles, and matrix D consists of the spectral profiles. In each case, the number of rows in the matrix is equal to the number of data points in that dimension, represented by I, J, K, and L, respectively (e.g., L is the number of wavelengths in the spectral direction), and the number of columns in A, B, C, and D correspond to the number of individual chemical species (R) contributing to the model. A component is generally a chemical compound contributing to a (28) Huber, L.; George, S. A. Diode Array Detection in HPLC; M. Dekker: New York, 1993. (29) Bylund, D.; Danielsson, R.; Malmquist, G.; Markides, K. E. J. Chromatogr., A 2002, 961, 237-44. (30) Garcia, I.; Sarabia, L.; Cruz Ortiz, M.; Manuel Aldama, J. Anal. Chim. Acta 2004, 515, 55-63. (31) Comas, E.; Gimeno, R. A.; Ferre, J.; Marce, R. M.; Borrull, F.; Rius, F. X. J. Chromatogr., A 2004, 1035, 195-202. (32) Smilde, A.; Bro, R.; Geladi, P. Multi-way Analysis with Applications in the Chemical Sciences; John Wiley & Sons Ltd.: Hoboken, NJ, 2004.
chromatogram (i.e., a pure chromatographic peak), but a component can also arise from the gradient background and noise. Equation 2 represents a quadrilinear model. Lin et al.33 defined bilinearity for second-order data, and these principles are also true for trilinear (third order) or quadrilinear data (fourth order). For a data set to be quadrilinear, the instrument response due to a pure component in all four domains should be unique, consistent, and independent of the presence of other species. In other words, the first- and second-dimension chromatograms and the spectra for each pure component must be identical in every sample, differing only in magnitude. This constraint obviously requires a high degree of retention time reproducibility between runs in order for the retention profiles to be consistent in the various samples. Synovec17,21,34 has extensively discussed the use of the trilinearity constraint in the analysis of 2D-GC data. Rank Determination in Four-Way Data. One challenge in modeling four-way data is that of rank deficiency. As a simplified (two-way) example, a spectrochromatogram with two separated peaks should have a rank of two in the chromatographic mode. However, when those two peaks have the same or very similar DAD spectra, the rank in the spectral dimension will only be one. Analysis of such data by a multiway model such as PARAFAC, and use of a fit diagnostic such as the core consistency diagnostic (CORCONDIA), will indicate that a one-component model suffices to fit the data.35 However, if the maximum rank of the data is known, a PARAFAC model can still be calculated, and another measure of fit quality (such as sum of squares) can be used instead of the CORCONDIA. In the case of the data analyzed here, it is clear simply from analysis of the spectra of the 26 indolic standards that these data will be rank deficient in the spectral dimension; many indolic metabolites have very similar spectral patterns. Similarly, highly overlapped chromatographic peaks can cause rank deficiency in either of the chromatographic modes. To individually determine the rank of the data in the four dimensions (first dimension chromatography, second-dimension chromatography, spectra, and concentration), the four-way array can be unfolded into two-way matrices and analyzed by SVD. For a four-way data array, X, with dimensions of I × J × K × L, the unfolded matrices can be represented as follows: X1 (I × JKL), X2 (J × IKL), X3 (K × IJL), and X4 (L × IJK).32,36 The subscripts on X indicate which dimension is preserved, and the parenthetical notation represents the dimensions of the unfolded matrix. When SVD analysis is carried out on each of these unfolded matrices, the diagonal of the singular value matrix contains the rank information for the mode that is preserved. One popular method of determining rank by SVD analysis is to examine a plot of the singular values versus the number of components (called a scree plot); a break in the continuity in the scree plot is an indication that the maximum number of components has been reached.37 When each dimension has a different rank, due to the rank deficiency issue described above, the maximum rank of the four dimensions can be used to calculate the PARAFAC model. (33) Lin, Z. H.; Booksh, K. S.; Burgess, L. W.; Kowalski, B. R. Anal. Chem. 1994, 66, 2552-60. (34) Pierce, K. M.; Wood, L. F.; Wright, B. W.; Synovec, R. E. Anal. Chem. 2005, 77, 7735-43. (35) Bro, R.; Kiers, H. A. L. J. Chemom. 2003, 17, 274-86. (36) Harshman, R. A. J. Chemom. 2001, 15, 689-714. (37) Otto, M. Chemometrics; Wiley-VCH: New York, 1999.
Analytical Chemistry, Vol. 78, No. 15, August 1, 2006
5561
Figure 1. Structures and identification numbers of the 26 indoles in this study.
A more quantitative method of calculating rank, also based on the singular values of the data matrix, is to calculate the percent variance explained by adding components to the model.37 If all of the principal components are used, the explained variance in the data would be 100%. In our case, we calculated the 90% variance rank, that is, the rank at which 90% of the variance is explained by the singular values. There is clearly a need to examine more closely the results of any modeling that is done based on rank analysis, since there is always a level of ambiguity in this determination.37 EXPERIMENTAL SECTION 2D-LC-DAD Chromatograms. 2D-LC-DAD chromatograms resulting from injection of solute-free mobile phase, a mixture containing the 26 indolic standards, 2 orp mutant seedling samples, and 2 wild-type maize seedling samples were obtained. The details of the chromatographic system and the sample preparation for the maize seedlings are reported in detail elsewhere.8 The 26 standard compounds studied are shown in Figure 1 and include IAA (13) as well as other related indolic compounds. The compounds are numbered in order of their first-dimension retention time. The two-dimensional chromatogram of the indole standards, labeled with compound numbers, is shown in Figure 2 as a contour plot at 220 nm. Data Analysis. All data analysis was carried out within the Matlab programming environment (Mathworks, Natick, MA) on 5562 Analytical Chemistry, Vol. 78, No. 15, August 1, 2006
Figure 2. Contour plot of the 2D-LC of the 26 indole standards (220 nm). Numbers refer to compounds shown in Figure 1.
a Dell Optiplex GX280 computer with a 3-GHz processor and 2 GB of RAM. Chromatograms collected in HP Chemstation (rev. A.10.01, Agilent Technologies, Palo Alto, CA) were converted into text files with a macro provided by Agilent. The WTTFA algorithm was written in-house in the Matlab programming environment based on the approach outlined by Lohnes et al.24 The fALS algorithm was written in-house as described previously.26 The PARAFAC-ALS algorithm used was part of the N-way toolbox for
Matlab developed by Andersson and Bro, which is available for free download on the Internet.25 All other functions used were built-in Matlab functions. RESULTS AND DISCUSSION Limitations of DAD. It should be pointed out that the data analysis techniques discussed here should be applicable to any multichannel detector. The limitations of using a DAD are clear: the sensitivity and selectivity of the detector are fairly low (especially compared to MS detectors). In addition, the 1090 DAD used for the present work had a sampling rate of 42 ms/data point (24 Hz) for single-wavelength data, whereas the full spectrum is only sampled at a rate of 102 ms/spectrum (10 Hz). The baseline (4σ) peak width of a typical peak in our chromatograms was ∼800 ms, so only eight spectra are collected across each seconddimension peak. Because the sampling rate of the DAD is ∼2.5 times lower than the sampling rate of the single channel, we observed some loss of resolution and increased noise in the full scan DAD data as compared to the single-wavelength data. In addition, background correction was applied to the single-channel data but not to the DAD data, which created a higher background signal in the DAD data. A faster DAD would solve most of these issues; however, chemometric data analysis still offers a substantial improvement over traditional single-channel analysis of chromatograms. A further limitation of the DAD is the detector saturation or overflow that occurs as the absorbance values approach 3 AU. In this case, the response of the detector is no longer linear with concentration, and the overflow of data coming into the computer can result in negative absorbance values being recorded. In regions where this occurs, the data cannot be analyzed. However, despite its shortcomings, the information content of the DAD data is still far greater than any single-channel detector. The application of the chemometric methods discussed allows a great deal of information to be gleaned even from a detector with substantially lower selectivity than a MS detector. Olivieri38 recently discussed methods for computing the selectivity of multiway techniques. Although not directly related to peak capacity, the methods discussed allow evaluation of the relative precision of measurement of specific compounds in a mixture relative to that of the pure (single component) sample. By definition, a selectivity of 1 (as calculated by the method of Messick et al.38,39) indicates that a compound in a mixture can be quantified with the same precision as if it were in a pure sample. A selectivity of 0 indicates that the analyte is completely overlapped and cannot be quantified at all. Using the retention times of the peaks and the spectra resolved by the PARAFAC-ALS algorithm in this work (a total of 95 peaks, vide infra), a theoretical chromatogram of highly overlapped peaks was constructed with peak widths of 0.7 min and 800 ms on the first- and seconddimension columns, respectively. Table 1 shows the effect of adding dimensionality to the data by comparing the average selectivity of the two independent chromatographic dimensions with and without DAD and the 2D-LC with and without DAD. As shown in Table 1, the improvement in the selectivity appears to scale with the improvement in peak capacity and allows the (38) Olivieri, A. C. Anal. Chem. 2005, 77, 4936-46. (39) Messick, N. J.; Kalivas, J. H.; Lang, P. M. Anal. Chem. 1996, 68, 1572-79.
Table 1. Summary of Selectivity Calculated with Various Data Dimensions data dimensions
average selectivity per componenta
peak capacityb
column 1, single wavelength column 1 + DAD column 2, single wavelength column 2 + DAD 2D-LC, single wavelength 2D-LC-DAD
0.15 0.36 0.05 0.25 0.78 0.84
50 nac 17.4 na 870 na
a Selectivity calculated according to ref 38 for a total of 95 components. b Peak capacity as determined in ref 8. c na, not available.
increase in information provided by DAD detection to be quantified. Ultimately, combining two dimensions of chromatography with DAD allows an average selectivity of 0.84 for each component, an increase of 8% relative to single-wavelength detection. It should be noted that the figures reported in Table 1 are average selectivities over 95 components. If a particular component is well resolved chromatographically (in one or both dimensions), there will be little improvement in selectivity upon the addition of DAD. Conversely, a component that is highly overlapped chromatographically will show a dramatic improvement in selectivity when the spectral dimension is added. For example, the individual selectivity of the resolved component corresponding to standard 12 (indole-3-acetyl-L-alanine) shows a selectivity of only 0.16 when using two dimensions of chromatography due to the fact that it is highly overlapped with several peaks in the standard, mutant, and wild-type maize samples. Upon the addition of the DAD spectrum, the selectivity is improved by over 200%. This analysis indicates that an improvement in the precision of quantification can be achieved by using multichannel detectors in conjunction with multivariate curve resolution, without the corresponding increase in analysis time usually required to improve chromatographic peak capacity. The magnitude of the increase in selectivity will be directly related to the information content of the detector; that is, a more selective detector such as MS will have a correspondingly greater increase in selectivity than a DAD. In addition, the gain in selectivity upon the addition of multichannel detection will be more dramatic as the number of components and the sample complexity increase. Window Target Testing Factor Analysis. In the analysis of the 2D-LC chromatograms in this work, 26 indolic standards (Figure 1) were used as target compounds for the qualitative analysis of orp mutant and wild-type maize samples. The spectra of these 26 standards were placed into a single matrix, and a correlation matrix was generated using the built-in correlation coefficient function in Matlab. Spectra that were correlated by 98.5% (θ < 10°) or more were considered to be identical. Using this criterion, six unique spectra, representative of all the indolic metabolite standards, were used to construct a library; these six spectra are shown in Figure 3. The compound numbers (from Figure 1) represented by each spectral factor are shown in the caption to Figure 3. The six spectra corresponded to six structural classes, as shown in Figure 3: IAA and its conjugates (class A), 5-hydroxyindoles (class B), indole-3-propionic and indole-3-butyric acid (class C), and three spectra that were unique to single Analytical Chemistry, Vol. 78, No. 15, August 1, 2006
5563
Figure 3. Six unique spectra of the indolic metabolites in Figure 1. The structure in each figure shows the chemical class represented by each spectrum. (A) 2-4, 6, 7, 9-15, 17-19, 22, and 24-26; (B) 1 and 8; (C) 20 and 23; (D) 5; (E) 16; (F) 21.
compounds, anthranilic acid, indole-3-acetamide, and indole-3ethanol (classes D-F, respectively). The parameters in the WTTFA algorithm that required optimization were the window size (W), the rank, i.e., the number of components expected within a window (R), and the threshold value for θ. First, the window size was determined based on the peak width of a typical peak in the indole standard chromatogram. Although changing the window size did not have a significant effect on the qualitative results obtained, there was an advantage in using a smaller window, as the computing time was significantly reduced. However, if the window size was too small (i.e., much smaller than the width of a peak), many “noise” peaks were detected. As mentioned, the baseline width (4σ) of a typical standard peak was ∼800 ms, so a range of window sizes from 0.4 to 1.6 s was tested (results not shown). It was determined that the ideal window size should be slightly bigger than the 4σ width of the peak; therefore, a window size (W) of 1 s was ultimately chosen. This choice resulted in the most accurate identification of the two-dimensional retention times for the known standards and the fewest number of extraneous peaks being identified. These results were not in agreement with the recommendations of Lohnes et al. for the window size (they suggested a window width of 2σ); however, they did note that the size of the window had little effect on the qualitative results that they obtained.24 The maximum number of components (R) expected within a window was set at six, which was the largest number of components that could reasonably be expected to be resolved within the 1-s window. This was a reasonable choice considering the complexity of the samples and the relationship between the window size and the typical peak width. It is quite possible given the sample complexity that overlapped peaks could be present within a 1-s window; however, it was not likely that we would observe more than six overlapping compounds within a 1-s window. Although a rank of six components was somewhat of an overestimation, this parameter had little effect on the results. Finally, the threshold value for θ was set at 5° for these data, although the appropriate threshold may depend on the S/N of the chromatograms of interest and should be considered as a variable related to the data collected. The threshold for θ and the 5564
Analytical Chemistry, Vol. 78, No. 15, August 1, 2006
window size are closely related parameters: we determined that θ should remain below the threshold value for a time at least equal to the window size in order to be considered a true positive match. In addition to the approaches described, a modification to this approach using a Gaussian window40 rather than a boxcar type window was evaluated, but no improvement in the qualitative results was achieved. The algorithm with the appropriate values for θ and W was applied to each chromatogram in the data set to identify the spectral matches, and then the retention times in the first and second chromatographic dimension were determined. The retention times were determined by finding the position on the time axis of the peak apex within that region of the first-dimension chromatogram wherein θ was below the threshold and persisted for at least a period of time equal to W. These peak regions were determined by creating a “boxcar chromatogram” wherein zeros were placed in the data matrix where θ was above the threshold and ones were placed where θ was below the threshold. Subsequently, the boxcar chromatogram was reshaped into a twodimensional chromatogram. A contour plot of the boxcar chromatogram shows the locations of the peaks in two-dimensional space that are highly correlated with one of the six spectra in the library shown in Figure 3. In this manner, spectral matches can be correlated with the first- and second-dimension retention times of the standards. Results of WTTFA Analysis on Mutant and Wild-Type Maize. The results of the WTTFA analysis are shown in Figure 4 for one of each of the wild-type (A) and mutant maize (B) samples. The dots in panels A and B of Figure 4 represent the retention times of the indolic metabolite standards in twodimensional separation space (based on the chromatogram shown in Figure 2). This figure gives a qualitative picture of the spectral signatures of the compounds present in the mutant and wild-type maize samples. The colored contours surround regions where θ is below the threshold of 5° for that spectral class (where the colors correspond to the spectral classes shown in Figure 3), and thus, a peak containing a spectrum matching the corresponding spectral factor in Figure 3 exists. In those cases where a contour surrounds the location of one of the 26 standards, there is an indication that the peak observed in the unknown chromatogram is indeed a positive match for the standard known to elute in that particular region of the 2D separation space. By matching spectral characteristics as well as two retention times, we can identify which of the standards were present in the mutant and wild-type maize samples. The results indicate that there is only one detectable peak in the wild-type maize samples that is consistent with any of the standards analyzed. The red contour in Figure 4A indicates that a component consistent with tryptophan is present in this sample. Based on a similar analysis, the mutant sample contains peaks with spectra that are consistent with tryptophan as well as several other components, including anthranilic acid (5), 5-hydroxytryptamine (8), indole-3-acetyl-L-lysine (9), indole-3-acetyl-myoinositol (10), and indole-3-acetyl-β-D-glucose (15). The fact that there are several contours in the mutant chromatogram that have spectroscopic patterns that resemble one of the six indole classes but that do not match the retention times of any of the 26 (40) Brown, C. D.; Wentzell, P. D. Chemom. Intell. Lab. Syst. 2000, 51, 3-7.
Figure 4. Comparison of WTTFA results for wild-type and mutant corn seedling extracts. Black dots represent the retention times of the 26 indolic standards. (A) Wild-type maize; only spectral component A is detected. (B) Homozygous orp maize WTTFA results for spectral components A (red), B (not detected), C (yellow), D (green), E (not detected), and F (purple). The boxes denoted by dashed lines in (A) and (B) represent the region chosen to illustrate the PARAFAC results in subsequent figures.
standards is also of significant interest. These areas indicate asyet unknown indoles that may play a significant role in the tryptophan-independent biosynthesis of IAA. For example, there are several instances in the mutant chromatogram of spectral factor D (green contour in Figure 4B), which is unique to anthranilic acid compared to the other 25 metabolites studied in this work. This result indicates the presence of other compounds that are structurally related to anthranilic acid that may be of significant biological interest. Biological Relevance of Qualitative Results. IAA was not detected in either the wild-type or the mutant maize samples. This result is due to the fact that IAA elutes close to a retention time corresponding to a very large peak that saturated the detector (as discussed above) and resulted in erroneous absorbance values at the apex of the peak. The second-dimension chromatogram in this region had to be omitted from the analysis so the peaks could not be resolved. However, several of the other compounds that were identified in the mutant and wild-type samples are of biological relevance. It is known that anthranilate is a precursor to IAA in the biosynthetic pathway of tryptophan in maize and
other plants;41 it is therefore not surprising that anthranilic acid and related conjugates might be detected in the maize seedlings. In addition, previous studies have shown that orp seedlings contain a significant amount of indole-3-acetic acid as an amide conjugate;42 however, its identity was unknown. Tentative identification by 2DLC is a major step forward in sorting out the changing indolic composition in these plants. Indole-3-acetyl-L-lysine was previously isolated from bacteria, where it appears to be involved in plant gall formation,43 but it has not previously been found in plants. However, it is a compound of unique interest in that plants also contain a class of proteins with an apparent posttranslational indoleacyl modification at specific lysine residues on the proteins.44 Hydroxytryptamines are relatively unusual in plants, although several routes for indole degradation or indole alkaloid production involve ring oxidations.45 It is important to point out that these results could not be obtained without the extra information provided by the DAD spectra of the metabolites. Identification based on retention times alone would not be feasible given the large number of peaks detected, many of which are low-abundance peaks overlapped with larger peaks. Modeling with PARAFAC. While the WTTFA algorithm provides important qualitative information about the samples, quantitative information is often desired as well. LC-DAD is particularly well suited for quantitative analysis due to the high level of peak area and retention time reproducibility that can be achieved on a high-quality system. The application of multivariate curve resolution methods (PARAFAC and fALS) are typically for the purpose of extracting pure component profiles and resolving overlapped peaks. However, in this work, we have shown that these methods also provide an excellent opportunity to resolve background components from chemical components and thus allow the identification of low-abundance metabolites in the mutant and wild-type maize seedlings. We have combined the implementation of the PARAFAC-ALS algorithm from the N-way toolbox25 and a fALS algorithm written in-house26 to compare the different samples within the data set and obtain information about the relative concentrations of the components in each sample. We also show that the PARAFAC model can be successfully used to model rank-deficient systems. In principle, our four-way data set could be analyzed by a single application of the quadrilinear PARAFAC-ALS algorithm; however, in practice, it is not feasible to analyze the entire data in a single analysis. Rank determination via a scree plot of the entire data set indicates the presence of at least 80-100 components (which will include background as well as chemical components); such an analysis is not computationally realistic using a desktop computer; further, the results would be very difficult to interpret. Instead, we decided to partition the data into more easily handled sections and to analyze each section individually. The sections were selected based on visual estimates of the relative complexity of the chromatogram, and thus each section has different dimensions. The data sections are shown in Figure 5, which shows an (41) Tozawa, Y.; Hasegawa, H.; Terakawa, T.; Wakasa, K. Plant Physiol. 2001, 126, 1493-506. (42) Wright, A. D.; Sampson, M. B.; Neuffer, M. G.; Michalczuk, L.; Slovin, J. P.; Cohen, J. D. Science 1991, 254, 998-1000. (43) Glass, N. L.; Kosuge, T. J. Bacteriol. 1986, 166, 598-603. (44) Walz, A.; Park, S.; Slovin, J. P.; Ludwig-Muller, J.; Momonoki, Y. S.; Cohen, J. D. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 1718-23. (45) Gilbert, K. G.; Cooke, D. T. Plant Growth Regul. 2001, 34, 57-69.
Analytical Chemistry, Vol. 78, No. 15, August 1, 2006
5565
Figure 5. Overlay of contour plots from one mutant sample (blue), one wild-type sample (red), and the indole standards (green) at 220 nm. Black boxes represent the sections of data that were analyzed, and the section indicated with an arrow is discussed in detail in the text.
overlay of the contour plots from one mutant sample, one wild type sample, and the standard sample at 220 nm. These singlewavelength chromatograms illustrate the presence of many overlapped peaks within a sample, as well as peaks that are common among the samples. The spectral information obtained from the DAD is absolutely necessary to resolve and interpret the relevance of the peaks in these regions. The regions of the chromatograms that are devoid of peaks were not analyzed, and additionally, the dead volume peaks in the second-dimension chromatograms were excluded. The rank in each mode was determined both by the scree plot method and by the 90% variance method. In six out of the nine sections analyzed, the rank determined by visual examination of the scree plots was in close agreement with the 90% variance rank. The scree plot rank was used as a starting point for the analysis only in those cases where there was a large difference between the two rank determination methods. The following discussion pertains to the section of the data indicated by the arrow in Figure 5, but the same data analysis procedure was followed for each section. The 90% variance rank of this section of the data was 10 components; this rank was used as a starting point for the modeling. A 10-component model was fit using the PARAFAC-ALS function from the N-way toolbox.25 The initial results revealed that the two standard components present in this section, 4-hydroxytryptamine (8) and indole-3acetyl-L-lysine (9), were not resolved. Instead, a single component with a bimodal chromatographic profile was resolved with a spectrum that appeared to be a linear combination of the individual spectra of the standards. Application of a unimodality constraint in the PARAFAC algorithm was not successful at resolving the two standards in the 10-component model, and a much higher number of components were required before the standards were 5566
Analytical Chemistry, Vol. 78, No. 15, August 1, 2006
resolved. The problem with applying a model with too many components is that the model begins to fit noise, and it becomes difficult to obtain accurate quantitative information. Application of the fALS Algorithm. The fALS algorithm, developed by Bezemer and Rutan,26 is so-called because it allows flexibility in the implementation of the constraints. fALS proved to be very useful for solving the problem with the PARAFACALS algorithm. With this method, any constraint can be selectively applied to specific components, specific dimensions, or both. Its flexibility allows the user to apply the unimodality constraint to only those components that represent chromatographic peaks but allow the background to have multimodal profiles. The results from the 10-component PARAFAC model (without imposing unimodality) were used to initiate the fALS algorithm (first applied without unimodality). The components giving rise to peaklike responses were identified, and the algorithm was run again, applying the unimodality constraint only to those components that appeared to be giving rise to a chromatographic peak, using the results from the previous fit to initiate the iterations. When the results from the second pass indicated that one of the unconstrained components tended to exhibit a peaklike (as opposed to a gradient-like background) response, that component was constrained to unimodality in the subsequent fit. By iterating on this process, we found that two background components generally sufficed to describe the background gradients. In cases where there appeared to be more than two background components contributing to the model, the rank of the model was decreased, and the data were refit using the procedure described above. In cases where there were peaks evident in the raw data that were not being resolved, the rank of the model was increased. After the formal mathematical analysis was conducted, the resolved factors (i.e., chromatograms in both
Figure 6. Resolved PARAFAC-ALS profiles of selected section (see Figure 5) for a nine-component model. (A) First-dimension retention profiles; (B) second-dimension retention profiles; (C) spectral profiles; (D) concentration profiles.
Figure 7. Comparison of chromatograms at 220 nm of the selected section of the data. (A) Raw data; (B) results of fitting with PARAFACALS; and (C) PARAFAC-ALS fitting results with background components omitted.
dimensions) were visually inspected to ensure that no artifacts had been introduced. The fALS function used26 does not allow true quadrilinearity to be applied; therefore, the data were
augmented in the spectral dimension to form a three-way data array. This treatment of the data means that the spectra were not forced to be consistent between samples. Therefore, to complete Analytical Chemistry, Vol. 78, No. 15, August 1, 2006
5567
Table 2. Summary of the Peaks Resolved in Each Sample at or above a S/N of 10a
no. of peaks resolved identity of stds (compd no.)b
mutant only
WT only
std only
mutant and WT
mutant and std
WT and std
mutant,WT, and std
16
13
15 5,6; 8, 9, 11; 13/14; 16-21; 23-24; 26c
45
3 1, 2/3,d 22
2 7/10,d 12
1 4
a Retention times are shown in Figure 8. b For compound numbers, see Figure 1. c Standards 15 and 25 were not detected. d Standards 2 and 3, 7 and 10, and 13 and 14 were unresolved pairs.
Figure 8. All peaks resolved by the PARAFAC-ALS algorithm. Key: blue, mutant only; green, wild-type only; red, standard only; cyan, mutant and wild type; magenta, mutant and standard; yellow, wild type and standard; black, mutant, wild type, and standard. The only black dot represents tryptophan.
Table 3. Quantitative Results of PARAFAC Analysis for Selected Standardsa
5-hydroxyl-L-tryptophan tryptophan indole-3-acetyl-L-alanine tryptamine b
mutant 1
mutant 2
WT1
WT2
ndb
0.6 1.1 nd nd
nd 1.8 0.8 nd
nd 1.6 0.3 nd
1.4 nd 1.9
a Concentrations are reported in µg of standard/g of plant material. nd, not detected.
the quadrilinear data analysis, the results from the fALS algorithm were used to initiate a final application of the four-way PARAFACALS algorithm, using only the nonnegativity constraint. This successive application of the two different curve resolution algorithms capitalized on the application of flexible constraints afforded by fALS and the application of quadrilinearity afforded by PARAFAC-ALS. PARAFAC and fALS Results. The data section highlighted in Figure 5 was ultimately fit using a nine-component model, and the resolved profiles are shown in Figure 6. The two background components have been removed for clarity. The blue and red traces in Figure 6 represent the first two components resolved in this section of the data. These compounds elute at 1tR ) 11.6 min; 2t ) 9.6 s and 1t ) 11.9 min; and 2t ) 9.8 s, respectively; that R R R is, they very nearly coelute in both chromatographic dimensions, 5568
Analytical Chemistry, Vol. 78, No. 15, August 1, 2006
and they have nearly identical spectra. They are present at relatively high concentration in both mutant samples but are not present in the standard or wild-type samples. The relative amount of the two components differs significantly between the two mutant samples (Figure 6D), but this is probably due to the high degree of overlap of the components in both chromatographic directions, as well the similarity in spectral features; note that the sum of these two components to the two mutant samples is reasonably consistent. The two standard components in Figure 6 are shown by the orange and pink traces. These compounds are present in the standard chromatogram and correspond to 5-hydroxytryptamine (8; orange trace, 1tR ) 13.0 min; 2tR ) 9.2 s) and indole-3-acetylL-lysine (9; pink trace, 1tR ) 13.7 min; 2tR ) 10.5 s), respectively. Indole-3-acetyl-L-lysine and 5-hydroxytryptamine are at very low levels in the mutant-type samples but were not detected in the wild-type samples. The component represented by the magenta trace, eluting at 1tR ) 13.7 min; 2tR ) 9.4 s, is well-separated chromatographically from the indole-3-acetyl-L-lysine peak in the second dimension only and exhibits a very similar spectral response. Based on its spectrum, this component is likely an indole. These results are consistent with the WTTFA analysis. It can be seen in Figure 4B (indicated by the dashed box) that a red contour surrounding the two standards was found in the mutant but not in the wild-type maize. This is a strong indication that there is a peak in the mutant maize sample consistent with the spectra and retention time of 5-hydroxytryptamine, as confirmed by the PARAFAC-ALS analysis. Figure 7 gives a further demonstration of the power of the application of curve resolution methods, specifically their ability to minimize noise and remove gradient background signals. Panel A shows a surface plot of the 2D chromatogram at 220 nm of one of the wild-type samples. There is a large amount of background present in this section of the chromatogram, and while there appears to be a peak present, it is has a very low S/N and is barely detectable. Panel B shows the ALS model for these data after the resolved profiles were reconstructed according to eq 2; clearly, a great deal of noise has been removed by the fitting procedure. Finally, the two background components that were resolved from the data were deleted before carrying out the reconstruction, as shown in panel C. The background signal is virtually eliminated, and two peaks are now clearly evident in the surface plot. These peaks correspond to the blue and black components in Figure 6. Analysis of Entire Data Set. Each section of the data outlined in Figure 5 was analyzed by the same procedure described above. The results of the analysis of the entire data set are shown in
Figure 8. The dots in Figure 8 show the locations in twodimensional chromatographic space of every peak that was resolved by the PARAFAC-ALS algorithm. The colors correspond to the sample type (mutant, wild-type, or standard) in which the peaks were found with a S/N of at least 10:1. Table 2 summarizes these results. A total of 95 components were resolved, not including the background components. The only standard that was identified in both the mutant and wild-type maize samples was tryptophan (indicated by the black dot in Figure 8)sthis result is consistent with the qualitative WTTFA results. In addition, 5-hydroxy-L-tryptophan and tryptamine were identified in the mutant samples. The other standards that were identified in the wild-type and mutant maize (magenta and yellow dots) were conjugates of IAA. The quantitative results for 5-hydroxy-Ltryptophan, tryptophan, indole-3-acetyl-L-alanine, and tryptamine are summarized in Table 3. The levels for tryptophan in the maize seedlings are consistent with previously reported levels of tryptophan in Lemna (duckweed), which has been used as a model system similar to maize.46 Quantitative results for the IAA conjugates of glutamine and aspartic acid (2 and 3) and indole3-acetyl-L-glycine and -myoinositol (7 and 10) could not be obtained because these species were unresolved by the PARAFAC algorithm (cf. Table 2). CONCLUSIONS Three chemometric methods (WTTFA, PARAFAC, fALS) have been applied to quadrilinear data generated by analyzing several related samples using 2D-LC-DAD. WTTFA provides a fast qualitative analysis of a sample by comparing a set of known metabolite spectra to the spectra resolved in the unknown samples (e.g., wild-type and mutant maize samples). In contrast, the use of the PARAFAC model allows the quantitative analysis of all components present as long as there is some difference in three of the four dimensions for each component (i.e., the firstdimension chromatographic profile, the spectroscopic profile, and the concentration variation between the samples). The unique combination of a flexibly constrained algorithm (fALS) and quadrilinear data analysis (PARAFAC-ALS) are what distinguish this analysis from previously published work. The ability to selectively implement the unimodality constraint within the fALS algorithm was found to be particularly useful in these studies. In the present work, coupling these two algorithms allowed for the resolution of overlapped peaks and background components from complex mixtures of plant metabolites. These techniques greatly enhance the S/N of the instrument response for low-abundance peaks. For example, the S/N of the low-abundance peaks shown in Figure 7 is enhanced by at least 1 order of magnitude. In addition, the chromatographic background signals related to the (46) Tam, Y. Y., Slovin, J. P., Cohen, J. D. Plant Physiol. 1995, 107, 77-85.
gradient elution process are mathematically resolved from the peaks of interest. Of the 26 indolic metabolite standards included in this study, several of the standards were identified at significant levels (S/N > 10) in both the mutant and wild-type maize samples, including tryptophan, which was identified in both. The results of the WTTFA analysis indicated that a few additional peaks observed only in the mutant chromatograms were “indole-like”, based on their spectroscopic characteristics, and the PARAFAC results confirmed this observation. These peaks represent important potential targets for future biological studies. Given the huge number of potential compounds in the typical plant metabolome (estimated at up to 200 000 23), it is not surprising that we were only able to identify a small fraction of the unknown components using a limited number of standard compounds in our spectral library. The biological relevance of the results that we have obtained is under investigation and will be the subject of future publications. Figure 6D illustrates the quantitative capabilities of modeling with PARAFAC and fALS. The relative concentration information that can be obtained using multivariate methods provides valuable information about the differences in the levels of metabolites between different sample types. For example, based upon our preliminary work with the orp mutant, there are at least 19 compounds that differentially accumulate in the orp mutant relative to wild-type maize. As demonstrated by the results shown in Table 3, if the concentrations of the standard sample(s) are known, absolute quantification can most certainly be achieved. The results of this study provide important direction for future work in this area. It is evident in these data (and well known in the metabolomics literature23) that more replicate samples should be analyzed to obtain valid statistical conclusions about the relative levels of the compounds present in samples from different biological origins (i.e., mutant vs wild type). We anticipate the future application of the techniques discussed here (both chromatographic8 and chemometric) to more selective detectors for 2D-LC such as MS. We believe these methods will have wide applicability in both metabolomic and metabolic profiling research. ACKNOWLEDGMENT The authors acknowledge funding received from the Research Corporation (grant RA-0344, S.C.R.) and the National Institutes of Health (grant 5R01GM054585-09, P.W.C.). The authors also thank Marc Cantwell for assistance in the determination of the selectivity diagnostic for the multivariate methods.
Received for review April 4, 2006. Accepted May 26, 2006. AC0606195
Analytical Chemistry, Vol. 78, No. 15, August 1, 2006
5569