Predictive Metabolite Profiling Applying Hierarchical Multivariate

Jan 10, 2006 - A method for predictive metabolite profiling based on resolution of GC-MS data followed by multivariate data analysis is presented and ...
0 downloads 8 Views 409KB Size
Predictive Metabolite Profiling Applying Hierarchical Multivariate Curve Resolution to GC-MS DatasA Potential Tool for Multi-parametric Diagnosis Pa1 r Jonsson,† Elin Sjo1 vik Johansson,† Anna Wuolikainen,† Johan Lindberg,‡ Ina Schuppe-Koistinen,‡ Miyako Kusano,§,| Michael Sjo1 stro1 m,† Johan Trygg,† Thomas Moritz,| and Henrik Antti*,† Research Group for Chemometrics, Organic Chemistry, Department of Chemistry, Umeå University, SE-901 87 Umeå, Sweden, Molecular Toxicology, Safety Assessment, AstraZeneca R&D, SE-141 85, So¨derta¨lje, Sweden, Metabolomics Research Group, Metabolome Analysis Research Team, RIKEN Plant Science Center (PSC), RIKEN 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama City, Kanagawa, 230-0045, Japan, and Umeå Plant Science Center, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, SE-901 87 Umeå, Sweden Received January 10, 2006

A method for predictive metabolite profiling based on resolution of GC-MS data followed by multivariate data analysis is presented and applied to three different biofluid data sets (rat urine, aspen leaf extracts, and human blood plasma). Hierarchical multivariate curve resolution (H-MCR) was used to simultaneously resolve the GC-MS data into pure profiles, describing the relative metabolite concentrations between samples, for multivariate analysis. Here, we present an extension of the H-MCR method allowing treatment of independent samples according to processing parameters estimated from a set of training samples. Predictions or inclusion of the new samples, based on their metabolite profiles, into an existing model could then be carried out, which is a requirement for a working application within, e.g., clinical diagnosis. Apart from allowing treatment and prediction of independent samples the proposed method also reduces the time for the curve resolution process since only a subset of representative samples have to be processed while the remaining samples can be treated according to the obtained processing parameters. The time required for resolving the 30 training samples in the rat urine example was approximately 13 h, while the treatment of the 30 test samples according to the training parameters required only approximately 30 s per sample (∼15 min in total). In addition, the presented results show that the suggested approach works for describing metabolic changes in different biofluids, indicating that this is a general approach for high-throughput predictive metabolite profiling, which could have important applications in areas such as plant functional genomics, drug toxicity, treatment efficacy and early disease diagnosis. Keywords: metabolomics • metabonomics • metabolic profiling • chemometrics • GC-MS • curve resolution • clinical diagnosis • high-throughput • O-PLS • H-MCR

Introduction Multivariate metabolite profiling (metabolomics/metabonomics) has, for some time now, proved to be a potential means for studying global changes in metabolite concentrations related to varying physiological or pathological status. It has been shown in a large number of studies that metabolic pattern changes in biofluids and tissues can be indicative of and provide biomarkers for, e.g., target-specific drug toxicity,1 genetic modification in plants,2 and mammals as well as for * To whom correspondence should be addressed. E-mail: henrik.antti@ chem.umu.se. † Umeå University. ‡ AstraZeneca R&D. § RIKEN Plant Science Center (PSC). | Swedish University of Agricultural Sciences. 10.1021/pr0600071 CCC: $33.50

 2006 American Chemical Society

specific disease states.3 However, so far, in our opinion, the metabolic profiling technology has only been used to prove its versatility and potential in isolated studies usually sparsely validated in terms of predictions on independent sample sets. This is obviously a problem since the ultimate use of a technology is highly dependent on its ability to handle and to make accurate predictions from independent samples measured with the same analytical technique, e.g., at different points in time, or at different labs. In situations where one or a few identifiable biomarkers (metabolites) are responsible for the variation of interest, the results of isolated studies can always be further validated using target analysis or specific assays for these compounds. Nevertheless, in most cases pattern differences in complex biological systems are described by combinations of metabolites, not always known or easy to Journal of Proteome Research 2006, 5, 1407-1414

1407

Published on Web 04/20/2006

research articles

Figure 1. Scheme of the predictive metabolite profiling approach based on the H-MCR method. All or a representative selection of samples from the original set of samples are subjected to H-MCR. The obtained H-MCR parameters can then be used to resolve new samples or samples not selected in step 1. The data obtained after H-MCR processing or treatment according to the H-MCR parameters is collected in two separate data tables, where each sample corresponds to one row and each column to one “metabolite” (resolved profile). The value in each cell of the tables is the area under the resolved profile for a specific sample. Sample comparisons based on multivariate analysis can then be carried out either by modeling the processed data and use the model to make predictions of the treated samples or by merging the processed and treated data and hence calculate a model based on all samples.

identify, varying in concentration in a characteristic manner.4 In these situations, the value lies in using the whole metabolite profile, generated by some analytical platform,5,6 e.g., NMR, GC-MS, or LC-MS, as the key to obtain a model for a correct sample classification or characterization. To be able to use this model for predictions of independent samples, high demands will be put on factors such as experimental design, reproducibility of analytical procedures and platforms as well as data handling and processing. In the metabolic profiling area, NMR is considered to be the most robust and reproducible analytical platform. Together with the relatively fast and straightforward data handling, results have shown that NMR is suitable for providing accurate predictions of independent samples measured at different points in time or at different instruments in different labs.7 The drawback with NMR is its relatively low sensitivity and difficulties to identify metabolites in complex matrixes. GC-MS, on the other hand, is considered a much more sensitive technique, compared to NMR, and has the advantage that the detected metabolites are relatively easy to identify, especially in combination with an efficient curve resolution technique and spectral libraries. However, GC-MS is not as robust and reproducible as NMR and, so far, this has put limitations on the ability to import and predict independent samples based on all processing and modeling steps. Here, we present a strategy for importing new independent samples into an existing processing and model framework based on curve resolution using the Hierarchical Multivariate Curve Resolution8 (H-MCR) method followed by multivariate projection modeling. The presented results will have implications in areas where it is important to obtain reliable predictions for independent samples based on an existing model, e.g., in disease diagnosis. In addition it will also help to sufficiently lower the time for data processing, e.g., curve resolution, since only a representative subset of samples describing the diversity of the data have to be processed and used to build a model, while the remaining samples can be efficiently treated according to the obtained processing parameters and thereafter incorporated or predicted into the model (Figure 1). Another application of this approach is in classification modeling when one class contains many 1408

Journal of Proteome Research • Vol. 5, No. 6, 2006

Jonsson et al.

more samples than another, which is a common problem biasing classification models. With the proposed method, one can choose to process and model a similar number of samples from the two classes, and thereafter import and predict the remaining samples, which will then not affect the classification in any way. Key issues for the presented strategy to work optimally are as follows: (i) use a well thought through experimental design introducing all known sources of variation into the model; (ii) make a representative selection of samples for processing and modeling by means of some type of fast screening of the entire sample population; and (iii) optimize analytical procedures to obtain robust and reproducible data over time and between labs. Results are presented for three different data sets; (a) aspen leaf extracts from a plant development study, (b) rat urine samples from a toxicity study, and (c) male and female human blood plasma samples. This adds further strength to the proposed method in terms of its applicability for metabolite profiling in a multitude of important fields such as plant functional genomics, drug toxicity, treatment efficacy, and disease diagnosis. Processing of GC-MS Data. Each GC-MS measurement results in a 2-D data table XGC-MS in which the rows represent the different time points for each individual scan, and the columns represent the different mass channels. If one of the rows in this data table is plotted, then it will show the total mass-spectrum for the compounds eluting at that time point. If one column is plotted, then an ion chromatogram for that ion will be revealed. The GC part of the GC-MS will separate the compounds injected into the instrument according to physiological properties (mainly boiling point and polarity). The MS part will detect the compounds when they have passed through the GC and give an individual fingerprint for each compound according to its chemical structure. All compounds in a complex sample cannot be separated on the GC column, so signals for different compounds will be recorded at the same time resulting in a mix of them in the detector (mass spectrometer). To obtain the “pure” signals from each compound a mathematical procedure named deconvolution or curve resolution has to be applied. This procedure can be described by the equation: XGC-MS ) CST+E, in which C is a matrix with equally many rows as XGC-MS, and the number of columns is equal to the number of compounds, since each column represents the chromatographic profile for one “pure” compound. S is a matrix with an equal number of rows as XGC-MS has columns and the number of columns is equal to the number of compounds, since each column represents the spectral profile for one “pure” compound. The column i (any column), in both S and C, represents the same compound. E is the residual containing the instrumental noise. This is how it should work in theory. However, in practice it is not always as easy, especially when dealing with complex samples. Instead, C and S can only be seen as estimates of the pure compounds, and E will not only contain instrumental noise or artifacts, but also unresolved components and background. These issues usually occur due to problems with nonlinear responses, background, minor compounds vs noise and the difficulty with determining the chemical rank (the number of compounds to resolve). When identifying a compound, the resolved spectral profile is used to search for the same or a similar profile in a spectral database (mass spectral library) and the chromatographic profile is used to determine its retention time (retention index). The retention index is used to confirm the “hit” in the

research articles

Potential Tool for Multi-parametric Diagnosis

mass spectral library and can also be used to select the right compound if more than one mass spectrum is found to be good “hits”.9 When comparing samples, the resolved compounds from one sample have to be found in the other samples (if they exist). This can be done by matching the resolved components from the different samples, according to similarity in spectral profiles and retention times. Different methods for solving the deconvolution equation have been published in the literature, such as alternating regression (AR),10 iterative target transformation factor analysis (ITTFA),11 heuristic evolving latent projections (HELP),12 and gentle.13 Recently, a new approach for resolving complex GC-MS data was presented. This method is named hierarchical multivariate curve resolution (H-MCR)8 and allows multiple samples to be processed simultaneously in contrast to the other methods that are performed on an individual sample basis. The advantages with this approach are that no matching of resolved compounds is needed and that totally overlapping compounds (compounds with identical chromatographic profiles) can be resolved as long as the ratio between them differs between samples. Another advantage with the H-MCR procedure is that the same compound is quantified in the same way for all samples, using the common spectral profile. The H-MCR procedure includes the following preprocessing steps (Figure 2): (i) Smoothing of each mass channel (column of XGC-MS) by a moving average. (ii) Alignment by finding maximum covariance between the samples total ion current (TIC) chromatograms. (iii) Division of data into time windows (the edges are set at common global low intensity points in the GC chromatograms). (iv) Multivariate curve resolution14 of each time window separately. This results in chromatographic profiles for each compound in each sample with a corresponding common spectral profile. The common spectral profile is the reason no matching is needed. When this is done a matrix X can be formed, in which each row represents one sample and each column represents one compound. The value in each cell of the matrix is the integrated area under the resolved chromatographic profile. The data matrix X can then be used for any kind of multivariate analysis in order to compare samples. The H-MCR method can be seen as a transformation or a model of XGC-MS to form a vector x (quantification of all resolved compounds) representing a row in the matrix X. The obtained model can also be applied to new samples which have not been a part of the curve resolution procedure. This is done by first smoothing the XGC-MS(NEW) matrix in the same way as the samples included in the model building part, followed by alignment of the new samples using the same target and division into time windows using the same edges. Resolving of the new data is then done using the spectral profiles found in the MCR part. The processing parameters used for independent samples were as follows: (1) the target TIC (sample), used for alignment; (2) the edges used for time window setting; (3) the spectral profiles S, to calculate chromatographic profiles CNEW according to the equation CNEW ) XGC/MS(NEW)S(STS)-1; and (4) the applied constraints (unimodality and nonnegativity). This was done for each time window individually. It is important to point out that there can be a risk by applying a model to new samples since new compounds not included in the model building will not be resolved perfectly given that a spectrum for that specific compound is not present. The opposite situation, with compounds present in the model

Figure 2. A. Raw chromatographic profiles for all model samples in a selected part of the chromatogram. B. Smoothed and aligned chromatographic profiles divided into time windows by setting edges in global (for all samples) low intensity regions. C. MCR resolved profiles for one time window. D. Corresponding mass spectrum for one of the resolved profiles (top). Mass spectrum found as best “hit” in library search (bottom).

samples but not in the new samples, does not pose a problem, since these compound concentrations will be zero in the new samples. Nevertheless, it is important that the selection of samples included in the model building part is well thought through so that the included samples are representative for the variation of the studied system. One way of selecting a representative set of samples is by first characterizing all samples with another (faster) processing method and then perform the selection based on that description using a space filling design15 or other design approaches such as D-optimal designs16 or D-optimal onion designs17 in multivariate space (e.g., PCA score space). This is something that will be investigated soon.

Material and Methods Multivariate Analysis. Orthogonal partial least squares (OPLS)18 was used as the preferred regression method in the rat urine toxicity example to correlate the metabolic information against the day of sample collection. O-PLS-DA (Discriminant Analysis) was used for classification in both the aspen leaf Journal of Proteome Research • Vol. 5, No. 6, 2006 1409

research articles extracts example and the human blood plasma example. In all examples, the number of O-PLS components was decided by a 7-fold full cross validation.19 Principal Component Analysis (PCA)20 and Hierarchical Cluster Analysis (HCA)21 were used to provide an overview of the data in the urine toxicity example. Selection of representative samples for processing and model building in the human blood plasma example was performed using a space filling design.15 Data Processing and Analysis. Nonprocessed GC-MS data files were exported to MATLAB software 6.5 (Mathworks, Natick, MA), where all data processing procedures, such as smoothing, alignment, time window setting, hierarchical multivariate curve resolution (H-MCR) and fast processing according to Jonsson et al.22 were carried out. Multivariate analysis (O-PLS and PCA) was performed in the SIMCA-P+ 11 software (Umetrics AB, Umeå, Sweden). Sample selection was preformed using in house MATLAB scripts for space filling design. Rat Urine Samples. Animals and Treatments. Five male rats (Han Wistar, substrain BrlHan:WIST@Mo, approximately 2 months of age at study onset, weight range 220-300 g) were used in the study. Animals were multiple housed (2 or 3/cage) during the acclimatization period in the animal house (10 days). During the course of the study, 5 animals were housed in individual metabolism cages to allow for the continuous collection of urine. Animals were dosed by oral gavage with a proprietary AstraZeneca compound (200 mg/kg/day) for 7 days beginning on day 1 and necropsied on day 8 when tissues were taken for pathology. Urine samples were collected on ice into 1% (1 mL, w/v) sodium azide at day 0 (pre-dose) and at days 1, 2, 3, 5, and 7 (0-7 h post dose). Urine pH was recorded and the samples retained at -80 °C until analysis with GCQuadropol-MS. The sample preparation, derivatization and GC-MS protocols can be found in the Supporting Information. Aspen Leaf Extracts. Sample Description. Twenty-eight hybrid aspen (Populus tremula x P tremuloides) plants were grown in a growth chamber, as described by Jonsson et al.22 After 15 weeks in the growth chamber, the plants were sampled. From each plant, leaves 2, 9, 10, and 20, counting from the top of the plant, were taken. All samples were instantly frozen in liquid nitrogen and stored at -80 °C until analysis. Extraction, derivatization, and GC-TOF-MS were performed according to Gullberg et al.23 (see Supporting Information). Human Plasma Samples. Sample Description. Human blood plasma samples were obtained from blood donors in fertile age (20-40 years) from the Blood Central of Norrlands Universitetssjukhus, Umeå, Sweden. Approximately, 10 mL of blood was collected from each person. The blood samples were centrifuged for 10 min at 3000 × g. The plasma was collected, aliquoted and stored in the -80 °C freezer until usage. Plasma from 21 men and 22 women were thawed at 37 °C for 15 min followed by vortexing. Ten of the female samples were prepared in duplicate. The extraction, derivatization and GC-TOF-MS were performed according to A et al.24 (see Supporting Information).

Results Rat Urine Data. GC-MS data from 60 rat urine samples split into 30 training samples and 30 test samples (representing analytical replicates) were subjected to the H-MCR procedure. Alignment and smoothing using a moving average (window size 3) was done prior to dividing the 30 training chromatograms into 35 time windows. In total, 364 compound profiles were 1410

Journal of Proteome Research • Vol. 5, No. 6, 2006

Jonsson et al.

Figure 3. A. H-MCR resolved profiles in one time window (window 17) for 30 rat urine samples. B. Resolved profiles in one time window (window 17) for 30 analytical replicates treated according to the H-MCR parameters in A.

resolved from the 35 time windows and used as sample descriptors in the multivariate analysis. The 30 test samples were then treated according to the training H-MCR parameters (Figure 3). The time required for resolving the 30 training samples was approximately 13 h, while the treatment of the 30 test samples according to the training parameters required approximately 30 s per sample. This highlights the huge time savings offered by using the proposed methodology for processing a subset of representative samples and treating the remaining samples as test samples. Each column (variable) in the data matrix X was Log10 transformed, mean centered and scaled to unit variance prior to O-PLS analysis. An O-PLS model was calculated correlating the resolved urinary profiles X against the urine collection time point y. Three significant (1 predictive and 2 orthogonal to the response) O-PLS components were extracted, describing 61.1% of the variation in X (R2X ) 0.611), with 17.1% of the variation in X correlated to y (R2Xycorr ) 0.171), describing 96.2% of the variation in y (R2Y ) 0.962) and predicting 83.3% of the variation in y (Q2Y ) 0.833), according to cross validation. The 30 test samples treated according to the training H-MCR parameters were then predicted into the O-PLS model resulting in a predictive accuracy for the test samples (RMSEP ) 0.68), that was in good

Potential Tool for Multi-parametric Diagnosis

research articles

Figure 4. Observed versus predicted/estimated scatter plot for the O-PLS model correlating 364 resolved urinary profiles X with the day of urine sampling y. The plot reveals the correlation between the day of sampling (Days) and the O-PLS estimated values for the 30 model samples (f) and as well as the correlation between the day of sampling (Days) and the O-PLS predicted values for the 30 analytical replicates (f) used as test samples, treated according to the H-MCR parameters for the model samples and predicted into the model.

agreement with the model estimations based on the training samples (RMSEE ) 0.50) (Figure 4). Evaluation of the curve resolution for the test samples was done by correlating the areas of the resolved compounds for the test samples with the corresponding areas for the training samples (analytical replicates) for all 364 resolved compound profiles, which revealed a correlation >0.95 for 76.1% of the compounds. This should be compared to a correlation >0.95 for 75.7% of the compounds when all 60 samples were processed together. By merging the training and test samples followed by calculation of a PCA model, an idea of if the processed test samples can be used to compliment the model based on the training samples was obtained. For the processed urine data a good agreement between the analytical replicates (training and test samples) could be seen in the PCA scores plot (Figure 5A). This should be compared to a score plot from a PCA based on the data where all 60 samples were processed simultaneously, which shows the same pattern as the score plot in Figure 5A (data not shown). To further verify the agreement between the analytical replicates, the merged data was subjected to Hierarchical Cluster Analysis (HCA). Each variable was standardized to unit variance prior to complete linkage HCA. In Figure 5B, it is clear that all analytical replicates cluster together in pairs. This is also the case for the dendrogram based on the data obtained when all 60 samples were processed together (not shown). Both the PCA and HCA results show that it is possible to compliment an existing model with new samples processed based on an existing sample set, as long as the acquired data is of high and equal quality in both sample sets. In cases when differences exist in data quality between training and test data, which is likely to occur in many cases, prediction of the test samples into the existing training model will function as a filter of the test sample data and will help to focus on the variation of interest in the processed test samples. Aspen Leaf Extract Data. GC-TOF-MS data from 105 plant extracts split into 65 training samples (Leaf 2, 10, and 20 from

Figure 5. A. PCA score plot of the 60 urine samples (30 H-MCR processed + 30 treated according to obtained H-MCR parameters). In the plot, the analytical replicates are connected with a line, showing that samples treated according to the model H-MCR parameters end up close to the analytical replicates processed by the H-MCR method. B. Dendrogram from Hierarchical Cluster Analysis (HCA) of the 60 urine samples (30 H-MCR processed + 30 treated according to obtained H-MCR parameters). In the dendrogram, it is evident that all analytical replicates cluster together in pairs. In the sample names, e.g., DA11D0, D denotes dosed animal, A11 denotes analytical replicate A for sample 11 and D0 denotes sampling day 0. Sample replicates A were used for the H-MCR processing, whereas sample replicates B were treated according to the obtained processing parameters.

28 hybrid aspen plants analyzed on March 10-15, 2003) that were subjected to the H-MCR procedure and 40 test samples (four each from leaves 2, 10, and 20 (analyzed on March 1015, 2003) and 28 from Leaf 9 (from the same 28 plants but analyzed on February 2, 2004)) that were treated according to the obtained training H-MCR parameters. Alignment and smoothing using a moving average (window size 7) was done prior to dividing the 65 training chromatograms into 56 time windows. In total, 440 compound profiles were resolved from Journal of Proteome Research • Vol. 5, No. 6, 2006 1411

research articles

Figure 6. O-PLS-DA score plot for the analysis of aspen leaf extract samples, revealing classification of the training samples building up the model (Leaf 2(b), Leaf 10 (+) and Leaf 20(f)) and the test set samples from leaves 2, 10 and 20 (analyzed on March 10-15 2003) and leaf 9 (‚) (analyzed on February 2, 2004). All test samples are circled in gray in the plot.

the 56 time windows and used as sample descriptors in the multivariate analysis. Each row (sample) of the data matrix X was normalized to the sum ) 1. In addition, each column (variable) was Log10 transformed, mean centered and scaled to unit variance prior to classification modeling according to leaf number. An O-PLS-DA model was calculated correlating the resolved leaf extract profiles X against a “dummy matrix” describing the class identity i.e., leaf number Y. Two significant (both correlated to Y) O-PLS-DA components were extracted, describing 58.7% of the variation in X (R2X ) 0.587), 97.1% of the variation in Y (R2Y ) 0.971) and predicting 96.7% of the variation in Y (Q2Y ) 0.967), according to cross validation. The 40 test samples, including 28 samples from leaf 9, analyzed ∼11 months later than the training samples, were treated according to the training H-MCR parameters, and then predicted into the model. The predictions resulted in a classification of the 28 test samples from leaf 9 in the O-PLS-DA scores close to the leaf 10 class, between the leaf 2 and leaf 10 classes and the remaining 12 test samples correctly classified to each respective leaf class (Figure 6). This is not surprising since the leaves 2, 9, 10, and 20 are describing different developmental stages. Hence, the metabolic variation between leaves 2 and 20 is large in comparison to the difference between leafs 9 and 10. Therefore, these results emphasize how the suggested approach can be used to treat and predict new samples correctly even when samples with the same characteristics are not included in the model. The reason for this working is the fact that the training samples are spanning an experimental domain (leaf 2 to 20), within which the test samples (leaf 9) belong, meaning that a representative training set selected due to some type of design is essential for the procedure to perform well. In addition, this example also points out the possibility to treat and predict samples analyzed at different points in time (training and test samples analyzed ∼11 months apart). This is of extreme importance as in many biological studies the analysis of new samples for verifying old studies often occurs much later in comparison to the first set of analysis. In this case, the multivariate model will act as a filter removing small differences due to, e.g., analytical drifts into the model residuals, while focusing on the important predictive information in the test samples. 1412

Journal of Proteome Research • Vol. 5, No. 6, 2006

Jonsson et al.

Human Blood Plasma Data. Human plasma GC-TOF-MS data from 21 male and 32 female subjects was processed according to the method described by Jonsson et al.22 followed by a class wise PCA analysis. A selection of representative training samples was then done using a space-filling design in the scores from each of the two individual PCA models. The 53 samples were divided into 34 training samples (17 male + 17 female) and 19 test samples (4 male + 15 female). The training set was then subjected to the H-MCR procedure. Alignment and smoothing using a moving average (window size 7) was done prior to dividing the training chromatograms into 62 time windows. In total, 332 compound profiles were resolved from the 62 time windows and used as sample descriptors in the multivariate analysis. The 19 test samples were then treated according to the obtained H-MCR parameters. Prior to multivariate analysis all variables (integrated areas for the resolved metabolites) were normalized using a weighted sum of all internal standards, mean centered and scaled to unit variance. An O-PLS-DA model was calculated correlating the samples metabolic composition X against a response y encoding class membership (male or female). According to cross validation four significant components were extracted (one predictive and three orthogonal components), describing 30.3% of the variation in X (R2X ) 0.303), with 5.11% of the variation in X correlated to y (R2Xycorr ) 0.0511), describing 98.8% of the variation in y (R2Y ) 0.988) and predicting 42.3% of the variation in y (Q2Y ) 0.423). The O-PLS-DA model revealed a clear separation of the male and female training samples (Figure 7A). The 19 test samples, treated according to the H-MCR parameters obtained from the model samples, were then predicted into the O-PLS-DA model. All test samples were accurately predicted with respect to gender (Figure 7B). This implies that the proposed strategy can be used for building robust predictive systems for classification or characterization of human blood plasma samples based on their metabolite composition.

Discussion Predictive models for biological systems based on multiparametric gene, protein or metabolite fingerprints are a requirement to meet future demands and goals regarding early disease diagnosis, clinical monitoring, or even personalized healthcare. For this to have the potential to succeed, it is of great importance to start thinking of complementing isolated studies with procedures and models that can be used to treat and make predictions of samples analyzed with the same analytical platform but, e.g., at different points in time or at different labs. With the developments of analytical techniques for, e.g., metabolic fingerprinting it is reasonable to believe that the use of the whole generated profiles, as opposed to individual markers, will be the way to proceed in order to maximize the information output as well as for building predictive systems. This however, puts high demands on factors such as analytical reproducibility, data handling and processing, statistical modeling, and validation, to create robust systems providing reliable and biologically relevant results. For instance, to achieve a working clinical diagnosis application it will be of great importance to have a robust procedure all the way from sample handling via instrumental analysis to modeling and prediction. If this is the case, then it will be possible to take a sample from a new subject, process it according to existing procedures, and get a prediction result based on a deciding model. A prerequisite for obtaining a reliable predic-

research articles

Potential Tool for Multi-parametric Diagnosis

Figure 7. O-PLS-DA scatter plots for the analysis of human plasma samples, showing separation between male (≤) and female (O) subjects. A. Estimated y values (class membership) for plasma samples used to fit the O-PLS-DA model. B. Predicted y values (class membership), for plasma samples in the test set.

tion result is obviously that the diagnostic information in the deciding model is based on biologically relevant variation, as opposed to analytical artifacts, and that this information is reproducible over multiple studies. In addition, selection of diverse samples for building models including variation representative for future prediction samples will be vital and should preferably be performed using some type of experimental design methodology. In the present study GC-MS and hierarchical curve resolution (H-MCR) is used as a concept for a sensitive and reproducible generation of representative data reflecting the concentration of pure metabolites and for obtaining pure spectral profiles for the resolved metabolites which can be used for identification purposes by means of library searches. Multivariate data analysis is then applied to the data for building robust and predictive models based on the whole metabolite profile and as a means for detecting combinations of important metabolites (markers) that can be subjected to identification based on the resolved mass spectra. The novelty of the strategy is that the H-MCR procedure has been extended so that new samples can now be treated according to the curve resolution parameters obtained for the model samples, meaning that the end product is a fast and efficient means for processing new samples, which can then be predicted into the existing model. As the results imply this has got a number of implications that are of benefit for further development of the

metabolite profiling area. First and foremost, there is now a working metabolite profiling framework for treatment and prediction of new samples, analyzed independently from the model samples, which is a prerequisite for the development of diagnostic models working in a clinical setting. In addition, faster sample processing is offered since only a subset of representative samples have to be processed, and the main part can be imported and treated according to the model samples and then predicted or included into the existing model. As exemplified in the rat urine example, the time required for resolving the 30 training samples was approximately 13 h, while the treatment of the 30 test samples according to the training parameters required approximately 30 s per sample (∼15 min in total), which greatly decreases the processing time, meeting high-throughput requirements for large sample numbers. The sequential combination of the hierarchical curve resolution and the multivariate statistical analysis contributes to strengthen the method in terms of robust and interpretable modeling, but also by working as a “filter” for new samples into the existing models. For example, the analysis of aspen leaf extracts (Figure 6), where the samples predicted into the existing model were analyzed 11 months later, showing that the variations caused by, e.g., analytical drifts (including GC column changes and TOF-detector voltage increases in this example), or sample handling (older samples) are suppressed and focus is put on the relevant variation described by the deciding model. This however, does not imply that the work with obtaining analytical reproducibility or detailed sample handling protocols is to be neglected. This is particularly crucial as the analytical techniques are getting more and more sensitive and the risk for obtaining spurious results due to analytical artifacts is drastically increasing. As shown in the presented results the suggested approach is working for describing metabolic changes in a number of different biologically relevant samples (rat urine, aspen leaf extracts and human blood plasma), which indicates that this methodology is a general approach for high-throughput predictive metabolite profiling that can have important applications in areas such as plant functional genomics, drug toxicity, treatment efficacy and early disease diagnosis.

Acknowledgment. This work was supported by grants from the EU-strategic funding, Strategic Research Foundation (SSF), the Swedish Association of Persons with Neurologically Disabilities (NHR), the Swedish Research Council, Wallenberg Consortium North (WCN), and the Kempe Foundation. Elwin R. Verheij, Leo van Stee and Bas Muiljwijk, TNO Pharma are gratefully acknowledged for provision of the GC-MS data in the rat urine toxicity example. Ing-Marie Olsson, Umeå University, is gratefully acknowledged for provision of the space filling algorithm used in the human blood plasma example. Supporting Information Available: Extraction, derivatization, and GC-TOF-MS information (pdf). This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Lindon, J. C.; Nicholson, J. K.; Holmes, E.; Antti, H.; Bollard, M. E.; Keun, H.; Beckonert, O.; Ebbels, T. M.; Reilly, M. D.; Robertson, D.; Stevens, G. J.; Luke, P.; Breau, A. P.; Cantor, G. H.; Bible, R. H.; Niederhauser, U.; Senn, H.; Schlotterbeck, G.; Sidelmann, U. G.; Laursen, S. M.; Tymiak, A.; Car, B. D.; Lehman-McKeeman, L.; Colet, J. M.; Loukaci, A.; Thomas, C., Contemporary issues in toxicology - The role of metabonomics in toxicology and its evaluation by the COMET project. Toxicol. Appl. Pharmacol. 2003, 187 (3), 137-146.

Journal of Proteome Research • Vol. 5, No. 6, 2006 1413

research articles (2) Fiehn, O. Metabolomics - the link between genotypes and phenotypes. Plant Mol. Biol. 2002, 48 (1-2), 155-171. (3) Brindle, J. T.; Antti, H.; Holmes, E.; Tranter, G.; Nicholson, J. K.; Bethell, H. W. L.; Clarke, S.; Schofield, P. M.; McKilligin, E.; Mosedale, D. E.; Grainger, D. J., Rapid and noninvasive diagnosis of the presence and severity of coronary heart disease using H-1NMR-based metabonomics. Nat. Med. 2002, 8 (12), 1439-1444. (4) van der Greef, J.; Stroobant, P.; van der Heijden, R., The role of analytical sciences medical systems biology. Curr. Opin. Chem. Biol. 2004, 8 (5), 559-565. (5) Dunn, W. B.; Ellis, D. I., Metabolomics: Current analytical platforms and methodologies. Trac-Trends Anal. Chem. 2005, 24 (4), 285-294. (6) Sumner, L. W.; Mendes, P.; Dixon, R. A. Plant metabolomics: large-scale phytochemistry in the functional genomics era. Phytochemistry 2003, 62 (6), 817-836. (7) Keun, H. C.; Ebbels, T. M. D.; Antti, H.; Bollard, M. E.; Beckonert, O.; Schlotterbeck, G.; Senn, H.; Niederhauser, U.; Holmes, E.; Lindon, J. C.; Nicholson, J. K., Analytical reproducibility in H-1 NMR-based metabonomic urinalysis. Chem. Res. Toxicol. 2002, 15 (11), 1380-1386. (8) Jonsson, P.; Johansson, A. I.; Gullberg, J.; Trygg, J.; A, J.; Grung, B.; Marklund, S.; Sjo¨stro¨m, M.; Antti, H.; Moritz, T., HighThroughput Data Analysis for Detecting and Identifying Differences between Samples in GC/MS-Based Metabolomic Analyses. Anal. Chem. 2005, 77 (17), 5635-5642. (9) Schauer, N.; Steinhauser, D.; Strelkov, S.; Schomburg, D.; Allison, G.; Moritz, T.; Lundgren, K.; Roessner-Tunali, U.; Forbes, M. G.; Willmitzer, L.; Fernie, A. R.; Kopka, J. GC-MS libraries for the rapid identification of metabolites in complex biological samples. Febs Lett. 2005, 579 (6), 1332-1337. (10) Karjalainen, E. J. The Spectrum Reconstruction Problem - Use of Alternating Regression for Unexpected Spectral Components in 2-Dimensional Spectroscopies. Chemom. Intel. Lab. Syst. 1989, 7 (1-2), 31-38. (11) Gemperline, P. J. A priori estimates of the elution profiles of the pure components in overlapped liquid chromatography peaks using target factor analysis. J. Chem. Info. Comput. Sci. 1984, 24 (4), 206-212. (12) Kvalheim, O. M.; Liang, Y. Z. Heuristic Evolving Latent Projections - Resolving 2-Way Multicomponent Data 0.1. Selectivity, Latent-

1414

Journal of Proteome Research • Vol. 5, No. 6, 2006

Jonsson et al.

(13) (14) (15) (16) (17) (18) (19) (20) (21) (22)

(23)

(24)

Projective Graph, Datascope, Local Rank, and Unique Resolution. Anal. Chem. 1992, 64 (8), 936-946. Manne, R.; Grande, B. V. Resolution of two-way data from hyphenated chromatography by means of elementary matrix transformations. Chemom. Intel. Lab. Syst. 2000, 50 (1), 35-46. Tauler, R. Multivariate curve resolution applied to second-order data. Chemom. Intel. Lab. Syst. 1995, 30 (1), 133-146. Marengo, E.; Todeschini, R. A New Algorithm for Optimal, Distance-Based Experimental-Design. Chemom. Intel. Lab. Syst. 1992, 16 (1), 37-44. deAguiar, P. F.; Bourguignon, B.; Khots, M. S.; Massart, D. L.; PhanThanLuu, R. D-optimal designs. Chemom. Intel. Lab. Syst. 1995, 30 (2), 199-210. Olsson, I. M.; Gottfries, J.; Wold, S. D-optimal onion designs in statistical molecular design. Chemom. Intel. Lab. Syst. 2004, 73 (1), 37-46. Trygg, J.; Wold, S. Orthogonal projections to latent structures (OPLS). J. Chemom. 2002, 16 (3), 119-128. Wold, S. Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics 1978, 20 (4), 397-405. Wold, S.; Esbensen, K.; Geladi, P. Principal Component Analysis. Chemometrics and Intelligent Laboratory Systems 1987, 2, (13), 37-52. Johnson, D. E. Applied Multivariate Methods for Data Analysts; Brooks/Cole Publishing Company: Pacific Grove, CA, 1998. Jonsson, P.; Gullberg, J.; Nordstro¨m, A.; Kusano, M.; Kowalczyk, M.; Sjo¨stro¨m, M.; Moritz, T. A strategy for identifying differences in large series of metabolomic samples analyzed by GC/MS. Anal. Chem. 2004, 76 (6), 1738-1745. Gullberg, J.; Jonsson, P.; Nordstro¨m, A.; Sjo¨stro¨m, M.; Moritz, T. Design of experiments: an efficient strategy to identify factors influencing extraction and derivatization of Arabidopsis thaliana samples in metabolomic studies with gas chromatography/mass spectrometry. Anal. Biochem. 2004, 331 (2), 283-295. A, J.; Trygg, J.; Gullberg, J.; Johansson, A. I.; Jonsson, P.; Antti, H.; Marklund, S. L.; Moritz, T. Extraction and GC/MS analysis of the human blood plasma metabolome. Anal. Chem. 2005, 77 (24), 8096-8094.

PR0600071