Anal. Chem. 2008, 80, 3783–3790
NMR-Based Characterization of Metabolic Alterations in Hypertension Using an Adaptive, Intelligent Binning Algorithm Tim De Meyer,*,† Davy Sinnaeve,‡ Bjorn Van Gasse,‡ Elena Tsiporkova,§ Ernst R. Rietzschel,| Marc L. De Buyzere,| Thierry C. Gillebert,| Sofie Bekaert,† José C. Martins,‡ and Wim Van Criekinge† Department of Molecular Biotechnology, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, B-9000 Ghent, Belgium, Department of Organic Chemistry, NMR and Structure Analysis, Ghent University, Krijgslaan 281 S4, B-9000 Ghent, Belgium, VRT MediaLab, VRT R&D at the Institute for BroadBand Technologies, Gaston Crommenlaan 10, B-9050 Ghent-Ledeberg, Belgium, and Department of Cardiovascular Diseases, Ghent University, De Pintelaan 185, B-9000 Ghent, Belgium
The current high-resolution 1H NMR equipment readily permits the measurement of a whole range of biofluid metabolites in a fast but also relatively sensitive way, providing an excellent tool for high throughput analysis. This has led to several very
successful applications in, for example, toxicology.1,2 As a single NMR spectrum represents a vast amount of data, several preprocessing steps are typically performed, usually followed by traditional multivariate data analysis. Several algorithms specific to metabolomics have already been described in the scientific literature, e.g., metabolite projection analysis.3 Crucial in preprocessing is data reduction. Excluding this step induces two major problems: (1) the almost unmanageable dimensionality of the data and (2) the interindividual differences in peak locations. The latter is caused by slight variations in the sample environment, such as pH. Further analytical problems arise due to the ultrahigh dimensionality of the data showing high dependencies between the features. The standard approach to solve these problems is the division of each spectrum in equally sized (typically 0.04 ppm) bins, integration of the intensity values in each bin, and annotation of this value to the bin. Alternatively, maximum intensities are used instead of integrated intensities.4 However, due to its crudeness, this equidistant binning method results in the loss of a considerable amount of information enclosed in the original spectra. For instance, when several peaks are assigned to the same bin, the smaller peaks will be obscured, while other bins might exclusively contain noise. Importantly, peaks on bin edges will be scattered over several bins and the distribution of this scattering might even change between spectra due to slight variations in peak locations. For further data analysis, unit-variance scaling (or auto scaling) is to be preferred since this results in all bins having equal variances and thus equal weights in further analysis. Nevertheless, due to the equidistant binning procedure, many bins are located in noise regions, resulting in the introduction of noise variables in the statistical models and concomitant deterioration of these models. Therefore, pareto scaling, i.e., dividing the (mean-centered) variables (bins) by the square root of their standard
* Corresponding author. E-mail:
[email protected]. Fax: +32 9 264 62 19. † Department of Molecular Biotechnology, Ghent University. ‡ Department of Organic Chemistry, NMR and Structure Analysis, Ghent University. § Institute for BroadBand Technologies. | Department of Cardiovascular Diseases, Ghent University.
(1) Lindon, J. C.; Holmes, E.; Nicholson, J. K. Prog. Nucl. Magn. Reson. Spectrosc. 2004, 45, 109–43. (2) Griffin, J. L.; Bollard, M. E. Curr. Drug Metab. 2004, 5, 389–98. (3) Dieterle, F.; Ross, A.; Schlotterbeck, G.; Senn, H. Anal. Chem. 2006, 78, 3551–61. (4) Forshed, J.; Torgrip, R. J.; Aberg, K. M.; Karlberg, B.; Lindberg, J.; Jacobsson, S. P. J. Pharm. Biomed. Anal. 2005, 38, 824–32.
As with every -omics technology, metabolomics requires new methodologies for data processing. Due to the large spectral size, a standard approach in NMR-based metabolomics implies the division of spectra into equally sized bins, thereby simplifying subsequent data analysis. Yet, disadvantages are the loss of information and the occurrence of artifacts caused by peak shifts. Here, a new binning algorithm, Adaptive Intelligent Binning (AI-Binning), which largely circumvents these problems, is presented. AI-Binning recursively identifies bin edges in existing bins, requires only minimal user input, and avoids the use of arbitrary parameters or reference spectra. The performance of AI-Binning is demonstrated using serum spectra from 40 hypertensive and 40 matched normotensive subjects from the Asklepios study. Hypertension is a major cardiovascular risk factor characterized by a complex biochemistry and, in most cases, an unknown origin. The binning algorithm resulted in an improved classification of hypertensive status compared with that of standard binning and facilitated the identification of relevant metabolites. Moreover, since the occurrence of noise variables is largely avoided, AI-Binned spectra can be unit-variance scaled. This enables the detection of relevant, low-intensity metabolites. These results demonstrate the power of AI-Binning and suggest the involvement of r-1 acid glycoproteins and choline biochemistry in hypertension.
10.1021/ac7025964 CCC: $40.75 2008 American Chemical Society Published on Web 04/18/2008
Analytical Chemistry, Vol. 80, No. 10, May 15, 2008
3783
deviation, is generally applied for equidistant binned spectra. While this scaling method reduces the larger variances more than the smaller ones, it does not result in equal variances. Thus, while the impact of noise is reduced, metabolites with low peak intensity may be ignored. Several algorithms have been developed as an alternative to standard binning. While many focus on peak alignment, e.g.,5–7 only few methods allow combined peak alignment and data reduction such as, for example, PARS,4,8 the curve-fitting algorithm,9 the peak alignment tools in HiRes,10 and targeted profiling.11 These approaches identify peaks or specific peak patterns in each spectrum based on their appearance being conserved between spectra or on theoretically or experimentally determined peak pattern shapes. In the ideal situation, each pattern in a spectrum is identified, even when slightly shifted. Further data analysis is then performed on the listed patterns and their characteristics, such as their amplitude or area. The performance of this type of algorithms depends heavily on the accuracy of the peak alignment, since erroneous alignment automatically introduces artifacts in the data. Algorithms which depend on theoretically or experimentally determined peak patterns therefore require highly accurate and complete pattern databases. An alternative approach is to improve standard, equidistant binning by allowing variable bin sizes. This approach should enhance equidistant binning in a very robust way, since a priori knowledge is not introduced and data modifying peak alignments are avoided. Several methods have been proposed, such as nonequidistant binning3 and adaptive binning.12 Both create a reference spectrum, respectively by averaging or taking maximal intensities over all spectra, followed by the determination of the smooth minima of this spectrum, the bin edges. In non-equidistant binning the five-point minima are used, while in adaptive binning the relative minima of the undecimated wavelet transformed and, therefore, smoothed spectra are considered as bin edges. Although the applied smoothening procedures counteract peak shift differences, they depend on rather arbitrarily chosen parameters and the creation of a reference spectrum, implicating loss of information compared with the procedure using all the spectra. Despite their drawbacks, these algorithms outperform standard binning.3,12,13 A similar algorithm, referred to as intelligent bucketing, is also available in the commercial package ACD/ Laboratories (www.acdlabs.com). Intelligent Bucketing allows smaller or larger bins within a predefined range and the bin edges are also based on local minima. (5) Vogels, J. T. W. E.; Tas, A. C.; Venekamp, J.; VanderGreef, J. J. Chemom. 1996, 10, 425–38. (6) Stoyanova, R.; Nicholls, A. W.; Nicholson, J. K.; Lindon, J. C.; Brown, T. R. J. Magn. Reson. 2004, 170, 329–35. (7) Witjes, H.; Melssen, W. J.; in ’t Zandt, H. J. A.; van der Graaf, M.; Heerschap, A.; Buydens, L. M. C. J. Magn. Reson. 2000, 144, 35–44. (8) Forshed, J.; Schuppe-Koistinen, I.; Jacobsson, S. P. Anal. Chim. Acta 2003, 487, 189–99. (9) Crockford, D. J.; Keun, H. C.; Smith, L. M.; Holmes, E.; Nicholson, J. K. Anal. Chem. 2005, 77, 4556–62. (10) Zhao, Q.; Stoyanova, R.; Du, S.; Sajda, P.; Brown, T. R. Bioinformatics 2006, 22, 2562–64. (11) Weljie, A. M.; Newton, J.; Mercier, P.; Carlson, E.; Slupsky, C. M. Anal. Chem. 2006, 78, 4430–42. (12) Davis, R. A.; Charlton, A. J.; Godward, J.; Jones, S. A.; Harrison, M.; Wilson, J. C. Chemom. Intell. Lab. Syst. 2007, 85, 144–54. (13) Slupsky, C. M.; Rankin, K. N.; Wagner, J.; Fu, H.; Chang, D.; Weljie, A. M.; Saude, E. J.; Lix, B.; Adamko, D. J.; Shah, S.; Greiner, R.; Sykes, B. D.; Marrie, T. J. Anal. Chem. 2007, 79, 6995–7004.
3784
Analytical Chemistry, Vol. 80, No. 10, May 15, 2008
In this work we try to overcome the mentioned shortcomings by proposing an Adaptive Intelligent Binning algorithm, AIBinning. Adaptive since it uses variable bin sizes and intelligent in a sense that it determines automatically when to stop further binning and does not require user defined arbitrary parameters or reference spectra. We demonstrate the improved performance of AI-Binning over standard binning in a hypertension study performed on a subset of the subjects in the Asklepios study, a longitudinal population study of which the first round was completed in 2004. It focuses in particular on aging, cardiovascular hemodynamics, and inflammation and their interplay in cardiovascular disease.14 Although hypertension is a frequently occurring, major cardiovascular risk factor, it is in most cases labeled as “essential hypertension”, indicating that the cause is unknown.15,16 In previous metabolic research, an altered lipoprotein particle composition was proposed to be associated with hypertension.17 However, an important shortcoming of the analysis was the lack of an appropriate study design. Here, the use of the improved binning algorithm was combined with clear phenotypes in a larger population and a matched subject study design in an attempt to validate and particularly refine these early findings. While our primary aim was the validation of the algorithm, the results may therefore also provide a deeper understanding of the biochemistry underlying or associated with hypertension. EXPERIMENTAL SECTION Study Subjects. The metabolic data set comprises a selected subset of 80 subjects from the Asklepios study population. Blood pressure was recorded using bilateral triplicate measurements on rested subjects using the Omron HEM-907 device. A complete overview of the Asklepios study methods and subjects has previously been described.14 Of the 80 subjects, 40 exhibited hypertension (defined as diastolic blood pressure g90 mmHg and systolic blood pressure g140 mmHg), while the rest were normotensive (defined as diastolic blood pressure 1), Vb is simply defined as the average of the individual bin values of all S individual spectra. Bin Evaluation Criterion (BEC) (Figure 1B). According to the AI-Binning algorithm, within each bin b with current bin value Vb, all possible new bin edge candidates are evaluated. During this evaluation, each candidate virtually divides the bin into two new bins. The respective bin values of these two new virtual bins, Vb1 and Vb2, and the corresponding summed bin value, Vb,sum ) Analytical Chemistry, Vol. 80, No. 10, May 15, 2008
3785
Figure 1. Fragment of a single spectrum j during the iterative binning process: The bin will be further divided if the current bin value (Vb), calculated out of maximal intensity (maxj) and bin edge intensities (Ij,1 and Ij,end) (A) is surpassed by the maximal sum of new bin values (Vb1,max + Vb2,max) (B) and if each of the new bin values surpasses the maximal noise bin value.
Vb1 + Vb2, are subsequently calculated. For each candidate, the summed bin value Vb,sum is determined following which the maximal summed bin value, termed Vb,max with corresponding new bin values Vb1,max and Vb2,max, is selected. The original bin b with bin value Vb will then be split into the corresponding new bins if and only if (eq 2) Vb,max > Vb ∧ Vb1,max > Vnoise ∧ Vb2,max > Vnoise
(2)
This set of conditions is referred to hereafter as the bin evaluation criterion or BEC. Vnoise is the maximal bin value Vb retrieved in the noise region and is discussed in the next subsection. If the BEC is fulfilled the bin is split into those two new bins b1 and b2 for which the summed bin value Vb,sum was maximal. For the first newly constructed bin (the left one with respect to the frequency axis), the search for an optimal new bin edge is repeated, Vmax is determined, and the new bin is again divided if the BEC holds. This procedure is recursively performed for each first new bin and automatically stops for that bin when no further new edges for which the BEC holds can be introduced. Then, the right neighbor of this bin is considered in an attempt to create new bins, within this bin, under the conditions imposed by the BEC. In this fashion, moving along each bin, all bins are considered until no more bin edges can be introduced in the last bin of the spectra, which terminates the binning algorithm. Noise Characterization. Random noise fluctuations generate small, conserved minima in all parts of the spectra. Without the latter two conditions of the BEC, this would result in the unnecessary separation of otherwise well defined bins. To avoid this, a minimal bin value Vnoise is introduced and used to define the last two inequalities in the BEC (eq 2). A rational choice for this minimal bin value is the maximal noise bin value. This can be determined by setting Vnoise equal to zero in the BEC (eq 2) and performing the AI-Binning algorithm on user identified noise regions within the spectra. AI-Binning is subsequently performed on the whole spectra with Vnoise defined as the maximal bin value encountered during the different binning steps of the noise regions. The user-based identification of one or more relatively large noise parts in the spectra, e.g., at the edges 3786
Analytical Chemistry, Vol. 80, No. 10, May 15, 2008
of the spectra, is therefore particularly important for a correct noise characterization. Resolution Parameter R. During the evaluation of new candidate bin edges, the extent to which Vb,max is able to exceed Vb is largely affected by the parameter R (eq 1). This parameter is defined as strictly positive and determines the extent to which two peaks should be separated before the algorithm allocates them into two separate bins. R is therefore called the resolution parameter. Larger R values require a more explicit separation (resolution) between two peaks to allow binning, restricting the final number of bins primarily via the first term of the BEC. On the other hand, for smaller values of R (close to 0), Vnoise increases in proportion to the bin values of non-noise regions, thus limiting the final number of bins by the last two inequalities of the BEC. This implies that for a certain value of R, the resulting number of bins will be maximal, corresponding to the maximal amount of non-noise information. Therefore, an objective criterion to select the optimal R value is to choose this R value, which yields the maximal number of bins. The choice of the optimal value for the resolution parameter can be considered as a trade-off between the resolution and the signal-to-noise ratio of the spectra. This allows the algorithm to be optimized with respect to the properties of the data. R should therefore not be considered as an arbitrary parameter, but as a definable, system-dependent parameter. Noise Bin Identification. The AI-Binning algorithm is scaleindependent and allows peaks to shift to a certain extent. Although such a behavior is optimal for information-rich regions, it complicates the analysis of peaks flanked by large noise regions. There, the combination of accidentally coinciding minima with scattered noise maxima might result in bin values surpassing Vnoise while actually not containing peaks. Conserved peaks result in a high mean intensity when averaged over each spectrum, while this is not the case in noise regions. Therefore, the maximal averaged intensity is determined for each bin and for each noise region. Bins for which this intensity does not exceed the maximal averaged noise region intensity are also considered noise bins and discarded from further analysis. However, considering that the number of noise bins will be small
Table 1. Influence of the Value of the Resolution Parameter R on the Total Number of Non-Noise Bins R
minimal bin valuea
total number of bins
number of non-noise bins
computation time (min)
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0.0572 0.0137 0.0033 7.88 × 10-4 1.89 × 10-4 4.55 × 10-5 1.09 × 10-5 2.63 × 10-6 6.33 × 10-7
259 269 277 277 277 272 267 260 249
257 267 275 275 275 270 265 258 247
32 35 31 36 38 34 31 28 28
a Minimal bin value of individual peaks to be separated; corresponds with maximal bin value in noise region.
due to the limited amount of large noise regions, the effect of this procedure on the subsequent data analysis will be trivial in the majority of the cases. RESULTS AND DISCUSSION Binning Characteristics. Raw spectra were imported in Matlab, the intensities in the frequency range dominated by the water peak (5.12-4.48 ppm) were set to zero, and the spectral fragments from 12 to –3 ppm, corresponding to 65514 data points, were considered for further analysis. Integral normalization was applied prior to standard or AI-Binning. The regions from 12 to 10 ppm and –1 to –3 ppm were defined as signal-free regions for noise characterization in the AI-Binning algorithm. For standard binning, intensities were integrated over equidistant bins of 0.04 ppm within the range of 10 to –1 ppm, resulting in 259 bins, after exclusion of the water peak bins. For the full resolution spectra, all intensities from 10 to –1 ppm were taken into account, except for the water peak, corresponding with 45248 data points per spectrum. For AI-Binning, values of 0.1-0.5 in 0.05 intervals were evaluated for the resolution parameter R (Table 1). For each value, two bins were identified as noise bins and excluded for further analysis. These bins were the leftmost and rightmost bins also containing the user-defined noise regions. An increase of the resolution parameter causes a stable increase of the number of resulting bins up to a maximum level, followed by a decrease. Although more fine-tuning of the resolution parameter might result in a marginally higher number of bins, the impact of the supplemental bins should be considered insignificant since these are borderline between noise and relevant information. R values of 0.2, 0.25, and 0.3 resulted in the same, maximal number of 275 non-noise bins. The degree of data reduction is therefore very similar to standard binning with 0.04 ppm bins (total of 259 bins). The results for these three values were very comparable: bin edge locations were identical or shifted by maximally one data point (which corresponds to 2.3 × 10-4 ppm) for more than 95% of the cases. For only eight bins, corresponding with borderline situations, the results for the 3 R values (0.2, 0.25, 0.3) were different with respect to the decision whether the bins should be split or not. The results for the highest resolution (R equal to 0.3) were used for further analysis. Details of the binned spectra are depicted in Figure 2. The minimal and median bin widths were 0.0032 and 0.012 ppm, respectively. The maximal width, noise bins not included, corresponded to 1.367 ppm.
Efficiency of AI-Binning Algorithm. For a fixed resolution parameter R and for 80 spectra with 65514 real data points (12 to -3 ppm) each, the AI-Binning algorithm requires approximately half an hour of computation time to be completed on a PC with an Intel Pentium Dual Core 1.66 GHz processor and 1 GB of RAM (Table 1). The iterative character of the algorithm implies that the computation time will particularly be determined by the length of the spectra. Since the number of real data points rarely exceeds 65514 in standard 1H NMR metabolomics experiments, the algorithm is computationally feasible for (nearly) all applications. Furthermore, once the optimal resolution parameter value R has been determined, this value can be used for other sets of experiments conducted under identical conditions. AI-Binning Increases Classification Accuracy. Full resolution spectra and binned spectra were mean-centered. In the first instance, the data was pareto scaled, by dividing each variable by the square root of its standard deviation. After the removal of structured noise with O-PLS, PLS-DA was used to investigate if AI-Binning resulted in a better classification compared with standard binning or the use of full resolution spectra (i.e., no binning at all). In each of the 80 rounds of the cross-validation procedure, a training model was build on 79 spectra and used to predict the status of the left out spectrum. The test set prediction accuracy was determined as the average number of times the status of the remaining spectrum was correctly predicted. This was repeated for 1-15 PLS components. Figure 3 shows that for each set of spectra, the averaged prediction accuracy reaches a maximum followed by a decrease due to overfitting. It is clear that AI-Binning performs better than both standard binning and the use of the full resolution spectra, expressed in both the maximal percentage of correct classifications and the smaller number of PLS components required to obtain this maximum. AI-Binning could correctly discriminate hypertension from normotension in more than 90% of the subjects, in contrast to less than 85% for the other methods. The results for standard binning and the use of full resolution spectra are similar. For each type of data preprocessing, the training set prediction accuracy reached 100% as an increasing number of PLS components were included. The classification accuracy was also determined for equidistant binned spectra, with bin widths set to either minimal, median, or maximal bin width of the AI-Binned spectra. The results were worse than those for full resolution or 0.04 ppm equidistant binned spectra, suggesting that 0.04 ppm bins are a good compromise if equidistant binning is to be preferred in combination with pareto scaling. The most plausible reason for the poor prediction results seen for the full resolution spectra is the incorporation of noise variables in the models. While all information is present in these full resolution spectra, many variables are meaningless, as they only represent noise. Due to random processes, several of these noise variables will provide good predictive quality for the training set and will be incorporated in the PLS models, resulting in bad predictive quality for independent spectra. Unit-Variance Scaling with AI-Binned Spectra. In most NMR metabolomics experiments, the data is transformed using pareto scaling. This attenuates the weights of peaks with larger variances (generally larger peaks) without raising the influence of noise variables to the same level, whereas unit-variance scaling gives each variable equal weight. This is particularly important Analytical Chemistry, Vol. 80, No. 10, May 15, 2008
3787
Figure 2. Detail of the set of 80 1H NMR spectra from 2.5 to 1.8 ppm, shown with bin edges (dashed lines) obtained via AI-Binning procedure.
Figure 3. Prediction accuracy (PLS-DA) for an increasing number of PLS components, applied on full resolution (+), equidistant binned (x), and AI-Binned spectra (O) after pareto scaling and O-PLS denoisal.
Figure 4. Prediction accuracy (PLS-DA) for an increasing number of PLS components, applied on full resolution (+), equidistant binned (x), and AI-Binned spectra (O) after unit-variance scaling and O-PLS denoisal.
for equidistant binned spectra, since these contain several noise bins, complicating the construction of correct prediction models. Although pareto scaling provides an efficient tool for preprocessing equidistant binned spectra, relatively small but important peaks may remain undetected. In AI-Binned spectra, each bin generally coincides with a peak, eliminating to a large extent noise bins and their negative influence on model building. Therefore, it was investigated whether AI-Binning could yield an additional advantage of being able to deal with unitvariance scaling. Figure 4 depicts the PLS-DA prediction accuracy of unit-variance scaled spectra. Again, only one O-PLS component
was removed for each set of spectra. It is clear that the unit-variance scaling does not deteriorate AI-Binning based results, while the prediction accuracy for equidistant binned spectra and the full resolution spectra decreased, most probably due to the larger influence of the noise variables (Figures 3 and 4). This was further illustrated by the prediction accuracies for equidistant binned spectra with bin widths corresponding with minimal, median, and maximal AI-Binned bin sizes. While the prediction accuracies decreased for median and minimal bin size, it increased for the maximal bin size to approximately 80% (data not shown). The very large bin size (1.367 ppm) results in only
3788
Analytical Chemistry, Vol. 80, No. 10, May 15, 2008
Table 2. Most Predictive Variables PLS-Regression Model Hypertension PLSa
bin (ppm)
presumed assignment
1
-0.1229
2.065–2.029
2
-0.1186
2.827-2.815
3 4
0.118 -0.1166
3.694-3.677 2.839-2.827
5 6 7
-0.1154 0.1053 0.0975
5.856-5.611 3.223-3.170 2.029-1.923
8 9 10
-0.0947 0.0927 -0.0906
-0.111 to 0.551 1.207-1.198 5.212-5.192
11 12 13
0.0886 0.0870 -0.0847
3.677-3.670 7.574-7.417 4.273-4.253
14 15
0.0847 -0.0808
7.342-7.327 0.260-0.190
16
0.0800
4.336-4.273
R-1 acid glycoprotein (N-acetyl group) unknown (most probably same entity as 4) cholineb unknown (most probably same entity as 2) urea cholinea R-1 acid glycoprotein (N-acetyl group) unknown unknown, perhaps fucose CH-CH2 fragment of unknown origin cholineb phenylalanine (uncertain) CH-CH2 fragment of unknown origin phenylalanine (uncertain) unknown, perhaps albumin background cholinec
assignment method literature
literature, 2D NMR HMDB, 2D NMR literature, 2D NMR literature (literature) 2D NMR literature, 2D NMR literature 2D NMR literature (literature) literature, 2D NMR
a
PLS coefficient: positive coefficients indicate that higher intensities of the corresponding spectral regions contribute to hypertension and lower intensities contribute to normotension, vice versa for negative coefficients. Literature refers to reported chemical shifts and/or spectra22; 2D NMR refers to experimental verification (cf. Experimental Section). In cholinex, x indicates the resonance of protons Hx in choline: (CHa3)3-N+-CHb2-CHc2-OH.
few variables (8-9), which might be considered noise-free, since the bins all contain at least some peaks. This also indicates that the global spectral shape, reflected in these variables, contains information about the hypertensive status. Identification of Hypertension Biomarkers. Except for the purpose of comparing different preprocessing methods, the NMRbased discrimination between hypertensive and normotensive subjects is trivial as such. However, our classification results in this study showed that unknown biological information might be retrieved from the spectra, at least partially explaining hypertension. Therefore, our next goal was to identify relevant metabolites, based on the bins generated by the better performing AI-Binning preprocessing method. The identification of the corresponding metabolites was facilitated by the fact that these bins generally coincide with single peaks. The design of the study, the matched subjects, and the clear difference between both phenotypes under study ensures that any retrieved metabolite might potentially be a new, relevant marker further elucidating the pathology of hypertension. Although the matched study design resulted in complex PLS models, the important variables can be identified as those having large regression coefficients. The model for 10 PLS components derived from the O-PLS denoised, unit-variance scaled, spectral data set was chosen for further analysis. Table 2 summarizes the most important variables (largest coefficients, absolute value, cutoff ) 0.8). The Human Metabolome Database (HMDB), reported chemical shifts in literature, and 2D NMR spectra (data not shown) were used for the identification of the corresponding biological entities.22,23 The regression coefficients for all variables are provided as Supporting Information (Table S-2). Although several peaks could not be annotated, the results clearly support the added value of AI-Binning. Most identified (22) Nicholson, J. K.; Foxall, P. J.; Spraul, M.; Farrant, R. D.; Lindon, J. C. Anal. Chem. 1995, 67, 793–811.
peaks in Table 2 could be traced back to the same molecular structures. The most interesting results consist of R-1 acid glycoproteins (N-acetyl group) (AAG peaks Figure 5) and choline or choline-containing molecules. Also, urea and a metabolite contributing a CH2-CH group were found to be involved in hypertension (Table 2). The validity of the choline, urea, and CH2-CH annotations was verified by 2D NMR spectroscopy (data not shown). The CH2-CH group could not be linked through 2D NMR with any other resonances in the spectra, nor did the 1H and 13C chemical shifts result in any feasible hit in the HMDB. The different R-1 acid glycoprotein peaks clearly corresponded with previously reported chemical shifts and spectra.22 The possibility of artifacts generated by the background is considered unlikely, since for several metabolites, e.g., choline, the identified peaks were not adjacent. Furthermore, for the adjacent peaks in the case of R-1 acid glycoprotein, regression coefficients with opposite signs were identified. Such behavior could be explained by the fact that the composition of R-1 acid glycoproteins may be more important than absolute concentrations. The R-1 acid glycoprotein contains several acetylated amino sugar moieties, linked via N-acetyl groups.22 Our results therefore indicate that differential R-1 acid acetylation might be involved in hypertension. This glycoprotein is an acute phase protein, closely associated with inflammation.24 Since inflammation is also impor(23) Wishart, D. S.; Tzur, D.; Knox, C.; Eisner, R.; Guo, A. C.; Young, N.; Cheng, D.; Jewell, K.; Arndt, D.; Sawhney, S.; Fung, C.; Nikolai, L.; Lewis, M.; Coutouly, M. A.; Forsythe, I.; Tang, P.; Shrivastava, S.; Jeroncic, K.; Stothard, P.; Amegbey, G.; Block, D.; Hau, D. D.; Wagner, J.; Miniaci, J.; Clements, M.; Gebremedhin, M.; Guo, N.; Zhang, Y.; Duggan, G. E.; Macinnis, G. D.; Weljie, A. M.; Dowlatabadi, R.; Bamforth, F.; Clive, D.; Greiner, R.; Li, L.; Marrie, T.; Sykes, B. D.; Vogel, H. J.; Querengesser, L. Nucleic Acids Res. 2007, 35, D521–D526. (24) Fournier, T.; Medjoubi, N.; Porquet, D. Biochim. Biophys. Acta 2000, 1482, 157–71.
Analytical Chemistry, Vol. 80, No. 10, May 15, 2008
3789
is compromised by the peak mixtures in several bins. Only a single R-1 acid glycoprotein peak containing bin was not contaminated by other peaks, and this was the only depicted bin identified as relevant for standard binned spectra (top 10 most important bins, other bins not in top 30). Although a complete characterization of the relevant peaks in the spectra will be required to fully elucidate the importance of the findings reported here, the advantages of AI-Binning can already be clearly illustrated.
Figure 5. AI-Binning facilitates the identification of relevant metabolites compared to standard, equidistant binning. AI-Binning clearly isolates the two large R-1 acid glycoprotein peaks (AAG1 and AAG2) in separate bins (solid lines), in large contrast to the bins obtained after standard, equidistant binning (dashed lines), where only equidistant bin 3 is not a mixture of different peaks. Only this bin was identified as relevant in hypertension for equidistant binned spectra, while both R-1 acid glycoprotein peaks were retrieved for AI-Binned spectra. For clarity, the averaged spectrum is depicted.
tant in hypertension,25 a possible involvement is not unlikely. Considering that subjects were matched, choline biochemistry appears to be associated with hypertension even beyond cholesterol and associated standard lipoprotein levels. These results therefore partially support earlier NMR-derived findings on hypertension.17 In these, the influence of lipoprotein particle composition difference was demonstrated, whereas, post hoc, no significant differences in classically determined triglycerides, HDL and LDL cholesterol levels were found. One of the choline peaks (around 3.2 ppm) was also reported as important.17 Bins important in PLS hypertension classification were also retrieved for standard binned data. Although an absolute comparison of the metabolites identified by both binning methods would require an external validation step by traditional analytical methods, which is beyond the scope of this manuscript, these results might illustrate how AI-Binning facilitates the identification of important metabolites. For this purpose, the PLS coefficients of the best discriminative model for equidistant binned data, nine PLS components with pareto scaled data (Figure 3), were analyzed. Due to the pareto scaling, each PLS regression coefficient partially depends on the remaining standard deviation within the corresponding bin. Each coefficient was therefore rescaled through multiplication by the corresponding, remaining standard deviation of the pareto scaled data. The identified regions deemed important included, among others, parts of all but one choline peak (data not shown) and a part of one of the R-1 acid glycoprotein peaks. Figure 5 illustrates how AI-Binning facilitates the detection of the R-1 acid glycoprotein peaks compared with standard binning, where an accurate analysis (25) Boos, C. J.; Lip, G. Y. H. Curr. Pharm. Des. 2006, 12, 1623–35.
3790
Analytical Chemistry, Vol. 80, No. 10, May 15, 2008
CONCLUSION We have demonstrated in this study that AI-Binning realizes an adequate data reduction comparable with that of standard binning, while outperforming both standard binning and the use of full resolution spectra in the subsequent data analysis. Only for the separation of very low-intensity peaks, the chosen bin edges might be somewhat inconclusive. However, even in this case, AIBinning performs better than standard, equidistant binning, where very small peaks are allocated in proportionally very large bins. No arbitrary parameters, reference spectra, a priori knowledge, or data modifications are required. Furthermore, using an explicit resolution parameter makes the algorithm flexible and easily adaptable to the data in an objective fashion. The results show that AI-Binning can be combined with pareto and particularly unitvariance scaling, which facilitates the discovery of relevant metabolites with relatively low concentrations/peak intensities. Moreover, the consistency of the identified peaks, particularly choline and R-1 acid glycoprotein proton resonances, underscores the reliability of the generated output and might provide further understanding of hypertension. On the whole, AI-Binning proves to be a very powerful tool for high-throughput preprocessing of metabolomics NMR spectra. Possible future extensions of the AIBinning concept might be toward 2D NMR spectra or toward spectra generated by other spectroscopic techniques. ACKNOWLEDGMENT The authors are very grateful to the Asklepios Study investigator group,14 on whose behalf they present the results. The Asklepios Study is supported by a grant from the Research Foundations Flanders (FWO G.0427.03). T.D.M. is funded by a Ph.D grant from the Special Research Fund of Ghent University (Grant 011D10004). The FWO is gratefully acknowledged for a Ph.D fellowship to D.S. The 700 MHz equipment of the Interuniversitary NMR Facility was jointly financed by Ghent University, the Free University of Brussels, and the University of Antwerp via the FFEU-ZWAP incentive of the Flemish Government. The FWO is also acknowledged for support via a research grant to J.C.M. (FWO G.0064.07). S.B. is currently funded by the Industrial Research Fund from Ghent University as a postdoctoral Technology Developer. SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.
Received for review December 21, 2007. Accepted March 5, 2008. AC7025964