Anal. Chem. 2008, 80, 7354–7362
Optimization of Human Plasma 1H NMR Spectroscopic Data Processing for High-Throughput Metabolic Phenotyping Studies and Detection of Insulin Resistance Related to Type 2 Diabetes Anthony D. Maher,*,† Derek Crockford,† Henrik Toft,‡ Daniel Malmodin,‡ Johan H. Faber,‡ Mark I. McCarthy,§,| Amy Barrett,§ Maxine Allen,§ Mark Walker,⊥ Elaine Holmes,† John C. Lindon,† and Jeremy K. Nicholson*,† Department of Biomolecular Medicine, Division of Surgery, Oncology, Reproductive Biology and Anaesthetics (SORA), Faculty of Medicine, Imperial College London, South Kensington SW7 2AZ, United Kingdom, Novo Nordisk A/S, Novo Nordisk Park, DK-2760 Ma˚løv, Denmark, Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Churchill Hospital, Old Road, Headington, Oxford, OX3 7LJ, United Kingdom, Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, United Kingdom, and SCMS (Diabetes Research Group), University of Newcastle upon Tyne, NE2 4HH, United Kingdom Optimizing NMR experimental parameters for high-throughput metabolic phenotyping requires careful examination of the total biochemical information obtainable from 1H NMR data, which includes concentration and molecular dynamics information. Here we have applied two different types of mathematical transformation (calculation of the first derivative of the NMR spectrum and Gaussian shaping of the freeinduction decay) to attenuate broad spectral features from macromolecules and enhance the signals of small molecules. By application of chemometric methods such as principal component analysis (PCA), orthogonal projections to latent structures discriminant analysis (O-PLS-DA) and statistical spectroscopic tools such as statistical total correlation spectroscopy (STOCSY), we show that these methods successfully identify the same potential biomarkers as spin-echo 1H NMR spectra in which broad lines are suppressed via T2 relaxation editing. Finally, we applied these methods for identification of the metabolic phenotype of patients with type 2 diabetes. This “virtual” relaxationedited spectroscopy (RESY) approach can be particularly useful for high-throughput screening of complex mixtures such as human plasma and may be useful for extraction of latent biochemical information from legacy or archived NMR data sets for which only standard 1D data sets exist. NMR spectroscopy has been widely and successfully applied to characterize disordered metabolic states in animals and man.1,2 Metabonomics involves the measurement and statistical mapping * To whom correspondence should be addressed. E-mail: j.nicholson@ imperial.ac.uk (J.K.N.);
[email protected] (A.D.M.). † Imperial College London. ‡ Novo Nordisk A/S. § University of Oxford. | Wellcome Trust Centre for Human Genetics. ⊥ University of Newcastle upon Tyne. (1) Nicholson, J. K.; Wilson, I. D. Prog. Nucl. Magn. Reson. Spec. 1989, 21, 449–501. (2) Bollard, M. E.; Stanley, E. G.; Lindon, J. C.; Nicholson, J. K.; Holmes, E. NMR Biomed. 2005, 18, 143–162.
7354
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
of the response of an organism to a pathological or physiological stimulus in terms of the changes induced in metabolite concentrations in biofluids such as urine or plasma.3,4 This and related approaches are now being increasingly applied in fields as diverse as drug metabolism and toxicology,5,6 epidemiology,7 nutrition8,9 and metabolic disorders such as type 1 diabetes10 and insulin resistance,11 and most recently in “metabolome-wide association studies”.12 Powerful new chemometric and statistical approaches have increased the size, complexity, and instrumental origin (e.g., NMR, mass spectrometry, etc.) of analytical data that may be used and improved biomarker detection by increasing the efficacy of information recovery from heavily overlapped biological NMR spectra.13-16 (3) Lindon, J. C.; Holmes, E.; Nicholson, J. K. Expert Rev. Mol. Diagn. 2004, 4, 189–199. (4) Nicholson, J. K. Mol. Syst. Biol. 2006, 2, 52. (5) Holmes, E.; Cloarec, O.; Nicholson, J. K. J. Proteome Res. 2006, 5, 1313– 1320. (6) Clayton, T. A.; Lindon, J. C.; Cloarec, O.; Antti, H.; Charuel, C.; Hanton, G.; Provost, J. P.; Le Net, J. L.; Baker, D.; Walley, R. J.; Everett, J. R.; Nicholson, J. K. Nature 2006, 440, 1073–1077. (7) Holmes, E.; Leng Loo, R.; Cloarec, O.; Coen, M.; Tang, H.; Maibaum, E.; Bruce, S. J.; Chan, Q.; Elliott, P.; Stamler, J.; Wilson, I. D.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2007, 79, 2629–2640. (8) Stella, C.; Beckwith-Hall, B.; Cloarec, O.; Holmes, E.; Lindon, J. C.; Powell, J.; Van der Ouderaa, F.; Bingham, S.; Cross, A. J.; Nicholson, J. K. J. Proteome Res. 2006, 5, 2780–2788. (9) Rezzi, S.; Ramadan, Z.; Martin, F. P.; Fay, L. B.; van Bladeren, P.; Lindon, J. C.; Nicholson, J. K.; Kochhar, S. J. Proteome Res. 2007, 6, 4469–4477. (10) Makinen, V. P.; Soininen, P.; Forsblom, C.; Parkkonen, M.; Ingman, P.; Kaski, K.; Groop, P. H.; Ala-Korpela, M. Mol. Syst. Biol. 2008, 4, 167. (11) Toye, A. A.; Dumas, M. E.; Blancher, C.; Rothwell, A. R.; Fearnside, J. F.; Wilder, S. P.; Bihoreau, M. T.; Cloarec, O.; Azzouzi, I.; Young, S.; Barton, R. H.; Holmes, E.; McCarthy, M. I.; Tatoud, R.; Nicholson, J. K.; Scott, J.; Gauguier, D. Diabetologia 2007, 50, 1867–1879. (12) Holmes, E.; Loo, R. L.; Stamler, J.; Bictash, M.; Yap, I. K.; Chan, Q.; Ebbels, T.; De Iorio, M.; Brown, I. J.; Veselkov, K. A.; Daviglus, M. L.; Kesteloot, H.; Ueshima, H.; Zhao, L.; Nicholson, J. K.; Elliott, P. Nature 2008, (13) Cloarec, O.; Dumas, M.-E.; Craig, A.; Barton, R. H.; Trygg, J.; Hudson, J.; Blancher, C.; Gauguier, D.; Lindon, J. C.; Holmes, E.; Nicholson, J. K. Anal. Chem. 2005, 77, 1282–1289. (14) Cloarec, O.; Dumas, M. E.; Trygg, J.; Craig, A.; Barton, R. H.; Lindon, J. C.; Nicholson, J. K.; Holmes, E. Anal. Chem. 2005, 77, 517–526. 10.1021/ac801053g CCC: $40.75 2008 American Chemical Society Published on Web 08/30/2008
1
H NMR is a well-established technique for metabonomic and metabolomic studies, having exceptional reproducibility17,18 and is quantitative to the extent that a given peak area is directly proportional to the concentration of the corresponding metabolite, provided the experiment is conducted under standardized conditions. However, with increasingly complex data sets, the extraction of latent biochemical information directly relevant to the biological question to be addressed presents a continuing challenge. There are two main reasons for this. First, the spectroscopic structure of biofluid 1H NMR data is inherently complex, with molecules giving rise to several peaks in each spectrum, each peak being subject to varying multiplicities and intersample variations in chemical shift, along with extensive overlap of peaks from other molecules. This multiplicity increases structural information content and spectral complexity.19 Second, there are external sources of variation in these data not related to their principal classification (e.g., diseased vs normal). These fall broadly into environmental sources, including diet and lifestyle, and experimental variations, including choice of NMR experiment and instrument, and sample handling and preparation.20 Following data acquisition, a third influence on the extent to which relevant biological information may be extracted lies in the choice of data handling and statistical (chemometric) analysis. With dependence on the problem to be addressed, the extent to which each of these sources of variation contributes to the final result will vary. For example, toxicology studies involving laboratory animals are generally well controlled with respect to diet and genetic makeup of the animals. However, studies on human populations cannot be controlled directly with respect to all biological parameters.4 This study concerns the optimization of data processing protocols for investigation of metabolic perturbations in human populations with a high prevalence of insulin resistance and type 2 diabetes. It has been estimated that there are currently 20.6 million type 2 diabetes patients in the USA with 70 000 direct disease-related deaths per year.21 The exact cause of type 2 diabetes is still unknown, but a wide range of environmental and genetic factors22 may contribute to the disease. Type 2 diabetes was one of the first diseases to be studied using 1H NMR spectroscopy of biofluids.23 There have been a number of successful applications of NMR-based metabonomic and related (15) Crockford, D. J.; Holmes, E.; Lindon, J. C.; Plumb, R. S.; Zirah, S.; Bruce, S. J.; Rainville, P.; Stumpf, C. L.; Nicholson, J. K. Anal. Chem. 2006, 78, 363–371. (16) Rantalainen, M.; Cloarec, O.; Beckonert, O.; Wilson, I. D.; Jackson, D.; Tonge, R.; Rowlinson, R.; Rayner, S.; Nickson, J.; Wilkinson, R. W.; Mills, J. D.; Trygg, J.; Nicholson, J. K.; Holmes, E. J. Proteome Res. 2006, 5, 2642–2655. (17) Dumas, M. E.; Maibaum, E. C.; Teague, C.; Ueshima, H.; Zhou, B.; Lindon, J. C.; Nicholson, J. K.; Stamler, J.; Elliott, P.; Chan, Q.; Holmes, E. Anal. Chem. 2006, 78, 2199–2208. (18) Keun, H. C.; Ebbels, T. M.; Antti, H.; Bollard, M. E.; Beckonert, O.; Schlotterbeck, G.; Senn, H.; Niederhauser, U.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Chem. Res. Toxicol. 2002, 15, 1380–1386. (19) Nicholson, J. K.; Foxall, P. J.; Spraul, M.; Farrant, R. D.; Lindon, J. C. Anal. Chem. 1995, 67, 793–811. (20) Maher, A. D.; Zirah, S. F.; Holmes, E.; Nicholson, J. K. Anal. Chem. 2007, 79, 5204–5211. (21) National diabetes fact sheet: general information and national estimates on diabetes in the United States; U.S. Department of Health and Human Services. Centers for Disease Control and Prevention: Atlanta, GA, 2005. (22) Sale, M. M.; Woods, J.; Freedman, B. I. Curr. Hypertens. Rep. 2006, 8, 16–22. (23) Nicholson, J. K.; O’Flynn, M. P.; Sadler, P. J.; Macleod, A. F.; Juul, S. M.; Sonksen, P. H. Biochem. J. 1984, 217, 365–375.
approaches to study insulin resistance and type 2 diabetes in experimental animals24,25 and metabolic abnormalities in man relating to type 2 diabetes23 and cardiovascular disease.26 The current studies are linked to the Molecular Phenotyping to Accelerate Genomic Epidemiology (MolPAGE) consortium, a large-scale EU-funded study that is adopting a systems biology approach4,27 to identify molecular phenotypic markers of disease that can then be used to facilitate identification, risk, progression, and response to therapy of disease. For metabonomics this will require development of data acquisition and analysis methods that permit maximum extraction of biological information while minimizing sample throughput times and volumes. The observed 1H NMR spectrum of untreated human plasma is a complex superposition of broad and sharp peaks from glucose and other sugars, amino acids, lipids, lipoproteins, and proteins.19 For metabonomic studies, three types of 1D NMR experiments may be acquired from each sample: (i) a 1D pulse-and-acquire experiment, in which large and small molecules contribute to the spectrum with intensity proportional to their concentration; (ii) a transverse relaxation-edited experiment (also called a CarrPurcell-Meiboom-Gill, CPMG28,29) in which signals from protons with short T2 relaxation times are suppressed via spin-spin relaxation during the echo time, giving a spectrum which is generally dominated by the smaller molecules that are not motionally constrained by protein binding; and (iii) a diffusionedited experiment, in which only molecules with a larger Stokes’ radius contribute to the spectrum.30 For experimental parameters used in metabolic profiling studies, the total acquisition time to obtain useful information for each of these experiments is approximately 9 min on a standard 600 MHz NMR spectrometer using 5 mm NMR tubes. Thus for studies on large population cohorts, there is potential for substantial time saving if it can be successfully shown that the information content of one of these experiments could be extracted from another. Here we present the results from a series of standard 1D pulse-and-acquire and experimental and “virtual” relaxation-edited experiments on a set of human blood plasma samples, in which the aim was to optimize high-throughput settings for the extraction of biological information about specific phenotypes related to type 2 diabetes and insulin resistance. MATERIALS AND METHODS Materials. D2O (99.9%) was from Goss Scientific Instruments Ltd. (Essex, U.K.). All other chemicals were purchased from Sigma (St. Louis, MO). (24) Dumas, M. E.; Wilder, S. P.; Bihoreau, M. T.; Barton, R. H.; Fearnside, J. F.; Argoud, K.; D’Amato, L.; Wallis, R. H.; Blancher, C.; Keun, H. C.; Baunsgaard, D.; Scott, J.; Sidelmann, U. G.; Nicholson, J. K.; Gauguier, D. Nat. Genet. 2007, 39, 666–672. (25) Dumas, M. E.; Barton, R. H.; Toye, A.; Cloarec, O.; Blancher, C.; Rothwell, A.; Fearnside, J.; Tatoud, R.; Blanc, V.; Lindon, J. C.; Mitchell, S. C.; Holmes, E.; McCarthy, M. I.; Scott, J.; Gauguier, D.; Nicholson, J. K. Proc. Natl. Acad. Sci. U.S.A. 2006, 103, 12511–12516. (26) Brindle, J. T.; Antti, H.; Holmes, E.; Tranter, G.; Nicholson, J. K.; Bethell, H. W.; Clarke, S.; Schofield, P. M.; McKilligin, E.; Mosedale, D. E.; Grainger, D. J. Nat. Med. 2002, 8, 1439–1444. (27) Nicholson, J. K.; Wilson, I. D. Nat. Rev. Drug Discov. 2003, 2, 668–676. (28) Carr, H. Y.; Purcell, E. M. Phys. Rev. 1954, 94, 630–638. (29) Meiboom, S.; Gill, D. Rev. Sci. Instrum. 1958, 29, 688–691. (30) Beckonert, O.; Keun, H. C.; Ebbels, T. M.; Bundy, J.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Nat. Protoc. 2007, 2, 2692–2703.
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
7355
Sample Collection. The samples in this study (n ) 100) were collected from the Warren 2 family cohort, as part of the MolPAGE consortium. Venous blood was collected from the cubital fossa into heparinized tubes. Following centrifugation the plasma was collected and stored at -80 °C. Samples were shipped on dry ice and stored at -40 °C until analysis. Parameters such as height, waist-to-hip ratio, etc. were also measured from each participant. Sample Preparation for NMR Analysis. Prior to analysis, thawed samples (n ) 100) were collectively centrifuged (16 000g for 5 min). Ten samples were randomly split into technical replicates. Plasma was diluted 1 to 4 in physiological saline in 80:20 H2O/D2O supplemented with 0.1% (w/v) sodium azide (for antibacterial and enzyme-inhibitory purposes) and 1.5 mM sodium formate as a chemical shift reference (δ8.452). Samples were prepared in 1 mL 96-well plates (Elkay Laboratory Products Ltd., U.K.). Automation. A Gilson 215 robotic sample handler (Gilson, Middleton, WI) was used to transfer samples from the well plate to the NMR spectrometer. Special attention was paid to optimizing wash sequences to prevent buildup of lipids and proteins (which occur naturally in human blood plasma samples) on the insides of the transfer lines. We found negligible carry over if the transfer capillaries were washed with 10% (v/v) ATC (a type of bleach), 10% (v/v) hydrogen peroxide, and 0.1% (w/v) sodium azide between each sample. NMR Spectroscopy. All experiments were acquired on a Bruker Avance Spectrometer operating at 600.29 MHz (for 1H) using a 5 mm TXI flow-injection probe equipped with a z-gradient coil. The following NMR experiments were acquired on each sample, each at 300 K, at a spectral width of 12 019 Hz, 96 transients were collected with 8 dummy scans using 64k time domain data points. A standard 1D spectrum [RD-90°-3 µs-90°-tm-90°-acquire] with selective irradiation of the water resonance during the recycle delay (RD, 2 s) and during the mixing time (tm, 0.1 s). Two types of CPMG experiments28,29 were also acquired [RD-90°-(τ/2-180°-τ/2)n-acquire] with total echo times of 608 ms (n ) 304, τ ) 2000 µs) and 102.4 ms (n ) 128, τ ) 800 µs), respectively. Prior to Fourier transformation, data were zero-filled by a factor of 1 and multiplied by an exponential function corresponding to a line broadening of 1 Hz in the frequency domain using TopSpin 2.0 (Bruker Biospin, Karlsruhe, Germany). Spectra were phased using in-house software (NMRProc, Doctors Tim Ebbels and Hector Keun, Imperial College London). Data were then imported into Matlab (Mathworks, Natick, MA) for further analysis. Theoretical calculations were done in Mathematica 6.0 (Wolfram Research, Champaign, IL). Principal Components Analysis (PCA) and Orthogonal Projections to Latent Structures Discriminant Analysis (OPLS-DA). Unless otherwise stated, prior to PCA and O-PLS-DA31 the spectra were referenced to formate at δ8.452. Spectral regions corresponding to water were removed, and spectra were normalized such that the total intensity of each spectrum was a constant. PCA and O-PLS-DA models were constructed on full resolution spectra using software written in in-house (Dr. Olivier Cloarec, Imperial College London). O-PLS-DA models were visualized by plotting the O-PLS coefficients color-coded according to the (31) Trygg, J. J. Chemom. 2002, 16, 283–293.
7356
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
correlation of each variable to the defined class.14 PCA models were mean centered, and O-PLS-DA models were univariance scaled. THEORY The CPMG spin-echo experiment (RD-90°-[τ/2-180°-τ/ 2]n-acquire) is conventionally used to measure T2 relaxation times by varying the τn spin-echo delay and fitting the attenuated signals to an exponentially decaying function. However, for biofluids, spin-echo experiments, including the Hahn spin-echo, have been widely used for spectral editing using a fixed-echo delay.32 The 180° pulse train effectively attenuates resonances from larger molecules in a complex mixture. The attenuation of a resonance can be described by the equation I ) I(0)e-τ⁄T2
(1)
where τ is the total echo time and I(0) is the intensity at zero echo time. Because larger molecules typically have a longer rotational correlation time τc, they have shorter T2 relaxation times, and their resonance intensities will be attenuated in proportion to the relaxation delays applied. Small molecules binding to proteins will be attenuated.33 For a standard pulse-and-acquire 1D NMR experiment, the T2 value is proportional to the rate of decay of the FID and manifests itself as inversely proportional to the line width. Here we propose two processing procedures that approximate a CPMG using data from a 1D experiment. First by calculating the first derivative of the Fourier transformed 1D NMR spectrum, here referred to as derivative-based relaxation-edited spectroscopy (D-RESY). This is explained by considering the mathematical function describing the shape of an NMR resonance (neglecting inhomogenous broadening contributions), the complex Lorentzian, S(Ω), as the sum of absorption (A(Ω)) and dispersion (D(Ω)) parts:
S(Ω) ) A(Ω) + iD(Ω) )
-(Ω - Ω0) λ +i 2 2 λ + (Ω - Ω0) λ + (Ω - Ω0)2 2
(2) where λ ) 1/T2 and Ω0 is the center frequency of the peak. This has the derivative (with respect to Ω),
S′(Ω) )
-2λ(Ω - Ω0) 2
2 2
(λ + (Ω - Ω0) )
-i
λ2 - (Ω - Ω0)2 (λ2 + (Ω - Ω0)2)2
(3)
The new spectrum, Snew(Ω), is computed by taking the square root of the sum of the squares of the real and imaginary parts of eq 3, which simplifies to Snew(Ω) )
1 λ2 + (Ω - Ω0)2
(4)
That is, Snew(Ω) is equal to A(Ω)/T2. Another approach for suppressing broad lines is by Gaussian shaping of the FID prior to Fourier transformation,34 as has been (32) Nicholson, J. K.; Buckingham, M. J.; Sadler, P. J. Biochem. J. 1983, 211, 605–615.
Figure 1. High-field region of 1H NMR spectra from a typical human blood plasma sample. The upper trace (black) shows the 1D spectrum, with the major metabolites identifiable in this region labeled. The blue and red traces are two different types of CPMG experiment acquired on this sample, with total echo times of 102.4 and 608 ms, respectively. The green and magenta traces are the D-RESY and G-RESY spectra, respectively (see text for details).
widely used for resolution enhancement, here called G-RESY. Here the FID is multiplied by a function of the form: G(t) ) e-at-bt
2
(5)
where a ) LB and b ) -a/[2(GB)(AQ)] where AQ is the acquisition time, and thence LB and GB are the Lorentzian and Gaussian parameters to be adjusted. This function narrows the lines but induces negative lobes for the real part of the spectrum, which are then removed by subsequent magnitude or absolute value mode calculation. We found sufficient suppression of broad features, at least cost in terms of signal-to-noise for human plasma samples by setting the values of LB and GB to -10 and 0.12, respectively. In practice, both methods are readily implemented in standard commercially available software. RESULTS AND DISCUSSION The most widely used NMR experiment in metabonomics is a 1D 1H NMR experiment based on the first increment of a 2D NOESY pulse sequence, employing selective irradiation at the frequency of the water resonance during the recycle delay and the mixing time (“1D”). This spectrum gives a global snapshot of the molecular composition in a given sample. A major feature of the 1D spectrum from human blood plasma is the superimposition of sharp signals from small molecules with a broad protein and lipoprotein envelope of resonances, resulting in extensive peak overlap. These spectra can be simplified by suppressing broad features either experimentally (by application of the CPMG pulse (33) Nicholson, J. K.; Gartland, K. P. NMR Biomed. 1989, 2, 77–82. (34) Lindon, J. C.; Ferrige, A. G. J. Magn. Reson. 1981, 44, 566–571.
sequence) or by using D-RESY or G-RESY. Figure 1 plots the low frequency region for a typical plasma sample from this study, annotated to highlight the metabolites immediately identifiable in this region. The 1D spectrum is in black, and the dominant signals from lipoproteins are immediately apparent, compared to those from small molecules such as amino acids. Two CPMG spectra acquired from the same sample with total echo times of 102.4 (“CPMG1”, blue) and 608 ms (“CPMG2”, red) are also shown in Figure 1. Evident from these spectra is the strong suppression of the lipoprotein signals compared to those from the low-molecular weight compounds such as amino acids, but this is at the cost of inferior signal-to-noise, decreasing from 0.30 to 0.13 relative to the 1D value for CPMG1 and CPMG2, respectively (measured from the maximum intensity of the alanine doublet at δ1.48, compared to the noise in the region δ2.0-δ1.0). The D-RESY spectrum is also shown (green); clearly the broad features from the 1D have been suppressed relative to the sharper peaks, and the signal-to-noise is 0.26 relative to that of the 1D spectrum. Finally, the G-RESY spectrum (selecting LB ) -10 and GB ) 0.12) results in even better suppression of broad signals (magenta trace) but with signal-to-noise calculated to be 0.22 relative to 1D. The key feature to note is that using the 1D, we have created two types of “virtual” relaxation-edited NMR spectroscopic data (D-RESY and G-RESY), in which the resonance intensities are proportional to their respective metabolite concentrations (in terms of comparing one sample to another) that can be subject to pattern recognition in the same way as other types of NMR data, and the results directly compared to those obtained from CPMG spin-echo data in which there is still some degree of differential T2 relaxation attenuation of low-molecular weight metabolite signals.1 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
7357
Figure 2. PCA models for five types of 1H NMR data. (A) 1D data, (B) CPMG data, total echo time ) 102.4 ms, (C) CPMG data, total echo time ) 608 ms, (D) D-RESY, and (E) G-RESY. On the lefthand side are PCA scores plots, colored according to replicate status (blue ) no replicate, red ) replicate 1, and black ) replicate 2, the numbers 1-10 label the positions of replicate 1); the right-hand side of each subfigure plots the PCA loadings. Numbers in parentheses are percentage variance explained by each PC.
Application of PCA. PCA is used frequently in metabonomics to extract and display systematic variation in a given (multidimensional) data set, allowing for an overview of the data to identify trends and outliers among the samples, and the variables (i.e., NMR metabolite signals) that influence these trends. Figure 2 shows the scores (left-hand side panels) and loadings (righthand side panels) from PCA models constructed from the 1D and the four types of broad feature-suppressed data described above. 7358
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
In each scores plot, the data points have been colored according to replication status for each sample (blue ) no replicate, red and black ) replicate 1 and 2, respectively) and annotated to show the positions of the 10 replicates in the first 2 PCs for interdata set comparison. The loadings have been plotted as a function of chemical shift. The PCA scores plot for 1D data (Figure 2A) shows high analytical reproducibility, as indicated by the proximity of scores from replicate samples. The PC1 loadings (right-hand side) were dominated by lipoprotein signals and glucose, while lactate and glucose were also seen to contribute to PC2 variation. In Figure 2B, the CPMG1 data from the same samples showed a similar distribution, but the major source of variation (as shown in the PC1 loadings) was from both lipoprotein signals and glucose. In Figure 2C, the CPMG2 scores plot shows three outlying samples in PC1. From the loadings (and later confirmed by visual inspection of the spectra), this was due to an unidentified singlet at δ3.15. The D-RESY spectrum (Figure 2D) had a similar distribution pattern for PC1 and PC2 scores as the CPMG1 spectrum but with almost no contribution from lipid peaks in PC2. Finally, the G-RESY spectrum (Figure 2E) showed lactate and glucose as the dominant sources of variation in PC1, with the singlet at δ3.15 the major variant resonance in PC2. In the context of PCA used for obtaining an “overview” of the data, it is notable that the data sets generated from the 1D (DRESY and G-RESY) revealed the same trends, i.e., outliers and analytical reproducibility within technical replicates, and sources of variation as were revealed by the CPMG experimental data. Thus it is expected that similar biological information may be extracted from the “virtual” RESY data as from the CPMG experiments. For this we turn to a supervised method of analysis, namely, O-PLS-DA. Application of O-PLS-DA. O-PLS-DA31 can be considered an extension of PCA, in which a Y matrix (in this context, phenotype metadata) is regressed onto the X matrix (the NMR spectral data) and is useful for identification of biomarkers. A range of phenotype data existed on these samples, and we here considered these as either “continuous variable”, i.e., those such as waist-to-hip ratio, resting blood pressure, etc., and “discrete variables”, i.e., presence of hypertension, smoking status, etc. Then for analysis by O-PLSDA and biomarker identification, each data set was sequentially grouped into above- and below-average groups or affected and nonaffected for each available phenotype, and the Q2Yˆ value was calculated for each model after computing one orthogonal component. The Q2Yˆ value gives an indication of the extent to which the phenotype can be predicted by the spectral data, with Q2Yˆ < 0 indicating the phenotype is unpredictable. Figure 3A shows the results from these calculations as a bar graph, with Q2Yˆ values for each data set (e.g., 1D, CPMG, or RESY) grouped for each phenotype. A notable feature of this plot is that there was no phenotype that could be better predicted by either CPMG experiment, when compared to D-RESY, G-RESY, or 1D data. O-PLS-DA results from predictable phenotypes (those with Q2Yˆ > 0.1, arbitrarily set) were visualized as color plots to identify the metabolites responsible (potential biomarkers). For example, results from models constructed for the most predictable phenotype, fat-free mass (FFM) are shown in Figure 3B-E, with O-PLS coefficients plotted as a function of chemical shift, in the region between δ0.8 to δ1.5. The result from the standard 1D spectral
Figure 3. Results from O-PLS-DA models constructed from the 1D, and four types of broad feature-suppressed NMR data after dividing into classes as described in the text. (A) Bar graph showing Q2Yˆ values for O-PLS-DA models constructed from available phenotypes, with the color scheme given in the inset. The phenotypes, from left to right, are glycosylated hemoglobin; 0 h and 2 h oral glucose tolerance test; body-mass index; waist-to-hip ratio; diastolic and systolic blood pressure; body fat percentage; fat-free mass; fat mass; impedance; myocardial infarction; bypass; angina; asthma; hyperlipidemia; hypertension; stroke; peripheral vascular disease; smoker. (B) O-PLS coefficient plot for 1D data of above- vs below-average fat-free mass (FFM) participants, (C) CPMG1 data, (D) CPMG2 data, (E) D-RESY, and (F) G-RESY.
data (Figure 3B) shows that participants with below-average FFM (positive O-PLS coefficients) had higher LDL levels (as seen by the orange-red correlation colored broad signals δ0.84 and δ1.25). Figure 3C shows that this phenotype was slightly better predicted (compared to 1D data) using the CPMG1 data set (Q2Yˆ ) 0.34, compared to 0.22), with the major variables (in this region of the spectra) revealed again to be from LDL signals. In Figure 3D, the CPMG2 spectral data, with almost complete suppression of signals from lipoproteins, revealed that leucine (doublets at δ0.95 and δ0.97) and valine (doublets at δ0.97, δ1.02) were correlated with above-average FFM (negative O-PLS coefficients in this plot). This was not observed in the 1D or CPMG1 because of overlap with the broad protein envelope and the neighboring resonances from CH3 groups in lipoproteins. In parts E and F of Figure 3, the results of the O-PLS-DA models constructed from D-RESY and G-RESY data, respectively, are displayed. The LDL signals are seen to discriminate the below-average FFM group (consistent with observations from the 1D and CPMG1 data), while the
leucine and valine signals are seen to correlate with the aboveaverage group in data from G-RESY, consistent with that revealed from CPMG2. Once again, there were no potential biomarkers identified by application of O-PLS-DA on CPMG data that could not also be observed in either of the two types of transformations of the 1D data for all the phenotypes we have investigated here. Application of STOCSY to Broad-Feature Suppressed NMR Data. Statistical total correlation spectroscopy (STOCSY)13 provides a useful tool for establishing spectral connectivity information in a series of biofluid NMR spectra, and this approach is widely applicable to the extraction of latent biochemical information from complex mixtures.35-38 These studies are also relevant in the broader context of covariance NMR for faster (35) Cloarec, O.; Campbell, A.; Tseng, L. H.; Braumann, U.; Spraul, M.; Scarfe, G. B.; Weaver, R.; Nicholson, J. K. Anal. Chem. 2007, 79, 3304–3311. (36) Smith, L. M.; Maher, A. D.; Cloarec, O.; Rantalainen, M.; Tang, H.; Elliott, P.; Stamler, J.; Lindon, J. C.; Holmes, E.; Nicholson, J. K. Anal. Chem. 2007, 79, 5682–5689.
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
7359
Figure 4. Results from a STOCSY model “driven” from the CH lactate resonance at δ4.11, expanded to show correlations to the lactate CH3 doublet at δ1.33 for 1D and broad feature-suppressed NMR data. (A) 1D data, (B) CPMG data, total echo time ) 102.4 ms, (C) CPMG data, total echo time ) 608 ms, (D) D-RESY, and (E) G-RESY.
acquisition times and resolution enhancement in NMR and other spectroscopic techniques.39-41 The principal aim of STOCSY experiments is to identify resonances that come from the same molecule as a peak of interest and is based on the principle that NMR signals from a given molecule will have a fixed proportionality through a series of spectra, provided the acquisition and processing methods are identical for each. However, this approach is potentially limited in crowded spectral regions, because the calculated correlation coefficient for those variables will depend on the intensity variation of peaks from more than one molecule. (37) Coen, M.; Hong, Y. S.; Cloarec, O.; Rhode, C. M.; Reily, M. D.; Robertson, D. G.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2007, 79, 8956–8966. (38) Wang, Y.; Cloarec, O.; Tang, H.; Lindon, J. C.; Holmes, E.; Kochhar, S.; Nicholson, J. K. Anal. Chem. 2008, 80, 1058–1066. (39) Chen, Y.; Zhang, F.; Bermel, W.; Bruschweiler, R. J. Am. Chem. Soc. 2006, 128, 15564–15565. (40) Chen, Y.; Zhang, F.; Bruschweiler, R. Magn. Reson. Chem. 2007, 45, 925– 928. (41) Noda, I. Anal. Sci. 2007, 23, 139–146.
7360
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
Thus, any approach that simplifies spectra should improve the result of a STOCSY model for a given class of molecules and hence aid metabolite identification. To test this hypothesis, we have constructed STOCSY models “driven” from the known lactate (CH) quartet at δ4.11 and plotted the results of this in Figure 4, expanding around the lactate methyl CH3 doublet at δ1.33. In the 1D spectra this doublet is usually heavily overlapped with the nearby lipoprotein resonances, and this can have the effect of dramatically reducing the calculated correlations to this resonance (Figure 4A), hindering the identification of peaks belonging to the same molecule. Parts B and C of Figure 4 show that STOCSY models may be improved by acquiring CPMG spin-echo experiments on the same data set (102.4 and 608 ms total echo times, respectively), because of the suppression of the lipoprotein signals, resulting in less overlap. Parts D and E of Figure 4, however, show that the same result can be achieved by appropriate transformation of the 1D data (by D-RESY, and G-RESY, respectively) due to the decreased contribution from broad features in these spectra.
Figure 5. Result from an O-PLS-DA model constructed from 1D NMR data, showing O-PLS-coefficients for a group diagnosed with diabetes (positive O-PLS coefficients) against others. (A) 1D data and (B) G-RESY. Key discriminatory metabolites have been labeled.
Diabetes Status of Individuals. A principal aim of the MolPAGE project is to optimize analytical technologies for highthroughput analysis of human samples to profile the molecular phenotype of diabetes and cardiovascular disease. For this study three classes of participants were defined: without diabetes; intermediate; or overt diabetes. O-PLS-DA models were constructed for all types of NMR data, and the results for models constructed from 1D data and 1D data after Gaussian shaping of the FID are plotted in parts A and B of Figure 5, respectively.
The main features observed from 1D data (Q2Yˆ ) 0.15) from the participants with overt diabetes were a significantly higher level of glucose, higher VLDL/LDL ratio, and other lipid resonances (Figure 5A). There appeared to also be a slightly higher VLDL/ LDL ratio in intermediate participants compared to those without diabetes (data not shown). The model constructed from G-RESY data (Q2Yˆ ) 0.17, Figure 5B) also revealed glucose as a significant discriminator between those with diabetes and others, while also indicating higher levels of pyruvate and decreased levels of Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
7361
creatinine and acetate in the blood plasma of these participants. Although the correlations (and Q2Yˆ values) are low, these same potential biomarkers were also revealed by construction of models from the CPMG and the D-RESY data. Crucially, there were no trends observed in the CPMG data that were not also observed in either the 1D data after standard, D-RESY, or G-RESY processing. CONCLUSIONS We have shown that the “virtual” RESY data derived from a set of standard 1D NMR spectra can give information specifically relating to small molecules (such as amino acids and sugars), similar to those extractable from CPMG spin-echo experimental data. In conditions optimized for high throughput (i.e., screening), the main objective is to minimize acquisition times while maximizing information recovery. Thus a large time saving may be afforded if this approach is taken to probe small molecules. It should be noted, however, that these approaches are not intended to be a substitute for spin-echo experiments, which can offer valuable information. Indeed, a spectrum of very closely spaced narrow signals (with long T2 times) would be observed in the CPMG but not in the D-RESY spectrum, especially at lower fields. For example, the bile acid “hump”, often observed between δ0.5
7362
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
and δ2.042 is observed in CPMG experiments but would be “suppressed” in the D-RESY spectrum. This approach would also be useful in legacy data sets, in situations where the samples may no longer exist, and only standard “pulse and acquire” type 1D NMR spectroscopic data were acquired. The methods could also be applied to the analysis of urinary NMR data, as an alternative way of removing the broad urea signal (δ5.5-δ6) and thus to uncover other metabolites in this spectral region. ACKNOWLEDGMENT Financial assistance from the EU FP6 MolPAGE Project (Project LSHG-512066) is acknowledged. Andrew Hattersley (Peninsula Medical School), Graham Hitman (St. Barts and the London Medical School), and Michael Sampson (Norfolk and Norwich Hospital) are acknowledged for sample collection. Dorrit Baunsgaard is acknowledged for helpful suggestions. Received for review May 23, 2008. Accepted August 1, 2008. AC801053G (42) Waterhous, D. V.; Barnes, S.; Muccio, D. D. J. Lipid Res. 1985, 26, 1068– 1078.