Recursive Segment-Wise Peak Alignment of Biological 1H NMR

Dec 2, 2008 - Recursive Segment-Wise Peak Alignment of. Biological. 1. H NMR Spectra for Improved Metabolic. Biomarker Recovery. Kirill A. Veselkov,â€...
0 downloads 0 Views 3MB Size
Anal. Chem. 2009, 81, 56–66

Recursive Segment-Wise Peak Alignment of Biological 1H NMR Spectra for Improved Metabolic Biomarker Recovery Kirill A. Veselkov,† John C. Lindon,† Timothy M. D. Ebbels,† Derek Crockford,† Vladimir V. Volynkin,‡ Elaine Holmes,† David B. Davies,† and Jeremy K. Nicholson*,† Department of Biomolecular Medicine, Division of Surgery, Oncology, Reproductive Biology and Anesthetics (SORA), Faculty of Medicine, Imperial College London, Sir Alexander Fleming Building, SW7 2AZ, London, U.K., and Department of Physics, Sevastopol National Technical University, Streletskaya Bay, Crimea, Ukraine Chemical shift variation in small-molecule 1H NMR signals of biofluids complicates biomarker information recovery in metabonomic studies when using multivariate statistical and pattern recognition tools. Current peak realignment methods are generally time-consuming or align major peaks at the expense of minor peak shift accuracy. We present a novel recursive segment-wise peak alignment (RSPA) method to reduce variability in peak positions across the multiple 1H NMR spectra used in metabonomic studies. The method refines a segmentation of reference and test spectra in a top-down fashion, sequentially subdividing the initial larger segments, as required, to improve the local spectral alignment. We also describe a general procedure that allows robust comparison of realignment quality of various available methods for a range of peak intensities. The RSPA method is illustrated with respect to 140 1H NMR rat urine spectra from a caloric restriction study and is compared with several other widely used peak alignment methods. We demonstrate the superior performance of the RSPA alignment over a wide range of peaks and its capacity to enhance interpretability and robustness of multivariate statistical tools. The approach is widely applicable for NMR-based metabolic studies and is potentially suitable for many other types of data sets such as chromatographic profiles and MS data. Metabonomic and metabolomic studies involve the collection of multiple data sets, usually by nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry.1-3 Although NMRbased metabonomics is generally highly reproducible and robust, the recovery of biological information is complicated by instrumental imperfections and noise and line shape variations, together with phase and baseline distortions, all of which contribute to * To whom correspondence should be addressed. E-mail: j.nicholson@ imperial.ac.uk. † Imperial College London. ‡ Sevastopol National Technical University. (1) Beckonert, O.; Keun, H. C.; Ebbels, T. M.; Bundy, J.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Nat. Protoc. 2007, 2, 2692–2703. (2) Nicholson, J. K.; Lindon, J. C.; Holmes, E. Xenobiotica 1999, 29, 1181– 1189. (3) Nicholson, J. K.; Wilson, I. D. Prog. Nucl. Magn. Reson. Spectrosc. 1989, 21, 449–501.

56

Analytical Chemistry, Vol. 81, No. 1, January 1, 2009

quantitative errors.4-7 Thus, NMR data sets often require extensive preprocessing prior to statistical modeling. For example, spectroscopic noise is usually reduced by applying free induction decay (FID) weighting functions or by using more sophisticated wavelet-based approaches.8,9 Robust preprocessing methods have also been applied routinely for manual or automatic baseline/ phase corrections.10-12 Major challenges in analysis of biological data sets can arise due to the natural variable dilution of samples and to uncontrolled variation in peak position across spectra, sometimes called “positional noise”.13-17 These effects induce unwanted systematic variation, which can compromise results and interpretation of statistical techniques, e.g., statistical total correlation spectroscopy (STOCSY),18-20 principal component analysis (PCA),21 and partial (4) Brown, T. R.; Stoyanova, R. J. Magn. Reson. 1996, 112, 32–43. (5) Carlos Cobas, J.; Bernstein, M. A.; Martin-Pastor, M.; Tahoces, P. G. J. Magn. Reson. 2006, 183, 145–151. (6) Stoyanova, R.; Nicholls, A. W.; Nicholson, J. K.; Lindon, J. C.; Brown, T. R. J. Magn. Reson. 2004, 170, 329–335. (7) Tang, C. G. J. Magn. Reson. 1994, 109, 232–240. (8) Leung, A. K.-m.; Chau, F.-t.; Gao, J.-b. Chemom. Intell. Lab. Syst. 1998, 43, 165–184. (9) Trbovic, N.; Dancea, F.; Langer, T.; Gunther, U. J. Magn. Reson. 2005, 173, 280–287. (10) Hoch, J. C.; Stern, A. NMR Data Processing; Wiley-Liss: New York, 1996. (11) Witjes, H.; Melssen, W. J.; Zandt, H. J. A.; van der Graaf, M.; Heerschap, A.; Buydens, L. M. C. J. Magn. Reson. 2000, 144, 35–44. (12) Witjes, H.; van den Brink, M.; Melssen, W. J.; Buydens, L. M. C. Chemom. Intell. Lab. Syst. 2000, 52, 105–116. (13) Craig, A.; Cloarec, O.; Holmes, E.; Nicholson, J. K.; Lindon, J. C. Anal. Chem. 2006, 78, 2262–2267. (14) Dieterle, F.; Ross, A.; Schlotterbeck, G.; Senn, H. Anal. Chem. 2006, 78, 4281–4290. (15) Forshed, J.; Schuppe-Koistinen, I.; Jacobsson, S. P. Anal. Chim. Acta 2003, 487, 189–199. (16) Holmes, E.; Foxall, P. J. D.; Nicholson, J. K.; Neild, G. H.; Brown, S. M.; Beddell, C. R.; Sweatman, B. C.; Rahr, E.; Lindon, J. C.; Spraul, M.; Neidig, P. Anal. Biochem. 1994, 220, 284–296. (17) Torgrip, R. J. O.; Åberg, M.; Karlberg, B.; Jacobsson, S. P. J. Chemom. 2003, 17, 573–582. (18) Cloarec, O.; Dumas, M.-E.; Craig, A.; Barton, R. H.; Trygg, J.; Hudson, J.; Blancher, C.; Gauguier, D.; Lindon, J. C.; Holmes, E.; Nicholson, J. Anal. Chem. 2005, 77, 1282–1289. (19) Holmes, E.; Cloarec, O.; Nicholson, J. K. J. Proteome Res. 2006, 5, 1313– 1320. (20) Smith, L. M.; Maher, A. D.; Cloarec, O.; Rantalainen, M.; Tang, H.; Elliott, P.; Stamler, J.; Lindon, J. C.; Holmes, E.; Nicholson, J. K. Anal. Chem. 2007, 79, 5682–5689. (21) Wold, S.; Esbensen, K.; Geladi, P. Chemom. Intell. Lab. Syst. 1987, 2, 37– 52. 10.1021/ac8011544 CCC: $40.75  2009 American Chemical Society Published on Web 12/02/2008

least-squares (PLS).22 For example, treatment groups can sometimes be discriminated on the basis of irrelevant variation in peak positions23 rather than from changes in sample composition characterizing the metabolic functions of an organism. Such difficulties are often observed in 1H NMR spectroscopic studies of urine, which varies in pH, ionic strength, metal ion concentrations, and osmolality, and also has a large dynamic range of multiply overlapped species. Correcting variation in peak position is therefore necessary for improved information recovery.13,14,24 A number of analytical strategies have been employed in attempts to address the problem of variable peak position across multiple spectra in biological data sets based on high-throughput technologies such as NMR,15,16,25,26 chromatography,27-29 and mass spectrometry.30,31 Several methods aim to identify and extract areas of analyte peaks, reducing spectral data to a set of quantified metabolites for data analysis.26,32 In this “targeted” profiling method,26 the NMR signatures of metabolites are usually identified by searching against a database of the pure components, but applications have been limited to quantifying the changes in defined and predictable biomarkers.32,33 In other methods, all spectral information should be used in data analysis and sample classification. One well-established but coarse-grained approach to correct for positional noise is “binning”,16,34 in which small segments or bins of a fixed (e.g., δ ) 0.04 ppm) or variable size are defined and integrated. This approach can give good classification information but at the expense of the loss of spectral resolution for biomarker identification, e.g., a reduction from 64k to 256 variables, which leads to difficulties in the analysis of small peaks and interpretation of statistical models. In order to overcome these limitations, spectral profiles need to be aligned at high resolution, e.g., spectra with typically 32k data points.15,27 Alignment methods aim at transforming spectroscopic signals from identical chemical groups so that they assume the same positions across multiple spectra. Peak positions of sample profiles are generally corrected with respect to peak positions of reference profile(s), with the exception of certain algorithms based on the use of a model peak.6,24 One major class of alignment methods relies on a set of detected peaks for the alignment. Examples include peak align(22) Wold, S.; Sjostrom, M.; Eriksson, L. Chemom. Intell. Lab. Syst. 2001, 58, 109–130. (23) Cloarec, O.; Dumas, M. E.; Trygg, J.; Craig, A.; Barton, R. H.; Lindon, J. C.; Nicholson, J. K.; Holmes, E. Anal. Chem. 2005, 77, 517–526. (24) Csenki, L.; Alm, E.; Torgrip, R.; Aberg, K.; Nord, L.; Schuppe-Koistinen, I.; Lindberg, J. Anal. Bioanal. Chem. 2007, 389, 875–885. (25) Forshed, J.; Torgrip, R. J. O.; Aberg, K. M.; Karlberg, B.; Lindberg, J.; Jacobsson, S. P. J. Pharm. Biomed. Anal. 2005, 38, 824–832. (26) Weljie, A. M.; Newton, J.; Mercier, P.; Carlson, E.; Slupsky, C. M. Anal. Chem. 2006, 78, 4430–4442. (27) Nielsen, N.-P. V.; Carstensen, J. M.; Smedsgaard, J. J. Chromatogr., A 1998, 805, 17–35. (28) Tomasi, G.; van den Berg, F.; ersson, C. J. Chemom. 2004, 18, 231–241. (29) van Nederkassel, A. M.; Daszykowski, M.; Eilers, P. H. C.; Heyden, Y. V. J. Chromatogr., A 2006, 1118, 199–210. (30) Jeffries, N. Bioinformatics 2005, 21, 3066–3073. (31) Wong, J. W. H.; Cagney, G.; Cartwright, H. M. Bioinformatics 2005, 21, 2088–2090. (32) Crockford, D. J.; Keun, H. C.; Smith, L. M.; Holmes, E.; Nicholson, J. K. Anal. Chem. 2005, 77, 4556–4562. (33) Shulaev, V. Brief. Bioinform. 2006, 7, 128–139. (34) Holmes, E.; Nicholson, J. K.; Nicholls, A. W.; Lindon, J. C.; Connor, S. C.; Polley, S.; Connelly, J. Chemom. Intell. Lab. Syst. 1998, 44, 245–255.

ment using reduced set mapping17 and fuzzy warping.35,36 The detected peaks are aligned once peak correspondence between sample profiles is established, though this can lead to nonalignment of small peaks, which are difficult to detect, and there is consequent loss of information. Noise can also introduce spurious peaks and compromise the alignment process.24 The correspondence of peaks in overlapped and dense spectral regions is difficult and can also result in misalignment. These methods can also fail to perform well in cases of spectral line shape distortions, because full line shape information is required for alignment.15 Another class of methods uses all spectral information for the alignment. Several algorithms are based on dynamic programming, e.g., dynamic time warping37 and correlation optimized warping (COW).27 Of the methods based on dynamic programming, COW has been extensively applied, because of its desirable property of preserving the area and shape of peaks.37,38 The COW algorithm aligns a spectral profile toward a target spectrum by piecewise linear stretching or compression of segments, allowing relatively small changes in a segment length. The sum of correlation coefficients between the reference and sample segments is usually maximized to determine an optimal path for the alignment.27 The major drawback of these methods is the computational complexity, with the processing time proportional to N2, where N is the number of data points,39,40 which leads to hours being required to align a typical one-dimensional (1D) metabonomic data set using a current desktop computer. A few methods, including parametric time warping41 and semiparametric warping,29 have attempted to accelerate the alignment process by explicit modeling of the warping function. However, sample profiles are aligned to reference profile(s) using a least-squares approach, which is biased toward the matching of major peaks and can result in misalignment of small, but potentially important, peaks. A computationally fast algorithm based on recursive alignment has also been proposed.39 It starts with the alignment of a full spectrum and progresses to smaller segments, when further alignment is required, making use of the fast Fourier transform (FFT) cross-correlation function to estimate a linear shift between segments. However, a linear shift correction of a full 1H NMR spectrum is not generally effective, as positions of nonshifted (stable) peaks, as well as those peaks exhibiting positional variation in a direction opposite to the shift correction, will be compromised. Application of these methods is advantageous for those high-throughput studies that have both local and global contributions into peak shifts, e.g., chromatographic data, in which the majority of peaks are generally shifted to some extent in the same direction.27,40 In 1H NMR-based metabolic studies, some peaks maintain stable positions, whereas others, particularly those arising from (35) Walczak, B.; Wu, W. Chemom. Intell. Lab. Syst. 2005, 77, 173–180. (36) Wu, W.; Daszykowski, M.; Walczak, B.; Sweatman, B. C.; Connor, S. C.; Haselden, J. N.; Crowther, D. J.; Gill, R. W.; Lutz, M. W. J. Chem. Inf. Model. 2006, 46, 863–875. (37) Bylund, D.; Danielsson, R.; Malmquist, G.; Markides, K. E. J. Chromatogr., A 2002, 961, 237–244. (38) Pravdova, V.; Walczak, B.; Massart, D. L. Anal. Chim. Acta 2002, 456, 77–92. (39) Wong, J. W. H.; Durante, C.; Cartwright, H. M. Anal. Chem. 2005, 77, 5655–5661. (40) Skov, T.; van den Berg, F.; Tomasi, G.; Bro, R. J. Chemom. 2006, 20, 484– 497. (41) Eilers, P. H. C. Anal. Chem. 2004, 76, 404–411.

Analytical Chemistry, Vol. 81, No. 1, January 1, 2009

57

metabolites with ionizable groups normal in the biological pH range and from those that form complexes, can change chemical shifts in either direction across spectra.3 As a result, methods developed for NMR-based studies have relied on local alignment, e.g., partial linear fit (PLF),42 peak alignment by means of genetic algorithm (PAGA),15 and peak alignment by beam searching (PABS).43 The basis of all these methods is to divide sample and reference spectra into a number of segments, each of which includes at least one peak. The sample segments are aligned independently by linear shifting at segment boundaries using interpolation toward corresponding reference segments.15,42,43 These methods are computationally fast, requiring only seconds or minutes to align an entire data set. The PLF method combines closely located peaks together into a single segment in an attempt to align each multiplet as a whole.42 The application of this method to biological studies has been limited, because the assumption that closely located singlets are part of the same multiplet is not normally fulfilled. The PAGA method sequentially divides a spectrum into segments in an unsupervised fashion according to a specified segment-size parameter.15 The correlation coefficient is calculated to determine an optimal shift position for each segment with respect to its reference by means of a genetic algorithm. However, an excessively large segment size can result in nonalignment of small peaks, whereas distortions in spectral profiles can be introduced by a small segment size due to a peak being cut at segment boundaries.39 In addition, genetic algorithms are usually employed in nonlinear optimization problems and are not efficient in the case of linear translation. As a result, the algorithm has been accelerated using a beam search approach (PABS)43 and later by FFT cross correlation (PAFFT).39 Applications of all these algorithms to NMR-based studies have generally been evaluated with respect to alignment of large peaks. The aim of this work is to consider the simultaneous alignment of both major and minor spectral peaks in an efficient and accurate way for 1H NMR-based metabonomic studies. We have developed the method of recursive segment-wise peak alignment (RSPA) to account for variations in peak position across samples in complex NMR spectra. The method refines the segmentation of reference and test spectra in a top-down fashion, subdividing the initial larger segments into smaller ones, as required, to improve the local peak alignment. RSPA makes use of the local peak position variation to simplify the alignment of a complex spectrum into the independent alignment of segments. To improve the accuracy of the alignment, NMR spin-coupled multiplets are aligned as a whole, not as separate singlets, by means of local recursion. The RSPA method is evaluated for the alignment of both major and minor NMR peaks in 1H NMR spectra of urine from a caloric restriction study. The performance of RSPA is compared with other widely used alignment methods, particularly with PAGA (or PAFFT) and COW. We use the mean of correlation coefficients between variance-scaled spectra, in which the difference in metabolite concentrations has been removed, as a quantitave measure of the alignment quality. The success of alignment is also considered using PCA by the increase in the explained variance of sample composition after alignment. (42) Vogels, J. T. W. E.; Tas, A. C.; Venekamp, J.; van der Greef, J. J. Chemom. 1996, 10, 425–438. (43) Lee, G.-C.; Woodruff, D. L. Anal. Chim. Acta 2004, 513, 413–416.

58

Analytical Chemistry, Vol. 81, No. 1, January 1, 2009

Finally, we evaluate the RSPA alignment for improved information recovery in STOCSY. MATERIALS AND METHODS Caloric Restriction Study Details. Experimental studies were conducted as a part of the Consortium for Metabonomic Toxicology (COMET) project, as reported previously,44-46 in male Sprague-Dawley rats using a standardized protocol. All studies were carried out in accordance with relevant national legislation and were subject to appropriate local ethical review approval. Male Sprague-Dawley rats (n ) 27), aged 6-8 weeks, were allowed to acclimatize for 48 h prior to caloric restriction, in metabolism cages at 21 °C and 55% relative humidity, with fluorescent lighting between 06.00 and 18.00. Throughout the acclimatization period, all animals had access to food (Purina chow) and water ad libitum. All animals were allocated randomly into three equal treatment groups (n ) 9), control and mildly and severely food restricted. The animals of the control group had access to food and water ad libitum, whereas mildly (treatment 1) and severely food restricted (treatment 2) animals had 75 and 50% food intake of control animals, respectively. Urine samples of animals were collected over solid CO2 in tubes containing sodium azide (1 mL of 1% solution) over the following time periods: -8 to 0, 8-24, 24-48, 48-72, 72-96, 96-120, 120-144, and 144-168 h. All 140 collected samples were immediately frozen and stored at -40 °C prior to NMR spectroscopic analysis. Half the animals from each group in this study were euthanized by CO2 inhalation at 48 h post-treatment and the remainder at 168 h post-treatment for removal of target organs for histological examination. 1 H NMR Spectroscopy of Urine. NMR spectral data were acquired on a Bruker DRX600 spectrometer operating at 600.13 MHz 1H observation frequency (Bruker Biospin, Rheinstetten, Germany) using a BEST (Bruker Efficient Sample Transfer) 5-mm flow injection probe for sample delivery and analysis. 1D NMR spectra of urine samples were acquired using a standard presaturation pulse sequence to suppress the residual water resonance. Free induction decays were collected into 64k data points, at 300 K, using a spectral width of 12 019 Hz, with an acquisition time 2.04 s, giving a total pulse recycle delay of 3.04 s. The data were zero-filled by a factor of 2 and the FIDs were multiplied by an exponential weighting function equivalent to a line broadening of 0.3 Hz prior to Fourier transformation. The resulting spectra were manually phased, baseline corrected, and calibrated to trimethylsilyl-2,2,3,3-tetradeuteropropionic acid at δ ) 0.0 using XWIN NMR (Bruker). Notation. The following notation has been used throughout this work. Lower case italic letters are used for scalars (e.g., sp) and lower case/boldface letters for row vectors (e.g., sp). Matrices are denoted with bold capital letters (e.g., X). Normalization: Accounting for Variable Dilution across 1 H NMR Spectra. Dilution causes uniform changes in concentrations of all metabolites in a sample, scaling all spectral intensities by the same factor. From a computational point of view, metabolite (44) Nicholson, J.; Keun, H.; Ebbels, T. J. Proteome Res. 2007, 6, 4098–4099. (45) Lindon, J. C.; Keun, H. C.; Ebbels, T. M.; Pearce, J. M.; Holmes, E.; Nicholson, J. K. Pharmacogenomics 2005, 6, 691–699. (46) Ebbels, T. M. D.; Keun, H. C.; Beckonert, O. P.; Bollard, M. E.; Lindon, J. C.; Holmes, E.; Nicholson, J. K. J. Proteome Res. 2007, 6, 4407–4422.

Figure 1. RSPA alignment scheme.

peaks affected purely by a dilution effect will exhibit the same n-fold changes between two spectra. A robust estimate of this dilution factor has been shown to be the median value of n-fold changes between sample and target profiles.14 This method is referred to as the probabilistic quotient method.14 The variation caused by dilution is removed separately for each sample with respect to a target profile for a complete data set. The choice of target profile is not crucial for the method’s performance14 and can be any sample profile of a given data set or, as typically selected, the median spectrum. Recursive Segment-Wise Peak Alignment of Chemical Shifts. A detailed description of all steps of the algorithm for RSPA alignment is given as Supporting Information, whereas a summary is provided here. RSPA aims to realign peak positions of a 1H NMR spectrum (discrete data signal) with respect to peak positions of a reference spectrum. The reference sample is selected on the basis of the greatest “closeness” to all spectra of interest in a given data set in terms of both major and minor peaks calculated by the “closeness index” (step 1, Figure 1), defined by eq A.1 in the Supporting Information. Corrections for complex peak shifts in a sample (“test”) spectrum in RSPA are simplified

into subcorrections for peak shifts in its segments relative to peak positions of corresponding reference segments. Segments are defined to contain multiple peaks such that any multiplet is most likely to be represented in a single segment. In turn, the correspondence between test and reference segments is established based on their proximity to each other (step 2, Figure 1). Next, peak shifts in each test segment are recursively corrected to match peak positions of its corresponding reference segment in a top down fashion, subdividing the initial larger segment into smaller ones, as required, to improve peak alignment (step 3, Figure 1). The recursion starts by shifting peaks in a test segment as a whole and then progresses to smaller subsegments until no further peak alignment is required. The maximum of FFT-derived cross correlation between a test segment and its reference segment is calculated to determine the optimal shifting for alignment. The recursion stops when either the smallest peak width is reached or a (sub-) segment is purely aligned, as validated by a measure based on scaled reference and test (sub-) segments. After shifting, the test segment is linearly interpolated at its boundaries. In the case of a segment containing a single multiplet, it will be aligned as a whole, requiring no further alignment and thus stopping recursion. When there are additional peaks in a segment, local recursive alignment will progress to smaller subsegments and align those peaks exhibiting positional variation. Finally, a test spectrum is reconstructed by joining aligned segments together (step 4, Figure 1). The method is computationally fast, requiring only minutes to align a data set of hundreds of NMR spectra of complex mixtures with 32k point resolution on a standard personal computer. Other Alignment Methods Used in This Work. Two commonly used alignment methods were selected for comparison purposes and are briefly described as follows. The PAGA method15 sequentially divides both sample (test) and reference spectra into segments in an unsupervised fashion according to a segment size parameter, predefined by a user (δ ) 0.025 ppm, see the discussion section). Next, peak positions in each test segment are shifted using interpolation at segment boundaries toward peak positions of its corresponding reference segment. The optimal shift for alignment is determined by the maximum of the Pearson correlation coefficient between a test segment and its corresponding reference by means of a genetic algorithm. The PAFFT version of the algorithm, described and implemented by Wang,39 was used in this work. A maximum shift-size parameter can also be specified to avoid excessive linear translation and a value (δ ) 0.025 ppm) corresponding to the largest positional variation in our data set was selected. The COW method27 initially divides both sample (test) and reference spectra into a number of segments of equal size according to a segment-size parameter. Based on average peak width, the value of this parameter was set to δ ) 0.01 ppm, as recommended elsewhere.40 In contrast to PAGA, boundaries of sample segments are allowed to change to a certain degree as specified by the so-called “slack” parameter. When the lengths of a test segment and its corresponding reference differ, the former is linearly stretched or compressed in order to obtain the segment of equal length. The sum of correlation coefficients between modified test segments and their references is maximized to determine the optimal alignment path by means of dynamic Analytical Chemistry, Vol. 81, No. 1, January 1, 2009

59

Table 1. Alignment Quality (aqbin) Measure for an Entire Data Set in Which the Areas Corresponding to Bin Sizes of Each Spectrum Were Scaled to Unit Variance method

aq0.02

aq0.04

aq

0.08

aq0.12

non-aligned PAFFT COW RSPA

0.746 0.771 0.799 0.818

0.799 0.829 0.842 0.869

0.826 0.861 0.867 0.888

0.849 0.889 0.896 0.913

whereas the bin size of δ ) 0.08 is sufficient to minimize the contribution of minor peaks and so is used to evaluate the alignment of major peaks. The ccbin reflects the matching of peaks between any two profiles. To assess the peak realignment quality across all spectra, we calculate the mean of all pairwise ccbin values (below the main diagonal of the correlation matrix). n

aqbin ) programming. The choice of value of the slack parameter is difficult, and thus, different values were tried until good alignment with no or minimal spectral distortion was achieved; the optimal value was δ ) 0.005 ppm. The COW algorithm implemented and described by Skov40 was used in this work. Due to insufficient memory for the application of COW to full resolution spectra, it was necessary to divide spectra into regions of minimal width (1 ppm, 2000 data points) and these were aligned by COW independently and then joined together. The spectra were divided in such a way to avoid cutting peaks in both test and reference spectra. The number of parameters in the RSPA alignment method (Table 1, Supporting Information) is comparable with the COW (2 parameters) and PAFFT (1 parameter) methods, but the selection of their values is transparent in contrast to PAGA and COW, where the optimal values of the slack parameter and the minimal segment size are difficult to determine. Evaluation of Spectral Alignment. The correlation coefficient (cc) between sample profiles in NMR-based studies has typically been used as an objective measure of alignment quality.15,25 However, the cc is a poor indicator of the similarity or peak matching between two given spectra as it is unduly influenced by the covariance of the highest peaks27 and by differences in sample composition between spectra. The cc can be expressed as

i-1

∑ ∑ cc

2 n(n - 1) i)1

bin(si, sj)

where si and sj are ith and jth spectra, respectively. The value of aqbin measure varies from zero to one, from a nonaligned to a completely aligned data set. PCA has been frequently applied to assess alignment quality.15,17,25 PCA attempts to explain as much variance as possible in a data set using a small number of orthogonal principal components (PCs).21 This often provides an efficient representation of relationships between data elements, ignoring features attributable to noise. The success of alignment has been reported with respect to an increase of explained variance by PCs and improved sample clustering in the PC space.15,17,25 However, PCA is usually applied to unscaled data sets, which are biased toward the variation of the highest peaks because of their larger contribution to total variance, similar to that for the cc analysis. In order to achieve equal impact of variables in a PC model, each variable should be scaled to unit variance. In either case, an examination of whether an increase in explained variance reflects the variation in chemical composition, but not the variation induced by variable peak positions, is required. Formally, PCA summarizes the data matrix in terms of scores and loadings: X ) TP' + E

cc(t, s) )

Cov(t, s)

√Var(t)Var(s)

(1)

where Cov(t,s) is the covariance between t and s profiles and Var is the variance of the sample profile. When the cc is calculated between wide spectral regions, minor peaks will be down weighted because of the large contribution of the highest peaks into the variance. The value of the cc will then be mainly influenced by a small number of large peaks. Moreover, the cc is a good measure only when differences in metabolite concentrations (levels of individual peaks) across spectra are removed. To overcome this problem, the local areas must be scaled appropriately prior to calculating the cc.27 We define ccbin as

ccbin(t, s) )

Cov(tbin, sbin)

√Var(tbin)Var(sbin)

(2)

where subscript bin indicates the operation in which spectral segments are mean centered and then scaled to unit variance. Bin sizes of δ ) 0.02 or 0.08 ppm are used in this work. The value of δ ) 0.02 ppm is selected to be a few times the full width half maximum of a typical peak to ensure the equal contribution of both minor and major peaks in the alignment quality measure, 60

Analytical Chemistry, Vol. 81, No. 1, January 1, 2009

(3)

j)1

(4)

where X is a data matrix of biological 1H NMR spectra, T is a matrix of scores and P is a matrix of weights (loadings) of spectral variables into the scores, E is a residual, and the prime (′) denotes a transpose operation. PC scores summarize inter-relationships between observations (1H NMR spectra of urine samples), e.g., groupings according to physiological traits or gender and dose or time-dependent changes. A subset of metabolites corelated with PC scores can be identified using the corresponding factor loadings. In the case of an unscaled data set, loadings can be directly used to interpret a model. If PC scores explain the variation in sample/chemical composition, i.e., amplitudes of metabolite peaks, the line shapes of PC loadings will resemble metabolite peaks as in an NMR spectrum. These line shapes will be distorted,6 e.g., showing the first-derivative shape of a metabolite peak, in cases where the variation caused by variable peak position (and also phase and peak line shape distortions) contributes into PC scores (Figure 2). The approach is difficult with unit-variance scaled models because the line shapes of loadings are not interpretable in the same way as in unscaled models and so cannot be used directly to identify peaks dominating the variance in the data (Figure 2). The mathematical properties of PC loadings need to be examined

than the biological ones.18 STOCSY analysis takes advantage of this multicolinearity to reveal structural correlations between various peaks in a set of biological spectra. Formally, STOCSY involves examining the correlation matrix of a set of 1H NMR spectra

R)

Figure 2. Influences of different sources of variation on the PC models. uv is a unit-variance scaling. The blue and red colors reflect the loadings of the first PC and second PC in the unscaled and uvscaled cases. The first and second transformed-uv PC loadings can be understood with respect to the line shapes of unscaled loadings. The transformed-uv loadings are color-coded by the actual uv-scaled loading values, re-expressed as correlation coefficients between spectral and PC variables.

to enable effective interpretation. The PC loading (p) can be expressed as follows47

pti,rk )

ccti,rk√σrk

√λt

(5)

i

where σrk and λti are variances of spectral (peak intensity) variable rk and PC vector of scores ti, respectively, and cc denotes the correlation coefficient between rk and ti. In the unscaled case, the line shapes of PC loadings (eq 5) are interpretable because a loading value is proportional to the standard deviation of a spectral variable σk, which is respectively lower and higher for variables that constitute minor and major intensities of a peak. This interpretation is destroyed when each variable is scaled to unit variance, i.e., σk = 1 in eq 5. In this case, a loading value can be multiplied by the standard deviation of an original spectral variable for interpretation of the dominant source of the variance contribution into PC scores (Figure 2). Similar to the unscaled case, the line shapes of transformed (or scaled) uv-loadings will be distorted,6 e.g., showing the first-derivative shape of a metabolite peak, in cases where the variation caused by variable peak position contributes to a model (Figure 2). The transformed uv-loadings can be plotted using a color code corresponding to the weight value obtained from a unit-variance PC (which is re-expressed here as cc between spectral variables and PC scores, Figure 2). A similar transformation has been used elsewhere to interpret unitvariance scaled O-PLS regression models.23 The success of alignment needs to be further evaluated with respect to the major advantage of NMR spectroscopysthe provision of detailed information on molecular structures in complex biological mixtures. Statistical correlations between intensity variables resulting from both intramolecular connectivity of proton nuclei and biological relationships are observed in 1H-NMR biofluid studies. The structural correlations are generally stronger (47) Johnson, R. A.; Wichern, D. W. In Applied multivariate statistical analysis; Prentice-Hall, Inc.: Englewood Cliffs, NJ, 1988.

1 XX' n - 1

(6)

where n is the number of sample observations X is a data matrix of biological 1H NMR spectra, in which columns are intensity variables scaled to unit variance and rows are spectral observations, and R is a matrix of pairwise cc between intensity variables, but not between spectra. Usually, correlations of one specific variable to all other intensity variables are studied, and this is referred to as 1D-STOCSY.18 The influence of variable peak positions across spectra on such an analysis can be interpreted by plotting covariance between intensity variables with color codes linked to their ccs. The chemical shift variability will distort the line shapes of covariance patterns, which should resemble peaks in an NMR spectrum. In turn, successful peak alignment has to improve identification and determination of both structural and biological correlations. RESULTS AND DISCUSSION Chemical Shift Variation in 1H NMR Spectra of Biofluids. The 1H NMR urine profiles of Sprague-Dawley rats subjected to caloric restrictions were used, as an example, for demonstrating the alignment quality of the methods. Dietary effects often result in subtle metabolic changes, where even small variations in peak positions will significantly limit the recovery of information. The influence of variable peak positions across all (140) 1H NMR spectra of the caloric restriction study is exemplified by the spectral region between δ 3.2 and 2.5 in Figure 3. The upper traces show an overlay of all spectra, and the lower traces are indicative of the intensity positions of peaks. For example, the two AB doublets of citrate (δ ) 2.54 and δ ) 2.68, respectively) clearly change in position, with the high-frequency pair more variable than the low-frequency pair. Each doublet exhibits positional variation to a different extent and as a unit, whereas the position of the small methylamine peak (δ ) 2.61, singlet) in between the doublets remains relatively stable. Hence, a single correction for peak positions across a whole spectrum is not generally effective and the problem should be approached on a segment-wise scale, which is often the case for NMR-based biological studies.15,25,42,43 Removal of Uncontrolled Chemical Shift Variation across Spectra. The variable dilution effects between sample profiles were removed prior to the alignment by applying the probabilistic quotient normalization.14 The alignment of the data set was performed as follows: the reference samples were selected for control animals and those animals with 25 and 50% caloric restrictions (including all time points) based on the greatest “closeness” with all other profiles within the group, and these were aligned with each other by the RSPA method (Figure 3). In order to compare the performance of each alignment method (RSPA, PAFFT, COW) in an unbiased way, the same reference spectra were used for each method. The results after RSPA alignment are exemplified for the small spectral region (δ 3.2-2.5) in Figure 3. Analytical Chemistry, Vol. 81, No. 1, January 1, 2009

61

Figure 3. Example of peak position variation. Nonaligned 1H NMR urine profiles (left) and profiles processed by RSPA (right). The lower plots represent peak intensity positions of all the 140 1H NMR spectra of the caloric restriction study.

Evaluation of Alignment Quality Using Correlation Maps. Alignment quality was initially evaluated using the ccs of the variance-scaled sample profiles. The areas corresponding to bin sizes (δ ) 0.02 or δ ) 0.08) of each spectrum were scaled to unit variance prior to calculating the cc, in order to equalize the difference in concentrations of metabolites across spectra and to make the contribution of minor and major peaks more even. The value of δ ) 0.02 ensures an equal contribution of both small and large peaks in the alignment quality measure (minor peaks, Figure 4a), whereas the segment size of δ ) 0.08 is large enough to minimize the influence of small peaks (major peaks, Figure 4b). A summary of the alignment quality for the whole data set, i.e., the ccs between all pairs of spectra, is visualized using correlation maps (Figure 4), where the gradient of blue to red color indicates a low to high similarity or peak matching between variance-scaled sample profiles. The correlation maps include information about the alignment of individual spectra to all others (respectively, rows or columns of the matrix) for the range of peak intensities, while the whole map indicates the peak realignment quality for the entire data set. In addition, the mean of all pairwise ccs is taken as a single quantitative measure of realignment quality for the whole data set (eq 3, Table 1). The results show that all three methods studied performed rather well using the large peaks (bin size g0.08) and noticeably improved the similarity between spectra (Figure 4a, Table 1). The improvement is comparable using the PAFFT and COW methods and somewhat better using the RSPA method. However, with respect to smaller peaks (bin size e0.04), the PAFFT method performs poorly as shown by the correlation map, which exhibits little improvement (Figure 4b), and the relatively small increase in aqbin (Table 1). Although alignment of smaller peaks can be improved using PAFFT by reducing the minimal segment size parameter, which was set to δ 62

Analytical Chemistry, Vol. 81, No. 1, January 1, 2009

) 0.025 ppm, spectral distortions were introduced for large peaks in this study due to peaks being cut at segment boundaries with segments of smaller size. The COW method clearly improves the matching of both small and large peaks, though it is at the expense of computational complexitysseveral hours were required to align this data set. Moreover, the RSPA method provides superior performance for the alignment of both small and large peaks (Figure 4a,b). The major advantage of the RSPA method over PAFFT (and similar PAGA and PABS) is that local recursion enables the alignment of smaller peaks in large segments. In comparison with COW, RSPA is computationally efficient, requiring just minutes to align this data set. Additionally, the resulting greater similarity between sample profiles is probably due to the linear shifting and interpolation employed in the RSPA method, which preserves the shape and integral of peaks better than the stretching or compression used in the COW method. NMR peak profiles are theoretically and practically very different to chromatographic data, in which the width of a peak can be altered,27,40 and hence, stretching or compression can be of potential advantage compared to linear translation. The high similarity between all sample profiles after the RSPA alignment indicates a good choice of reference spectra. It is accepted that the first 90 sample profiles are more similar to each other than later ones but that is also the case for the nonaligned data, and therefore, the relative improvement in peak matching is equivalent for all spectra. The reason for this is that the first 90 spectra represent mainly phenotypic changes of control animals and these usually exhibit similar behavior. Evaluation of the Variation in Sample Composition Using PCA. In order to evaluate further the effectiveness of the alignment methods studied here, we have applied PCA to recover the latent

Figure 4. Correlation maps indicating the similarity between sample profiles influenced by (a) large metabolite peaks and (b) by both small and large metabolite peaks. ccbin size indicates the correlation coefficient between two spectra, in which spectral segments down to specific bin size are scaled to unit variance.

components associated with metabolic changes over time during caloric restriction. The data sets were either unscaled or were subjected to unit-variance (uv) scaling prior to PCA. To provide a biological interpretation of changes in PC scores, we have related these changes statistically to the variation in spectral variables (metabolite concentrations) by calculating the portion of the variable variance explained by the PC(s) using PC loadings (eq 5). The PC scores of the uv-scaled nonaligned data exhibit changes due to caloric restriction, though the corresponding loading plot of the first PC comprises many first-derivative-like line shapes (Figure 5a, PC loadings). This indicates that variable peak positions contribute to the PC scores such that the PC explains the variation caused by both variable peak position and sample composition. The situation is made worse for the unscaled nonaligned data since the first PC loadings comprise entirely first derivative-like line shapes. There is little useful information about the variation in the sample composition in this PC.

After chemical shift alignment by RSPA (Figure 5b), the changes in PC scores induced by caloric restriction in both unitvariance and unscaled models are clearer and due to sample composition, because the loadings as shown for the first PC are not distorted, exhibiting peaklike line shapes in an NMR spectrum. In both unscaled and unit-variance cases, the response to caloric restriction, as determined by PC scores, is clearly related to variations of multiple metabolites after the alignment. This provides meaningful biological interpretation of models. For example, metabolites such as 2-oxoglutarate, citrate, creatine, creatinine, and many others (those colored red in PC loading plots, Figure 5) covary when animals respond to caloric restriction. This covariance of metabolites occurs at the global system level and dominates the latent variables modeled by PCA. The PC models applied to profiles aligned by the two other methods, COW and PAFFT, mainly reflect “true” variation in Analytical Chemistry, Vol. 81, No. 1, January 1, 2009

63

Figure 5. PCA scores (PC1 vs PC2) and PC loadings (PC1) of the caloric restriction study based on (a) nonaligned 1H NMR urine profiles and profiles aligned by (b) RSPA. The PC scores are color-coded as follows: blue, 1H NMR spectra of urine samples of control animals over the time course (7 days) of the experiment plus treatment groups 1 and 2 at the 0-h time point; black, 24-h time point for treatment groups 1 and 2; green and red, all time points from 2 days post-treatment for treatment group 1 animals and treatment group 2 animals (see Materials and Methods for the description of treatment groups). Upper (lower) sections of PC loading plots indicate variables positively (negatively) covarying with PC1 scores.

sample composition (Figure 1 in the Supporting Information), although the variable positions of several peaks after the application of PAFFT methods still give small contributions into PCs due to non- or misalignment of small peaks. Both COW and PAFFT explain less variation than RSPA in the sample composition in uv- and unscaled models and thus result in less successful alignment and thus less interpretable models (Table 2). 64

Analytical Chemistry, Vol. 81, No. 1, January 1, 2009

RSPA for Improved Information Recovery via STOCSY. Since the major advantage of NMR spectroscopy is to provide detailed information on molecular structure in complex biological mixtures, RSPA alignment is further evaluated for the enhancement of information recovery in STOCSY analyses. STOCSY was applied to identify peaks correlated to those at chemical shifts δ ) 2.68 (1/2-CH2 group of citric acid) and δ ) 9.13 (CH group of

Table 2. Percentage of Variance of the Caloric Restriction Data Explained by the First Two PCs in uv-Scaled or Unscaled Cases

method

PC1 (uv-scaled) (%)

PC2 (uv-scaled) (%)

PC1 (unscaled) (%)

PC2 (unscaled) (%)

non-aligned PAFFT COW RSPA

18.8 18.9 18.2 19.3

9.2 9.3 8.6 9.7

34.2 51.9 54.3 60.1

16.3 9.2 8.8 9.3

N-methylnicotinic acid, NMNA), respectively. The chemical shift variation of citric acid peaks is an extreme case in metabonomic studies, whereas NMNA exhibits smaller positional variability and is also present in the urine in a concentration range several times less than citric acid. The chemical shift of the citrate intensity variable at δ ) 2.68 ppm is so variable that even the structural correlation between citrate doublets cannot be identified. The derivative-like line shape of covariance patterns indicates, as in the case of PC loadings, that the citrate positional variance exceeds that due to concentration changes and thus has a major impact on the analysis (Figure 6a). After RSPA alignment, there are strong positive structural correlations between the two doublets of citric acid and smaller positive correlation with 2-oxoglutarate represented by the two structurally correlated triplets (Figure 6b). Substantially smaller and negative correlation can also be seen for the creatinine peak. The biological correlations between citrate, 2-oxoglutarate, and creatinine are likely to occur due to effects caused by caloric restriction on metabolic activity. The stronger correlation between

citrate and 2-oxoglutarate is due to the fact that their concentrations are linked within the tricarboxylic acid cycle.3 The situation is rather different with NMNA, where the variable chemical shift only reduces the structural correlations between NMNA peaks (Figure 6c), values of which are substantially increased after RSPA alignment allowing better molecular identification (Figure 6d). Furthermore, a positive biological correlation between NMNA and trans-aconitate peaks is only revealed after alignment, which arises because the identification of biological correlations is not restricted to intramolecular peaks but can potentially involve all peaks in a data set. The relationship between NMNA and trans-aconitate may result from extended genome activity of the gut microbiota.48

CONCLUSIONS The problems that variable peak positions have on information recovery from one-dimensional biological NMR spectra have been addressed. The RSPA method, based on recursive segment-wise alignment, has been developed to remove the variation in peak positions across sample profiles. It provides a fast and accurate spectral alignment using full spectral information. The computational efficiency of the method is achieved by simplifying the alignment of a complex spectrum into the independent alignment of segments using the local nature of chemical shift variation. To improve the accuracy of the alignment, the RSPA local recursion algorithm is designed to align a multiplet as a whole, not as separate peaks. In order to avoid the subjective choice of the reference profile, the closeness index for the selection of the

Figure 6. One-dimensional STOCSY analysis to identify intensity variables correlated to that at chemical shifts, δ 2.68 and δ 9.13. Upper (lower) sections of the plot indicate variables positively (negatively) covarying with a variable at defined chemical shift, δ. Analytical Chemistry, Vol. 81, No. 1, January 1, 2009

65

reference sample on the basis of closeness to all other profiles of interest in terms of both large and small features has been introduced. The limitations of the use of correlation coefficients between sample profiles as a quantitative measure of alignment quality have been overcome by the application of scaling to equal variance of spectral segments of variable size. As a result, representative measures have been obtained indicative of the alignment of minor and major peaks. Among multivariate methods, PC models have been examined to explain the variation in sample composition after alignment. The use of scaled-PC loadings to overcome the difficulties faced when interpreting unit-variance models has been investigated and shown to be helpful. Since the major advantage of NMR spectroscopy is to provide detailed information on molecular structure in complex biological mixtures, the effects caused by variable peak positions across samples on information recovery using STOCSY have been investigated. RSPA alignment has resulted in considerable enhancement of the identification and determination of both structural and biological correlations between 1H NMR intensity variables. The RSPA method for the alignment of biological NMR spectra has been compared to two other widely applied methods, PAFFT and COW. The major advantage of the RSPA method over PAFFT (and similar methods, PAGA and PABS) is that local recursion enables the allocation and alignment of smaller peaks, providing more accurate alignment. RSPA is computationally fast compared to COW and, therefore, can be applied to align large data sets including hundreds or thousands of sample profiles. In line with the known properties of 1H NMR spectra, the linear translation

66

Analytical Chemistry, Vol. 81, No. 1, January 1, 2009

used in RSPA preserves the shape of peaks better than the compression and stretching used in COW. Finally, the alignment of spectra using RSPA will undoubtedly be improved further by optimizing the parameters to prevent misalignment on a local scale. The method should be of wide application for the alignment of NMR-based studies in metabonomics, metabolomics, and natural product mixture and food analysis and may be suitable for application to many other types of data sets such as chromatographic profiles. ACKNOWLEDGMENT K.A.V. gratefully acknowledges the MetaGrad program and Overseas Research Students Awards Scheme (ORSAS) for financial support. The members of the COnsortium for MEtabonomic Toxicology (COMET) are acknowledged for providing the data set used in the manuscript. Thomas Skov and Jason W. H. Wang are acknowledged for providing scripts of the PAFFT and COW alignment algorithms. SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.

Received for review June 7, 2008. Accepted September 29, 2008. AC8011544 (48) Nicholson, J. K.; Connelly, J.; Lindon, J. C.; Holmes, E. Nat. Rev. Drug Discovery 2002, 1, 153–161.