ARTICLE pubs.acs.org/ac
Data-Driven Approach for Metabolite Relationship Recovery in Biological 1H NMR Data Sets Using Iterative Statistical Total Correlation Spectroscopy Caroline J. Sands, Muireann Coen,* Timothy M. D. Ebbels, Elaine Holmes, John C. Lindon, and Jeremy K. Nicholson* Biomolecular Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, Sir Alexander Fleming Building, South Kensington, SW7 2AZ, United Kingdom ABSTRACT: Statistical total correlation spectroscopy (STOCSY) is a well-established and valuable method in the elucidation of both inter- and intrametabolite correlations in NMR metabonomic data sets. Here, the STOCSY approach is extended in a novel Iterative-STOCSY (I-STOCSY) tool in which correlations are calculated initially from a driver peak of interest and subsequently for all peaks identified as correlating with a correlation coefficient greater than a set threshold. Consequently, in a single automated run, the majority of information contained in multiple STOCSY calculations from all peaks recursively correlated to the original user defined driver peak of interest are recovered. In addition, highly correlating peaks are clustered into putative structurally related sets, and the results are presented in a fully interactive plot where each set is represented by a node; node-to-node connections are plotted alongside corresponding spectral data colored by the strength of connection, thus allowing the intuitive exploration of both inter- and intrametabolite connections. The I-STOCSY approach has been here applied to a 1H NMR data set of 24 h postdose aqueous liver extracts from rats treated with the model hepatotoxin galactosamine and has been shown both to recover the previously deduced major metabolic effects of treatment and to generate new hypotheses even on this well-studied model system. I-STOCSY, thus, represents a significant advance in correlation based analysis and visualization, providing insight into inter- and intrametabolite relationships following metabolic perturbations.
T
he complex biochemical data sets generated by nuclear magnetic resonance spectroscopy (NMR) or mass-spectrometry of biomaterials such as urine and plasma require intensive data mining to recover useful biological information.1,2 One aim of metabonomics is to determine the important metabolic consequences in response to perturbation, and as such, a typical approach includes the identification of important or discriminatory variables by pattern recognition methods, followed by their assignment as specific metabolites using a combination of statistical and experimental methods.3,4 Consequently, data analysis can be time-consuming, the identification of biomarkers can be difficult, and the visualization and interpretation of results can be complicated by the multivariate nature of the data. Many NMR data analysis methodologies utilize the inherent correlations in the data,5-8 and one commonly used technique in the analysis of NMR generated metabolic data sets is that of statistical total correlation spectroscopy (STOCSY).9 In STOCSY, through calculation of the correlation statistics between peak intensities, the degree to which resonances covary across samples is determined and visualized by back projection onto the covariance spectrum. While high correlations between peaks typically indicate structurally related resonances, i.e., peaks arising from the same molecule, lower correlations can allude to r 2011 American Chemical Society
biological relationships, for example, between molecules in the same biochemical pathway or reaction intermediates. Thus, although originally developed as a tool for structural assignment, STOCSY has subsequently been used in the extraction of biomarker, pathway, and reaction information.10-12 Additional applications include the identification of drug metabolite signatures in molecular epidemiology studies,13 enhancing information recovery from LC-NMR and J-resolved and diffusion-edited NMR data sets14-16 and aiding biomarker identification and pathway elucidation through two-dimensional (2D) heteronuclear STOCSY.11,17,18 An increasingly popular approach in the analysis of metabolic data sets is the production of metabolic maps or networks based on interpeak or intermetabolite correlation coefficients.19-22 Aside from the computational time required, clearly interpreting the pair wise correlations between all variables in a metabonomic data set (tens of thousands of points in 1H NMR) is incomprehensible, and though cherry picking a small number of metabolites of interest may ease analysis, important interactions may be overlooked. A common approach Received: November 2, 2010 Accepted: January 21, 2011 Published: February 16, 2011 2075
dx.doi.org/10.1021/ac102870u | Anal. Chem. 2011, 83, 2075–2082
Analytical Chemistry is, therefore, to reduce the full variable set into a smaller subset where each member is, in theory, representative of a single metabolite. This approach has been utilized in two recently published methods, cluster analysis statistical spectroscopy (CLASSY)21 and recoupled-STOCSY (R-STOCSY),22 both of which aim to provide a snapshot of metabolic relationships following a perturbation of interest. In CLASSY, following dimension reduction by the selection of variables corresponding to peak apexes, correlation analysis (2D STOCSY) is performed, and a correlation map is presented after hierarchical clustering of the resultant correlation matrix. Similarly, a 2D STOCSY approach is applied in R-STOCSY, on this occasion following variable dimension reduction by the clustering of consecutive variables based on their correlation/covariance profile (statistical recoupling of variables, SRV);23 here, results are presented superimposed onto a subset of the global KEGG metabolic network. However, in both these approaches, considerable time is required both in the assignment of clusters to specific metabolites and the subsequent relation of these to the correlation data produced. Here, we present Iterative-STOCSY (I-STOCSY), a novel datadriven approach for the elucidation of intra- and intermetabolite relationships in 1H NMR metabolic data sets. In this method, STOCSY is recursively performed, initially from a selected driver peak of interest and subsequently for all peaks identified as correlating to the previous driver peak with a correlation coefficient greater than a preset threshold. Additionally, putatively structurally related peaks are grouped and represented as single nodes in a fully interactive plot, which not only shows node-to-node I-STOCSY connections but also relates the nodes directly to the spectral data they represent, thus allowing the easy exploration of both inter- and intrametabolite relationships. In comparison to other correlation based methods, this allows a more directed and logical approach to the identification of metabolites that are biomarkers of the studied perturbation and the generation of hypotheses for intermetabolite associations. Additionally, the flexibility afforded by the userdefined choice of both initial driver peak and correlation threshold allows specific endogenous and/or treatment related associations of interest to be followed with respect to both the effect under study and strength of association. The I-STOCSY approach was tested on a 1H NMR data set of 24 h postdose aqueous liver extracts from rats treated with the model hepatotoxin galactosamine (galN). Although galN has been widely used as a model for hepatotoxicity for many years, the mechanism of its toxicity is still unclear, particularly with respect to interanimal variability in toxic response. However, it is known that following the initial conversion of galN to galactosamine-1-phosphate (galN-1-P), UDP-galN is formed from reaction of galN-1-P with UDP-glucose (UDP-glc). The subsequent epimerization of UDP-galN to UDP-glcN and formation of glcN-1-P links subsequent metabolism into the glucosamine pathway and thus results in the formation of glcNAc-1-P, glcNAc-6-P, and the UDP-N-acetylhexosamines UDP-galNAc and UDP-glcNAc. The critical point in galN toxicity is the formation of the UDP-hexosamines (UDPgalN and UDP-glcN); these cannot serve as uridylate donors in the uridylyltransferase reaction, resulting in depletion of the uridine nucleotide pool, inhibition of RNA and protein synthesis and ultimately liver pathology.24-26
’ EXPERIMENTAL METHODS Data Collection and Preprocessing. 1H NMR exemplar
spectroscopic data sets representing aqueous liver tissue extracts
ARTICLE
Figure 1. Interpeak connectivity map for the basic I-STOCSY method driven from a UDP-galNAc peak (δ1H = 5.52) in a 1H NMR spectral data set of 24 h post galN-dose liver aqueous extracts (n = 48). Peaks are represented by uniquely colored nodes with adjacent ppm values showing peak location and positioned in columns corresponding to the round in which they were identified. Peaks with correlation coefficient r > threshold (θ) are connected by lines colored by the strength of connection (correlation coefficient r, on scale shown).
of male Sprague-Dawley rats treated with the model hepatotoxin galactosamine (galN) were analyzed. This study formed part of the second COnsortium on MEtabonomic Toxicology (COMET-2) project in mechanistic toxicology, and conventional biomarker analysis has been published in detail previously.11,26,27 In brief, following a 6 day acclimatization period, male Sprague-Dawley rats (n = 48) were administered a single intraperonital injection of vehicle (0.9% saline) with 0 (n = 8) or 415 mg/kg (n = 40) galN and were euthanized 24 h post dose. Aqueous liver tissue extracts were prepared by homogenizing 80 mg of liver tissue sample in 1.5 mL of cold acetonitrile/water (50:50), then spinning the homogenized samples at 9447g for 10 min, removing and lyophilizing the supernatant, and finally reconstituting in 600 μL of D2O/H2O (90:10) containing 3-trimethylsilyl[2,2,3,3-2H4]propionate sodium salt (TSP) (1.3 mM) and sodium azide (1.4 mM). 1 H NMR spectra were acquired using a standard solvent suppressed one-dimensional pulse sequence (relaxation delay, 90° pulse, delay (4 μs), 90° pulse, mixing time, 90° pulse, acquisition) with an acquisition time of 2.7 s, and presaturation of the water signal was applied during the relaxation delay of 4 s and the mixing time of 100 ms. For each sample, 128 scans were collected into 64 K data points with a spectral width of 12 000 Hz, and an exponential line broadening function corresponding to 0.3 Hz was applied to all spectra prior to Fourier transformation. Data were imported into the MATLAB computing environment (R2009b, The MathWorks, Inc., MA), and the regions corresponding to water/HDO (δ 4.74.9) and TSP (δ -0.2-0.2) were removed from all spectra. Spectra were subsequently probabilistic quotient normalized,28 and peaks were detected at zero crossings of a smoothed spectral derivative calculated using a Savitzky-Golay third order polynomial filter29 of the mean spectrum. I-STOCSY Method. In summary, I-STOCSY takes a set of spectra, a set of indices corresponding to peak apexes, a user defined peak of interest, and correlation cutoff value θ and repeatedly performs STOCSY, initially driven from the defined peak but for subsequent rounds driven from all peak apexes identified as correlating to the driver with a correlation coefficient 2076
dx.doi.org/10.1021/ac102870u |Anal. Chem. 2011, 83, 2075–2082
Analytical Chemistry
ARTICLE
Figure 2. (A) I-STOCSY method and (B) results as illustrated on a simple simulated data set.
(r) with absolute value greater than θ. The resulting network shows all interpeak connectivities (with |r| > θ) originally stemming from the driver peak of interest but expanding to encompass connections from the peaks picked up in previous rounds as the number of rounds increases (Figure 1). Although this mode of display gives some idea of the complexity of metabolic associations captured in the spectroscopic data
sets, the interpretation of such networks is nontrivial. Modification of the basic method was, therefore, critical to improve visualization and interpretability of results. This was achieved through two important developments; first, the addition of a step whereby putative structurally related peaks are grouped and thus can be represented by just one node in the resulting network, and second, in the development of a fully interactive tool for the 2077
dx.doi.org/10.1021/ac102870u |Anal. Chem. 2011, 83, 2075–2082
Analytical Chemistry
ARTICLE
Figure 3. Interactive plot views of the first three rounds of I-STOCSY on a 24 h post galN dose liver aqueous extract 1H NMR data set. (A) Round 1, Original driver peak, UDP-galNAc (δ1H = 5.52). (C) Round 2, UDP-galNAc UDP-glcNAc peak selected. (E) Round 3, GlcNAc-1-P peak selected. In each case, all nodes connecting above the selected threshold are colored and indicated by arrows from the driver peak, and Δ indicates that the node represents a subset of peaks for that given metabolite. (B, D, and F) Corresponding spectral data for each of (A, C, and E). Colored markers show the resonances corresponding to each node (colored as for A, C, or E), and peaks are colored by the strength of the connection (correlation coefficient r, on scale shown). Arrows indicate selected driver peak.
visualization of I-STOCSY node-to-node connections alongside the spectroscopic data each connection represents. The complete I-STOCSY method along with a simple worked example is shown in Figure 2A. Here, I-STOCSY was applied to a set of simulated spectra generated using MetAssimulo (n = 30 1H NMR urine spectra, signal-to-noise ratio =10 000, peak shifting off) with prespecified intermetabolite concentration correlations to best exemplify the method.30 In the first step of the first round of I-STOCSY, the STOCSY calculation is driven from the user defined driver peak of interest
and the returned set of correlation coefficients is used to determine those peaks (set D) that correlate to the driver with |r| greater than the preset correlation threshold θ (Step 2). Following this, at Step 3, the subset of D not picked up in previous rounds (D0 ) is selected and condensed into putative structurally related subsets (in Round 1, the set D0 is identical to the set D since this is the first occasion at which correlating peaks are detected). It has been established that a STOCSY correlation threshold of r > 0.95 is necessary to assign peaks to the same metabolite with high probability (mean threshold value for a 2078
dx.doi.org/10.1021/ac102870u |Anal. Chem. 2011, 83, 2075–2082
Analytical Chemistry positive predictive value of 0.9 across a variety of data sets). 31,32 Therefore, at this stage, the interpeak correlation coefficients between all members of D0 are calculated and any peaks found to share a correlation greater than 0.95 are assigned to the same structural set. Subsequently, each newly defined putative structural set is added as a new row to a matrix in which all such sets are defined. In the next step (Step 4), this matrix of structural sets is searched for the row locations of all peaks found originally to correlate with the driver above the given threshold (all peaks in set D). Thus, the number of driver-to-peak connections is reduced, since connections to multiple peaks in the same structural set are represented by reference to one index value (the row in which the peaks are found) instead of one value per peak. At this stage, the corresponding correlation between driver and each identified set is also determined, and where structural sets contains multiple peaks, correlations are calculated simply as the mean correlation between the driver peak and all set members. Finally, in Step 5, any newly identified peaks (i.e., the set D0 ) are added to the list of peaks to be carried forward and become driver peaks in the next round (Dround). Again, in Round 1, the set Dround is equivalent to the set D0 since there is only one driver peak (the initial user defined peak) and, therefore, only one iteration of Steps 1 to 5. However, in subsequent rounds, if more than one new peak is picked up in the previous round then the steps must be repeated for every newly identified peak (each member of Dround); thus, the set D0 will be different for each driver peak, and as such (as long as new peaks are picked up), membership of Dround for the next round will increase with every iteration of the 5 steps. I-STOCSY continues until either no new peaks are identified or the maximum number of rounds (preset by the user) is exceeded. The condensation of peaks into putative structural sets (Step 3) is key in reducing the complexity of interpretation, and although as a result of this, multiple peaks are represented by one node; no information is lost. This is because, in the subsequent round, all of the newly identified peaks, not just one per structural set, are run as drivers, and thus, all the resulting correlations with |r| > θ are saved as connecting to that given structural set. If the chosen threshold causes some intramolecular correlations to be treated as intermolecular, this means that the desired reduced complexity is only compromised marginally, and the true connectivities are easily deduced in the subsequent interpretation stage. I-STOCSY Interactive Plot. Representing the I-STOCSY results as a diagram of interconnected nodes (Figure 1) is useful in that it highlights both connections and the order in which peaks are picked up. However, the complexity of such a diagram and the necessity to link each node to its real spectral data for interpretation is both time-consuming and tedious. Therefore, an interactive plot was developed in which the nodes are plotted alongside complementary spectral data (Figure 2B). Selecting a node or spectral peak of interest highlights all connecting nodes in a map of connectivities and concurrently their corresponding peaks in the spectral data. Node-peak relationships are shown by node color-coordinated markers underneath the spectral data, and the strength of connection (correlation) given by the color of the peak above.
’ RESULTS The I-STOCSY method was applied to the spectral data set representing aqueous liver extracts 24 h post galactosamine administration, driven from the UDP-galNAc anomeric proton resonance
ARTICLE
(δ1H = 5.52) with a correlation threshold θ of 0.85. UPD-galNAc was selected as a major galN-related metabolite in postdose samples and the threshold of 0.85 as best to exemplify the method. As the value of the threshold decreases, more peaks are picked up as correlating to each driver peak; thus, this relatively high threshold keeps the number of connections to a minimum. Previous investigation of this data set has shown increased hepatic levels of the galN related metabolites UDP-galN/UDPglcN, glcNAc-6-P, UDP-galNAc, and UDP-glcNAc in galN treated rats;26 all of which are detected here in the first three rounds of I-STOCSY (Figure 3), with Round 1 nodes corresponding to the remaining resonances of UDP-galNAc and UDP-glcNAc. Owing to the stringent threshold for defining structural sets (r > 0.95) and the inherent signal overlap and peak shifting present in NMR data, the majority of nodes represent only a subset of peaks for any given metabolite. Therefore, most metabolites are represented by a number of nodes, each containing peaks with similar overlap or shifting profiles. However, in Round 1, each of the three nodes corresponds to resonances of both UDP-galNAc and UDP-glcNAc (Figure 3A,B). Not only is this because the majority of resonances for these metabolites overlap, but also, owing to the presence of an epimerase (UDPacetylglucosamine 40 -epimerase),33 these compounds are in fast metabolic equilibrium, and therefore, even the fully resolved resonances corresponding to the anomeric (UDP-galNAc δ1H ∼= 5.55, UDP-glcNAc δ1H ∼= 5.52) and N-acetyl (UDPgalNAc δ1H = 2.09(s), UDP-glcNAc δ1H = 2.085(s)) peaks correlate extremely highly. In Round 2, nodes corresponding to uridine and the previously unassigned glc-NAc-1-P (assigned by comparison to a standard spectrum of authentic material) are detected (Figure 3C). From analysis of the corresponding spectral data, however (Figure 3D), it can be seen that the uridine peaks picked are those that overlap with UDP-gal/ glcNAc (δ1H = 4.365 and 4.24) and, therefore, are not meaningful connections at this point. In the final round (Round 3, Figure 3E), nodes connected to glcNAc-1-P are shown, representing the metabolites glcNAc-6-P, UDP-galN/UDP-glcN, glucose, and glycogen, all of which corroborate previous findings.11 As well as indicating which peaks are part of, and connected to, any given set, the spectral plot view of I-STOCSY also shows the strength and direction of correlation of each connected node to the node from which it was picked up (Figure 3B,D,F). With the exception of glucose and glycogen, the metabolites all correlate positively with their drivers and thus with the original driver peak UDP-glcNAc. These metabolites, therefore, all increase in abundance in 24 h postdose liver samples, fitting in with findings from established methods and the metabolism as discussed above. Glucose and glycogen, on the other hand, are negatively correlated to their driver (glcNAc-1-P). The depletion of hepatic glycogen and glucose has been previously shown19 and explained as a direct result of galN-induced UDP-glc depletion, one consequence of which is a concurrent inability to maintain hepatic glucose levels.26,34 Parts A and B, C and D, and E and F of Figure 3 represent three views of the I-STOCSY iterative plot, with a peak of interest selected from Round 1, 2, and 3, respectively. In each case, selecting a node in the node-to-node plot (parts A, C, and E) or a peak in the spectral view plot (parts B, D, and F) produces the same result, highlighting all connecting nodes in the node-tonode view and their corresponding peaks colored by the strength of connection in the spectral data. Since all putative structurally related peaks (represented by one node) are uniformly colored, 2079
dx.doi.org/10.1021/ac102870u |Anal. Chem. 2011, 83, 2075–2082
Analytical Chemistry
ARTICLE
Figure 4. Full network of nodes for I-STOCSY on GalN driven from UDP-galNAc (over total eight rounds). Nodes of each known metabolite are highlighted in the same color, and the metabolite names are outlined based on their relative correlation to the original UDP-galNAc driver peak; red, positively correlated; blue, negatively correlated.
this allows the intuitive exploration of intra- and intermetabolite correlations and easy selection of peaks of interest from the otherwise complex spectral data. The full node connectivity map for I-STOCSY driven from galNAc is shown in Figure 4. In this figure, metabolites have been highlighted along with the direction of their change relative to that of the original driver UDP-galNAc. It is worth noting that, although initially picked up as a result of overlap with UDPgalNAc and UDP-glcNAc resonances, nonoverlapped uridine peaks also positively correlate to GlcNAc-6-P, and thus, uridine remains a valid I-STOCSY derived metabolite. All the additional metabolites (in Rounds 4 to 8) are negatively correlated to UDPgalNAc, i.e., their levels decrease following galN administration. Although uridine would be expected to decrease postdose, its increase has been previously attributed to a compensatory increase in synthesis in response to GalN-induced depletion.4,34 The decrease in the remaining metabolites can be largely explained as resulting from the galN induced depletion of hepatic glucose and consequently of ATP generation. With the exception of glutathione and inosine, the majority of metabolites and their relative change in abundance on galN treatment have been previously described following the metabonomic investigation of galN treatment.11,26 In terms of glutathione and inosine, inosine is closely metabolically related to the previously reported metabolite adeonsine, and although not previously found from metabonomic investigation of galN treatment, a reduction in glutathione levels following galN treatment has been experimentally well established in both in vivo and in vitro hepatocytes.35-37 Thus, in a single automated run, the I-STOCSY approach can not only reproduce previously deduced information on the complex response following metabolic perturbation but also generate new hypotheses even on this well-studied model system. In addition, the above demonstration of the I-STOCSY method could have been driven from any of the endogenous metabolite peaks in the 1H NMR spectral data set to further expose metabolic connectivities within the data.
’ DISCUSSION AND CONCLUSIONS The I-STOCSY method represents a semiautomatic, datadriven approach for the investigation of inter- and intrametabolite
relationships. This method has been shown to recover both previously known and novel metabolic consequences of systemic perturbation quickly, effectively, and with minimum user input. The interactive plot allows the user to quickly visualize peak associations either by following a series of connections node to node or by selecting peaks of interest directly from the plotted spectral data. Since each node represents a highly correlated set of peaks, not only does this method present intermetabolite connectivities clearly but also it gives an idea of putatively structurally related intrametabolite peaks all linked directly to the spectral data. Thus, I-STOCSY represents a comprehensive tool for correlation based analysis of 1H NMR data sets, providing improved visualization and interpretation of the key metabolic changes and thus the underlying biology of metabolically challenged organisms or individuals. Additionally, the user specification of initial driver peak and correlation threshold allows the information presented to be tailored to metabolic associations of interest with depth of information given by the set correlation cutoff. STOCSY is a well-established method in identification of both structurally and nonstructurally correlated metabolites related to a peak of interest. Here, STOCSY analysis is extended and interpretation is enhanced since essentially selecting a given peak in the interactive I-STOCSY plot is equivalent to running a STOCSY analysis from that peak (and viewing all peak correlations with |r| > θ). Thus, a single run of I-STOCSY (depending on the threshold selected) can represent the majority of meaningful information contained in multiple STOCSY calculations from all peaks recursively correlated to the original user defined driver peak of interest. I-STOCSY requires just two user defined parameters; first, the initial driver peak from which the first round of STOCSY should be run, and second, a threshold value corresponding to the minimum correlation coefficient required to define peak to peak connections (θ). The initial driver peak here was selected as a major drug metabolite, but in principle, any peak could be used. While different initial driver peaks will select different connecting peaks dependent on their interpeak correlations, it is worth mentioning that, once a peak is selected, for a given I-STOCSY threshold its set of peak connections will be identical regardless of the round in which it appears. Therefore, in terms of driver 2080
dx.doi.org/10.1021/ac102870u |Anal. Chem. 2011, 83, 2075–2082
Analytical Chemistry peak selection, for maximum information recovery, multiple runs of I-STOCSY should be driven from peaks undetected in previous applications of I-STOCSY. The I-STOCSY threshold defines the absolute value of correlation coefficient that must be exceeded between two peaks before they are defined as connected, and as such, threshold choice has a significant impact on I-STOCSY results. At high thresholds, only a few node-to-node connections will be detected between highly correlated peaks; indeed, for thresholds approaching |r| > 0.95, in the majority of cases, only structural correlations will be detected. As the threshold value is reduced, however, more and more connections will be detected from each driver peak, alluding to inter- as well as intrametabolite relationships. In this example, a correlation value of |r| > 0.85 was selected, since the resulting plot could be shown in its entirety and explained comprehensively. At this threshold, not only were a significant number of metabolites previously identified as affected by galN administration detected but also novel connections to two metabolites were previously undetected in metabonomic studies. These novel metabolites were found to be in the same metabolic pathway as previously reported metabolites (inosine) or well recognized as being affected by galN treatment via other experimental means (glutathione).35-37 Thus, this illustrates the capacity of this method, in a single run, to both reproduce previous results and generate new and sensible hypotheses. In a practical sense, repeating I-STOCSY with a reduced threshold would yield yet more connectivities and thus certainly be useful in further analysis. Within I-STOCSY, the condensation of detected peaks into sets where each represents a highly correlated putative structurally related group is key in simplifying visualization and thus improving interpretability. In principle, the ideal situation would be for all peaks corresponding to a given metabolite to correlate above a certain threshold and, thus, for each metabolite to be represented by a single node. In practice, however, the peak overlap and shifting inherent in NMR data mean no one threshold is applicable for grouping together structural peaks across all metabolites. The choice of r > 0.95 as a threshold for placing two peaks into the same “structural” set is, therefore, a highly conservative estimate derived from an extensive study into the behavior of STOCSY in assigning structural vs nonstructural correlations.31 Although truly structurally related peaks may well correlate with lower coefficients, in the definition of structural sets here, selection of such a high threshold is desirable. If the threshold were reduced, although more metabolites would be represented by fewer nodes, since connections to any peak within each structural set lead to the whole set being highlighted in the resulting plot, results could be misleading. On the other hand, the grouping of very highly correlated peaks provides a distinct advantage in terms of visualization and, thus, interpretation compared to if a node represented each individual peak. Even if grouped peaks are in fact nonstructural (i.e., from two different metabolites), their correlation profiles to other peaks are inherently extremely similar, and thus, no information is lost. In fact the grouping of multiple metabolites into a single structural set can be informative, for example, indicating a strong metabolic relationship, such as that between UDP-galNAc and UDP-glcNAc. The majority of metabolites though are represented by multiple structural sets and, therefore, multiple nodes, yet all the nodes corresponding to a given metabolite are picked up as connected over several I-STOCSY rounds. It is worth mentioning that, although alignment was not required for this data set, if the data set under study contains a high degree of peak shifting, prior peak alignment
ARTICLE
by appropriate algorithms, for example, Recursive Peak-wise Segment Alignment (RPSA),38 may well be worthwhile. Other methods also utilize a correlation based approach to group peak indices into highly correlated sets. For example, in CLASSY, this is an overlap matrix evaluated on the intervariable correlations thresholded at increasing correlation coefficients to identify unambiguous local clusters,21 and in R-STOCSY, a correlation/covariance landscape termed SRV with clusters are then grouped into superclusters by virtue of their intercluster correlation.22 In subsequent steps, however, only the most highly correlating member of each local cluster (CLASSY) or the mean signal intensity of variables in each supercluster (R-STOCSY) is chosen to represent the entire set in the correlation based cluster analysis, and thus, novel interpeak correlation information could be lost at this stage. In contrast, in I-STOCSY, even after the formation of structural sets, since each newly detected peak in each set becomes a driver peak in the next round, there is no loss of information even if different peaks are picked up as connecting above the I-STOCSY threshold for different members of the same structural set. Finally, development of the interactive I-STOCSY plot represents a significant advantage in visualization and interpretation over other correlation based methods, since intra- and intermetabolite relationships can be explored simultaneously, both through the node-to-node network map and the spectral data itself. Although other correlation based approaches (e.g., CLASSY and R-STOCSY) provide 2D intercluster correlation maps and, thus, a global overview of metabolic associations, before the correlations can be interpreted, the assignment of clusters to specific metabolites is still required, which can be a lengthy process whether carried out by hand or through database matching algorithms. This prior identification of putatively interesting metabolites (alongside extraction of the corresponding experimental data) is also a prerequisite in other metabolic pathway elucidation tools (for example, the CytoScape plug-in MetScape39). In this respect, I-STOCSY is completely distinct, mitigating the need for prior identification of metabolites and providing a data-driven approach to the elucidation of metabolic associations through the linking of node-to-node connectivities with corresponding spectral data, both of which can be selected to explore correlation connectivities. This is not only valuable as an intuitive exploration tool in the recovery of metabolic relationships but also valuable since it allows such relationships to be easily user-verified. For example, in this data set, uridine is first identified as connecting (with a subset of its peaks) directly to UDP-galNAc and UDP-glcNAc; however, inspection of the spectral data reveals this to be as a result of peak overlap and, therefore, not a meaningful connection at this point. However, selecting a nonoverlapped uridine peak in the spectral view plot reveals all connections at this threshold, including a connection to glcNAc-6-P at Round 3. Although here applied to an NMR spectroscopic data set containing pre- and postdose samples from a toxicological study, the I-STOCSY approach is of potential value in any metabonomic data set. While when applied to full data sets of results following a metabolic challenge (as in this two-class model of preand postdose samples) the major effects of treatment will be elucidated, this approach is putatively equally valuable in revealing connections within and between metabolites in single treatment groups or in multiple class models. In summary, I-STOCSY represents a significant advance in terms of maximum information recovery from complex NMR spectral data sets, providing a 2081
dx.doi.org/10.1021/ac102870u |Anal. Chem. 2011, 83, 2075–2082
Analytical Chemistry comprehensive tool for exploring inter- and intrametabolite relationships and elucidating novel mechanistic insight into metabolic perturbations.
’ AUTHOR INFORMATION Corresponding Author
*E-mail
[email protected] (J.K.N.); m.coen@imperial. ac.uk (M.C.).
’ ACKNOWLEDGMENT The authors thank the COMET-2 consortium sponsored by Bristol-Myers Squibb, Sanofi-Aventis, Servier, and Pfizer, for provision of the NMR data set. Astra-Zeneca is acknowledged for financial support to C.J.S. and the MRC Integrative Toxicology Training Partnership (ITTP) for financial support to M.C. ’ REFERENCES (1) Holmes, E.; Antti, H. Analyst 2002, 127, 1549–1557. (2) Rousseau, R.; Govaerts, B.; Verleysen, M.; Boulanger, B. Chemom. Intell. Lab. 2008, 91, 13. (3) Nicholson, J. K.; Connelly, J.; Lindon, J. C.; Holmes, E. Nat. Rev. Drug Discovery 2002, 1, 153–161. (4) Coen, M.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Chem. Res. Toxicol. 2008, 21, 9–27. (5) Zhang, F.; Bruschweiler, R. Angew. Chem., Int. Ed. Engl. 2007, 46, 2639–2642. (6) Snyder, D. A.; Zhang, F.; Bruschweiler, R. J. Biomol. NMR 2007, 39, 165–175. (7) Bruschweiler, R.; Zhang, F. J. Chem. Phys. 2004, 120, 5253–5260. (8) Eads, C. D.; Noda, I. J. Am. Chem. Soc. 2002, 124, 1111–1118. (9) Cloarec, O.; Dumas, M. E.; Craig, A.; Barton, R. H.; Trygg, J.; Hudson, J.; Blancher, C.; Gauguier, D.; Lindon, J. C.; Holmes, E.; Nicholson, J. Anal. Chem. 2005, 77, 1282–1289. (10) Holmes, E.; Cloarec, O.; Nicholson, J. K. J. Proteome Res. 2006, 5, 1313–1320. (11) Coen, M.; Hong, Y. S.; Cloarec, O.; Rhode, C. M.; Reily, M. D.; Robertson, D. G.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2007, 79, 8956–8966. (12) Maher, A. D.; Cloarec, O.; Patki, P.; Craggs, M.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2009, 81, 288–295. (13) Holmes, E.; Loo, R. L.; Cloarec, O.; Coen, M.; Tang, H.; Maibaum, E.; Bruce, S.; Chan, Q.; Elliott, P.; Stamler, J.; Wilson, I. D.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2007, 79, 2629–2640. (14) Cloarec, O.; Campbell, A.; Tseng, L. H.; Braumann, U.; Spraul, M.; Scarfe, G.; Weaver, R.; Nicholson, J. K. Anal. Chem. 2007, 79, 3304– 3311. (15) Fonville, J. M.; Maher, A. D.; Coen, M.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2010, 82, 1811–1821. (16) Smith, L. M.; Maher, A. D.; Cloarec, O.; Rantalainen, M.; Tang, H.; Elliott, P.; Stamler, J.; Lindon, J. C.; Holmes, E.; Nicholson, J. K. Anal. Chem. 2007, 79, 5682–5689. (17) Keun, H. C.; Athersuch, T. J.; Beckonert, O.; Wang, Y.; Saric, J.; Shockcor, J. P.; Lindon, J. C.; Wilson, I. D.; Holmes, E.; Nicholson, J. K. Anal. Chem. 2008, 80, 1073–1079. (18) Wang, Y.; Cloarec, O.; Tang, H.; Lindon, J. C.; Holmes, E.; Kochhar, S.; Nicholson, J. K. Anal. Chem. 2008, 80, 1058–1066. (19) Weckwerth, W.; Loureiro, M. E.; Wenzel, K.; Fiehn, O. Proc. Natl. Acad. Sci. U.S.A. 2004, 101, 7809–7814. (20) Zhang, S.; Nagana Gowda, G. A.; Asiago, V.; Shanaiah, N.; Barbas, C.; Raftery, D. Anal. Biochem. 2008, 383, 76–84. (21) Robinette, S. L.; Veselkov, K. A.; Bohus, E.; Coen, M.; Keun, H. C.; Ebbels, T. M.; Beckonert, O.; Holmes, E. C.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2009, 81, 6581–6589.
ARTICLE
(22) Blaise, B. J.; Navratil, V.; Domange, C.; Shintu, L.; Dumas, M.; Elena, B.; Emsley, L.; Toulhoat, P. J. Proteome Res. 2010, 9, 4513–4520. (23) Blaise, B. J.; Shintu, L.; Elena, B.; Emsley, L.; Dumas, M. E.; Toulhoat, P. Anal. Chem. 2009, 81, 6242–6351. (24) Decker, K.; Keppler, D. Prog. Liver Dis 1972, 4, 183–199. (25) Coen, M. Toxicology 2010, 278, 326–340. (26) Coen, M.; Want, E. J.; Clayton, T. A.; Rhode, C. M.; Hong, Y. S.; Keun, H. C.; Cantor, G. H.; Metz, A. L.; Robertson, D. G.; Reily, M. D.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. J. Proteome Res. 2009, 8, 5175–5187. (27) Coen, M.; Hong, Y. S.; Clayton, T. A.; Rohde, C. M.; Pearce, J. T.; Reily, M. D.; Robertson, D. G.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. J. Proteome Res. 2007, 6, 2711–2719. (28) Dieterle, F.; Ross, A.; Schlotterbeck, G.; Senn, H. Anal. Chem. 2006, 78, 4281–4290. (29) Savitzky, A.; Golay, M. J. E. Anal. Chem. 1964, 36, 13. (30) Muncey, H. J.; Jones, R.; De Iorio, M.; Ebbels, T. M. D. BMC Bioinf. 2010, 11, 496. (31) Alves, A. C.; Rantalainen, M.; Holmes, E.; Nicholson, J. K.; Ebbels, T. M. Anal. Chem. 2009, 81, 2075–2084. (32) Sands, C. J.; Coen, M.; Maher, A. D.; Ebbels, T. M.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2009, 81, 6458–6466. (33) Jacobson, B.; Davidson, E. A. Biochim. Biophys. Acta 1963, 73, 145–151. (34) Keppler, D. O.; Rudigier, J. F.; Bischoff, E.; Decker, K. F. Eur. J. Biochem. 1970, 17, 246–253. (35) McMillan, D. C.; Jollow, D. J. Toxicol. Appl. 1996, 115, 7. (36) McMillan, J. M.; McMillan, D. C. Toxicology 2006, 222, 175– 184. (37) Seckin, P.; Alptekin, N.; Kocak-Toker, N.; Uysal, M.; AykacToker, G. Res. Commun. Mol. Pathol. Pharmacol. 1995, 87, 237–240. (38) Veselkov, K. A.; Lindon, J. C.; Ebbels, T. M.; Crockford, D.; Volynkin, V. V.; Holmes, E.; Davies, D. B.; Nicholson, J. K. Anal. Chem. 2009, 81, 56–66. (39) Gao, J.; Tarcea, V. G.; Karnovsky, A.; Mirel, B. R.; Weymouth, T. E.; Beecher, C. W.; Cavalcoli, J. D.; Athey, B. D.; Omenn, G. S.; Burant, C. F.; Jagadish, H. V. Bioinformatics 2010, 26, 971–973.
2082
dx.doi.org/10.1021/ac102870u |Anal. Chem. 2011, 83, 2075–2082