Two-Dimensional Statistical Recoupling for the ... - ACS Publications

Jun 30, 2010 - Université de Lyon, Centre de RMN a` Tre`s Hauts Champs, CNRS/ENS Lyon/UCB Lyon 1, 5 rue de la Doua,. 69100 Villeurbanne, France...
9 downloads 0 Views 2MB Size
Two-Dimensional Statistical Recoupling for the Identification of Perturbed Metabolic Networks from NMR Spectroscopy Benjamin J. Blaise, Vincent Navratil, Ce´line Domange, Laetitia Shintu,† Marc-Emmanuel Dumas,‡ Be´ne´dicte Elena-Herrmann, Lyndon Emsley, and Pierre Toulhoat* Universite´ de Lyon, Centre de RMN a` Tre`s Hauts Champs, CNRS/ENS Lyon/UCB Lyon 1, 5 rue de la Doua, 69100 Villeurbanne, France Received March 22, 2010

The development of Statistical Total Correlation Spectroscopy (STOCSY), a representation of the autocorrelation matrix of a spectral data set as a 2D pseudospectrum, has allowed more reliable assignment of one- and two-dimensional NMR spectra acquired from the complex mixtures that are usually used in metabolomics/metabonomics studies, thus, improving precise identification of candidate biomarkers contained in metabolic signatures computed by multivariate statistical analysis. However, the correlations obtained cannot always be interpreted in terms of connectivities between metabolites. In this study, we combine statistical recoupling of variables (SRV) and STOCSY to identify perturbed metabolite systems. The resulting Recoupled-STOCSY (R-STOCSY) method provides a 2D correlation landscape based on the SRV clusters representing physical, chemical, and biological entities. This enables the identification of correlations between distant clusters and extends the recoupling scheme of SRV, which was previously limited to the association of neighboring clusters. This allows the recovery of only meaningful correlations between metabolic signals and significantly enhances the interpretation of STOCSY. The method is validated through the measurement of the distances between the metabolites involved in these correlations, within the whole metabolic network, which shows that the average shortest path length is significantly shorter for the correlations detected in this new way compared to metabolite couples randomly selected from within the entire KEGG metabolic network. This enables the identification without any a priori knowledge of the perturbed metabolic network. The R-STOCSY completes the recoupling procedure between distant clusters, further reducing the high dimensionality of metabolomics/metabonomics data set and finally allows the identification of composite biomarkers, highlighting disruption of particular metabolic pathways within a global metabolic network. This allows the perturbed metabolic network to be extracted through NMR based metabolomics/metabonomics in an automated, and statistical manner. Keywords: NMR • metabolic profiling • network • STOCSY

Introduction The use of nuclear magnetic resonance (NMR) in metabolomics/metabonomics studies, where small variations between large numbers of spectra are detected using advanced statistical methods and attributed to changes in response to stimuli (such as disease), has grown enormously in the past few years, with landmark applications to biofluids (plasma, serum, urine),1,2 tissue samples (biopsies, extractions from tissues),3,4 or small model organisms such as Caenorhabditis elegans.5,6 Data sets routinely contain hundreds of spectra, each with thousands of data points. To understand the metabolic varia* To whom correspondence should be addressed. E-mail: pierre.toulhoat@ univ-lyon1.fr. † Current address, Equipe CES (UMR 6263 ISM2), Universite´ Paul Ce´zanne, Faculte´ de St Je´roˆme, Av. Escadrille Normandie Niemen, 13397 Marseille Cedex 20, France. ‡ Current address, Department of Biomolecular Medicine, Faculty of Medicine, Imperial College London, South Kensington, London, SW7 2AZ, U.K. 10.1021/pr1002615

 2010 American Chemical Society

tions induced by various patho-physiological conditions, accurate assignments of metabolites within pathways are of primary importance to identify candidate biomarkers. Detection of metabolic connectivities so as to recover candidate biomarkers represents the subsequent crucial step in order to establish links to specific physiological pathways and recover the perturbed metabolic network.7 These metabolites and pathways then become the focus of research aimed at a diagnosis, prognosis, or treatment. To aid in the task of metabolic assignment, following the concept of 2D sample-sample correlation spectroscopy introduced by Sasic et al.8 for infrared spectroscopy and Zhang et al. in NMR,9 Cloarec et al. developed the representation of the autocorrelation matrix of a NMR data set in a pseudo 2D spectrum coined statistical total correlation spectroscopy (STOCSY).10 STOCSY encapsulates the hallmark information contained in a classical TOCSY experiment (spin correlations) but also metabolic correlations (metabolic system) and thus Journal of Proteome Research 2010, 9, 4513–4520 4513 Published on Web 06/30/2010

research articles identifies patho-physiological connectivities in the metabolic network by highlighting the simultaneous variations of metabolite concentrations.11–20 However, the biological interpretation of a STOCSY analysis is complex and not always straightforward. Because of the amount of information contained in the correlation matrix, one notably has to select a certain threshold of correlation for the representation of the pseudo-2D spectrum and different approaches have been developed to solve this problem. These comprehensive approaches have notably been developed in a toxicological context.21,22 We have recently proposed a new approach to identify metabolic NMR signals in a data set based on statistical relationships of covariance and correlation between consecutive variables inherited from high-resolution bucketing.23 The automated variable-sized bucketing, which results from this procedure, was shown to be an efficient method for the recovery of meaningful spectral segments associated with candidate biomarkers without significant loss of information. In this study, we combine SRV and STOCSY (to form recoupled-STOCSY or R-STOCSY) to reduce the dimensionality inherited from the bucketing and identify correlations between distant clusters. The result is a STOCSY spectrum presenting correlations between NMR metabolic signals with an increased correlation level between clusters, and which recovers structural metabolic signals from the statistical electronic noise. To validate the biological interpretation of these statistically significant correlations, we measure the average shortest path lengths between the metabolites identified as involved in these correlations, and compare them to random metabolic couples on the whole KEGG metabolic network. We find that shorter distances are obtained for the R-STOCSY correlations as compared to random metabolic pairs. Compared to other similar approaches, for instance developed by Weckwerth et al.24 or Zhang et al.,25 R-STOCSY does not perform a differential analysis of network connectivities (intragroup variance), but considers the entire spectral information for the representation of the autocorrelation matrix and study the intergroup variance to extract the perturbed metabolic network. It thus becomes possible to identify the perturbed metabolic network from the whole metabolic network, without any a priori knowledge, by extracting the high-density spectral information encapsulated in the data set in an automated manner. This suggests that correlations within pathways might be higher than correlations between metabolites belonging to different pathways. The study uses data from samples of C. elegans discriminating wild-type nematodes from sod-1 oxidative stress mutants at two ages. It particularly focuses on the age effect in sod-1 mutants and highlights perturbations of the energetic and cellular respiration metabolisms.5

Methods Data Set. This study uses a data set previously published discriminating 138 samples of sod-1(tm776) mutants and N2 wild-type C. elegans nematodes.5 Sample preparation, data import, pattern recognition, and statistical analyses are described elsewhere.5,6 NMR acquisition was done on a Bruker Avance spectrometer operating at 700 MHz 1H frequency, using magic-angle spinning of the samples and a one-dimensional 1 H NOESY experiment with water presaturation, with an acquisition time of approximately 25 min per sample. Processing of the data was achieved using Topspin 2.1 and AMIX (Bruker GmBH, Rheinstetten, Germany). The data matrix 4514

Journal of Proteome Research • Vol. 9, No. 9, 2010

Blaise et al. contains 138 spectra represented in rows and 9000 variables among which 399 variables correspond to the excluded residual water signal after presaturation. Principal components analysis (PCA) and Orthogonal-Partial Least Square (O-PLS) regression were run using Matlab R2008a (Mathworks, Natick, MA) to, respectively, evaluate the highest directions of variance and discriminate our samples with the addition of a class information matrix Y. Statistical Recoupling of Variables. SRV was performed on the data set according to the algorithm recently published and available upon request.23 SRV is an automated variable size bucketing procedure aiming at the identification of metabolic NMR peaks in the NMR signal. It is achieved without a priori knowledge, by focusing on the statistical relationships between consecutive variables inherited from the high-resolution bucketing. In this way, it is actually possible to restore the spectral dependency, which is a fundamental property of the NMR signal, by following a landscape of covariance/correlation (L) between consecutive variables. L(i) )

covariance (i, i + 1) ) geometric mean(i, i + 1) correlation

The second step of SRV is to identify clusters whose borders are located at the minima of the covariance/correlation landscape. The identified clusters must contain a minimum number of variables, which is defined as a function of the peak width of a singlet at the baseline in the NMR spectrum and the bucketing resolution. minimal size )

singlet base peak width bucketing resolution

The third step of the procedure consists in the aggregation of neighboring clusters based on a sufficient threshold of correlation. This threshold is empirically determined by visualizing the identified clusters in a typical NMR spectrum of the data set. To set the threshold in practice, the procedure is considered optimal when both the strong aliphatic as well as the weak aromatic multiplets are correctly identified. The clusters identified in this way then correspond to physical, chemical, or biological entities (peaks, multiplets, spin systems, metabolite correlations, etc.). Note that in the case of two overlapping signals with different intensities, SRV usually identifies 3 clusters, with the central cluster corresponding to the weaker peak, and the two outer clusters corresponding to the stronger peak. This is notably observed for the broad lipid signals in HRMAS spectra to which the NMR signals arising from small molecules may be added. It is thus important to note that SRV does not discard the smaller peaks in overlapping regions. Fuller details have been given previously.23 We used the following parameters to efficiently recouple variables inherited from the high-resolution bucketing: 10 consecutive variables are required to describe a typical wellresolved singlet on a 700 MHz spectrum (peak base width at the noise threshold for a resolved weak singlet in the aromatic area ) 0.01 ppm, bucketing resolution ) 0.001 ppm). A correlation threshold is empirically determined to associate neighboring clusters along the chemical shift axis: 0.90 for the entire data set, 0.93 concerning the effect of age in sod-1 mutants, and, respectively, 0.72, 0.76, 0.76, and 0.80 for N2

Identification of Perturbed Metabolic Networks samples (L4 larvae and gravid adults) and sod-1 mutants (gravid adults and L4 larvae). In these cases, the observed correlations represent the topology of the global metabolic network (Supplementary Figure 1). Higher thresholds would prevent the association of clusters and thus result in the incapacity of the algorithm to identify multiplets. Lower thresholds would lead to the association of clusters grouping NMR signals belonging to different entities. The levels of correlation obtained in these last sub-data sets to focus on intragroup variance are clearly decreased showing the weak influence of the intragroup variance on the total variance. This procedure allows the reduction of the dimensionality from 8601 variables (buckets) to 138 clusters representing physical, chemical, or biological entities, without significant loss of information. SRV was performed using Matlab R2008a (Mathworks, Natick, MA). Statistical Total Correlation Spectroscopy. 10 The autocorrelation matrix of the data set was then computed with Matlab R2008a (Mathworks, Natick, MA) according to: C)

1 XT · X NS - 1

where NS is the number of spectra available in the data set and X represents the autoscaled experimental matrix of NS spectra × NV spectral variables (buckets) and C the autocorrelation matrix (NV × NV), computed with Matlab. Since the SRV data have been autoscaled, the correlations in the R-STOCSY plot do not involve only the most intense NMR signals. Rather, they involve the signals showing the strongest correlations, irrespective of their intensity, which can correspond to weak NMR signals that might often be missed in a traditional analysis of the NMR spectra. For all correlations, a p-value was computed based on a t test to check the statistical significance of the association. Each p-value represents the probability of there being a correlation as large as the observed value by random chance, when the true correlation is zero. The threshold of statistical significance, in this context of multiple hypothesis testing, was set according to the Bonferroni correction. The significance threshold was set to the usual value of 0.05 and was corrected according to the number of potential correlations.26 p