Anal. Chem. 2009, 81, 6242–6251
Statistical Recoupling Prior to Significance Testing in Nuclear Magnetic Resonance Based Metabonomics Benjamin J. Blaise, Laetitia Shintu, Be´ne´dicte Elena, Lyndon Emsley, Marc-Emmanuel Dumas, and Pierre Toulhoat* Universite´ de Lyon, Centre de RMN a` Tre`s Hauts Champs, CNRS/ENS Lyon/UCB Lyon 1, 5 Rue de la Doua, 69100 Villeurbanne, France Significance testing is a crucial step in metabolic biomarker recovery from the metabolome-wide latent variables computed by multivariate statistical analysis. In this study we propose an algorithm based on the landscape of the covariance/correlation ratio of consecutive variables along the chemical shift axis to restore, prior to significance testing, the spectral dependency and recouple variables in clusters which correspond to physical, chemical, and biological entities: statistical recoupling of variables (SRV). Variables are associated into a series of clusters, which are then considered as individual objects for the control of the false discovery rate. Compared to classical procedures, it is found that SRV allows efficient recovery of statistically significant metabolic variables. The proposed SRV method when associated with the Benjamini-Yekutieli correction retains a low level of significant variables in the noise areas of the nuclear magnetic resonance (NMR) spectrum, close to that observed using the conservative Bonferroni correction (false positive rate), while also allowing successful identification of statistically significant metabolic NMR signals in cases where the classical procedures of Benjamini-Yekutieli and Benjamini-Hochberg (false discovery rate) fail. This procedure improves the interpretability of latent variables for metabolic biomarker recovery. Following recent developments of the “-omics” sciences, metabonomics has seen in the last years a rapid increase of the number of variables used to describe the spectral data of a metabolic system. This trend is necessary since the resolution should match that inherited from high-field nuclear magnetic resonance (NMR) or mass spectrometry (MS) spectra to allow an efficient assignment of latent variables. For example, a width of 0.001 ppm is often used for rectangular buckets for NMR data, with 9 000 buckets in a typical 9 ppm-wide 1H spectrum.1 The driving force of multivariate statistical data analysis is then to reduce the high dimensionality space of all the variables so as to establish linear combinations of variables that encapsulate the * To whom correspondence should be addressed. P. Toulhoat, e-mail:
[email protected]. (1) Cloarec, O.; Dumas, M. E.; Trygg, J.; Craig, A.; Barton, R. H.; Lindon, J. C.; Nicholson, J. K.; Holmes, E. Anal. Chem. 2005, 77, 517–526.
6242
Analytical Chemistry, Vol. 81, No. 15, August 1, 2009
metabolic information of the total system.2 The results of multivariate analysis can be visualized in two dual spaces: on a score plot (space of individuals) showing discrimination between subpopulations and a loading plot (space of variables), which represents the latent variable illustrating the variation of a combination of concentrations of small molecular weight compounds that together explain the discrimination between the subpopulations under study. The quality of this discrimination can be checked by numerous methods. Among these the most useful are cross-validated score plots,3 cross-validations of the model,4 resampling under the null hypothesis,5 or estimations of the p-value by cross-validated analysis of variance.6 However, a goal of metabonomics is to determine the metabolic reaction of an organism to change.7 The following step is therefore to identify individual metabolites, extracted from the metabolomewide latent variable discriminating subpopulations, as reliable markers.8,9 These potential biomarkers are often found by taking into account prior knowledge of the physiological process under investigation, and complementary approaches, such as other “omics”,10 are used to establish an interpretation of the observed variations. A crucial evolution is thus to rationalize the interpretation of the latent variable (i.e., selection of candidate biomarkers) on the basis of complementary univariate analysis. Cloarec et al.3 enhanced the unbiased interpretability of the latent variable in supervised methods by adding the degree of correlation to the covariance profile. In this representation covariance is used for structural assignment purposes as intensity ratios within multiplets are conserved. However, the drawback (2) Nicholson, J. K.; Connelly, J.; Lindon, J. C.; Holmes, E. Nat. Rev. Drug Discovery 2002, 1, 153–161. (3) Cloarec, O.; Dumas, M. E.; Craig, A.; Barton, R. H.; Trygg, J.; Hudson, J.; Blancher, C.; Gauguier, D.; Lindon, J. C.; Holmes, E.; Nicholson, J. K. Anal. Chem. 2005, 77, 1282–1289. (4) Anderssen, E.; Dyrstad, K.; Westad, F.; Martens, H. Chemom. Intell. Lab. Syst. 2006, 84, 69–74. (5) Efron, B. Ann. Stat. 1979, 7, 1–26. (6) Eriksson, L.; Trygg, J.; Wold, S. J. Chemom. 2008, 22, 594–600. (7) Nicholson, J. K.; Lindon, J. C.; Holmes, E. Xenobiotica 1999, 29, 1181– 1189. (8) Rousseau, R.; Govaerts, B.; Verleysen, M.; Boulanger, B. Chemom. Intell. Lab. Syst. 2008, 91, 54–66. (9) Maher, A. D.; Cloarec, O.; Patki, P.; Craggs, M.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2009, 81, 288–295. (10) Dumas, M. E.; Wilder, S. P.; Bihoreau, M. T.; Barton, R. H.; Fearnside, J. F.; Argoud, K.; D’Amato, L.; Wallis, R. H.; Blancher, C.; Keun, H. C.; Baunsgaard, D.; Scott, J.; Sidelmann, U. G.; Nicholson, J. K.; Gauguier, D. Nat. Genet. 2007, 39, 666–672. 10.1021/ac9007754 CCC: $40.75 2009 American Chemical Society Published on Web 07/02/2009
of covariance is an enhanced focus on high-intensity peaks. This is the reason why the correlation profile is superimposed in the form of a heatmap on top of the covariance profile. This approach often raises the question of a threshold of correlation below which the variables should not be considered. A method has recently been developed to define this threshold in the case of structural assignment.11 Holmes et al.12 recently proposed an a posteriori validation of the selected metabolites by one-way analysis of variance (ANOVA), estimating the p-value in the context of multiple hypothesis testing by the Bonferroni correction. This correction calculates the false positive rate (FPR), which estimates the number of variables associated with a true null hypothesis and that are proposed as significant by any given statistical test. It is widely used among epidemiologists but is known to be extremely conservative especially in the case of “-omics” sciences.13 In the case of NMR based metabonomics, the FPR can be useful for highthroughput NMR experiments but would certainly lead to noninterpretable results in the case of usual NMR based metabonomics with small data sets. Less conservative corrections evaluate the false discovery rate (FDR) that controls the rate at which variables identified as significant by a given test are in fact associated with a true null hypothesis and have been successfully developed and applied by Benjamini and Hochberg.14,15 They are based on the estimation of q-values.16 In the case of multiple testing under dependence assumptions between variables, the control of the FDR is modified, as reported by Benjamini and Yekutieli.17 We note that high-resolution bucketing of the NMR signal that matches the resolution of high-field NMR removes the spectral dependence among NMR variables and leads to an inappropriate use of significance testing corrections, since they are either based on independency (Bonferroni and Benjamini-Hochberg) or since the input variables have lost their spectral relevance (Benjamini-Yekutieli). This neglects the fact that we know that all the variables making up a single NMR resonance are correlated and all the resonances for a given metabolite in distinct parts of the spectrum are correlated. A recent approach looks for known metabolites and develops the quantification of metabolites by NMR, thereby allowing data reduction,18 since all the variables corresponding to the same molecule are grouped together to define its concentration. Locally we can use the prior knowledge of spectral structure, without invoking any metabolic hypothesis, to improve the significance of the analysis. A series of points in a cluster describe a NMR signal, and a spectrum is a group of clusters, representing NMR variables with a biological meaning. (11) Couto Alves, A.; Rantalainen, M.; Holmes, E.; Nicholson, J. K.; Ebbels, T. M. D. Anal. Chem. 2009, 81, 2075–2084. (12) Holmes, E.; Loo, R. L.; Stamler, J.; Bictash, M.; Yap, I. K. S.; Chan, Q.; Ebbels, T.; De Iorio, M.; Brown, I. J.; Veselkov, K. A.; Daviglus, M. L.; Kesteloot, H.; Ueshima, H.; Zhao, L. C.; Nicholson, J. K.; Elliott, P. Nature 2008, 453, 396–U350. (13) Storey, J. D.; Tibshirani, R. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, 9440– 9445. (14) Benjamini, Y.; Hochberg, Y. J. R. Stat. Soc., Ser. B: Stat. Methodol. 1995, 57, 289–300. (15) Storey, J. D.; Xiao, W. Z.; Leek, J. T.; Tompkins, R. G.; Davis, R. W. Proc. Natl. Acad. Sci. U.S.A. 2005, 102, 12837–12842. (16) Storey, J. D. Ann. Stat. 2003, 31, 2013–2035. (17) Benjamini, Y.; Yekutieli, D. Ann. Stat. 2001, 29, 1165–1188. (18) Zhang, S. C.; Gowda, G. A. N.; Asiago, V.; Shanaiah, N.; Barbas, C.; Raftery, D. Anal. Biochem. 2008, 383, 76–84.
For the recovery of metabolic biomarkers from the latent variable, we note that multiple hypothesis testing appears as a suitable method to certify the importance of a given variable.19 We show here how we can recover metabolic NMR signals without prior knowledge of metabolites by first capitalizing on the relationships between variables due to the spectral structure of the data (covariance and correlation of consecutive variables) and then using significance testing procedures to improve the interpretation of the latent variables for metabolic biomarker identification. We illustrate the method with the discrimination between two C. elegans strains: N2 wild-type and sod-1 mutants, for which the NMR data has been published previously.20,21 On the basis of approaches to reduce the high-dimensionality data sets,22,23 we report here a new algorithm based on the landscape of covariance and correlation between consecutive variables to restore the spectral dependency and recouple variables in physical, chemical, and biological entities for a control of the FDR adapted to the dependency of the NMR signals. This procedure uses statistical recoupling of variables (SRV) to counter “the curse of dimensionality”.24 The combination of traditional multivariate (orthogonal projection on latent structures, O-PLS) analysis using high-resolution bucketing with univariate analysis focusing on entities inherited from the “clever bucketing” scheme achieved by statistical recoupling developed here provides a powerful way to secure the interpretation of the latent variable. METHODS Metabotype of C. elegans. This study uses a data set already published discriminating sod-1(tm776) mutants from N2 wild-type nematodes.20 Data Import and Pattern Recognition. 1H high-resolution magic angle spinning (HRMAS) whole organism NMR spectra are processed using the Topspin 2.1 interface (Bruker GmbH, Rheinstetten, Germany). They are reduced over the chemical shift range of 0-9 ppm with exclusion areas around residual water signal (4.6-5.0 ppm) using the AMIX software (Bruker GmbH). Four bucketing sizes are used: 0.1, 0.04, 0.01, and 0.001 ppm-wide regions, and the signal intensity in each region is integrated. In the case of high-resolution bucketing (0.001 ppm), the X data matrix contains 9000 variables, among which 399 variables correspond to the excluded water signal. Spectra are scaled to total intensity, and integration is performed with the sum of the intensity mode. The corresponding bucket tables are then exported to the softwares Simca-P 12 (Umetrics, Umeå, Sweden) and Matlab R2008a (Mathworks, Natick, MA) for statistical analysis. Multivariate Analysis. Unsupervised principal component analysis is run in both subpopulations to check sample homogene(19) Shaffer, J. P. Annu. Rev. Psychol. 1995, 46, 561–584. (20) Blaise, B. J.; Giacomotto, J.; Elena, B.; Dumas, M. E.; Toullhoat, P.; Segalat, L.; Emsley, L. Proc. Natl. Acad. Sci. U.S.A. 2007, 104, 19808–19812. (21) Blaise, B. J.; Giacomotto, J.; Triba, M. N.; Toulhoat, P.; Piotto, M.; Emsley, L.; Se´galat, L.; Dumas, M.-E.; Elena, B. J. Proteome Res. 2009, 8, 2542– 2550. (22) Leek, J. T.; Storey, J. D. Proc. Natl. Acad. Sci. U.S.A. 2008, 105, 18718– 18723. (23) Heller, R.; Stanley, D.; Yekutieli, D.; Rubin, N.; Benjamini, Y. Neuroimage 2006, 33, 599–608. (24) Bellman, R. Adaptive Control Processes: A Guided Tour; Princeton University Press: Princeton, NJ, 1961.
Analytical Chemistry, Vol. 81, No. 15, August 1, 2009
6243
ity and finally exclude outliers. O-PLS25 is carried out to investigate the information shared by the NMR spectral matrix X and the descriptive variable matrix Y. Spectra are visualized in crossvalidated score plots, and the metabolic variations sustaining the discrimination are represented in loading plots. We use the method developed by Cloarec et al.3 to improve the interpretability of the loading by adding the correlation coefficients on the covariance metabolic profile. Covariance and correlation coefficients are computed from the data matrix and the descriptive variables matrix with Matlab. Predictions are realized on two data sets, one extracted from the same data set before model calculation (cross-validation) and another one acquired at a different time (independent validation). Observed and predicted Y matrixes are computed using an O-PLS Matlab routine developed in the lab. Receiver operating characteristic (ROC) curves are plotted using Matlab. These represent the true positive rate as a function of the false positive rate and determine the best threshold for a given test to define the membership at a class. The diagonal stands for random prediction, whereas the more the curve moves to the top left corner, the better the prediction ability of the model is. Analysis of Variance. One-way ANOVA is a supervised univariate method, which calculates the variance of each variable without considering the dependency of the variables on each other.26 The total variance is given by the sum of squares of the distance between the value xi and its average. This can be written as TSS )
∑ (x
i
- ¯x)2
i
MSwithin )
F)
SSbetween k-1 SSwithin n-k MSbetween MSwithin
The calculated value is then compared to the theoretical value reported in the Fisher tables for a risk level defined by the p-level.26 A one-way ANOVA is thus used here to compute the F statistics, which represents the intergroup over intragroup vari(25) Trygg, J.; Wold, S. J. Chemom. 2002, 16, 119–128. (26) Brownlee, K. A. Statistical Theory and Methodology in Science and Engineering Wiley: New York, 1960.
6244
p