Technical Note pubs.acs.org/ac
Data-Driven Sample Size Determination for Metabolic Phenotyping Studies Benjamin J. Blaise* Hospices Civils de Lyon, Département de réanimation néonatale et néonatalogie, Hôpital Femme Mère Enfant, 59 bd Pinel, Bron Cedex, Bron 69677, France Hospices Civils de Lyon, Département d’anesthésie et réanimation, Hôpital Edouard Herriot, 5 place d’Arsonval, Lyon Cedex 03, Lyon 69437, France ABSTRACT: Sample size determination is a key question in the experimental design of medical studies. The number of patients to include in a clinical study is actually critical to evaluate costs and inclusion requirements to achieve a sufficient statistical power of test and the identification of significant variations among the factors under study. Metabolic phenotyping is an expanding field of translational research in medicine, focusing on the identification of metabolism rearrangements due to various pathophysiological conditions. This top-down hypothesis-free approach uses analytical chemistry methods, coupled to statistical analysis, to quantify subtle and coordinated metabolite concentration variations and eventually identify candidate biomarkers. The sample size determination in metabolic phenotyping studies is difficult considering the absence of a priori metabolic target. This technical note introduces a data-driven sample size determination for metabolic phenotyping studies. Starting from nuclear magnetic resonance (NMR) spectra belonging to a small cohort, metabolic NMR variables are identified by the statistical recoupling of variables (SRV) procedure. A larger data set is then generated on the basis of Kernel density estimation of SRV variable distributions. Statistically significant variations of metabolic NMR signals identified by SRV are assessed by the Benjamini-Yekutieli correction for simulated data sets of variable sizes. Simulated model robustness is evaluated by receiver operating characteristic analysis (sensitivity and specificity) on an independent cohort and cross-validation. Sample size determination is obtained by identifying the optimal data set size, depending on the purpose of the study: at least one statistically significant variation (biomarker discovery) or a maximum of statistically significant variations (metabolic exploration).
S
relying mostly on mass spectrometry (MS) or nuclear magnetic resonance (NMR), quantifying metabolites in liquids, semisolid samples, cells, or entire model organisms to evaluate the influence of various pathophysiological conditions. The development of statistical methods is a keystone for the improvement of data extraction from metabolic phenotyping studies given the complexity of data sets.5,6 The data set high dimensionality is a potential source of difficulties, since the number of variables exceeds the number of samples. These potential sources of error are well described, and tremendous efforts are placed in the development of statistical tools to counter “the curse of dimensionality”.7 Different studies focus on the risk of asserting that an observed difference is true when it is not (type 1 error), using multiple hypothesis corrections. Another risk is to state that there is no difference when there is actually one (type 2 error).8 Although this error might seem more acceptable, it is connected to statistical power of test and the sample size and might have different consequences, including for instance the misidentifi-
ample size determination is an essential aspect of experimental design in clinical studies. It actually gives an estimation of the number of patients that needs to be included to be able to identify a potential difference between the groups under study. It is closely related to the statistical power of test and thus to the ability to identify a statistically significant difference if there is truly one (type 2 error). This parameter is used to determine the time required for patient inclusion, or the necessity of a multicenter study, and thus to evaluate study costs.1 Different approaches can be used for sample size determination. The sample size depends on different factors such as the minimum expected difference, the estimated measurement variability, the desired statistical power, the significance threshold, or the type of test used in the analysis.2 However, in the case of “-omics” sciences, the traditional approaches are not easily transferrable. Metabolic phenotyping, also named metabonomics, is one of the youngest “-omics” sciences giving a particular focus on the metabolic response of a living system to pathophysiological stimuli.3,4 It aims at the identification of subtle and coordinated concentration variations of low molecular weight compounds that are substrate and end products of enzymatic reactions. Metabolic phenotyping is a top-down hypothesis-free analytical approach, © 2013 American Chemical Society
Received: April 29, 2013 Accepted: August 23, 2013 Published: August 23, 2013 8943
dx.doi.org/10.1021/ac4022314 | Anal. Chem. 2013, 85, 8943−8950
Analytical Chemistry
■
cation of diagnostic procedures or therapies and financial consequences. Determining an appropriate sample size thus appears as an essential step in the experimental design of metabolic phenotyping studies.9 Ferreira and co-workers used the concept of power analysis for high dimensionality data sets. They developed a theoretical approach for sample size determination in microarray experiments based on the relationship between the cumulative distribution function of p-values, the given false discovery rate measurement, and an estimation of the effect size.10 This approach has been adapted11 and used in MS based metabolic studies.12 However, this approach still relies on an estimation of the effect size, which should not be included as a variable for sample size determination in a metabolic phenotyping study. Another efficient method for sample size determination would be to generate a large hypothetical data set and to investigate sub data sets of various sizes, with usual statistical analysis, to extract the smallest number of samples allowing the identification of statistically significant differences between the groups under study. The generation of a large data set can be, for instance, achieved through ab initio simulations or the use of preacquired spectrum data bank refined by input parameters such as metabolite concentrations.13 However, theoretical simulations do not include important factors such as the minimum expected difference or the estimated measurement variability, and data bank simulations require additional metabolic concentration information altering the top-down hypothesis-free approach in metabolic phenotyping studies. This Technical Note introduces a new approach for sample size determination. It uses a NMR data set discriminating wildtype (WT) and sod-1(tm776) mutant C. elegans. Metabolic NMR variables are identified using the statistical recoupling of variables (SRV) algorithm,14 which acts as an automated variable size bucketing procedure, based on the statistical relationship between consecutive variables in the high resolution data set. An extended data set is then generated on the basis of these variables. Kernel density estimation is used to evaluate the probability density function of SRV variables. The inverse cumulative probability is then computed, and randomly selected numbers are chosen in the ]0,1[ interval, to rebuild entire spectra, variable after variable, and thus generate an extended NMR data set. Simulated data sets of different sizes are then computed. Orthogonal partial least-squares (OPLS) analysis is performed to discriminate simulated samples according to genetics. Prediction capacities (sensitivity and specificity) of each simulated model are evaluated by receiver operating characteristic (ROC) analysis and measurement of the area under the curve (AUC), based on the prediction of an independent testing cohort. The false discovery rate is eventually measured, using the Benjamini-Yekutieli correction for multiple hypotheses testing in a context of negatively dependent tests,15 to identify statistically significant variations among NMR signals. Sample size can then be determined depending on the purpose of the study: biomarker discovery (at least one statistically significant variation) or metabolic exploration (as many statistically significant variations as possible). Compared to other procedures, this data-driven sample size determination allows the inclusion of the minimum expected difference and the estimated measurement variability, without additional information other than the spectral data. It thus provides the number of samples to include in a cohort.
Technical Note
METHODS
Data Set. This Technical Note uses a sub data set already published elsewhere.4,16 It contains 72 whole organism NMR spectra of entire WT and sod-1(tm776) mutant C. elegans. High resolution magic angle spinning (HRMAS) NMR experiments were carried out on a Bruker Avance II spectrometer operating at 700 MHz (proton resonance frequency), using a standard double resonance (1H−13C) 4 mm HRMAS probe. The 72 HRMAS NMR spectra were split into a training cohort (22 spectra: 12 WT and 10 sod-1(tm776) samples) and a testing cohort (50 spectra: 27 WT and 23 sod-1(tm776) samples). These two cohorts are independent. NMR simulations were realized with the training cohort and model validations (ROC curve analysis) with the independent testing cohort. A second data set composed of 21 NMR spectra of human plasma samples, discriminating traumatized patients at their admission in the intensive care unit, according to the later development of sepsis. All NMR experiments were carried out on a Bruker Avance III spectrometer operating at 800.14 MHz (proton resonance frequency) equipped with a 5 mm TXI probe and high throughput sample changer that maintained the sample temperature at 4 °C until actual NMR acquisition. Data Processing. All FIDs were multiplied by an exponential function corresponding to a 0.3 Hz line-broadening factor, prior to Fourier transformation. 1H HRMAS NMR spectra were phased and referenced to the β proton signal of alanine (δ = 1.48 ppm) using Topspin 2.1 (Bruker GmbH, Rheinstetten, Germany). Residual water signal (4.61 to 4.99 ppm) was excluded. Spectra were divided into 0.001 ppm-wide buckets over the chemical shift range [0; 10 ppm] using the AMIX software (Bruker GmbH). 1 H NMR spectra were automatically phased and referenced to the α-glucose anomeric proton signal (δ = 5.23 ppm) using Topspin 2.1 (Bruker GmbH, Rheinstetten, Germany). Residual water signal (4.68−4.86 ppm) was excluded. Spectra were divided into 0.001 ppm-wide buckets over the chemical shift range [0; 10 ppm] using the AMIX software (Bruker GmbH). All spectra were normalized to their total intensity and meancentered prior to analysis. Statistical Analysis. Data were exported to MATLAB (Mathworks, Natick, MA) for statistical analysis. Principal component analysis (PCA) was performed to derive the main sources of variance within the data set, check population homogeneity, and eventually identify outliers. Data were visualized as loading plots, which represent the coordinated variations of NMR spectral regions. Supervised regression methods such as orthogonal partial least-squares (O-PLS) were performed to establish a robust sample classification model. OPLS is the most commonly used multivariate statistical analysis in metabolic phenotyping studies; however, other classification approaches could be used to discriminate samples. The O-PLS analysis was run to discriminate populations by regressing a supplementary data matrix Y, containing information about the membership of samples to each group. Model performances were assessed by goodness-of-fit parameters R2 and Q2, related, respectively, to the explained and predicted variance. We performed model validation by resampling the model 1000 times under the null hypothesis. Statistical Recoupling of Variables. SRV algorithm14,17 was used to identify metabolic NMR variables. The following parameters were used in the analysis of the HRMAS C. elegans data set (singlet size = 0.01 ppm, bucketing resolution = 0.001 8944
dx.doi.org/10.1021/ac4022314 | Anal. Chem. 2013, 85, 8943−8950
Analytical Chemistry
Technical Note
inherited from high resolution bucketing, or R2 = 0.937 and Q2 = 0.627 when considering the SRV clusters that represent biological NMR data. These parameters refer to the explanation and prediction capacities of the established model. Score plots based on SRV cluster analysis are represented as miniatures within loading plots for the testing cohort (Figure 1a) and
ppm, correlation threshold = 0.7, significance threshold = 0.05, number of factors = 3). The following parameters were used on the human data set (singlet size = 0.009 ppm, bucketing resolution = 0.001 ppm, correlation threshold = 0.6, significance threshold = 0.05, number of factors = 5). Inverse Cumulative Probability and Sample Size Determination. Inverse cumulative probability for normal distributions was derived from MATLAB using Kernel smoothing density estimate, for each metabolic NMR variable identified by SRV. The Kernel density estimator of a distribution f is given by the following formula, where K is a Kernel (a non-negative real-valued integrable even function, with an integral equal to 1 over ]−∞, +∞[) and h is a smoothing parameter.18,19 fĥ (x) =
1 n*h
n
⎛ x − xi ⎞ ⎟ h ⎠
∑ K ⎜⎝ i=1
To approximate any distributions, Gaussian Kernel estimators are frequently used. K is then defined by the following equation. K (u) =
1 − 1 u2 e 2 2π
NMR spectra were simulated by selecting random numbers in the ]0,1[ interval as an object of the inverse cumulative probability function of each SRV variable. An additional noise-removing filter was added to the SRV procedure in this Technical Note, to take into account the effect of Kernel estimators on electronic noise. Kernel density estimators actually use fluctuations around the distribution values to fit the global distribution. In electronic noise areas, such approaches may lead to the artificial rise of signals and should be canceled out. For each spectrum of the data set, the mean value and standard deviation of the noise were computed considering the 8.7−9 ppm area. A variable belongs to the NMR signal if its intensity is superior to 1.96 times the standard deviation added to the mean of the noise (95% confidence interval) for at least half of the spectra involved in the simulated data set. Receiver operating characteristics (ROC) are calculated to evaluate sensitivity and specificity of each calculated model, based on the prediction of an independent testing cohort. Results are presented as ROC curves with area under the curve (AUC) measurements. The false discovery rate was measured according to the Benjamini-Yekutieli correction15 to determine the number of statistically significant NMR variables in each data set.
Figure 1. Metabolic phenotypes extracted from low or medium size cohorts. Orthogonal partial least-squares (O-PLS) analysis of 2 independent cohorts of (a) 50 samples (testing cohort) and (b) 22 samples (training cohort) discriminating gravid wild-type (WT) from sod-1(tm776) mutant C. elegans. O-PLS regression coefficient plot colored, according to the correlation between the metabolic variables identified by statistical recoupling of variables (SRV) and the class information matrix, and significance testing of the SRV variables following the Benjamini-Yekutieli correction. SRV variables that do not achieve the Benjamini-Yekutieli false discovery rate threshold of 0.05 are represented in gray. Score plots are inserted as miniatures on both loading plots.
■
RESULTS Metabolic Phenotype Discriminating Gravid C. elegans Samples Based on Genetics. A 1 + 3 supervised OPLS analysis is carried out to discriminate samples in the training and testing cohort according to genetics (WT versus sod-1(tm776) mutants). The O-PLS analysis shows a clear discrimination between the two groups on score plot, assessed by elevated values of goodness-of-fit parameters, R2 = 0.817 and Q2 = 0.714 when considering the 9000 initial variables inherited from high resolution bucketing, or R2 = 0.823 and Q2 = 0.647 when considering the SRV clusters that represent biological NMR data, for the testing cohort (50 samples). Similar results are obtained for the training cohort (22 samples) R2 = 0.942 and Q2 = 0.553 when considering the 9000 initial variables
training cohort (Figure 1b). Coordinated metabolic variations sustaining these discriminations are represented by the O-PLS regression coefficient plot colored according to the correlation between metabolic NMR signals and the class information matrix. This loading plot corresponds to the first latent variable and is computed using projections of the raw data on the calculated weight vector W. Nonstatistically significant variables are represented in gray, according to the Benjamini-Yekutieli multiple hypothesis testing correction. The latent variable 8945
dx.doi.org/10.1021/ac4022314 | Anal. Chem. 2013, 85, 8943−8950
Analytical Chemistry
Technical Note
Figure 2. Flowchart. Schematic representation of the data-driven sample size determination procedure. Training procedure, with sample-size determination and validation, is represented in red; independent validation by the testing cohort is represented in green.
Figure 3. Inverse cumulative distribution function. (a) Discrete cumulative density function of a typical variable identified by statistical recoupling of variables (red). Kernel smoothing density estimator of the same variable (blue) with a sampling step of 0.01 over the ]0,1[ interval. (b) Estimated inverse cumulative distribution function of a typical variable over the ]0,1[ interval.
corresponding to the training cohort shows 53 statistically
(Figure 1b). All variables are in gray, showing a lack of statistical power of test in this training cohort. Data-Driven Sample Size Determination. The purpose of this Technical Note is to derive, from a small original cohort, the number of samples necessary to identify statistically significant NMR signal variations and thus identify candidate
significant NMR signals, which are candidate biomarkers for the discrimination between gravid WT and sod-1(tm776) mutant C. elegans (Figure 1a). The same analysis for the training cohort fails to identify any statistically significant NMR signal variation 8946
dx.doi.org/10.1021/ac4022314 | Anal. Chem. 2013, 85, 8943−8950
Analytical Chemistry
Technical Note
Figure 4. Sample size determination and model validation. (a) Number of variables identified by statistical recoupling of variables (SRV) and highlighted as statistically significant by the Benjamini-Yekutieli correction with respect to the number of spectra involved in the data set, for the discrimination between gravid wild-type (WT) and sod-1(tm776) mutant C. elegans. Error bars represented here with confidence intervals at 95%, when we randomly generate 1000 equally sized sub data sets from the expanded data set. The sample size is determined by the number of expected statistically significant variables depending on the experimental design (biomarker discovery: at least one; metabolic exploration: as many as possible). (b) Validation by resampling under the null hypothesis of the model discriminating the groups under study for a simulated data set of 400 spectra (plain green triangles for R2 and blue squares for Q2) or for a simulated data set of 100 spectra (empty green triangles for R2 and blue squares for Q2). (c) Area under the curve (AUC) for the receiver operating characteristic (ROC) analysis as a function of the number of simulated spectra included in the model calculation. Sensitivity and specificity of the models are assesed by the prediction of the independent testing cohort with respect to the simulated spectra derived from the training cohort. (d) ROC curves corresponding to the prediction of the independent testing cohort with a simulated data set of 400 spectra (solid red) or 100 spectra (dashed blue) derived from the training cohort. AUC are measured at 0.85 and 0.83, respectively.
biomarkers. Contrary to other approaches, the driving idea is to remove the traditional estimation of the effect size and to replace it by a deep analysis of acquired NMR data to evaluate this effect size. This Technical Note describes the different steps used for the data-driven generation of NMR data, based on Kernel analysis of the data set, the determination of the optimal sample size, and the validation procedure. This approach is detailed as a flowchart on Figure 2. From the original C. elegans data set, two independent cohorts are derived for training and testing purposes. The training cohort is analyzed by SRV to identify biological NMR signals, allowing the generation of 400 simulated spectra (200 WT and 200 sod1(tm776) simulated samples). False discovery rate measurement by the Benjamini-Yekutieli correction is then performed to identify statistically significant NMR signal variations and thus extract candidate biomarkers. Once the sample size is determined, standard analyses (O-PLS, resampling under the null hypothesis and ROC analysis) are used to assess the performance and validity of the model (red boxes on Figure 2).
An additional independent validation is done, based on ROC analysis of the independent testing cohort (green boxes on Figure 2). Inverse Cumulative Distribution Function. In order to derive the sample size determination from the training cohort, the initial 22 spectra are amplified to a total of 400 individuals (200 per group). Instead of using theoretical approaches, a data-driven method was developed. The conducting idea is to compute the cumulative distribution function for each of the NMR variables inherited from the SRV analysis (Figure 3a). However, the cumulative density function obtained in this way is a discrete function. That is why an additional smoothing step is necessary to transform it into a continuous function allowing a reverse process. It can be achieved using a Kernel smoothing density estimation on a highly resolved determination interval (0.01 wide steps) as shown in Figure 3a. The continuous inverse cumulative distribution function can then be computed (Figure 3b) for the generation of new individuals. The data set is expanded by randomly selecting numbers in the ]0,1[ interval 8947
dx.doi.org/10.1021/ac4022314 | Anal. Chem. 2013, 85, 8943−8950
Analytical Chemistry
Technical Note
and collecting the corresponding values for all the SRV variables. Sample Size Determination and Model Validation. Exploring this expanded data set allows the sample size determination. Increasing numbers of spectra are randomly selected from this newly generated data set. For each sub data set, the false discovery rate using the Benjamini-Yekutieli correction is measured. This finally leads to the number of statistically significant variables for each sub data set. 1000 repetitions are computed for each sub data set size. Figure 4a represents the mean and 95% confidence intervals corresponding to the distribution of statistically significant variables for each sub data set size. The sample size determination could eventually be achieved following two goals. The first one is metabolic exploration of pathophysiological stimuli, which requires extensive metabolic characterization. The sample size is then obtained by picking up the smallest data set size allowing the identification of a number of statistically significant signals close to the maximum represented by the total number of SRV clusters (114 in this case, after the application of the noise-removing filter). In this case, the power of test will be sufficient for each metabolic NMR variable to be identified as statistically significant if it is truly different between the two groups under study. Another approach is the biomarker discovery, and in this case, one may be interested in the smallest data set, giving the sample cost and availability, allowing the identification of at least one statistically significant variation. These biological hypothesis and candidate biomarkers must then be validated by traditional biological experiments. For this data set, in a context of metabolic exploration, 100 samples is a reasonable number to identify statistically significant NMR signal variations. On the other hand, for biomarker discovery, a smaller data set of 40 samples is enough. Using the fully expanded data-driven data set, it is now possible to derive a new O-PLS model discriminating the samples. A 1 + 2 O-PLS is carried out to discriminate the 400 samples. The discrimination is associated with elevated goodness-of-fit parameters (R2 = 0.990 and Q2 = 0.989). The same analysis can be performed for a cohort of 100 samples (R2 = 0.985 and Q2 = 0.981). The resampling under the null hypothesis shows a clear decrease of the goodness-of-fit parameters with the correlation between the initial and the randomly permuted class information matrix, validating the model robustness (Figure 4b) for a cohort of 400 samples (plain green triangles for R2 and blue squares for Q2) or for a sub data set of 100 spectra (empty green triangles for R2 and blue squares for Q2). Sensitivity and specificity analyses are carried out to investigate the robustness of the simulated models, using ROC curves. The independent testing cohort is used to evaluate the areas under the curve (AUC) for each simulated data set. 1000 resamplings are performed for each data set size. AUC are represented as mean with error bars corresponding to the 95% confidence interval in Figure 4c. Elevated AUC are observed, showing the robustness of simulated data sets to efficiently predict the independent cohort. ROC curves corresponding to the simulated 400 spectra or 100 spectra cohort are shown in Figure 4d, with AUC of 0.85 and 0.83, respectively. The Benjamini-Yekutieli correction can now be performed to extract candidate biomarkers for the large cohort of 400 samples (Figure 5a) or a smaller cohort of 100 samples (Figure 5b). In both cases, many NMR signals are highlighted as
Figure 5. Metabolic phenotypes extracted from large size cohorts. 400 and 100 simulated sample cohorts discriminating gravid wild-type (WT) and sod-1(tm776) mutant C. elegans. Orthogonal partial leastsquares (O-PLS) regression coefficient plot, colored according to the correlation between the variables identified by statistical recoupling of variables (SRV) and the class information matrix, and significance testing of the SRV variables following the Benjamini-Yekutieli correction, for a cohort of (a) 400 or (b) 100 simulated spectra derived from the training cohort. SRV variables that do not achieve the Benjamini-Yekutieli false discovery rate threshold of 0.05 are represented in gray.
statistically significant and are thus candidate biomarkers for further biological or metabolic network analysis,20,21 focusing on the genetic discrimination between WT and sod-1(tm776) mutant C. elegans. The highlighted signals include those already observed on larger cohorts.4,16 Further biological interpretations are out of the scope of this Technical Note. Application to a Human Cohort. The data-driven procedure was tested on a small human cohort of 21 samples discriminating patients admitted to the intensive care unit according to the later development of sepsis (1 + 4 component model, R2 = 0.855, Q2 = 0.384, AUC = 0.778 on a crossvalidated receiver operator characteristic curve). The power of test of this cohort is not sufficient to identify statistically significant variations of NMR signals. The data-driven sample size estimation evaluates the adequate size at 200 samples, for this data set with the purpose of biomarker discovery. 8948
dx.doi.org/10.1021/ac4022314 | Anal. Chem. 2013, 85, 8943−8950
Analytical Chemistry
■
Technical Note
DISCUSSION It thus seems that sample size can be determined from a datadriven approach. However, one should note that the sample size determination might be slightly optimistic, due to the fact that the data-driven expanded data set will enhance the discrimination by reducing the intravariance in each group (effect in relation with the generation process). This will increase the discrimination by maximizing the between over within group variance ratio. On top of that, the Kernel estimator and the smoothing parameter will induce autocorrelation and consequently an oversimplification of the classification model. This was controlled by the efficient prediction of an independent testing cohort. The sample size might thus be slightly underestimated. Once again, two different strategies could be achieved in metabolic phenotyping studies, depending of the purpose of the experimental design: metabolic exploration or biomarker discovery. In the first case, one will favor the identification of the largest number of statistically significant variations to decipher biochemical mechanisms of pathophysiological stimuli. In the second case, especially when samples are rare and expensive, the selected size should correspond to the smallest data set allowing the identification of at least one statistically significant variation. This data-driven approach is based on the assumption that a small random cohort is a good estimator of the general population. If this seems obvious in the case of whole-organism NMR, where each sample is in fact composed of a population of identical cells or model organisms, this is less trivial in human studies. However, this hypothesis is necessary to adapt sample size determination for metabolic phenotyping. It is also important to note that aberrant behavior in the simulated data set can be observed when expanding a too small cohort. That is why a minimum of twenty samples in the training cohort appears reasonable before expansion, for the use of SRV algorithm and the data-driven approach. Furthermore, this sample size determination approach is different from traditional approaches since it does not seem to take into account the variations of the tested factors, the estimation of the studied effect, or the minimum difference that one seeks to achieve. However, these elements are in fact the core of the developed method, although they are not directly visible. The minimum difference is indeed connected to the identification of statistically significant variations, and the variations of the tested factors are investigated by the identification of SRV clusters and their intergroup comparisons. Finally, it is also interesting to note that the determined sample size is of the same magnitude as the identified biological NMR variables (114 SRV clusters after the noise removing filter) for the C. elegans data set. In the case of metabolic phenotyping approaches, the high dimensionality of data sets is often a potential threat to the validity of the results. This Technical Note shows that the use of SRV in conjunction with a data-driven approach allows the identification of a reasonable sample size comparable to the number of metabolic NMR variables, thus driving us away from the curse of dimensionality.7 In conclusion, a new approach for sample size estimation has been developed, on the basis of only the spectral data from an initial cohort, without additional input. The conducting idea is to generate larger data sets, on the basis of statistical analysis of a small training cohort to evaluate the number of statistically significant variables and the classification performance of
simulated models of different size. Sample size determination can then be easily obtained by the analysis of the number of statistically significant variations observed for different data set sizes. For biomarker discovery, one would identify the smallest data set size allowing the identification of a single statistically significant variation. In a case of metabolic exploration, one will try to identify as many statistically significant variations as possible to generate biochemical hypothesis. This data-driven procedure thus respects the top-down hypothesis free characteristics of metabolic phenotyping. This approach is based on the SRV algorithm to reduce the dimensionality of the initial data set and on Kernel smoothing density estimator to compute the inverse cumulative probability and thus generate data-driven spectra. This data-driven sample size determination allows an efficient evaluation of the number of samples to include in a metabolic phenotyping study to identify statistically significant variations of metabolic signals throughout the data set and opens up perspectives for the metabolic exploration and biomarker discovery at the systems biology and medicine levels.
■
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS I would like to thank Drs. Aurélie Gouel, Bernard Floccard, and Guillaume Monneret and Prof. Bernard Allaouchiche as well as Drs. Bénédicte Elena, Jean Giacomotto, Laurent Ségalat, Pierre Toulhoat, and Prof. Lyndon Emsley for providing the data sets. I also thank Prof. Pierre Gressens for financial support. I thank Dr. Jérémie Palacci for technical support. I finally thank Prof. Ecochard and Kassaı̈ for fruitful discussions.
■
REFERENCES
(1) Eng, J. Radiology 2003, 227, 309−313. (2) Dell, R. B.; Holleran, S.; Ramakrishnan, R. Inst. Lab. Anim. Res. J. 2002, 43, 207−213. (3) Nicholson, J. K.; Lindon, J. C.; Holmes, E. Xenobiotica 1999, 29, 1181−1189. (4) Blaise, B. J.; Giacomotto, J.; Elena, B.; Dumas, M.-E.; Toulhoat, P.; Segalat, L.; Emsley, L. Proc. Natl. Acad. Sci. 2007, 104, 19808− 19812. (5) Fonville, J. M.; Richards, S. E.; Barton, R. H.; Boulange, C. L.; Ebbels, T. M. D.; Nicholson, J. K.; Holmes, E.; Dumas, M.-E. J. Chemom. 2010, 24, 636−649. (6) Lavine, B.; Workman, J. Anal. Chem. 2010, 82, 4699−4711. (7) Bellman, R. E. Adaptive control processes: a guided tour; Princeton University Press: Princeton, NJ, 1961. (8) Broadhurst, D. I.; Kell, D. B. Metabolomics 2006, 2, 171−196. (9) Hendriks, M. M. W. B.; van Eeuwijk, F. A.; Jellema, R. H.; Westerhuis, J. A.; Reijmers, T. H.; Hoefsloot, H. C. J.; Smilde, A. K. TrAC Trends Anal. Chem. 2011, 30, 1685−1698. (10) Ferreira, J. A.; Zwinderman, A. Int. J. Biostatistics 2006, 2 (1), 1− 38. (11) van Iterson, M.; Hoen, P.; Pedotti, P.; Hooiveld, G.; den Dunnen, J.; van Ommen, G.; Boer, J.; Menezes, R. BMC Genomics 2009, 10, 439. (12) Vinaixa, M.; Samino, S.; Saez, I.; Duran, J.; Guinovart, J. J.; Yanes, O. Metabolites 2012, 2, 775−795. (13) Muncey, H. J.; Jones, R.; De Iorio, M.; Ebbels, T. M. BMC Bioinformatics 2010, 11, 496. (14) Blaise, B. J.; Shintu, L.; Elena, B.; Emsley, L.; Dumas, M.-E.; Toulhoat, P. Anal. Chem. 2009, 81, 6242−6251.
8949
dx.doi.org/10.1021/ac4022314 | Anal. Chem. 2013, 85, 8943−8950
Analytical Chemistry
Technical Note
(15) Benjamini, Y.; Yekutieli, D. Ann. Stat. 2001, 29, 1165−1188. (16) Blaise, B. J.; Giacomotto, J.; Triba, M. N.; Toulhoat, P.; Piotto, M.; Emsley, L.; Ségalat, L.; Dumas, M.-E.; Elena, B. J. Proteome Res. 2009, 8, 2542−2550. (17) Navratil, V.; Pontoizeau, C.; Billoir, E.; Blaise, B. J. Bioinformatics 2013, 29, 1348−1349. (18) Parzen, E. Ann. Math. Stat. 1962, 33, 1065−1076. (19) Rosenblatt, M. Ann. Math. Stat. 1956, 27, 832−837. (20) Blaise, B. J.; Navratil, V.; Domange, C.; Shintu, L.; Dumas, M.E.; Elena-Herrmann, B.; Emsley, L.; Toulhoat, P. J. Proteome Res. 2010, 9, 4513−4520. (21) Blaise, B. J.; Navratil, V.; Emsley, L.; Toulhoat, P. J. Proteome Res. 2011, 10, 4342−4348.
8950
dx.doi.org/10.1021/ac4022314 | Anal. Chem. 2013, 85, 8943−8950