COMMENT pubs.acs.org/ac
Response to Comment on “Optimized Preprocessing of Ultra-Performance Liquid Chromatography/Mass Spectrometry Urinary Metabolic Profiles for Improved Information Recovery” Kirill A. Veselkov,† Lisa K. Vingara,‡ Perrine Masson,§ Steven L. Robinette,† Elizabeth Want,† Jia V. Li,† Richard H. Barton,† Claire Boursier-Neyret,§ Bernard Walther,§ Timothy M. Ebbels,† Istvan Pelczer,|| Elaine Holmes,† John C. Lindon,† and Jeremy K. Nicholson*,† †
)
Biomolecular Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, Sir Alexander Fleming Building, South Kensington, London SW7 2AZ, United Kingdom ‡ Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland 97239, United States § Technologie Servier, 27 Rue Eugene Vignat, Orleans 45000, France Department of Chemistry, Princeton University, Princeton, New Jersey 08544-1014, United States
H
ere, we consider and reply to the useful comments raised by Mattarucchi and Guillou1 on the performance of the pre-processing strategy of UPLC/MS urinary metabolic profiles introduced in our recent article (Anal. Chem. 2011, 83, 5864 5872).2 We further strengthen our conclusions on the critical importance of the variance stabilization and normalization of metabolic profiles to yield improved information recovery. On a general note, we all agree that optimization of the data acquisition methods so as to improve data quality in metabolic studies can benefit the subsequent data processing. Specifically, Mattarucchi and Guillou raised a concern that a previously reported finding of constancy in the coefficient of variation of the intense LC/MS peak intensities in sample profiles contradicts our results. It should be noted that the constant coefficient of variation means that the standard deviation is proportional to signal intensity, implying the presence of multiplicative noise. In this case, a homoscedastic noise pattern can be obtained by means of the log-transformation.3 This is in fact in agreement with our findings. If the additive background noise is more substantial, as observed in the case of Anderle et al.,4 the log-transformation would amplify the variation of peaks in a lower intensity range. This was not evident in the diagnostic plots of our study (see Figures 1B and 2A,B of our paper2). We accept that this may not be generally the case and thus we proposed the application of the generalized log-transformation if the additive background noise is more substantial, as suggested by Anderle et al.4 (as described in the Methods section2). The correspondents pointed out that multiple signal preprocessing influences, e.g., inaccuracies in peak detection, peak alignment, and peak integration, introduce errors in the UPLC/ MS peak intensity data matrix. To minimize these influences, we have used a robust preprocessing workflow, including the CentWave peak picking algorithm5 based on multilevel wavelet signal representations, which was shown to reliably detect peaks (with high precision and recall) across different peak width and dilution series. The aim of our study was to characterize the technical variance arising from UPLC/MS platform variability over the run, e.g., due to column degradation, sample introduction equipment, contaminant build-up, MS source contamination, r 2011 American Chemical Society
and detector electronic noise, but not the variance introduced by peak integration. It was shown that the higher peak intensities of urine profiles exhibit larger variability when repeatedly measured. This is a common occurrence across various molecular profiling platforms, e.g., NMR,6 8 LC/MS,4 and microarray technology.9,10 While we accept that the influence of various peak detection/alignment strategies and their impact on metabolic information recovery warrants further study, this point does not undermine the success of our variance stabilizing normalization strategy as shown by the tight clustering of the repeated injections of the pooled sample (representative of the UPLC/MS platform variability), while leaving biological variation intact (Figure 42). It was also suggested that the higher peak intensities of UPLC/ MS metablic profiles of urine samples exhibited larger variability (our Figure 1A2) due the fact the most abundant metabolic species exceeded the instrument’s dynamic range as a result of variable sample dilution. We can confirm that this is not the case. The dependency of the variance on the mean was calculated for repeated injections of a single pooled sample exhibiting no differences in overall concentrations. In addition, similar patterns of increased variance as a function of increased peak intensity were observed within a series of undiluted and 2, 4, and 8 times diluted technical replicates. Mattarucchi and Guillou also asked for some clarifications regarding the nature of intensity values obtained in the case of missing peaks. The presence of missing values in our study was avoided by integrating the raw baseline signal intensity in the case of a missing peak, which is the default workflow of the XCMS peak integration. This is of particular advantage since the zero-inflated distribution of data causes many problems for subsequent statistical analysis. This was also the conclusion in their recent work.11 The problem caused by zero intensity values in the log-transformation can be avoided by adding a small offset prior to applying the log-transformation. The offset should be selected with respect to the intensity of the smallest peak(s). It was set to 10 in our case. Published: November 14, 2011 9721
dx.doi.org/10.1021/ac202516e | Anal. Chem. 2011, 83, 9721–9722
Analytical Chemistry On a more general note, Mattarucchi and Guillou referred to their recent study11 in which it was proposed to reduce the differences in overall concentration by differential dilution of urine samples and to improve data quality by increasing the time scan length. Although this is of advantage, a variety of other sources such as errors in sample dilution, variation in LC separation efficiency, e.g., due to column aging or drifts in ionization and detector efficiencies, e.g., due to source contamination, introduce systematic biases between metabolic profiles. Such variation can be assessed and accounted for by applying computational data normalization methods. We believe that a major advantage of our work is that we compared the performance of various linear and nonlinear normalization methods. For example, Mattarucchi and Guillou used the total useful concentration normalization method in their study which we have shown in our work is a suboptimal choice since its performance can be badly compromised by a single large metabolite peak that varies substantially from sample to sample. We found that the median fold change normalization is least compromised by biologically relevant changes in mixture components and is thus preferable.2 In conclusion, a variety of statistical tools routinely applied for information recovery, confirmatory analysis, and predictive modeling in metabolic studies assume that the data noise is consistent across the whole intensity range. We have shown that this assumption is violated for the UPLC/MS peak intensities of metabolic profiles, the higher peak intensities of urine profiles exhibit larger variability when repeatedly measured, mainly due to the presence of multiplicative noise.2 This violation of constant variance across the measurement range imposes a serious challenge when standard statistical techniques are applied, as demonstrated by principal component analysis. We have evaluated that the UPLC/MS peak intensity of urine samples can be brought in line with this assumption by applying log-based transformations, which successfully stabilize the technical variance across the intensity range. Therefore, our conclusion that variance stabilizing transformation and normalization are critical preprocessing steps that can greatly benefit metabolic information recovery via commonly used pattern recognition tools is fully justified.
COMMENT
(8) Purohit, P. V.; Rocke, D. M.; Viant, M. R.; Woodruff, D. L. OMICS 2004, 8, 118–130. (9) Durbin, B. P.; Hardin, J. S.; Hawkins, D. M.; Rocke, D. M. Bioinformatics 2002, 18, 105–110. (10) Rocke, D. M.; Durbin, B. Bioinformatics 2003, 19, 966–972. (11) Mattarucchi, E.; Guillou, C. Biomed. Chromatogr. 2011, DOI: 10.1002/bmc.1697.
’ AUTHOR INFORMATION Corresponding Author
*E-mail:
[email protected].
’ REFERENCES (1) Mattarucchi, E.; Guillou, C. Anal. Chem. 2011, DOI: 10.1021/ ac202416r. (2) Veselkov, K. A.; Vingara, L. K.; Masson, P.; Robinette, S. L.; Want, E.; Li, J. V.; Barton, R. H.; Boursier-Neyret, C.; Walther, B.; Ebbels, T. M.; Pelczer, I.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2011, 83, 5864–5872. (3) Kvalheim, O. M.; Brakstad, F.; Liang, Y. Anal. Chem. 1994, 66, 43–51. (4) Anderle, M.; Roy, S.; Lin, H.; Becker, C.; Joho, K. Bioinformatics 2004, 20, 3575–3582. (5) Tautenhahn, R.; Bottcher, C.; Neumann, S. BMC Bioinf. 2008, 9, 504. (6) Zhang, S.; Zheng, C.; Lanza, I. R.; Nair, K. S.; Raftery, D.; Vitek, O. Anal. Chem. 2009, 81, 6080–6088. (7) Parsons, H. M.; Ludwig, C.; Gunther, U. L.; Viant, M. R. BMC Bioinf. 2007, 8, 234. 9722
dx.doi.org/10.1021/ac202516e |Anal. Chem. 2011, 83, 9721–9722