Evaluation of Peak Picking Quality in LC−MS Metabolomics Data

Oct 26, 2010 - In fact, the typical output of a peak-picking software has a number of ... In a second phase (higher resolution QC), following the sele...
0 downloads 0 Views 4MB Size
Anal. Chem. 2010, 82, 9177–9187

Evaluation of Peak Picking Quality in LC-MS Metabolomics Data Leonid Brodsky,*,†,‡ Arieh Moussaieff,† Nir Shahaf,† Asaph Aharoni,† and Ilana Rogachev† Department of Plant Sciences, Weizmann Institute of Science, P.O. Box 26, Rehovot 76100, Israel, and Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel The output of LC-MS metabolomics experiments consists of mass-peak intensities identified through a peak-picking/alignment procedure. Besides imperfections in biological samples and instrumentation, data accuracy is highly dependent on the applied algorithms and their parameters. Consequently, quality control (QC) is essential for further data analysis. Here, we present a QC approach that is based on discrepancies between replicate samples. First, the quantile normalization of per-sample log-signal distributions is applied to each group of biologically homogeneous samples. Next, the overall quality of each replicate group is characterized by the Z-transformed correlation coefficients between samples. This general QC allows a tuning of the procedure’s parameters which minimizes the inter-replicate discrepancies in the generated output. Subsequently, an in-depth QC measure detects local neighborhoods on a template of aligned chromatograms that are enriched by divergences between intensity profiles of replicate samples. These neighborhoods are determined through a segmentation algorithm. The retention time (RT)-m/z positions of the neighborhoods with local divergences are indicative of either: incorrect alignment of chromatographic features, technical problems in the chromatograms, or to a true biological discrepancy between replicates for particular metabolites. We expect this method to aid in the accurate analysis of metabolomics data and in the development of new peakpicking/alignment procedures. The quality of biological samples and of the peak-picking program output is imperative in the analysis of LC-MS-based metabolomics data.1 Peak-picking and alignment procedures cannot correct artifacts of a nonbiological nature but rather can add new ones due to the possible wrong integration of peaks and/ or erroneous intersample peak alignments. In fact, the typical output of a peak-picking software has a number of errors that include (i) imperfection due to the sample preparation, (ii) inaccuracy generated by the LC-MS technology, (iii) incorrect peak integration and peak alignment due to the lack of compatibility between software parameters or algorithms and the * Corresponding author. Phone: +972 8 9460731. Fax: +972 8 9344181. E-mail: [email protected]. † Weizmann Institute of Science. ‡ University of Haifa. (1) Fiehn, O.; Wohlgemuth, G.; Scholz, M.; Kind, T.; Lee do, Y.; Lu, Y.; Moon, S.; Nikolau, B. Plant J. 2008, 53 (4), 691–704. 10.1021/ac101216e  2010 American Chemical Society Published on Web 10/26/2010

particular data set, and (iv) inter-replicate discrepancy of biological origin. An integration of the intrareplicate-group data normalization and quality control (QC) of the peak picking/peak alignment software output will allow the detection of errors generated by the software as well as those present in the raw data. Intrareplicate Group Data Normalization. In general, data normalization procedures aim at equalizing variations in peak intensity between samples that originate from technical factors, keeping the true biological differences unaffected. In the case of biological replicates, one does not assume any significant true biological differences between them. Therefore, the replicates (both biological and technical) are expected to be very similar in their peak intensities after proper normalization. In recent years, two main approaches for normalization of LC-MS metabolomics data were reported.2 In the first approach, statistical models were used to derive optimal scaling factors for each sample based on the complete data set.3 These models included normalization by unit norm,4 median5 of intensities, employing the maximum likelihood method,6 or adopted from a technique developed for gene expression data.7 The second approach performs normalization using a single or multiple external (i.e., added to samples after extraction) or internal (i.e., added to samples prior to extraction) standard compounds. The internal/external standards-based normalization strategy exhibits high efficacy in increasing the homogeneity of replicate samples8 but adds complexity to sample preparation.8-10 This technical complexity limits the use of the technique, hence, supporting the need for statistical data normalization (scaling) methods. The list of data scaling methods11 consists of techniques such as centering, scaling (autoscaling, range scaling, Pareto scaling, (2) Katajamaa, M.; Oresic, M. J. Chromatogr., A 2007, 1158, 318–328. (3) Crawford, L. R.; Morrison, J. D. Anal. Chem. 1968, 40, 1464. (4) Scholz, M.; Gatzek, S.; Sterling, A.; Fiehn, O.; Selbig, J. Bioinformatics 2004, 20, 2447. (5) Wang, W.; Zhou, H.; Lin, H.; Roy, S.; Shaler, T. A.; Hill, L. R.; Norton, S.; Kumar, P.; Anderle, M.; Becker, C. H. Anal. Chem. 2003, 75, 4818. (6) Oresic, M.; Clish, C. B.; Davidov, E. J.; Verheij, E.; Vogels, J.; Havekes, L. M.; Neumann, E.; Adourian, A.; Naylor, S.; van der Greef, J.; Plasterer, T. Appl. Bioinf. 2004, 3, 205. (7) Hartemink, A. J.; Gifford, D. K.; Jaakkola, T. S.; Young, R. A. Pac. Symp. Biocomput. 2001, 4266, 132–140. (8) Sysi-Aho, M.; Katajamaa, M.; Yetukuri, L.; Oresic, M. BMC Bioinf. 2007, 8, 93. (9) Hermansson, M.; Uphoff, A.; Ka¨kela¨, R.; Somerharju, P. Anal. Chem. 2005, 77, 2166–2175. (10) Bijlsma, S.; Bobeldijk, I.; Verheij, E. R.; Ramaker, R.; Kochhar, S.; Macdonald, I. A.; van Ommen, B.; Smilde, A. K. Anal. Chem. 2006, 78, 567–574. (11) Van den Berg, R. A.; Hoefsloot, H. C.; Westerhuis, J. A.; Smilde, A. K.; van der Werf, M. J. BMC Genomics 2006, 7, 142.

Analytical Chemistry, Vol. 82, No. 22, November 15, 2010

9177

vast scaling, and level scaling), and data transformation (log transformation and power transformation). These methods do not rely on standards and could therefore be applied to any data set; however, they seem to be inferior to the standard based methods.8 Here we address the between-replicate normalization issue by applying the quantile normalization method that originated from microarray data preprocessing.12,13 Quantile normalization of microarray data is widely used and has proven to be very effective in the detection of differentially expressed genes. When applied to metabolomics data, the quantile normalization method could be classified as a statistical normalization method because it is not based on internal standards. Recently, Draisma and colleagues14 demonstrated the application of quantile normalization to the general normalization of LC-MS and NMR of human metabolomics data. Here we study the application of quantile normalization to the biological replicates in LC-MS data and show that its performance is comparable to the one of the currently applied standard-based methods. Quality Control. An inaccurate estimation of per-ion intensity differences between samples might be a result of issues related to the capacity of software packages used for peak picking and alignment. The currently most widely used programs such as XCMS,15 M/Zmine,16 OpenMS,17 and MetAlign,18 might generate imprecise peak intensity data. A major challenge when using these various programs is the selection of optimal parameter settings, which can be data set dependent. Recently, Lange et al.19 prepared the “ground truth” benchmark of aligned chromatograms. The aligned chromatogram features (peaks) were taken from the conventional Arabidopsis thaliana LC-MS runs of biological replicates. The quality control (QC) measure developed in this paper was based on this “ground truth” alignment of chromatogram features and was applied to evaluate the alignment performance of different software packages. In this study we used the “ground truth” feature alignment reported by Lange et al. as the test data set for our procedure, aimed to examine whether the suggested “ground truth” feature alignment contains any retention time (RT) and m/z local discrepancies between intensity profiles of replicates. The similarity of the log-intensity profiles for sample-replicates (both biological and technical) is considered to be an innate QC measure.8,19 Indeed, it is intuitively obvious that in a properly normalized groups of replicates, the log-intensity profiles across peaks inside this group have to be very similar in their general and local details. Thus, our QC estimates the total similarity between replicates after data normalization inside every replicate (12) Bolstad, B. M. Unpublished manuscript, 2001, http://bmbolstad.com/stuff/ qnorm.pdf. (13) Bolstad, B. M.; Irizarry, R. A.; Astrand, M.; Speed, T. P. Bioinformatics 2003, 19, 185–193. (14) Draisma, H. H.; Reijmers, T. H.; van der Kloet, F.; Bobeldijk-Pastorova, I.; Spies-Faber, E.; Vogels, J. T.; Meulman, J. J.; Boomsma, D. I.; van der Greef, J.; Hankemeier, T. Anal. Chem. 2010, 82, 1039–1046. (15) Smith, C. A.; Want, E. J.; O’Maille, G.; Abagyan, R.; Siuzdak, G. Anal. Chem. 2006, 78, 779–787. (16) Katajamaa, M.; Miettinen, J.; Oresic, M. Bioinformatics 2006, 22, 634– 636. (17) Kohlbacher, O.; Reinert, K.; Gro¨pl, C.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Sturm, M. Bioinformatics 2007, 23, 191–197. (18) Lommen, A. Anal. Chem. 2009, 81 (8), 3079–3086. (19) Lange, E.; Tautenhahn, R.; Neumann, S.; Gro¨pl, C. BMC Bioinformatics 2008, 9, 375.

9178

Analytical Chemistry, Vol. 82, No. 22, November 15, 2010

group and evaluates if there are local artificial discrepancies between sample replicates. Here, we applied replicate-based QC measure in a two-phase QC study. In the first phase (general QC), the global similarity between replicates (estimated as Fisher’s Z-transformed correlations20 coupled with the regular Pearson correlations) was used to evaluate the optimal parameter setting of the peak-picking/ peak-alignment software. In a second phase (higher resolution QC), following the selection of the best software parameters, the implemented segmentation procedure allowed the detection of the RT and mass-peak (m/z) local regions that are enriched with discrepancy between sample-replicates. The RT-m/z position of such local regions might point either to specific errors in the alignment of chromatograms, to unexpected variation between biological replicates for specific metabolites, or to technical problems, such as sample mislabeling, problems with the LC gradient, chromatographic column performance, and more. EXPERIMENTAL SECTION LC-MS Metabolomics Data. The raw data and a table with the aligned mass peaks for the Arabidopsis-Four-Tissues data set (original reported data set) were obtained from Matsuda et al.21 that analyzed extracts of Arabidopsis thaliana seedlings using LC-MS. The tomato introgression lines (ILs) originated from a cross between the wild species Lycopersicon pennellii and the cultivated tomato (cv. M82). For the generation of the ILs data set, fruit peel tissue was extracted and profiled by the use of an ultraperformance liquid chromatography-quadrupole time-offlight-mass spectrometry (UPLC-QTOF-MS) system as described previously22 with several modifications; separation was performed using a 5%-60% acetonitrile gradient containing 0.1% formic acid with a 9.5 min total run time. The XCMS software package15 was used for peak picking and alignment. For analysis of Arabidopsis ecotypes samples we used two data sets, for both of them whole roots of seedlings were collected from five ecotypes 6 days postgermination. In the first data set (i.e. whole root extracts), the sample powder was extracted in 75% MeOH containing 0.1% formic acid. In the second data set (i.e. comparison of biological samples to technical controls), ecotype samples were subjected to an extensive preparation procedure, which included significant samples dilution and several consecutive drying-resuspension steps in order to keep a suitable final concentration of lowabundance metabolites before LC-MS analysis. Technical controls were given a similar treatment and analyzed in an identical manner as biological samples with the exception that no root samples were introduced [i.e., all masses originated from solutions and the disposable material (e.g., plastics)]. Samples of both sets were then profiled by the abovementioned UPLC-QTOF-MS system with a modified gradient: 5%-55% acetonitrile containing 0.1% formic acid and a 34 min separation time were used. The “ground truth” data set was generated by Lange et al.19 from a typical Arabidopsis thaliana metabolomics experiment, with (20) Snedecor, G. W.; Cochran, W. G. Statistical Methods; Iowa University Press: Ames, IA, 1989. (21) Matsuda, F.; Yonekura-Sakakibara, K.; Niida, R.; Kuromori, T.; Shinozaki, K.; Saito, K. Plant J. 2009, 57, 555–577. (22) Mintz-Oron, S.; Mandel, T.; Rogachev, I.; Feldberg, L.; Lotan, O.; Yativ, M.; Wang, Z.; Jetter, R.; Venger, I.; Adato, A.; Aharoni., A. Plant Physiol. 2008, 147, 823–851.

different plant lines and treatments measured at multiple time points in triplicates. Briefly, a freshly ground Arabidopsis thaliana leaf tissue was extracted in methanol and measured on two different LC-MS setups: data set M1 was measured on capillary LC-electrospray (ESI)-QTOF-MS and data set M2 was measured on LC-ESI-QTOF-MS. The feature (mass peaks) finding was done with XCMS using the “centWave” algorithm. The mass peaks from the same substance were grouped together as annotated feature groups. The highly reproducible feature groups were aligned, and subsequently, the aligned feature groups were split up into their consensus features, which form the alignment ground truth data set.19 Segmentation Algorithm for Detection of Erroneous RT (m/z) Scan Fragments. The segmentation algorithm used was described previously by Brodsky et al.23 The enrichment score of a scan fragment is based on a normalized difference in the deviation between the sample-replicates’ log-intensities from the general mean of all intersample-replicate deviations across all scans. This difference is normalized by the SD of all intersamplereplicate deviations across all scans for the given pair of sample replicates. Thus, the normalized differential of a peak m is

am )

i j ) - log(Sm )) - mean (log(Sm SD

i where Sm is a signal in peak m of sample i (correspondingly Sjm); mean is the average of the log (Sim) - log (Sjm) differential across all peaks. Consider a set of N consequent peaks in a scan fragment (a1, a2..., aN). Denote the score of the interval [i,j] as

j

Aij )

∑a

m

m)i

√j - i + 1

The segmentation algorithm detects a series of high score scan intervals or fragments; the fragment with the highest score in the scan followed by the next highest score fragment that does not intersect with the previous one, followed by the next highest score fragment that does not intersect with the two previous ones, and so on.24 The algorithm’s procedure is based on two theorems describing numerical inequalities. The statement of the first theorem is as follows: calculate the scores of subsequent enclosed fragments, each starting at the beginning of the scan. For a certain scan position: (i) if a subsequent portion of the above enclosed fragments that end in positions before the selected position all have positive scores and (ii) if a subsequent portion of the above enclosed fragments that end in position after the selected position all have negative scores, then the highest score fragment is either to the left or to the right from the selected position; the position is a dividing one in the search for the highest score interval. Therefore, after finding all such positions by sequential screening through the expanding series of enclosed fragments, the entire (23) Brodsky, L.; Kogan, S.; BenJacob, E.; Nevo, E. Proc. Natl. Acad. Sci. U.S.A. 2010, DOI: 10.1073/pnas.1011134107. (24) Leontovich, A. M.; Brodsky, L. I.; Gorbalenya, A. E. Biosystems 1993, 30, 57–63.

scan will be sliced into regions for the subsequent search inside these shorter regions for the high score fragments (ones that are most enriched with errors). The second theorem provides an additional type of division inside a region: In the expanding series of enclosed fragments, all of which start at the beginning of the region and have positive scores, the fragment with the highest score can be detected (maximum of the score profile). The theorem states that this maximum point is the division position. Specifically, the fragment with the highest score inside the zone should be either to the left from this division position or to the right. The segmentation proceeds via such binary-search-like divisions of the scan until the scan is fully segmented. The scan segments with a high enough score of enrichment by errors are the fragments of interest. A full description of the applied segmentation procedure (together with its mathematical theorems and a pseudocode scheme) can be found at Brodsky et al.23 RESULTS AND DISCUSSION Intrareplicate Group Data Normalization. Coefficient of Variation (CV) Based Efficacy of Quantile Data Normalization. Coefficient of variation (CV ) SD/mean) of intensities of a specific mass-peak across a number of sample-replicates provides an indication for the quality of normalization.8 The authors (SysiAho et al.8) considered a normalization method as appropriate if it decreased the median CV (CV-median) as well as the standard deviation of CVs (CV-sd) calculated across all mass-peaks. By using a multiple linear regression of the mass-peak intensity correction against intensities of internal standards as independent variables, the authors obtained a 30% improvement of CV-median on a 1400 mass-peaks of their test data set of sample-replicates.8 Using the same test data set, these authors examined several alternative statistical normalization (scaling) methods that are not based on internal controls. Implementation of the first method (L2N) resulted in 5% of CV-median improvement, while the second procedure (3STD) gave a 40% deterioration (increase) of CVmedian. In this study, we applied quantile normalization to the same data set of replicates and obtained a 7.5% improvement of CV-median and 11% improvement of the CV standard deviation (CV-sd), suggesting this normalization method is superior to other tested statistical (i.e., not based on internal standards) normalization methods. However, the difference in CV improvement between quantile normalization and the standard based method is significant. Subsequently, the coefficient of variation QC measure was applied to determine the improvement in data quality following quantile normalization of several peak-intensity outputs obtained by the peak intensity acquisition software packages from the LC-MS runs of Arabidopsis-Four-Tissues data set21 (four different Arabidopsis tissues, eight replicates per tissue). The authors (Matsuda et al.21) generated an output containing approximately 15 000 mass peaks. We used the most dissimilar group of replicates (CL, cauline leaf tissue) from this output and by quantile normalization obtained a 31.5% improvement of the CV-median and a 22.4% improvement of the CV-sd. Next, we applied the XCMS peak picking/peak alignment software with QC optimized parameters (see details below) to the same data set and obtained ∼40 000 mass peaks. Once again, for the most heterogeneous CL Analytical Chemistry, Vol. 82, No. 22, November 15, 2010

9179

Table 1. Quantile vs Linear Normalizationa correlations

pair-ttest

total differentiations

Euclidean distances

pair-ttest

total differentiations

ttest_CL ttest_FL ttest_RL ttest_ST

1.94 × 10-16 2.48 × 10-9 7.13 × 10-16 1.5 × 10-19

0.0596 0.0215 0.0363 0.0311

ttest_CL ttest_FL ttest_RL ttest_ST

1.34 × 10-11 1.27 × 10-18 1.03 × 10-13 2.17 × 10-22

-38 600.00 -8 997.07 -13 734.76 -5 966.30

a Per replicate group pair-ttests for intra-group correlations and intra-group Euclidean distances together with per group total differentiations in correlations and distances for the Arabidopsis-Four-Tissues data set. Highly significant pair-ttest values and directions of differentiations demonstrate that the quantile normalization overperforms the linear one: the correlations between replicates are higher and the Euclidean distances are lower.

group of replicates, the quantile normalization resulted in 22.6% improvement of the CV-median and 22.1% improvement of CVsd. Comparison of Quantile and Linear Normalizations. After normalization of replicates inside each group, one should expect better corroboration between the replicates’ signals. Namely, the correlations between sample-replicates have to increase and the Euclidean distances have to decrease (if there was no rescaling of values during normalization). Almost all popular normalization and scaling methods in metabolomics (linear averaging, centering, autoscaling, range-scaling, pareto-scaling, vast-scaling, and levelscaling) are linear ones. Therefore, they do not change the correlations between samples. However, the normalization methods (without rescaling) could be checked if they improve the between-replicate Euclidean distances. On the example of the Arabidopsis-Four-Tissues21 data set, Table 1 demonstrates that quantile normalization statistically significantly overperforms linear normalization (centering) in improving between-replicate Euclidean distances. Also, the quantile normalization statistically significantly improves the initial between-replicate correlations (and, therefore, the result of normalization is significantly better than the result of any linear normalization or linear scaling according to this measure). Taking two normalizations (quantile and linear) and correlations/Euclidean distances of one sample from the Arabidopsis-Four-Tissues data set against its replicate group as an example, Figure S-1 in the Supporting Information demonstrates an improvement in inter-replicate correlations (panel A, correlations are higher) and in Euclidian distances (panel B, distances are lower) after quntile normalization in comparison with the results of linear averaging normalization. Effect of Quantile Normalization on Differentiation between Biological Samples and Technical Control Samples. The Zcorr based tuning of parameters in peak picking programs typically results in a high number of detected mass peaks (tens of thousands), whereas, in the majority of samples, many of them are at the noise level in a subset of samples. Thus, the quantile normalization of the sample specific intensity distributions is in fact an ordering of the biological level signals of each sample relative to the noise level signals, which are equalized across samples. By this lineup relative to a sample specific noise, the biological signals from different samples become mutually comparable. The effect of quantile normalization in two groups of “replicates” is very illustrative when comparing biological samples (one group of “replicates”) with technical controls (another replicate group). The technical control samples contain the extraction material such as buffers (also present in the biological sample) 9180

Analytical Chemistry, Vol. 82, No. 22, November 15, 2010

but without any biological substance. Figure S-2A in the Supporting Information shows the divergence in the PCA positions of a set of technical controls associated with a set of biological samples, when no normalization was performed. When performing quantile normalization of all sets of technical controls jointly, they are clustered in a more compact manner in the PCA space (Figure S-2B in the Supporting Information). In addition, once quantile normalization of all biological samples is carried out as one group of “replicates”, the relative deviation of the biological sets from the technical control cluster in the PCA space becomes apparent (Figure S-2B in the Supporting Information). As seen in Figure S-2C in the Supporting Information, once quantile normalization was separately applied to the two sets of samples (one biological and the second technical controls), the differentiation between the biological sets in mass-peak intensities could be accurately estimated and is most probably representing a true biological effect. Quality Control. General QC of Data and Parameter Tuning for a Peak-Picking/Peak-Alignment Program. The general QC of data and tuning of the peak picking procedure parameters were based on the Pearson correlations between intensity profiles of sample replicates. For each sample, the average correlation coefficient regarding the replicates was calculated. The average correlation coefficient is Z-transformed:20 Zcorr ) 0.5Ln((1 + corr)/(1 - corr))(n - 3)1/2, where n is the sample-profile length (number of peaks in the peak picking output). The advantage of Zcorr in comparison with the basic Pearson correlation is that Zcorr values are normally N(0,1) distributed across data sets with different numbers of detected mass peaks. Such normality of the Zcorr enables (i) comparison of Zcorr values across different peak picking/alignment outputs with different numbers of detected peaks and (ii) selection of the best output according to the p-value of Zcorr deviation from the N(0,1) nullhypothesis. For selecting the optimized parameter settings of a software (here XCMS), we developed a procedure in which the peakpicking/alignment program runs numerous times (between 50 and 200) with different parameter settings. Each parameter setting is generated by a random alteration of parameter values (in a reasonable range) and algorithms’ modes. Finally, out of multiple peak picking/alignment outputs, the output with the best average Zcorr across individual samples is selected for subsequent data analysis. While the Z-transformation generates N(0,1) distributed Zcorr values only for samples with normal (or at least symmetrical) distributions of log-signals,20 the LC-MS generated distribution of log-signals can be exceedingly nonsymmetrical, with the majority of low signals in samples just about the noise level. We

Figure 1. The Zcorr and AvCorr (scaled up 400 times) profiles across XCMS outputs from 52 different parameter settings for the Arabidopsis-Four-Tissues data set. Each output consists of four groups (tissues), each group contains 8 samples. For every sample-profile of a particular XCMS output, the Pearson correlation coefficients against the other 7 sample profiles of the replicate group are calculated. The averaged across group correlation (AvCorr) is Ztransformed: Zcorr ) 0.5Ln((1 + corr)/(1 - corr))(n - 3)1/2, where n is the sample-profile length (number of peaks in the XCMS output) and these Zcorr per-sample values across 52 XCMS outputs (52 × 8 × 4 ) 1664 dots of the graph) are depicted on the plot. According to Zcorr coupled with AvCorr: “A” marks the best parameter setting (Re29), Zcorr is maximum and AvCorr is one of the best; and “B” marks one of the poorest parameter setting (Re30), both Zcorr and AvCorr are low.

addressed this issue by an additional quantile normalization procedure in which the real distributions of log-signals in samples were transformed to the normal (Gaussian) distributions with the same mean/SD as in the original sample distribution. Such “Gaussianization” of the sample log-signals is performed only for the purpose of Zcorr calculation and following the selection of the optimized parameter settings for the peak-picking/peak alignment procedure. The Zcorr profile across samples and different parameter settings of XCMS performed with the Arabidopsis-Four-Tissues data set21 is depicted in Figure 1. We examined the Pearson correlation coefficients between all pairs of replicates for Zcorr best and Zcorr poor XCMS outputs (parts A and B of Figure 2, respectively) and for the original reported data set by Matsuda and colleagues (Figure 2C). The best Zcorr XCMS output contained 39 600 mass peaks with an average Pearson correlation between replicates equal to 0.965 (Figure 2D). One of the poorest Zcorr XCMS output for this data set contained 9 500 mass peaks, in which the average Pearson correlation between replicates was 0.93 (Figure 2D). The original data set of Matsuda et al. contained 15 000 mass peaks with an average Pearson correlation between replicates equal to 0.83 (Figure 2D). The comparison of average Zcorr values of the three correlation matrixes (Figure 2E) shows that the best XCMS output results in a superior average Zcorr value than the Zcorr of the poor XCMS output and the Zcorr of the output of Matsuda and colleagues. In order to examine whether the best XCMS parameter setting selected for the Arabidopsis-Four-Tissues data is also the most favorable for different LC-MS data sets, the same XCMS parameter settings were applied to the analysis of three different LC-MS outputs generated by analyzing (i) tomato introgression lines (IL-1, IL-2), originating from a cross between the wild species

Lycopersicon pennellii and the cultivated tomato (cv. M82); each line contains a single L. pennellii chromosome segment inserted in either chromosome 1 or 2; (ii) tomato ILs (IL-3, IL-4) in which each line contains a single L. pennellii chromosome segment inserted in chromosomes 3 and 4; and (iii) whole root extracts derived from five Arabidopsis ecotypes. The Zcorr profiles for the four tomato ILs (IL-1 to IL-4) were highly correlated: both had the best Zcorr with the R36 parameter setting (arrow B, Figure S-3 in the Supporting Information). However, the best Zcorr for the samples derived from the Arabidopsis ecotypes data set was obtained using the R14 parameter setting (arrow A, Figure S-3 in the Supporting Information). These results demonstrated that the optimal Zcorr parameter setting could be different between diverse LC-MS data sets. On the basis of the data sets analyzed in this study, we could suggest that an optimized XCMS parameter set will generate an output with a Zcorr of above 150 and this value can be used as an approximate threshold for the acceptance criterion of an XCMS output. Optimization of XCMS Parameters for IL Samples via Discriminant Analysis. Approximately 200 XCMS runs (matched filter algorithm) were performed with the data set of 117 IL samples (chromosomes 7, 8, 9, and 10). The randomized ∼200 parameter combinations were uniformly distributed inside the parameter’s hyper-cube in the ranges as follows: (1) Parameters of the “xcmsSet” method: fwhm ) (5:50), step ) (0.02:0.2), steps ) (1:5), mzdiff ) (-0.1:0.2), snthresh ) (2:20), profmethod ) (“bin”,“binlin”), max ) (500:5000). (2) Parameters of the “group” method: bw ) (1:50), mzwid ) (0.05:0.3), minfrac ) (0.1:0.5), minsamp ) (1:3), max ) (2:10). (3) Parameters of the “retcor” method: missing ) (1:29), extra ) (1:29), method ) (“loess”, “linear”), span ) (0.1:2.0), family ) (“symmetric”, “gaussian”). Three per sample outputs of each run of our MetaboQC program were taken into account for the optimization: average Z-correlation of a sample against samples of the same replicate group (“Zcorr”), average correlation of a sample against samples of the same replicate group (“AvCorr”), the total score of local discrepancies of a sample against samples of the same replicate group (“ScoreQC”). The quadratic discriminant analysis (QDA module of the R software,25 MASS package) was used in order to select those parameters and their pair interactions that provide sufficient classification of 117 samples in ∼200 runs (22 644 cases) into two classes: “good” and “not-good”. The classes were defined as follows: (1) based on Zcorr (the “good” class consists of 1033 cases with Zcorr more than 170); (2) based on Zcorr and ScoreQC (the “good” class consists of 633 cases with Zcorr more than 150 and ScoreQC less than 10 000); (3) based on AvCorr (the “good” class consists of 10 613 cases with AvCorr more than 0.88); (4) based on Zcorr and AvCorr (the “good” class consists of 651 cases with Zcorr more than 150 and AvCorr more than 0.88). On the first iteration of QDA analysis, the parameters with major impact of their linear effects on the output were selected. These parameters are step, mzdiff (from the “xcmsSet” method) and mzwid, minfrac, minsamp (from the “group” method). On the second iteration, pair interactions of these major parameters were added to the linear discriminant analysis model across all (25) R-package, CRAN (http://www.r-project.org/).

Analytical Chemistry, Vol. 82, No. 22, November 15, 2010

9181

Figure 2. The quality of few peak-picking software outputs for the Arabidopsis-Four-Tissues data set. The correlation matrices (A, B, and C) are highlighted by color; see part A for the color index. (A) The matrix of the Pearson correlations between samples inside each of the four replicate groups from the XCMS output with the best Zcorr (39 600 peaks). (B) The matrix of the Pearson correlations between samples in each of the four replicate group from the XCMS output with the one of the poorest Zcorr (9 500 peaks). (C) The matrix of the Pearson correlations between samples in each of the four replicate group from the original data set21 (15 000 peaks). It is worse than the Zcorr matrix of one of the poorest XCMS run. (D) Average per replicate group Pearson correlations for the three peak-picking/peak-alignment outputs (XCMS, best and poor Zcorrs, and the original data set by Matsuda et. al21). These correlations cannot be properly compared across different software outputs because the correlation coefficient is heavily dependent on the length (i.e., number of peaks) of correlated profiles, and different outputs may have a different numbers of peaks. (E) Average per group Zcorrs for the three peak-picking/alignment outputs examined. The Zcorr values can be statistically compared across all three outputs since they reflect the significance of deviations of the real Zcorr values from the N(0,1) null expectations.

Table 2. Discriminant Functions for Four Classificationsa parameter

Zcorr > 170

Zcorr > 150 and score < 10 000

AvCorr > 0.88

AvCorr > 0.88 and Zcorr > 150

fwhm step steps mzdiff snthresh profmeth Amx bw mzwid minfrac minsamp max missing extra method span family step_mzdiff step_minfrac step_mzwid step_minsamp mzdiff_minfrac mzdff_mzwid mzdiff_minsamp minfrac_mzwid minfrac_minsamp mzwid_minsamp false negatives false positives

-0.74 1100.19 0.77 147.91 -1.59 -0.05 -0.02 0.02 131.92 -220.59 69.31 3.65 -0.98 -0.82 1.16 -13.35 -1.15 -6462.01 3062.35 -4033.45 -677.66 934.13 1850.59 109.89 -680.41 -108.13 343.90 6.4% 6.1%

0.02 13.21 0.02 8.03 -0.04 0.01 0.00 0.02 92.92 66.03 5.75 -0.16 0.03 0.11 -0.02 1.01 0.04 367.89 -302.08 117.02 36.04 -121.70 -189.79 1.09 -31.24 -5.88 -58.99 20.1% 9.4%

-0.02 -5.92 -0.03 -17.08 -0.01 0.02 0.00 -0.01 -45.86 2.14 -6.69 0.02 -0.05 -0.04 -0.01 0.22 0.01 -20.26 -9.95 11.07 7.80 34.59 24.05 6.23 -7.41 -0.77 30.91 26.0% 45.6%

0.40 -1665.16 -3.73 -1135.60 1.35 0.34 0.02 -0.11 -390.72 45.63 -102.64 -3.11 1.28 0.57 -1.07 11.32 1.26 10302.79 -2770.02 5353.77 733.73 -222.36 -1228.09 12.47 730.58 149.09 -342.33 6.8% 6.3%

a The parameters in bold are ones with the most impact on the classifications. Percents of false and true positives of classifications are in the last two lines of the table.

parameters. The resulting discrimination functions for standardized parameters in the above four classifications and sensitivity/ 9182

Analytical Chemistry, Vol. 82, No. 22, November 15, 2010

specificity of classifications are depicted in Table 2. The table shows that the reasonable percents of false positives and false

Figure 4. The problematic fragment in the RT-m/z peak sorting. The figure represents one of the detected fragments from the RT-m/z peak sorting that is enriched by discrepancies between profiles of two sample replicates. Two sorted profiles with the best Zcorr XCMS output [CL4 (dark blue) and CL5 (pink)] derived from the cauline leaves (CL) tissue of the Arabidopsis-Four-Tissues data set21 were used. The score of the fragment is about 55 SDs, and the fragment is depicted by the dashed yellow bar that mark the fragment margins. The depicted height of the dashed yellow bar is a quarter of the true score of the problematic fragment. Therefore, the Y-axis is scaled down 4 times for better visual presentation of the figure.

Figure 3. Discriminant analysis of randomly selected XCMS parameter combinations. (A) The best combinations of values for five major standardized XCMS parameters. The parameter combinations prove (i) a high Zcorr with lower averaged correlations between samples (AvCorr, blue continues profiles) and (ii) a high Zcorr and a high AvCorr (yellow dashed profiles). Each type of profiles is sliced into two symmetric groups. The visible difference between the two types of profiles is in the minfrac parameter that is close to zero (mean) for the second type of profiles. (B) Two patterns for true values of the five major XCMS parameters. For ILs, data from both patterns provide a high Zcorr and a high AvCorr of the XCMS output.

negatives (sensitivity/specificity) are reached in Zcorr and AvCorr + Zcorr classifications. The optimal combinations of five major parameters for Zcorr and AvCorr + Zcorr classifications were found in the following way. First, the parameter combinations were simulated based on standard normal distributions of every parameter. Second, with the use of the discriminant functions limited by the effects and interactions of five major parameters, the best parameter combinations were selected from 60 000 simulated combinations of standardized parameters. Several profiles of standardized parameters that provide the QDA classification in the “good” class are depicted in Figure 3A. One could see that (1) profiles of good standardized parameter combinations according to Zcorr and Zcorr + AvCorr classifications are similar. (2) Basically there are two symmetrical combinations of standardized parameters that equally provide a classification in the “good” group {step(Up), mzdiff (Down), minfrac(Up), mzwid (Down), minsamp(Up)} and {step(Down), mzdiff (Up), minfrac(Down), mzwid (Up), minsamp(Down)}. (3) There is a thin but clear difference between parameter combinations providing Zcorr(high) + AvCorr(lower) and Zcorr(high) + AvCorr(higher). The difference is in the

minfrac parameter. Namely, if the standardized minfrac is about zero (mean), then both Zcorr and AvCorr are high. The real values of parameters (two symmetrical combinations with the same minfrac and minsamp values) that provide a classification of the XCMS output samples into the “Zcorr(high) + AvCorr (higher)” group are depicted in Figure 3B. The true parameter values were restored from the values of standardized parameters based on estimations of the mean and SD of true parameters according to their uniformly distributed values in each specific range. Higher Resolution Quality Control through the Detection of Significant Local Discrepancies between Sample Replicates. Following quantile normalization inside groups of sample replicates, one may anticipate a good correlation between their log-intensity profiles; nevertheless, the inter-replicate discrepancies of intensities for a number of individual peaks could be found. However, when analyzing biological replicates, we assume that the discrepancies are uniformly distributed over the RT-m/z plane, i.e., there are no local RT-m/z neighborhoods, which are significantly enriched by “local errors”. Thus, if such fragments are detected, they could indicate either (i) biological deviations in metabolite intensities between replicates for specific RT-m/z regions, (ii) incorrect integration of peaks in one of the replicates, (iii) a poor alignment of peaks in two or more replicates, or (iv) technical problems of the instrument. The detection of fragments that are significantly enriched with “local errors” is performed by the segmentation procedure23,24 that carries out the following: first, all peaks are sorted in a nested manner according to the RT for the basic sorting and inside this sorting according to the m/z (or in an opposite nesting, first the m/z and subsequently RT). In this sorting, the ttest-statistics-like measured deviation between peak intensities of two sample replicates A and B is calculated (tp ) (pA - pB)/(SD), where pA and pB are intensities of the same peak in samples A and B, respectively, and SD is the standard deviation of such differences across all peaks). Under the null Analytical Chemistry, Vol. 82, No. 22, November 15, 2010

9183

/

Figure 5. The distribution of problematic fragments over the RT-m/z plane; the template of peak positions for the best XCMS output of the Arabidopsis-Four-Tissues data set21 containing 39 600 peaks. A number of positions are highlighted by colors that indicate the average scores of the problematic fragments between all pairs of replicates that cover this particular peak. The RT zones 1-6 and 12-18 min are enriched by a series of problematic fragments. The inter-replicate discrepancies are concentrated in the RT interval 12.5-14 min. Moreover, intervals with a high score could be seen in several other RT windows. Since this data set has high overall fitting of replicates to each other, and the SD of inter-replicate deviates is small, the segmentation method has found many high-score intervals of inter-replicate deviations.

hypothesis on inter-replicate deviations generated by chance, the sum of tp across the fragment divided by the square root

n t of the fragment’s length (A ) ∑ p)1 p √n) is the random variable distributed normally N(0,1). The segmentation algorithm detects nonintersecting fragments in the RT-m/z sorting that are enriched with high inter-replicate deviations. Each detected fragment that is enriched by the between-replicate deviations receives a score A that provides the N(0,1) based significance score of the fragment. However, it is important to note that the score of the fragment depends on the SD of the inter-replicate deviations. Thus, if the general fitting between replicates is poor (i.e., the SD of the inter-replicate deviations is high), we expect lower scores across all problematic fragments, and therefore, lower numbers of significantly problematic fragments will be detected. The details of the segmentation algorithm are described in the Experimental Section. Three examples provided below demonstrate the application of the QC procedure for in-depth detection of “local errors” between intensity profiles of sample replicates. Identification of “Local Errors” in the Arabidopsis-Four-Tissues Data Set. The detection of local discrepancy between sample replicates in the Arabidopsis-Four-Tissues data21 was performed using either the original data set or with the best Zcorr output of XCMS. In the best Zcorr XCMS output, the algorithm detected numerous RT fragments enriched with discrepancy between replicates. Figure 4 depicts one of the detected fragments from peak sorting according to their RT-m/z coordinates. The higher m/z ions in the RT region 8.30-8.33 min are deviating in their intensities between the two replicates. The ttest-distribution based

Figure 6. The local inter-replicate discrepancies detected in the tomato ILs data set. (A) Total ion chromatograms (TIC) of tomato extracts (two biological replicates of the IL-7) acquired in the positive mode. The RT positions of two peaks with masses 957 and 1093 (inside the black frame) are different in the top and bottom profiles of the two replicate samples. (B) Detected peaks of superimposed and aligned chromatograms of tomato ILs from the same group of replicates are depicted as the dark blue spots on the RT-m/z plane. The regions of aligned peaks with deviating intensities are highlighted in color. The local neighborhoods enriched by between-replicate deviations are concentrated in RT interval 6.3-6.7 min (the area of high m/z values). The regions were detected by the segmentation algorithm. The discrepancy between the TIC chromatograms inside the frame matches well with the bands of bright blue spots on the RT-m/z plane. 9184

Analytical Chemistry, Vol. 82, No. 22, November 15, 2010

Figure 7. Flowchart for the MetaboQC program.

score for this fragment enrichment with discrepancies is about 55 SDs. Similar analysis using the original data set detected about the same RT interval of discrepancy between the two replicate profiles (Figure S-4 in the Supporting Information) suggesting that the detection of these local discrepancies was not dependent on the peak-picking/peak-alignment software used. The detection of local discrepancies between sample replicates (RT or m/z intervals significantly enriched by discrepancies between replicates) provides an overall picture regarding RT and m/z windows of the inter-replicate discrepancy. Figure 5 indicates that in the best XCMS output for the Arabidopsis-Four-Tissues data set, the major discrepancy between replicates was observed in about a 13 min RT interval. A wider RT window enriched with problematic fragments was between 12 and 16 min. Approximately the same RT wide window is enriched with problematic fragments in the original data set of Matsuda et al. (Figure S-5 in the Supporting Information). Their data appeared to contain less problematic intervals than the best XCMS output. However, this observation is most probably due to the lower average score of these data defective intervals that result from the higher SD of general discrepancy between replicates. For every sample, the per-individual-peak normalized total score of all inter-replicate discrepancies was calculated. The profiles across samples of these scores for the best Zcorr XCMS output and for the original data set are depicted in Figure S-6 in the Supporting Information. These profiles are obviously correlated indicating that the same groups of replicates and almost the same individual samples had more inter-replicate deviations than others. Since different peak-picking and peak-alignment

methods were used in generating these outputs, it might be concluded that the source of inter-replicate discrepancy in the Arabidopsis-Four-Tissue data is of experimental origin: either in the initial biological samples or in the quality divergence of the different LC-MS instrument runs. Local Discrepancy Study of the “Ground Truth” Feature Alignment. The peak picking/alignment algorithms could be a significant source of RT specific local discrepancies between samples. To examine this, we used the “ground truth” feature alignments defined in the M1 and M2 data sets reported by Lange et al.19 and proposed by these authors to be the true alignment of features detected in Arabidopsis thaliana leaf tissue LC-MS chromatograms (replicated biological samples). Five samples with either 1 007 features (peaks) from data set M1 or 2 015 features (peaks) from data set M2 were used in the analysis of local discrepancy. The analysis of the M1 data set revealed a series of defective intervals in the RT 3.2 min region (Figure S-7 in the Supporting Information). Samples 2 and 4 largely deviate from the other three samples in this RT region. This divergence is probably the result of an incorrect alignment of peaks in samples 3 and 4 of the M1 data set. Alternatively, local defective intervals with a significant score were not detected in the M2 data set. The detection of local discrepancies of replicates even in one of the published “ground truth” alignments19 demonstrates the sensitivity of the developed QC method to identify mistakes in the chromatogram alignment. Local Inter-Replicate Discrepancy Study of the Tomato ILs Data Set. The tomato ILs data set exemplified to a large extent the identification of local inter-replicate discrepancies that result from problems in the LC-MS runs. Figure 6A presents the chromatograms of two biological replicates, injected in two separate sequences. Most of the compounds show similar retention times on both chromatograms, except for two abundant peaks (957 and 1093 Da) displaying deviating retention times. This is most probably the effect of improper chromatography as several samples were injected onto a not fully equilibrated column. As a result, the XCMS program was unable to align peaks at the RT region 6.3-6.7 min (Figure 6). Figure 6B shows these local interreplicate discrepancies in the 6.3-6.7 min RT region and the 700-1200 Da m/z region. Implementation of the in-depth QC procedure for this data set enabled us to avoid the false differential peaks in the tomato ILs results. The problematic RT window can be subsequently either removed from the data analysis, or should be treated separately, or corrected (see Figure 7). CONCLUSIONS This study demonstrates that quality control of the peakpicking/peak-alignment software output is an important step before subsequent statistical analysis and metabolite assignments are performed. The developed computer program can detect peakintensity errors that may result from irreproducibility of replicated samples, errors in analytical instrumentation, improper peak integration, and wrong alignment of chromatograms. Quantile normalization, often used for microarray data analysis, is most valuable for normalization of LC-MS metabolomics data inside replicate groups. While it is not as good as normalization methods that are based on internal standards,8 it is superior to other statistical (at least, linear) standard-free normalization methods. Indeed, other statistical normalization methods are equalizing some moments of the log-signal distributions in Analytical Chemistry, Vol. 82, No. 22, November 15, 2010

9185

Figure 8. Conclusive example of differences in XCMS outputs depending on (i) parameter setting; (ii) correction of inter-replicate local discrepancies. (A) PCA plot for the XCMS output of IL7-8 data set generated by one of the worst parameter combinations: 9 700 mass peaks, average Zcorr is 95, AvCorr is 0.77. Samples of IL7-1, IL8-3, IL8-3-1, and M82 groups (IL lines) are sliced between two superclusters (left and right). (B) PCA plot for the XCMS output of IL7-8 data set generated by one of the best parameter combinations: 22 100 mass peaks, average Zcorr is 220, AvCorr is 0.91. The grouping of samples is almost the same as on panel A. However, this is according to a number of mass peaks that is 2.5 times higher than the number of mass peaks in the poor XCMS output of panel A. Samples of IL7-1, IL8-3, IL8-3-1, and M82 groups are again sliced between two superclusters. (C) PCA plot after the weighted average correction of inter-replicate local discrepancies in the best XCMS output of the IL7-8 data set. Each group of replicates (IL7-1, IL8-3, IL8-3-1, and M82), which previously was split between two superclusters, now is in one (left) supercluster as a whole. Three matrices of intragroup correlations are highlighted by color on panels D, E, and F: (D) the poor XCMS output (9 700 peaks); (E) the good XCMS output (22 100 peaks); (F) the good XCMS output (22 100 peaks) with corrected local discrepancies. The order of groups is the same as on panels A, B, and C. The color index is the same as in Figure 2A. Obviously, the correlation matrix E is better than matrix D (intragroup correlations are higher), and matrix F (corrected local discrepancies in the good XCMS output) is much better than matrix E.

replicates. The quantile normalization is equalizing the entire cumulative distributions of log-signals that should be the case in normalization of biological or technical replicates. After data normalization inside replicate groups, Z-correlation between replicates (coupled with sufficient Pearson correlation) allows the optimization of the peak picking/peak alignment software parameters. The Zcorr based parameter optimization is often specific for a particular data set. The study of a particular data set (ILs, chromosomes 7-10) by discriminant analyses reveal that the two best combinations of major parameters are symmetric (Figure 3). Typically, the appearance of such symmetrical solutions is evidence that the parameters are mutually dependent. Identification of RT scans that are enriched with local interreplicate discrepancies may facilitate uncovering the source of errors and perform corrections in the statistical analysis results as well as in metabolite assignment. Since local discrepancies were detected even in the M1 “ground truth” feature alignment,19 we suggest that the alignment of chromatograms could be a major source for local RT-scan specific discrepancies. Occasionally, poor intersample peak alignment may result from erroneous peak integration. Our computer program generates a “corrected” XCMS output, where the discrepancies are “corrected” through a weighted averaging of log-signals across the replicate group. For every peak, the weights are inversely proportional to the scores of local defects, to which the given peak belongs. The flowchart of the proposed postpeak-picking QC procedure is depicted in Figure 7. 9186

Analytical Chemistry, Vol. 82, No. 22, November 15, 2010

The conclusive illustrative example for the entire procedure: the parameter optimization, detection of local discrepancies, and the correction of these discrepancies, is depicted in Figure 8. With the relatively noisy raw chromatograms of the ILs chromosomes 7-8 data set started with, the example shows how optimization of XCMS parameters results in doubling the number of identified meaningful mass peaks that have less noisy intensities (better intrareplicate group correlations between samples) in comparison with a standard XCMS parameter setting. Also, the example shows that the cleaning (correction) of local interreplicate discrepancies ends up with the PCA picture that fits much better to the biological grouping of samples (Figure 8C) than the PCA pictures before data correction (Figure 8A,B), where several biological groups of samples were sliced between two major PCA clusters. Finally, further to the already highlighted applications of our QC method, we suggest that it can be valuable in the development and parameter tuning of new peak-picking/peak-alignment procedures by determining the inconsistencies between replicates in their outputs. The C++ computer program (MetaboQC) for QC of LC-MS metabolomics data can be used through the command-line mode. The program’s executable, the program instructions file, and the working examples for the singular and multi input modes can be found in the Supporting Information (Program S-1, Program S-2, Program S-3, and Program S-4). The program’s code is free and

open under GNU General Public License. The software could be found at http://www.weizmann.ac.il/plants/aharoni/software (Weizmann Institute of Science, Israel) and http://evolution. haifa.ac.il/index.php/people/item/146 (University of Haifa, Israel). ACKNOWLEDGMENT A.A. is an incumbent of the Adolfo and Evelyn Blum Career Development Chair. The research in the A.A. laboratory was supported by a grant from The Israel Ministry of Science (IMOS Project No. 3-2552), The EU project “META-PHOR”, Contract Number FOODCT-2006-036220, Mr. and Mrs. Mordechai Segal, and by the Benoziyo Institute. We are thankful to Sagit Meir for

the help with the data analysis, to Arieh Tishbee for operating the LC-MS instruments, to Elia Brodsky for the help with the figures, and to Max Itkin for testing the computer program. SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.

Received for review May 8, 2010. Accepted September 28, 2010. AC101216E

Analytical Chemistry, Vol. 82, No. 22, November 15, 2010

9187