Direct and Unbiased Information Recovery from Liquid

Sep 4, 2013 - A reworking of a data mining strategy, in which statistical treatment of raw data from liquid chromatography–mass spectrometry (LC-MS)...
0 downloads 0 Views 943KB Size
Article pubs.acs.org/ac

Direct and Unbiased Information Recovery from Liquid Chromatography−Mass Spectrometry Raw Data for PhenotypeDifferentiating Metabolites Based on Screening Window Coefficient of Ion Currents Tetsuo Kokubun*,† and Lilla D’Costa†,‡ †

Jodrell Laboratory, Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3DS, United Kingdom School of Biological Sciences, Royal Holloway, University of London, Egham, Surrey TW20 0EX, United Kingdom



S Supporting Information *

ABSTRACT: A reworking of a data mining strategy, in which statistical treatment of raw data from liquid chromatography− mass spectrometry (LC-MS) precedes recognition of chromatographic peaks, is presented. In this algorithm the tR−m/z plane of LC-MS data is divided into equal-sized segments of twelve seconds by one m/z unit each, and the total ion currents in corresponding segments as specified by the tR−m/z pair from multiple LC-MS runs are evaluated to generate mean ion currents (μ) and standard deviations (σ). The μ’s and σ’s of the segments, derived from contrasting classes of LC-MS data set (e.g., resistant−susceptible, case−control, etc.), are used to calculate the Z-factor (screening window coefficient) which is in turn used to rank the segments. Chromatographic peaks are recognized only where the ion currents are shown to differentiate the classes. The result-reporting format enables detection of positive as well as negative correlations between ion intensities and biological traits under study and thus points to the presence of potentially phenotype-discriminating metabolites. Examples of data analyses are presented, in which ions that may distinguish resistant and susceptible species of Aesculus to the leaf-miner Cameraria ohridella were detected.

W

techniques have also been discussed in terms of sensitivity, dynamic ranges, comprehensiveness and selectivity, resolving power, and whether the techniques give further information regarding chemical structures and the identities of the metabolites.1−3,11,12 The aspects of quality control measures and validation have also been discussed.4,12 One of the key steps in metabolomic studies is the handling of the vast quantity of data, generated electronically and in ever increasing resolution and information density.13,14 A number of software solutions have been implemented to achieve data preprocessing and statistical analyses, both as commercial products and as publicly available ‘free’ software packages.12,15,16 For chromatography-coupled mass spectrometry, data are processed typically in stages divided into the following: 1) smoothing, gap-filling and ion peak detection, and deconvolution if the peaks are deemed to be composed of more than one compound;17−20 2) deisotoping followed by integration of each peak against the background noises and baseline shifts; 3) alignment of metabolite peaks in separate data files;16,19−21 and 4) normalization.22,23 These steps are intensely investigated individually or in combination depending

hile there are subtle differences in definitions and goals between metabolite fingerprinting, metabolite profiling, metabolomics, and metabonomics (hereafter collectively ‘metabolomics’) they often share the same laboratory techniques typified by nuclear magnetic resonance spectroscopy and mass spectrometry with or without prior chromatographic separation(s).1−4 The underlying premise is to detect as wide a range of metabolites as possible (‘nontargeted’) in a sample with (semi)quantitative information. The latter aspect is particularly important since it is often the quantitative differences between the samples that hold the key to understanding biological phenomena. These strategies have been used extensively to detect potential biomarkers in clinical, pharmaceutical, and toxicological studies for diagnoses and prognoses and in an area of nutrition and its effect on health. Metabolomic techniques are also increasingly used in ecological studies including the influence of physical environment to animals and plants and in biotic interactions such as plantpathogen and plant-animal relationships.5,6 The progress and gradually evolving concept of metabolomics have been summarized in a number of review articles, and general workflows have been proposed for robustness and higher efficiencies, while each of the steps are subject to further refinements according to the specific goal of the studies.1,4,7−10 The strengths and weaknesses of employed instrumental © 2013 American Chemical Society

Received: May 23, 2013 Accepted: August 18, 2013 Published: September 4, 2013 8684

dx.doi.org/10.1021/ac401545b | Anal. Chem. 2013, 85, 8684−8691

Analytical Chemistry

Article

icity among ions that were more abundant yet showing no correlation to the classes. An obvious caveat of this approach is that the number of variables becomes large especially when higher data resolution is maintained. On the other hand, in the field of high-throughput screening (HTS) of drug candidate libraries it is not uncommon to have over one million compounds subjected to an assay. In such endeavors the robustness and suitability of the assay protocols are routinely evaluated by the Z- (or Z′-) factor, first proposed by Zhang et al.:30

on the algorithms, since they ultimately determine the coherence of the following statistical analyses.7,24 The created peak lists (‘features’) are usually exported in a table format where peak identification constitutes the column and the integrated peak area in samples, the rows,2 or in more elaborate data structures.19 The obtained data matrices containing the quantitative information of the peaks are then analyzed by advanced and powerful statistical models to recover biologically relevant and interpretable information.2,9,14 The whole computational procedure has been termed data mining, information recovery, and chemometrics. Ion peaks that show significant correlation to the biological trait(s) under study are selected and identified, consulting with online or off-line chemical databases when appropriate.11,25 With this as a backdrop a comparative nontargeted LC-MS metabolite fingerprinting approach was taken in an attempt to probe into the chemical basis of the resistant-susceptible status of some of the species of Aesculus (Sapindaceae) against the leaf-miner insect Cameraria ohridella Deschka et Dimic (Lepidoptera: Gracillariidae). The leaf-miner has recently colonized the horse-chestnut trees Aesculus hippocastanum L. in Europe, causing unsightly damage. Females lay eggs on the upper epidermis by the veins of the leaves on different species of Aesculus; however, larvae can only survive on a limited range of the species.26 The analysis was, however, severely hampered by the complexity of the data containing countless peaks of greatly variable intensities, and the different plant species tended to accumulate their own unique metabolites at variable quantities among large amounts of common ones. The inherent biological variations within each of the resistant and susceptible classes were large. In addition to this, peak-picking from individual injections was clearly influenced by the parameters set in the software and thus might have excluded the signals that were truly relevant for class discrimination. For these reasons we sought ways to process and to explore the data, primarily aiming at detection of class-discriminating signals; the identification of metabolites would be meaningful only when there is reasonable rationale that they are indeed discriminating classes. A procedure that may circumvent the above problems simultaneously would be to compare the raw data directly, without detecting peaks first. The data from a liquid chromatography (LC)-coupled mass spectrometry (MS) can be represented in a three-dimensional space, having retention time (tR), mass-to-charge ratio (m/z), and ion current (or ion count; IC) as axes. Among these only the IC is the experimental readout, the value of which is a function of the identifier pair tR−m/z for a given sample in defined chromatographic and MS conditions. A possible simplest method is a subtraction of averaged IC values from contrasting phenotypes. Such an approach has been used before on data sets from direct infusion ESI-MS and total ion chromatograms from 2D GC × GC-TOF-MS, with or without prior normalization.27,28 Although simple in the concept, these procedures did not use, respectively, chromatographic separation (retention time) and mass-to-charge ratio (m/z) as identifiers of signals. Also, the absolute differences between smaller peaks were reported to be accordingly small, regardless of however large their fold changes were.28 In 2006 Crockford et al. reported an algorithm, in which binned ion intensity data were directly applied to a multivariate statistic procedure (orthogonal projection to latent structure−discriminant analysis, O-PLS-DA).29 The resultant weight (σ) then t-statistic enabled to select candidate biomarkers of hydrazine hepatotox-

Z=1−

3(σ1 + σ2) |μ1 − μ2 |

It is reflective of the dynamic range of the mean assay signals (μ1, μ2) and standard deviations (σ1, σ2) of controls. It can also be used for optimization of assay protocols including selection of positive controls that give maximum assay signals.31 When the formula was applied to a set of known active compounds in an assay (although perhaps it was not the originally intended use), the resultant Z-factor gave a value Z > 0.5,32 the category that an assay would be considered ‘excellent’.30 These observations pointed to an interesting possibility that the Zfactor might be used to screen compounds themselves,33 detect variables that clearly discriminate classes,34 and even for deconvoluting signals from complex mixtures such as plant extracts. Reinterpreted for the problem of the AesculusCameraria LC-MS data set, an extract from the leaves represents the chemical library in its entirety; each bin, a drug candidate in the library; a multiple injection of the same sample, the replication; and the samples of different biological origin, the variable concentration gradient with regard to the ion currents in the bins. The ‘assay results’, the field observation on the resistant-susceptible statuses, cannot be more robust as they are exactly the results of whole-organism assays in fully natural settings, using the full set of the compounds in the library. We present here a reworking of the algorithm which departs from peak-picking followed by statistics on the peak characteristics to statistical treatment of the raw data followed by peak recognition based on this interpretation.



ALGORITHMS AND PROCEDURES In this algorithm, the tR−m/z plane of an LC-MS analysis is divided into equal-sized segments of one m/z unit by twelve seconds. Each segment has a value that is the sum of all ICs, including those of noises, detected within the boundary. The values from corresponding segments in multiple files are then subjected to a simple statistic procedure to obtain averages (mean ion currents, μ) and standard deviations (σ). The μ and σ values for two classes are obtained separately, and the Zfactor is calculated for each and all of the segments as specified by the tR−m/z pair. The results are written to a file in the decreasing order of the Z-factor in a spreadsheet format that can be read by Microsoft Excel directly for further analyses. Two small stand-alone programs were written in C++ language to implement this algorithm. The first (MZshovel) preprocesses the raw data in ASCII format (converted from a proprietary format by using a utility program, DataBridge, distributed with the LC-MS data system MassLynx, Waters Corp.) and reorganizes to a format so that the second program would not be required to perform any nonstatistics-related operation (Figure S-1). The m/z values as recorded in the raw 8685

dx.doi.org/10.1021/ac401545b | Anal. Chem. 2013, 85, 8684−8691

Analytical Chemistry

Article

ASCII files are binned around median mass unit −0.5 ≤ x < +0.5, and the associated ICs are accumulated without any transformation into a two-dimensional matrix of 64-bit integer words, each representing the total ion current in a segment of one m/z unit by twelve seconds. Data from different polarities are accordingly compiled into separate data matrices, that is, if an LC-MS was run with alternate positive and negative mode scans it would generate two separate data files. UV absorption data, if contained in the ASCII file, are discarded. Thus, a feature in this algorithm is defined as the sum of ICs within a segment of fixed widths but not a chromatographic peak of a compound with the associated mass spectrum. In a typical setup of a 30-min LC program and scanning m/z 150−2000 every 2.4 s (in one polarity), as employed in our laboratory, a single run generates approximately 2.77 × 105 segments each containing data from five scans. After all the ICs are accumulated the data in the whole matrix are linearly normalized to range 0 to 65535 (= 216-1), and written to a file with a .mzs extension, together with recorded minimum and maximum m/z values, LC retention time range in which mass spectra were obtained, bin sizes (tR, m/z), and the conversion factor used for normalizing the ICs in segments. The second program (Ionwinze) analyzes the data and reports the outcome. A sample set is a set of the .mzs files that are supplied to this program at once, which may or may not include experimental controls; if required a multiple number of files containing controls (e.g., positive and negative controls as well as QCs) may be included in a single sample set. This offers a freedom in defining what constitutes a sample set. A minimum of three different data files are required for each of the two classes as input, and the maximum is currently set at 255. The program calculates the mean total IC in segments identified by the tR−m/z pair from all the supplied files, the corresponding standard deviation, and the Z-factor. One may envisage the finely divided tR−m/z plane as a ‘microplate’ with 277,000-plus wells, to which an extract was fractionated twodimensionally by retention time and m/z values; the total ICs of the segments (‘wells’) are then evaluated ‘vertically’ across the stacked microplates. If the source files contained different scan ranges with respect to retention times and/or m/z values, the program calculates only the data in segments that are common to all the files. The Z-factor values can be divided into several categories,30 and accordingly the program reports the statistical results classified: 1) Z ≥ 0.5, where the distribution of the signals from two classes are completely separated with large separation band; 2) 0.5 > Z ≥ 0, where the separation band is smaller or touching at 3 × SDs (Z = 0); and 3) 0 > Z ≥ −1, where the signal distributions overlap and lose discriminating power at Z = −1. This is repeated for mean ion currents {μ1 > μ2} and {μ2 > μ1}. [The source codes and compiled programs (MZshovel and Ionwinze, including options for coefficient of variation (CV) and Student's t-test and z-test; see discussion below), raw data, and intermediate files relevant to this article may be obtained from the corresponding author.]

Alliance separation module (Waters, MA, US) with a Luna C18 column (3 mm diameter × 150 mm, dp 5 μm; Phenomenex) at 30 °C. The mobile phase was a mixture of water, methanol, and 1% (v/v) formic acid in acetonitrile, with linear gradient from 90:0:10 (t = 0 min) to 0:90:10 (t = 20 min) then isocratic for 5 min, at a flow rate of 0.5 mL/min. The column was equilibrated in the initial condition solvent for 10 min before an injection. A ZQ single quadrupole mass spectrometer (Micromass, Manchester, UK) with an electrospray ion source was used for detection of metabolites. An alternate positive-negative scan mode (1.0 s per scan) was used for m/z range 150−2000, with an interscan delay of 0.2 s. Scanning was continued to tR = 30 min in order to cover the gradient delay from the LC solvent mixer to the column outlet. Capillary voltages were 3.5 kV and 4.0 kV for positive and negative ionizations, respectively. Source (120 °C) and desolvation (450 °C) temperatures, gas flows (N2) for desolvation (500 L/h) and at the cone (50 L/h), and analyzer settings were kept constant throughout experiments. All the instruments were controlled by the MassLynx software (v. 4.0, Micromass). Plant Material. The following six species were included in the study: Aesculus chinensis, A. f lava, A. hippocastanum, A. indica, A. pavia, and A. turbinata. Leaf material was collected in May 2011 from the saplings between 1 and 1.5 m in height and planted in identical pots in the experimental greenhouse of Royal Holloway, University of London, free from insect damage or microbial infection. The leaf materials were lyophilized and separately ground to fine powder using a pestle and a mortar (excluding leaf veins). The powdered leaf material was extracted with 80% methanol (5.0 mL per 100 mg dw) at room temperature for 24 h, with occasional shaking. An aliquot was centrifuged at 14,000 × g for 15 min, and the supernatant was used directly as an LC sample (injection volume 10 μL).



RESULTS AND DISCUSSION

The presented algorithm treats LC-MS data without reducing the dimension of the data structure. It faithfully processes all the IC data recorded in the source files albeit at reduced resolutions in all three dimensions, a necessity for assigning the local arithmetical average value to segments. Any signals arising from column bleed, solvent impurities, autosampler carryover, and, if used, internal standard and remaining derivatization agents are subjected to statistical treatment. Normalizing the whole data to range 0−65535 in integer form meant that if the accumulated value in a segment was below 1/65535 of the largest value the IC in that segment would be replaced by a null value. In practice, however, this was conveniently used to assign the noise threshold according to the dynamic range of the ICs reported in an individual file. In an initial coding the grand total IC in a single file was used as the base of normalization; however, this procedure eliminated a substantial proportion of low-abundance ions as well as noises. The contribution of upward-drifting baselines toward the end of reverse-phase chromatographic runs was found to be the primary cause of this, as found through analyses of blank injections. Since this algorithm evaluates only the segments the sizes of which are fixed and ‘vertically’ across LC runs, it does not correlate ions that are binned to different m/z values detected at the same tR. Therefore it evaluates a parent ion and any fragment and adduct ions independently, and their Z-factors are reported separately in the output. Similarly, if a chromatographic peak is divided into two or more adjacent segments,



EXPERIMENTAL SECTION General. HPLC gradient grade solvents (methanol, acetonitrile) were obtained from Fisher Scientific (Leicester, UK). Deionized water was prepared in-house using a Milli-Q Integral system (Millipore, MA, US) and membrane-filtered (0.22 μm) prior to use. Liquid Chromatography−Mass Spectrometry (LCMS). HPLC separation was performed using Waters 2695 8686

dx.doi.org/10.1021/ac401545b | Anal. Chem. 2013, 85, 8684−8691

Analytical Chemistry

Article

Figure 1. Typical base-peak intensity chromatograms in negative ESI mode. The top four species (Aesculus chinensis, A. f lava, A. indica, and A. pavia, in this order) are resistant, and the bottom two (A. hippocastanum and A. turbinata.) are susceptible to Cameraria ohridella. One sample each from the six species is presented as an example. Chromatograms are normalized to the largest peak in the sample.

Table 1. Segments with Z-Factor between 0.5 and 0, Found for the {μ2 > μ1}-Group in Negative ESI Mode

then the parts of peaks are separately analyzed and reported (more discussion on this topic is presented later). The preference-performance of C. ohridella had been established with oviposition and larval development experiments on the saplings employed in this work; therefore, the resistance-susceptible characters of these species were known: Aesculus chinensis, A. f lava, A. indica, and A. pavia were completely resistant to C. ohridella and none of these species supported larval development, while A. turbinata and A. hippocastanum were susceptible and allowed the insects to develop fully and to pupate. These results were broadly consistent with the preceding field observations as reported earlier.26 These six species exhibited base-peak intensity chromatograms characteristic to the species (Figure 1 for negative ESI mode; see Figure S-2 for positive mode). The maximum deviations observed in the batch of LC runs were 6 s at tR 6.6 min (for m/z 863.5 at the peak apex) and 4.8 s at tR 25.4 min (m/z 982.0), for commonly detected metabolites in all the samples. The raw data, originating from 18 individual plants (12 resistant, 6 susceptible), were then fed to the programs. No segment was found to separate the two groups with a Z-factor ≥0.5 in either polarity, and in negative mode 0.5 > Z ≥ 0 where average ICs in the resistant group were greater than the susceptible group (μR > μS; full result in S-5 and S-6). A small number of segments were listed for positive mode in the 0.5 > Z ≥ 0 range with average ion currents less than 10 (in normalized value; S-5); however, an inspection of the original LC-MS data (with MassLynx software) revealed that the detected signals were very close to the background noise levels and were difficult to ascertain the presence of a compound. In negative ionization mode up to ten segments were picked up in the 0.5 > Z ≥ 0 range for μS > μR (Table 1). The highestranking segment m/z ‘865’ at 12.8 min (denoted at the beginning of the segment 12.8−13.0 min) was, however, deduced to have been derived from a compound with nominal

segment a

resistant group b

Z-factor

tR (min)

m/z

0.348 0.308 0.280 0.143 0.129 0.096 0.066 0.028 0.022 0.000

12.8 21.4 20.6 12.4 21.2 21.4 1.4 11.2 13.0 5.0

865 1004 1020 1775 1004 1072 357 669 499 1907

mean I. C. 55 111 88 0 65 44 525 15 455 0

c

susceptible group c

SD

mean I. C.c

SDc

23 106 101 0 32 26 117 12 115 0

993 3954 4937 7 1140 525 2552 123 1381 3

181 780 1062 2 280 119 514 23 187 1

a

Denotes the beginning of the segment width (0.2-s intervals). Median value of the bin in m/z dimension (−0.5 ≤ x < +0.5). cBased on normalized value (0−65535) to the largest total ion current value in segments in a file.

b

mass 432, as the cluster ion [2M-H]− with one heavy natural isotope (2H or 13C) in the cluster (Table 2; Figure 2). The corresponding molecular ion [M-H]−, m/z 431, was evaluated to Z = −0.279 and reported in the third Z-factor category due to the 3σ-criterion (corresponding to P < 0.0013, one-tailed; Table 2, see also S-6). This compound, exhibiting a 4-fold average signal intensity in the susceptible plants, was isolated and identified as the flavonoid kaempferol 3-O-α-rhamnopyranoside by NMR spectroscopy (unpublished result). The ion m/z 1004 at 21.4 and 21.2 min represents a case where the chromatographic peak was split into two adjacent segments; however, the greater concentration in the susceptible plants cannot be disputed (Figure S-3). The identities of this compound and the rest in the table are currently unknown but are primary candidates for discriminating the two 8687

dx.doi.org/10.1021/ac401545b | Anal. Chem. 2013, 85, 8684−8691

Analytical Chemistry

Article

Table 2. Ions Derived from Kaempferol 3-O-Rhamnopyranoside (MW 432), Found in Bins 12.8−13.0 min in Negative ESI Mode resistant group Z-factor 0.348 −0.057 −0.258 −0.279 −0.817

m/z bina 865 863 432 431 433

identity −

[2M-H] [2M-H]− [M-H]− [M-H]− [M-H]−

isotopes (1H, 13C)

susceptible group

mean I. C.b

SDb

CVc

mean I. C.b

SDb

CVc

55 139 2087 10427 568

23 72 572 2846 205

0.418 0.518 0.274 0.273 0.361

993 3286 9588 44883 1737

181 1037 2573 11848 503

0.182 0.316 0.268 0.264 0.290

d

1 0 1 0 2

Median value of the bin in m/z dimension (−0.5 ≤ x < +0.5). bBased on normalized value to the largest total ion current in segments in a file. Coefficient of variation. dThis bin contains ions with 864.5 ≤ m/z < 865.5; therefore, the mass of the cluster ion as reported from the spectrometer (864.5) was rounded up. See also Figure 2 and Discussion on the centroiding-induced measurement errors.50 a c

(thereby the original plant material) in the decreasing order of the CVs in particular segments, followed by the generation of a ‘heatmap’ to correlate the field-observed resistance-susceptible characters, however, became unusable when the number of the files increased. In a second approach in which the files were assigned to two classes beforehand, and evaluating the segments with the P-values calculated based on the t-test, required significantly greater computing power (see also S-4 for the formulas). This procedure was, however, fundamentally flawed since the variances of the ion currents from the two sets of files were unknown and were not expected to be normally distributed. Moreover the t-test is insensitive to the overlapping of signal distributions in contrast to the Z-factor. A direct analysis on raw data offers unique advantages over the mainstream methods as it eliminates certain sources of bias, and the presented algorithm may be useful beyond marker discovery. The first and foremost is that the algorithm does not require an S/N ratio to be preset. The presence of a genuine signal from a compound would be statistically picked up against their local random noises, irrespective to whether the signals form a sharp chromatographic peak or a part of a broad hump. The concept of minimum peak height and width does not apply to the algorithm; therefore, those ion signals that would have been omitted through these restrictive criteria may be highlighted. Besides, it uses all the numerical data in the source files without selecting any value to remove (such as outliers) or filling ‘missing’ values at data points with an estimated one. Avoiding reconstruction of chromatograms for estimating total ICs (thus for peaks) would alleviate such algorithm-induced alteration of data. Among the features of the algorithm is that it is able to handle an absence of metabolites in a set of samples. This enables the programs to pick up, for example, newly expressed or depleted metabolites following biotic and/or abiotic challenges. A reporting by fold-changes from the ground states would not be possible in such situations.22 This feature is expected to be particularly useful, for example, for the detection of adulterants and dopants that are completely absent in the genuine materials. A note on the processing time may deserve here. Because of the simplicity of the integer-arithmetical calculations (as well as in part the use of the C++ language) the entire computation from data preprocessing through to statistic result requires very little computing resource and power and is very fast29 using a commonly available low-end desktop personal computer. The total processing time for 50 ASCII raw data files (scanning ranges 0−30 min and m/z 150−2000) by MZshovel was about 30 s, and the resultant data set (randomly divided to 25 each) took a mere 2 s for Ionwinze to calculate the Z-factors, sort the

Figure 2. Combined mass spectra in negative mode 12.8−13.0 min. Data taken from Aesculus hippocastanum (sample 1 of three). Inset: expansion of m/z 861−868 range.

phenotypes either alone or in combination. The confirmation of the true relevance of the highlighted metabolites, however, requires biological testing with appropriate models and/or in field conditions. As to why these compounds apparently occur in higher concentration in the susceptible species, we can only speculate that those plants that accumulate them ‘inadvertently’ become potentially susceptible, rather than the resistant plants solely producing anti-insect compounds. A large number of segments, for example 356 in positive ESI scans for {μ1 > μ2}-group, were shown to have the IC data distribution overlapped (0 > Z ≥ −1; Supporting Information S-5 and S-6). The Z-factor appears to be a computationally simple yet highly effective univariate parameter to evaluate the means and distributions of signals from two classes. The assay signals may be MS ion intensities of the analyte themselves33 or derivatives obtained from those of (enzymatic) reaction components.35,36 It is admittedly crude as well as simplistic in that many biological phenomena are not expressed by a direct relationship between a small number of causes and effects. However, it is interesting to note that, where putative variables (metabolites) are identified with multivariate statistical methods, a justification for the results, statistical at least, is often given by simpler (and more transparent) univariate methods such as ‘Student’s’ t-test,1,8,29,37−43 ANOVA,27,40 and by the foldchanges.37,42−44 In instances the t-test has been one of the primary methods for uncovering statistically significant signals among 2D GC and LC data.28,37 Our first approach for detection of class-discriminating variables was with the coefficient of variations (CV, = σ/μ), where large variations were assumed to be significant as also considered by Crockford et al.29 Sorting of the source files 8688

dx.doi.org/10.1021/ac401545b | Anal. Chem. 2013, 85, 8684−8691

Analytical Chemistry

Article

as plant extracts are analyzed. The range of Z-factor values obtained for the deprotonated ions from kaempferol rhamnoside (Table 2) may be partially explained by this nonlinearity. The present algorithm, like all other software, assumes that a matrix effect is negligible. Besides, (LC−)MS data are often acquired in the centroid mode, where the true distribution of measured ion abundances is discarded and a single value at the peak apex is used as a surrogate for the integrated mass spectral peak area. The effect of centroiding and the induced problems in quantitation has been discussed in detail.50 Even assuming acceptable linearity, deducing the relative quantity of a compound involves totaling of molecular, daughter, adduct, and cluster ions both within and outside the m/z-scan range. It is a feat that poses a significant, if not impossible, challenge without a prior knowledge of the behavior of the compounds especially when a reference standard is not available. The limited data obtained during the current work seem to be consistent with the expected increase in cluster formation when the analyte concentration is higher (Table 2). Altogether, mass spectrometry gives skewed quantitative data even at the most favorable conditions.51 On the data preprocessing step the greatest risk is at the linear normalization of the data to the largest total ion current in a segment, in a file. This can potentially affect especially lowintensity signals and distort the results (vide infra).22 In the examples presented above each of the data sets were scaled to large and dominating peaks that were common to all of the samples, so they were expected to have caused relatively small disruptions; a potential solution to this problem may be to spike the samples with an internal standard that would have supplied the highest segmental IC value thereby controlling the normalization. Further, there are circumstances where the Z-factor would fail to pick up a potential class discriminator. For example, where the insect resistance is based on a low concentration of a metabolite, and when the 3σ of that compound within the resistant group exceeds the dynamic range between classes, the Z-factor would appear close to or even below −1; sorting the segments by the signal intensities alone would have highlighted the segment(s). Although not unique to the presented algorithm univariate methods take essentially a reductionist approach by attempting to attribute biological traits to a small number of metabolites, rather than on a truly holistic standpoint despite that the entire data points are used for analysis. This algorithm obviously does not address the synergistic/additive interaction or antagonistic effects of components in a sample. Moreover, constant components across samples are ranked lower. Therefore characterization of active mixtures would require significant investment in the follow-up research time and resources.

results, and report, on an Intel Pentium 4 machine with 1024 MB memory, running at 3.2 GHz. These are significant differences from reported execution time (minutes to days) for separate steps of normalizing,19 peak-aligning,16,19,21 and feature selection45 on smaller data sets. However, the presented algorithm has weaknesses as well and requires further refining. First, while the effect of chromatographic baseline shift and truly random noises (instrumental and electronic) may be canceled out between identical LC elution program,46 when the absolute intensities of genuine signals are small the margin of error becomes greater and would inflate SDs, which would in turn give smaller Zfactors. Second, depending on the chromatographic resolution, a single segment may contain more than one chromatographic peak of the same and very close m/z values. The time interval of segments should be adjusted to optimal value when used with a system that generates narrower peaks e.g. ultrahigh performance LCs. Conversely a single broad peak may be split into several neighboring segments having the same m/z identifier (Figure S-3). Extensive tailings, including those caused by column overloading,47 would have detrimental influence on the analysis. A possible solution would be to obtain a better chromatographic separation, including using a longer elution profiles48 rather than focusing on LC performance measured by the throughput. Currently the maximum retention time for the programs is set to 60 min (hard-coded). Third, by design it does not align ion signals in time dimension but places mechanically into bins. Therefore the reproducibility and robustness of the chromatographic system are of paramount importance. A shift in retention time may have a large effect on calculating the total ICs in each of the segments, especially when the peak apex is close to the border of segments. In such a case the segments would be reported to have a greater variation than in ideal cases. Widening of segments in the time-dimension may be a possible solution to accommodate the drifts, although this incurs a loss of resolution. Peak alignment is often considered an essential and critical part of metabolomic analyses especially when multiple samples in a study cannot be analyzed in a single batch, for example large epidemiological cohort studies that may span many years. We consider, however, the idea of peak alignment deserves some serious consideration, since it may imply a reverse manipulation with the reporting from a mass spectrometer (the detector) of the analyte that had gone into the detector; the issues of stability and reproducibility should, ideally, be dealt with by the LC components as far as practically possible. Then, there is an inherent risk in treating IC data as quantitative information against the fundamental ideas of metabolomics and allied disciplines. Mass spectrometer responses are not governed solely by the amount of compounds that enter an ion source but a multitude of factors including ionizability of the analytes in given solvents and modifiers, polarity and ion sources,49 ionization energy (voltage in case of ESI), and matrix effects as well as the effect of polarity switching. The existence of quantitative differences in ionization and detection capacities due to the instrument design (MS2) has been reported.47 Thus, the linearity of the detector response cannot always be assumed even if the reported values may apparently fall within the accepted range of regression curves; it may be true for a single quadrupole MS (MS1) system, especially when complex sample matrices such



CONCLUSIONS We described here a hybrid approach combining LC-MS profiling and high-throughput screening, for detecting ion signals from raw LC-MS data sets with a simple yet powerful statistical parameter, the Z-factor. The highlighted signals are, when their Z-factors are above zero, already confirmed sufficient for discriminating the classes without further statistical testing. It is also possible to detect small in size but significant differences, previously undetected or masked by larger but insignificant ions for the purpose of class differentiation. The algorithm requires relatively little computing power and takes seconds to complete the calculation, ordering, 8689

dx.doi.org/10.1021/ac401545b | Anal. Chem. 2013, 85, 8684−8691

Analytical Chemistry

Article

(17) Jellema, R. H.; Krishnan, S.; Hendriks, M. M. W. B.; Muilwijk, B.; Vogels, J. T. W. E. Chemom. Intell. Lab. Syst. 2010, 104, 132−139. (18) Kuhl, C.; Tautenhahn, R.; Böttcher, C.; Larson, T. R.; Neumann, S. Anal. Chem. 2012, 84, 283−289. (19) Åberg, K. M.; Torgrip, R. J. O.; Kolmert, J.; Schuppe-Koistinen, I.; Lindberg, J. J. Chromatogr., A 2008, 1192, 139−146. (20) Wei, X.; Shi, X.; Kim, S.; Zhang, L.; Patrick, J. S.; Binkley, J.; McClain, C.; Zhang, X. Anal. Chem. 2012, 84, 7963−7971. (21) Koh, Y.; Pasikanti, K. K.; Yap, C. W.; Chan, E. C. Y. J. Chromatogr., A 2010, 1217, 8308−8316. (22) Sysi-Aho, M.; Katajamaa, M.; Yetukuri, L.; Orešič, M. BMC Bioinf. 2007, 8, 93. (23) Veselkov, K. A.; Vingara, L. K.; Masson, P.; Robinette, S. L.; Want, E.; Li, J. V.; Barton, R. H.; Boursier-Neyret, C.; Walther, B.; Ebbels, T. M.; Pelczer, I.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2011, 83, 5864−5872. (24) Pluskal, T.; Castillo, S.; Villar-Briones, A.; Orešič, M. BMC Bioinf. 2010, 11, 395. (25) Sumner, L. W.; Amberg, A.; Barrett, D.; Beale, M. H.; Beger, R.; Daykin, C. A.; Fan, T. W.-M.; Fiehn, O.; Goodacre, R.; Griffin, J. L.; Hankemeier, T.; Hardy, N.; Harnly, J.; Higashi, R.; Kopka, J.; Lane, A. N.; Lindon, J. C.; Marriott, P.; Nicholls, A. W.; Reily, M. D.; Thaden, J. J.; Viant, M. R. Metabolomics 2007, 3, 211−221. (26) D’Costa, L.; Koricheva, J.; Straw, N.; Simmonds, M. S. J. Ecol. Entomol. 2013, DOI: 10.1111/een.12037. (27) Allwood, J. W.; Ellis, D. I.; Heald, J. K.; Goodacre, R.; Mur, L. A. J. Plant J. 2006, 46, 351−368. (28) Shellie, R. A.; Welthagen, W.; Zrostliková, J.; Spranger, J.; Ristow, M.; Fiehn, O.; Zimmermann, R. J. Chromatogr., A 2005, 1086, 83−90. (29) Crockford, D. J.; Lindon, J. C.; Cloarec, O.; Plumb, R. S.; Bruce, S. J.; Zirah, S.; Rainville, P.; Stumpf, C. L.; Johnson, K.; Holmes, E.; Nicholson, J. K. Anal. Chem. 2006, 78, 4398−4408. (30) Zhang, J.-H.; Chung, T. D. Y.; Oldenburg, K. R. J. Biomol. Screening 1999, 4, 67−73. (31) Moorwood, C.; Lozynska, O.; Suri, N.; Napper, A. D.; Diamond, S. L.; Khurana, T. S. PLoS One 2011, 6, e26169. (32) Tian, H.; Ip, L.; Luo, H.; Chang, D. C.; Luo, K. Q. Br. J. Pharmacol. 2007, 150, 321−334. (33) Bovet, C.; Plet, B.; Ruff, M.; Eiler, S.; Granger, F.; Panagiotidis, A.; Wenzel, R.; Nazabal, A.; Moras, D.; Zenobi, R. Toxicol. In Vitro 2009, 23, 704−709. (34) Broadhurst, D. I.; Kell, D. B. Metabolomics 2006, 2, 171−196. (35) Partserniak, I.; Werstuck, G.; Capretta, A.; Brennan, J. D. ChemBioChem 2008, 9, 1065−1073. (36) Forsberg, E. M.; Green, J. R. A.; Brennan, J. D. Anal. Chem. 2011, 83, 5230−5236. (37) Vorst, O.; de Vos, C. H. R.; Lommen, A.; Staps, R. V.; Visser, R. G. F.; Bino, R. J.; Hall, R. D. Metabolomics 2005, 1, 169−180. (38) Pandher, R.; Ducruix, C.; Eccles, S. A.; Raynaud, F. I. J. Chromatogr., B: Anal. Technol. Biomed. Life Sci. 2009, 877, 1352−1358. (39) Lin, X.; Zhang, Y.; Ye, G.; Li, X.; Yin, P.; Ruan, Q.; Xu, G. J. Sep. Sci. 2011, 34, 3029−3036. (40) Jänkänpäa,̈ H. J.; Mishra, Y.; Schröder, W. P.; Jansson, S. Plant Cell Environ. 2012, 35, 1824−1836. (41) Wang, X.; Yang, B.; Sun, H.; Zhang, A. Anal. Chem. 2012, 84, 428−439. (42) Wu, Y.; Zhang, Y.; Xie, G.; Zhao, A.; Pan, X.; Chen, T.; Hu, Y.; Liu, Y.; Cheng, Y.; Chi, Y.; Yao, L.; Jia, W. PLoS One 2012, 7, e44830. (43) Zhao, T.; Zhang, H.; Zhao, T.; Zhang, X.; Lu, J.; Yin, T.; Liang, Q.; Wang, Y.; Luo, G.; Lan, H.; Li, P. J. Pharm. Biomed. Anal. 2012, 60, 32−43. (44) Zhao, X.; Zhang, Y.; Meng, X.; Yin, P.; Deng, C.; Chen, J.; Wang, Z.; Xu, G. J. Chromatogr., B: Anal. Technol. Biomed. Life Sci. 2008, 873, 151−158. (45) Zou, W.; Tolstikov, V. Rapid Commun. Mass Spectrom. 2008, 22, 1312−1324.

and reporting. By providing a way to correlate biological and statistical criteria, the algorithm may be used for the election of potential markers for further examination with strong rationale. It can be envisaged that the method may be used in areas of MS-based metabolomic studies, whether it is a difference that is sought or for test of equivalence.



ASSOCIATED CONTENT

S Supporting Information *

Figures (workflow diagram, BPI chromatograms in positive ESI mode, and ion chromatograms for comparison), calculation of P-values based on Student’s t-distribution, and the output of data-mining programs. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*Phone: +44-208-332-5318. Fax: +44-208-332-5310. E-mail: t. [email protected]. Notes

The authors declare no competing financial interest.

■ ■

ACKNOWLEDGMENTS We thank Dr. Nigel C. Veitch (RBG, Kew) for running NMR experiments for identification of flavonoid compounds. REFERENCES

(1) Fiehn, O. Plant Mol. Biol. 2002, 48, 155−171. (2) Goodacre, R.; Vaidyanathan, S.; Dunn, W. B.; Harrigan, G. G.; Kell, D. B. Trends Biotechnol. 2004, 22, 245−252. (3) Lindon, J. C.; Nicholson, J. K. Annu. Rev. Anal. Chem. 2008, 1, 45−69. (4) Koek, M. M.; Jellema, R. H.; van der Greef, J.; Tas, A. C.; Hankemeier, T. Metabolomics 2011, 7, 307−328. (5) Bundy, J. G.; Davey, M. P.; Viant, M. R. Metabolomics 2009, 5, 3− 21. (6) Prince, E. K.; Pohnert, G. Anal. Bioanal. Chem. 2010, 396, 193− 197. (7) Kind, T.; Tolstikov, V.; Fiehn, O.; Weiss, R. H. Anal. Biochem. 2007, 363, 185−195. (8) Bruce, S. J.; Jonsson, P.; Antti, H.; Cloarec, O.; Trygg, J.; Marklund, S. L.; Moritz, T. Anal. Biochem. 2008, 372, 237−249. (9) Hendriks, M. M. W. B.; van Eeuwijk, F. A.; Jellema, R. H.; Westerhuis, J. A.; Reijmers, T. H.; Hoefsloot, H. C. J.; Smilde, A. K. Trends Anal. Chem. 2011, 30, 1685−1698. (10) Ressom, H. W.; Xiao, J. F.; Tuli, L.; Varghese, R. S.; Zhou, B.; Tsai, T.-H.; Ranjbar, M. R. N.; Zhao, Y.; Wang, J.; Poto, C. D.; Cheema, A. K.; Tadesse, M. G.; Goldman, R.; Shetty, K. Anal. Chim. Acta 2012, 743, 90−100. (11) Werner, E.; Heilier, J.-F.; Ducruix, C.; Ezan, E.; Junot, C.; Tabet, J.-C. J. Chromatogr., B: Anal. Technol. Biomed. Life Sci. 2008, 871, 143− 163. (12) Theodoridis, G.; Gika, H. G.; Wilson, I. D. Mass Spectrom. Rev. 2011, 30, 884−906. (13) Goodacre, R.; Broadhurst, D.; Smilde, A. K.; Kristal, B. S.; Baker, J. D.; Beger, R.; Bessant, C.; Connor, S.; Capuani, G.; Craig, A.; Ebbels, T.; Kell, D. B.; Manetti, C.; Newton, J.; Paternostro, G.; Somorjai, R.; Sjöström, M.; Trygg, J.; Wulfert, F. Metabolomics 2007, 3, 231−241. (14) Madsen, R.; Lundstedt, T.; Trygg, J. Anal. Chim. Acta 2010, 659, 23−33. (15) Katajamaa, M.; Orešič, M. J. Chromatogr., A 2007, 1158, 318− 328. (16) Lange, E.; Tautenhahn, R.; Neumann, S.; Gröpl, C. BMC Bioinf. 2008, 9, 375. 8690

dx.doi.org/10.1021/ac401545b | Anal. Chem. 2013, 85, 8684−8691

Analytical Chemistry

Article

(46) Falck, D.; de Vlieger, J. S. B.; Niessen, W. M. A.; Kool, J.; Honing, M.; Giera, M.; Irth, H. Anal. Bioanal. Chem. 2010, 398, 1771− 1780. (47) Nilsson, L. B.; Skansen, P. Rapid Commun. Mass Spectrom. 2012, 26, 1399−1406. (48) Guy, P. A.; Tavazzi, I.; Bruce, S. J.; Ramadan, Z.; Kochhar, S. J. Chromatogr., B: Anal. Technol. Biomed. Life Sci. 2008, 871, 253−260. (49) Sana, T. R.; Waddell, K.; Fischer, S. M. J. Chromatogr., B: Anal. Technol. Biomed. Life Sci. 2008, 871, 314−321. (50) Wang, Y.; Gu, M. Anal. Chem. 2010, 82, 7055−7062. (51) Schroeder, F. C.; Gibson, D. M.; Churchill, A. C. L.; Sojikul, P.; Wursthorn, E. J.; Krasnoff, S. B.; Clardy, J. Angew. Chem., Int. Ed. 2007, 46, 901−904.

8691

dx.doi.org/10.1021/ac401545b | Anal. Chem. 2013, 85, 8684−8691