Automatic Chemical Structure Annotation of an LC–MSn Based

May 10, 2013 - Global profiling combined with predicted metabolites screening for ... Performance of combined fragmentation and retention prediction f...
3 downloads 3 Views 2MB Size
Article pubs.acs.org/ac

Automatic Chemical Structure Annotation of an LC−MSn Based Metabolic Profile from Green Tea Lars Ridder,*,†,‡ Justin J. J. van der Hooft,†,∥,⊥,# Stefan Verhoeven,‡ Ric C. H. de Vos,§,∥,⊥ Raoul J. Bino,⊗ and Jacques Vervoort†,∥ †

Laboratory of Biochemistry, Wageningen University, Dreijenlaan 3, 6703 HA Wageningen, The Netherlands Netherlands eScience Center, Science Park 140, 1098 XG Amsterdam, The Netherlands § Plant Research International, Wageningen University and Research Centre, P.O. Box 16, 6700 AA Wageningen, The Netherlands ∥ Netherlands Metabolomics Centre, Einsteinweg 55, 2333 CC Leiden, The Netherlands ⊥ Centre for Biosystems Genomics, P.O. Box 98, 6700 AB Wageningen, The Netherlands ⊗ Laboratory of Plant Physiology, Wageningen University, POB 658, 6700 AR Wageningen, The Netherlands ‡

S Supporting Information *

ABSTRACT: Liquid chromatography coupled with multistage accurate mass spectrometry (LC−MSn) can generate comprehensive spectral information of metabolites in crude extracts. To support structural characterization of the many metabolites present in such complex samples, we present a novel method (http://www.emetabolomics.org/magma) to automatically process and annotate the LC−MSn data sets on the basis of candidate molecules from chemical databases, such as PubChem or the Human Metabolite Database. Multistage MSn spectral data is automatically annotated with hierarchical trees of in silico generated substructures of candidate molecules to explain the observed fragment ions and alternative candidates are ranked on the basis of the calculated matching score. We tested this method on an untargeted LC−MSn (n ≤ 3) data set of a green tea extract, generated on an LC-LTQ/Orbitrap hybrid MS system. For the 623 spectral trees obtained in a single LC−MSn run, a total of 116 240 candidate molecules with monoisotopic masses matching within 5 ppm mass accuracy were retrieved from the PubChem database, ranging from 4 to 1327 candidates per molecular ion. The matching scores were used to rank the candidate molecules for each LC−MSn component. The median and third quartile fractional ranks for 85 previously identified tea compounds were 3.5 and 7.5, respectively. The substructure annotations and rankings provided detailed structural information of the detected components, beyond annotation with elemental formula only. Twenty-four additional components were putatively identified by expert interpretation of the automatically annotated data set, illustrating the potential to support systematic and untargeted metabolite identification.

M

fragmentation patterns. We have developed a computational method, called MAGMa, to support the data analysis by annotating all fragmented compounds in LC−MSn data sets with candidate molecules taken from large chemical databases such as PubChem. The method is based on a recently developed algorithm for candidate substructure annotation of multistage accurate mass spectral trees,10 and is available via the MAGMa web application (http://www.emetabolomics.org). Computational algorithms are commonly used to assign elemental formulas to mass peaks based on accurate m/z values and/or isotopic peak ratios.11−13 Recently, refined methods have been described that improve the assignment of elemental formulas using the hierarchical information present in MSn spectral trees.14,15 However, one elemental formula may still

ass spectrometry allows a comprehensive detection of small metabolites in biological samples. It enables largescale metabolomics experiments that complement data from genomics, transcriptomics and proteomics analyses toward a complete characterization of biological systems.1,2 However, identification of the numerous unknown metabolites that are detected in untargeted metabolomics studies is currently a significant bottleneck in the biological interpretation.3−5 Recently, liquid chromatography coupled with accurate mass multistage mass spectrometry (LC−MSn) has been described as a powerful platform for compound identification in untargeted metabolomics studies.6−9 Multistage fragmentation data is generated for large numbers of metabolites present in a crude extract, enabling (partial) chemical characterization, even in the absence of data from reference compounds or NMR. Extracting all information present in the LC−MSn data, however, requires significant input and time by human experts with specific knowledge of the detected compound classes and their © XXXX American Chemical Society

Received: March 22, 2013 Accepted: May 10, 2013

A

dx.doi.org/10.1021/ac400861a | Anal. Chem. XXXX, XXX, XXX−XXX

Analytical Chemistry

Article

represent hundreds of known chemical structures,16 and an even larger number of possible unknown molecules.11 More specific annotation is obtained by methods that assign in silico generated substructures of possible candidate molecules to the m/z values in a mass spectrum. Methods based on this concept include EPIC,17 Mass-Frontier,18 FiD,19 MetFrag,20 and MassMetaSite.21 Substructure annotation of single fragmentation spectra, have been used in various published identification workflows.22,23 However, substructure annotation of multistage MSn data with n > 2 needs to take full account of the hierarchical structure in the spectral data by annotating spectral trees, rather than single spectra. Some approaches rely on similarity with spectral trees of known compounds to derive structural features,6,22,24 though such reference information is not always present. We recently described a novel algorithm,10 that can automatically annotate multistage spectral trees obtained from MSn experiments, where n ≥ 2, without relying on reference spectral data. This algorithm derives hierarchical trees of in silico generated substructures to explain multistage spectral data, where the substructure assignments at each MS level take the assignments of the precursor as well as subsequent fragmentations into account. The substructure based annotation results in a matching score that can be used to rank different candidate structures for an unknown metabolite to support its identification. The current online version of MAGMa enables automatic annotation based on candidate molecular structures retrieved from PubChem25 or the Human Metabolite Database.26 Furthermore MAGMa can be used to annotate all fragmentation data from a complete LC− MSn datafile in a single run, allowing a systematic and detailed analysis of a complex metabolic profile. Here, we apply MAGMa to an untargeted LC−MSn profile of a crude green tea extract including hundreds of multistage spectral trees. Tea is a major component of the human diet and its composition has been extensively studied in relation to claims of beneficial effects on human health.27−29 The major constituents in green tea are flavan-3-ols and flavonols, which occur in a variety of glycosylated, acylated and polymeric forms.30 A comprehensive identification of phenolic constituents in green and black tea has recently been performed on the basis of LC− MSn (n ≤ 3) analysis.31 In addition, one-dimensional proton NMR was performed for a large number of these components to guide and confirm the MSn based assignments. Here, we use the manual annotations to evaluate the results of our computational approach. Furthermore, we use the in silico annotated data to putatively identify additional compounds in the LC−MSn data set, demonstrating the utility of our approach in the identification of unknown compounds detected in untargeted LC−MSn-based metabolomics studies.

Compound Database. A local database was created with PubChem molecules containing only (the most abundant isotopes of) C, H, N, O, P and S elements, up to a mass of 1200 Da. Related stereoisomers present in PubChem were condensed into a single record with unique atom connectivity, keeping an electronic link to all related stereoisomers in the online PubChem database. The local database for the present study was generated from PubChem on January 28, 2013 and contained 21 334 440 (nonstereoisomeric) chemical structures. Substructure Based Annotation of LC−MSn Data with MAGMa. An overview of the MAGMa process is provided in Figure 1. A number of parameters in the process can be set by

Figure 1. Schematic representation of the in silico metabolite annotation algorithm in MAGMa.

the user, depending on the data and the available computational resources. During data import of the mzXML datafile, all signals below 5000 counts were discarded. Furthermore, fragment ions at MSn≥2 were included in the analysis using a 5% threshold relative to the base peak. For each precursor ion in the LC−MSn datafile, candidate molecules were obtained by querying the local PubChem-derived database on monoisotopic mass, assuming MS1 precursor peaks to correspond to [M − H]− ions, and applying a mass tolerance of 5 ppm, with an absolute minimum of 0.001 Da. Each candidate molecule was used to annotate the corresponding spectral tree with in silico generated substructures according to the recently published algorithm.10 In short, substructures of a candidate molecule were systematically generated up to a maximum number of three dissociated bonds. Applying a fourth bond dissociation step, as described in the previous publication,10 is computationally expensive due to the combinatorial expansion of possible substructures. Instead, we augmented the substructures from three bond dissociation steps with the additional removal of the single heteroatom hydroxyl and amino groups only, one at a time, corresponding to the common neutral losses of water and ammonia. These settings resulted in generation of sufficiently extensive sets of relevant substructures, while limiting the required computation time. For each fragment peak in the spectral tree the most likely substructures were selected based on a penalty score



MATERIALS AND METHODS Experimental Data Set. LC−MSn data was obtained by analyzing a single aqueous-methanol extract of powdered green tea, with liquid chromatography coupled to multistage accurate mass spectrometry using electrospray ionization as described previously.31 A hybrid LTQ MS/Orbitrap FTMS system (Thermo Scientific) was used in negative ionization mode for untargeted MSn (n ≤ 3) with a dynamic exclusion of 20 s, and with up to three MS3 scans of the three most intense MS2 fragments ions above 50 000 counts.8 For part of the lower abundant signals, only MS2 data could be obtained within the limited time the compound was eluting from the column. The raw datafile was converted into mzXML format by the ReadW software (Institute for Systems Biology, Seattle, WA). B

dx.doi.org/10.1021/ac400861a | Anal. Chem. XXXX, XXX, XXX−XXX

Analytical Chemistry

Article

Figure 2. Number of candidates per annotated spectral tree and the ranking of the correct candidate. Black parts of the bars represent the correct molecule and tied candidates with the same score. Panels a and b show the same data, once sorted on decreasing size of the candidate lists and once on increasing precursor m/z. Panels c and d represent results including only candidates with correct elemental formulas.

number, calculated as the mean of their ordinal rank numbers. For example, if the second and third best scoring candidates have the same score, they both receive rank number 2.5. In practice, when equally valid candidates are obtained for an unknown metabolite, human experts would consider the most common and well-known candidate molecules first. PubChem does not contain the required information to assess the (biological) relevance of all compounds. However, one easily accessible parameter in PubChem that may provide some generic indication of the importance of a compound is the number of the compound related “substances” representing the presence of the compound in a large number of depositor databases. We tested the utility of this parameter in prioritizing the most relevant candidates by means of a “refined” ranking, in which candidate molecules are still primarily sorted on increasing Candidate Score and candidates with equal Candidate Scores are subsequently ordered on the basis of decreasing number of related PubChem Substances. Data Annotation. The automatically annotated data set is available via the MAGMa web interface at http://www.emetabolomics. org/magma/results/ded5cb2f-c8a9-4f9f-ae11-680c25022a83.

depending on the type of bonds that are broken. Single, double, and triple or aromatic bonds involving a heteroatom were assigned penalties of 1, 2, and 3, respectively. Penalties for carbon−carbon bonds were doubled. Furthermore, substructures assigned to a given fragmentation spectrum were restricted to (be part of) the substructure assigned to the corresponding precursor peak and also takes into account consistent substructure assignments at subsequent MS levels, by means of a recursive matching algorithm.10 The result is a hierarchical tree of substructures and an overall Candidate Score that is calculated as a weighted average of the penalty scores of all assigned substructures, including a penalty of 10 for nonmatched peaks, on the basis of the square root of the intensity. The latter overall penalty score gives a relative indication of how well the candidate molecule matches with the multistage spectral data. Ranking of Candidate Molecules. Related molecules, such as positional isomers, may obtain equal Candidate Scores when the spectral data does not contain discriminating fragment peaks. Two different rankings of a set of previously identified molecules are reported. Fractional ranking implies that multiple candidates with equal candidate scores receive the same rank C

dx.doi.org/10.1021/ac400861a | Anal. Chem. XXXX, XXX, XXX−XXX

Analytical Chemistry

Article

Figure 3. Substructures (in black) automatically assigned by MAGMa to m/z values in spectral trees of (a) theogallin and (b) quercetin. Neutral losses are indicated in cyan.



RESULTS AND DISCUSSION The LC−MSn data set of green tea contained fragmentation data of 623 molecular ions, in-source fragments and adducts. This fragmentation data included 325 spectral trees, consisting of an MS2 scan and on average 2.3 MS3 scans. A total of 298 fragmentations consisted of MS2 only (due to lower signal intensities). On the basis of the observed MS1 precursor m/z values, ranging from 96.9603 to 1175.5500 Da, a total of 116 240 candidate structures were retrieved from the PubChem database. In silico fragmentation of all candidate structures resulted in 130 million (nonunique) substructures, of which 565 303 were selected for the annotation of 5969 peaks in the data set (each peak was annotated with fragments from multiple candidates) on the basis of monoisotopic mass. The entire processing required 9.2 CPU hours on a virtual Ubuntu 12.04 system running on a Windows 7 host with an Intel i7 2.4 GHz processor. Parallel processing of candidate molecules on 4 CPU resulted in an actual runtime of 2.5 h. The automatic annotation of the LC−MSn data set was evaluated for 89 components that were previously assigned based on expert knowledge of MS fragmentation patterns and 1D-1H NMR spectra.31 For four of these previous assignments, PubChem did not contain compounds that agreed with the earlier characterizations. These included two novel structures that have recently been elucidated by NMR as quercetin-3-O(2G-p-coumaroyl(trans)-3G-O-β-L-arabinosyl-3R-O-β-D-rhamnosylrutinoside and kaempferol-3-O-(2G-p-coumaroyl(trans)3G-O-β- L-arabinosyl-3R-O-β- D-glucosylrutinoside,31,32 and which were not yet deposited in the PubChem database. For the remaining 85 assignments, correct chemical structures were present among the automatically generated lists of PubChem candidates, the size of which ranged from 4 to 1327 per precursor ion. Figure 2a,b presents the rankings of the correct molecules, relative to the total number of candidates. For those compounds that could only be partially characterized in the earlier study,31 the best scoring candidate that matches the previous characterization was considered as a “correct” candidate. For example, if the presence of a certain number of glycoside groups had been determined, but not their position on the aglycone backbone or how they are attached to each other, any matching structure with the given number of glycosides was accepted. For (isomers of) two components in the evaluation set, the MAGMa annotation suggested alternative structures that we

considered equally valid as our earlier putative annotations (see Supporting Information). In all but 6 cases, the correct structures are ranked high, as indicated by short dark bars in Figure 2, relative to the total number of candidates. In most cases, the correct candidate is tied with a variable number of equally scoring candidates (black parts of the bars in Figure 2). The fractional rankings of all 85 known components, which correspond to the centers of the black bars in Figure 2, show a median of 3.5 and a third quartile of 7.5. “Refining” the ranking of candidates with equal scores based on the number of compound associated PubChem substances resulted in a small improvement of the median and third quartile rankings to 2 and 6, respectively. This indicates that the PubChem-derived parameter may provide some added value in prioritizing the most relevant candidate molecules for further verification. However, it introduces a bias toward known compounds and is not based on the experimental data. Therefore, its added value will depend on the type of data and should be used with care. The actual list of molecules and their rankings is provided as Supporting Information. Figure 2b shows that the 6 molecules that were ranked less successfully, at positions beyond 50, are the molecules with the lowest monoisotopic mass among the previously identified components. They are, starting from the left in Figure 2b, caffeine, kaempferol, epicatechin, catechin, a quercetin isomer (possibly Tricetin) and quercetin. As the candidates for each spectral tree have been selected on the basis of monoisotopic mass, with a threshold of 5 ppm deviation, they may include false candidates with incorrect elemental formulas. It might be possible to exclude incorrect elemental formulas on the basis of additional information, for example isotope peak ratios.11 Figure 2c,d indicates that the lists of candidates with correct elemental formulas are significantly shorter. However, the absolute rankings of the correct candidates improve only slightly, with unchanged median and quartile rankings, indicating that the substructure-based candidate score already distinguished molecules with correct elemental formulas. This is confirmed by the fact that the top ranked candidates for the spectral trees in the evaluation set, even when they were not the correct molecule, always represented the correct elemental formula. A similar finding has been reported by Hill et al. for the ranking of PubChem candidate molecules, based on fragment predictions D

dx.doi.org/10.1021/ac400861a | Anal. Chem. XXXX, XXX, XXX−XXX

Analytical Chemistry

Article

with the Mass Frontier software, for positive ionization CID spectra of 102 individual test compounds.18 Comparison of Figure 2a,b with Figure 2c,d shows that the four worst ranked molecules are not only the smallest molecules in the assigned data set, but they are also accompanied by the largest sets of alternative candidates with identical elemental formulas. Caffeine is the smallest assigned molecule and its MS fragments were not specific enough to distinguish the correct structure among more than 600 alternative molecules with the same elemental composition. The next five smallest molecules are kaempferol, epicatechin, catechin, a quercetin isomer (possibly Tricetin) and quercetin. Upon fragmentation in negative ionization mode, these polyphenolic aglycones undergo chemical rearrangements for which the substructure based annotation is less successful. This is exemplified by the two substructure annotations for theogallin and quercetin (Figure 3), where the first is straightforward to interpret and the latter requires more careful analysis. The substructures assigned to the spectral tree of theogallin correspond to single dissociations at the ester bond, resulting in quinic acid and gallic acid. These substructures are generated with a low penalty score, resulting in a top-ranking of theogallin among other PubChem candidates. In the case of quercetin, fragmentation occurs in the C-ring at the center of the molecule, involving several chemical rearrangements.7 The m/z 257.0455 fragment of quercetin, for example, corresponds to an unusual loss of CO2 from the C-ring.33,34 The annotation algorithm of MAGMa does not include prediction of rearrangements and, consequently, the assigned substructures do not represent the actual fragment ions. The substructure assigned to m/z 257.0455 is obtained by loss of separate carbonyl and hydroxyl groups, illustrating a more general observation that substructures assigned to rearrangement products often involve dissociation of multiple bonds in the molecule, required to match the experimental mass and resulting in relatively high penalty scores. For the spectral trees other than the six cases discussed above, the substructure based Candidate Score helped to rank the correct molecules among the candidates with identical elemental composition (Figure 2c,d). This is further illustrated by the five pairs of distinct molecules with identical elemental composition identified in the green tea sample (Table 1). The corresponding pairs of spectral trees are analyzed based on identical lists of candidates, including both assigned molecules. In all cases each molecule is ranked highest for its corresponding spectral tree, and also higher than the other tea component with identical elemental composition. For the second example in Table 1, the prodelphinidin and quercetin-3-O-p-coumaroylhexoside pair (C30H26O14, m/z 609.1254), details of the annotations are shown in Figure 4, to clarify the correct ranking. The spectrum of quercetin-3-O-pcoumaroylhexoside consists of two main peaks corresponding to the quercetin (m/z 301.0351) and quercetin-3-O-hexoside (m/z 463.0877) fragments. The latter peak cannot be explained by the substructures of the alternative compound for this elemental formula, prodelphinidin (i.e., the wrong molecule for this fragmentation data) and the first peak only by a high penalty substructure, resulting in a high overall penalty score and lower ranking, as expected. The main peaks in the spectrum belonging to prodelphinidin correspond to the (epi)gallocatechin monomer, and to three fragments (only the two most intense are shown) consisting of one (epi)gallocatechin monomer connected to a fragment of the second (epi)gallocatechin unit.

Table 1. Fractional (and Refined) Rankings of Five Pairs of Molecules a and b, with Identical Elemental Composition Used To Annotate the Corresponding Pairs of Spectral Trees, A and Ba

a Numbers T00XX refer to previously identified spectral trees provided in the Supporting Information of Van der Hooft et al.31 For each pair, rankings are given for the correct candidates (a for A, and b for B) and for the cross annotation (a for B and b for A).

Such fragmentation can be expected from a stable C−C linkage between the (epi)gallocatechin monomers. Annotation of these fragments based on the wrong candidate, quercetin-3-O-pcoumaroylhexoside, results in less plausible substructures with higher penalty scores. These examples illustrate how substructure-based annotation and ranking provides information beyond annotations with elemental formula only. The scoring algorithm prioritizes candidate structures that match best with the observed fragmentation and the assigned substructures facilitate a chemical interpretation of the MS spectra based on a given candidate molecule. The larger dark-gray and black bars in Figure 2 exemplify that the ranking based on E

dx.doi.org/10.1021/ac400861a | Anal. Chem. XXXX, XXX, XXX−XXX

Analytical Chemistry

Article

Figure 4. MS2 spectra, and automatically assigned substructures (in black) with penalty scores (between parentheses), of quercetin 3-O-pcoumaroylhexoside and prodelphinidin. The bold lines highlight the assignments based on the correct molecules. The other substructures are obtained from cross-candidate annotation, resulting in higher substructure penalty scores. Fragment m/z 463.0877 could not be matched by any substructure of prodelphinidin.

Figure 5. Structures tentatively assigned to spectral trees in the LC−MSn data set of green tea, using the MAGMa annotation.

between possible candidates. Consequently, final structure assignments still need to be made by a human expert, e.g., by including other experimental information. However, the ranking often guides

Candidate Score is not perfect. This may reflect limitations of the current scoring algorithm as well as of the fragmentation data itself, which do not always provide the information required to distinguish F

dx.doi.org/10.1021/ac400861a | Anal. Chem. XXXX, XXX, XXX−XXX

Analytical Chemistry

Article

retrieved from PubChem. Six compounds were present in the candidate list, but ranked outside the top 50. They included common molecules, i.e., caffeine, kaempferol, (epi)catechin and quercetin, for which reference data is usually available. A practical approach may be to combine the present substructure based automatic annotation with techniques that automatically match reference spectral trees stored in databases,6,24 which will facilitate the annotation of known compounds, especially for these smaller and more common molecules. Four of the earlier assignments were not reproduced by the automatic annotations due to fact that the correct molecules were not (yet) present in the PubChem database. This restriction to known molecules may be overcome by methods that generate new chemical structures in silico, for example by de novo structure generation, or by applying reaction rules that can generate biochemically feasible metabolites. Integration of metabolic reaction rules to generate extended sets of candidates within MAGMa is currently in progress. In addition to the evaluation results, the automatically annotated data set resulted in new putative annotations, including compounds not reported in green tea before. This illustrates the potential of the automatic annotation to achieve a systematic analysis. Although final structure assignments still need to be made by human experts, their efforts can be focused on the most relevant candidates, as prioritized by the calculated Candidate Score. The substructures used in the automatic annotation need careful interpretation. But even when the substructures do not represent the actual fragments formed in the MS experiment, the annotation can help the expert to interpret the observed mass peaks by providing an initial mapping of the observed fragments to the structures of possible candidates. This is especially valuable for the identification of less familiar compound classes.

the expert effectively to the most relevant candidates from a much larger list of possibilities than would have been evaluated manually (Figure 2). Further analysis of the automatically annotated data set resulted in the putative annotation of 24 additional compounds (Figure 5), based on the ranking as well as on an assessment of the substructures assigned to MS2 and MS3 fragment ions. For example, vitexin-2″-sulfate and isovitexin-2″-sulfate were tentatively annotated, for which assigned substructures indicated chemical losses and rearrangements in the glycoside moiety, while still attached to the apigenin aglycone (the substructure assignment is provided as Supporting Information). These are expected to be typical fragments for C-bonded glycosides, as O-bonded glycosides are normally lost completely in MS fragmentation. A literature search confirmed that (iso)vitexin-2″-sulfates have been found in tea leaves and that the observed m/z 311 and m/z 431 ions (Figure S1 of the Supporting Information) are known MS fragments of these molecules.35 Likewise, we could confirm our assignments of saponarin, theasinensin,36 procyanidin C1, caffeoylglucose,37 di-O-galloylglucose,38 tri-O-galloyl-glucose,38 vitexin 2″-rhamnoside39 and (epi)gallocatechin-3-O-p-coumarate40 on the basis of literature reports of these components in green tea and their MS fragmentations. Myricetin-3-O-galloylhexoside has been found in black tea.41 The highest m/z to which we assigned a molecular structure was 1173.5687 Da. All top 3 ranked candidates (out of 11) for this spectral tree were triterpene saponins, including theasaponin C1, that has been isolated from seeds of tea plant.42 A number of putative assignments are made of compounds that to our knowledge have not been identified in tea before: myricetin-3-O-coumaroylhexoside has been extracted from water lily43 and pine needles,44 epigallocatechin-3-O-p-hydroxybenzoate from Cistus salvifolius,45 epigallocatechin-3-O-ferulate46 and (epi)gallocatechin-methyl(epi)gallocatechin dimer47 from Parapiptadenia rigida bark, mollupentin 2″-O-rhamnoside or cerarvensin 2″-O-rhamnoside48 from Allophyllus edulis leaves, trichocarpin from poplar bark,49 and di-O-p-coumaroylquinic acids have been identified in coffee beans.50 Further characterization, e.g., by NMR, is required to verify these tentative identifications in green tea. The newly assigned molecules include a diverse set of chemical structures (Figure 5). Our earlier analysis was primarily focused on flavonols and flavan-3-ols, guided by typical fragmentations observed for the aglycone fragments.31 The systematic search of candidate molecules for all spectral trees in the MAGMa-based annotation resulted in the finding of additional compound classes in the tea LC−MSn data set, including C-glycosylated polyphenols (vitexin-2″-sulfate, vitexin-2″-rhamnoside, mollupentin-2″-O-rhamnoside and saponarin), procyanidins ((epi)gallocatechin- methyl(epi)gallocatechin and procyanidin C1) and more distinct chemical classes such as the multiple acylated glycosyls and quinic acids, saponins and phenol glycosides (trichocarpin). The automatic annotation of the MSn data with substructures facilitated the assessment of the candidate chemical structures given the observed fragmentations.



ASSOCIATED CONTENT

S Supporting Information *

Additional information as noted in the text. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Present Address #

J.J.J.v.d.H.: University of Glasgow, Joseph Black building, University of Glasgow, Glasgow G12 8QQ, UK. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank Peter Jacobs, Marijn Sanders and Fatma Yelda Ü nlü for helpful discussions and support. This research was supported by The Netherlands eScience Center, grant number 660.011.302. J.J.J.v.d.H., R.C.H.d.V., and J.V. acknowledge The Netherlands Metabolomics Centre and J.J.J.v.d.H. and R.C.H.d.V. also the Centre for Biosystems Genomics, which are both part of The Netherlands Genomics Initiative/Netherlands Organization for Scientific Research, for funding.



CONCLUSIONS We have developed a method to automatically annotate highresolution accurate mass LC−MSn data on the basis of substructures of candidate molecules retrieved from chemical databases. Application of the method to a green tea data set successfully annotated 79 out of 89 previously identified compounds, including the correct molecules at high rankings in the candidate lists



REFERENCES

(1) Fiehn, O. Plant Mol. Biol. 2002, 48, 155−171. (2) Moco, S.; Vervoort, J.; Bino, R. J.; de Vos, R. C. H.; Bino, R. Trends Anal. Chem. 2007, 26, 855−866. (3) Wishart, D. S. Bioanalysis 2011, 3, 1769−1782.

G

dx.doi.org/10.1021/ac400861a | Anal. Chem. XXXX, XXX, XXX−XXX

Analytical Chemistry

Article

(4) Dunn, W.; Erban, A.; Weber, R. M.; Creek, D.; Brown, M.; Breitling, R.; Hankemeier, T.; Goodacre, R.; Neumann, S.; Kopka, J.; Viant, M. Metabolomics 2013, 9, 44−66. (5) van der Hooft, J. J. J.; de Vos, R. C. H.; Ridder, L.; Vervoort, J.; Bino, R. J. Metabolomics 2013, 1−10. (6) Sheldon, M. T.; Mistrik, R.; Croley, T. R. J. Am. Soc. Mass Spectrom. 2009, 20, 370−376. (7) van der Hooft, J. J. J.; Vervoort, J.; Bino, R. J.; Beekwilder, J.; de Vos, R. C. H. Anal. Chem. 2011, 83, 409−416. (8) van der Hooft, J. J. J.; Vervoort, J.; Bino, R. J.; de Vos, R. C. H. Metabolomics 2012, 8, 691−703. (9) Roux, A.; Xu, Y.; Heilier, J.-F.; Olivier, M.-F.; Ezan, E.; Tabet, J.C.; Junot, C. Anal. Chem. 2012, 84, 6429−6437. (10) Ridder, L.; van der Hooft, J. J. J.; Verhoeven, S.; de Vos, R. C. H.; van Schaik, R.; Vervoort, J. Rapid Commun. Mass Spectrom. 2012, 26, 2461−2471. (11) Kind, T.; Fiehn, O. BMC Bioinf. 2006, 7, 234. (12) Kind, T.; Fiehn, O. BMC Bioinf. 2007, 8, 105. (13) Böcker, S.; Letzel, M. C.; Lipták, Z.; Pervukhin, A. Bioinformatics 2009, 25, 218−224. (14) Rojas-Chertó, M.; Kasper, P. T.; Willighagen, E. L.; Vreeken, R. J.; Hankemeier, T.; Reijmers, T. H. Bioinformatics 2011, 27, 2376− 2383. (15) Scheubert, K.; Hufsky, F.; Rasche, F.; Bocker, S. J. Comput. Biol. 2011, 18, 1383−97. (16) Neumann, S.; Böcker, S. Anal. Bioanal. Chem. 2010, 398, 2779− 2788. (17) Hill, A. W.; Mortishire-Smith, R. J. Rapid Commun. Mass Spectrom. 2005, 19, 3111−3118. (18) Hill, D. W.; Kertesz, T. M.; Fontaine, D.; Friedman, R.; Grant, D. F. Anal. Chem. 2008, 80, 5574−5582. (19) Heinonen, M.; Rantanen, A.; Mielikäinen, T.; Kokkonen, J.; Kiuru, J.; Ketola, R. A.; Rousu, J. Rapid Commun. Mass Spectrom. 2008, 22, 3043−3052. (20) Wolf, S.; Schmidt, S.; Muller-Hannemann, M.; Neumann, S. BMC Bioinf. 2010, 11, 148. (21) Bonn, B.; Leandersson, C.; Fontaine, F.; Zamora, I. Rapid Commun. Mass Spectrom. 2010, 24, 3127−3138. (22) Peironcely, J. E.; Rojas-Chertó, M.; Tas, A.; Vreeken, R.; Reijmers, T.; Coulier, L.; Hankemeier, T. Anal. Chem. 2013, 85, 3576−3583. (23) Li, L.; Li, R.; Zhou, J.; Zuniga, A.; Stanislaus, A. E.; Wu, Y.; Huan, T.; Zheng, J.; Shi, Y.; Wishart, D. S.; Lin, G. Anal. Chem. 2013, 85, 3401−3408. (24) Rojas-Cherto, M.; Peironcely, J. E.; Kasper, P. T.; van der Hooft, J. J. J.; de Vos, R. C. H.; Vreeken, R.; Hankemeier, T.; Reijmers, T. Anal. Chem. 2012, 84, 5524−5534. (25) PubChem. http://pubchem.ncbi.nlm.nih.gov/. (26) Wishart, D. S.; Tzur, D.; Knox, C.; Eisner, R.; Guo, A. C.; Young, N.; Cheng, D.; Jewell, K.; Arndt, D.; Sawhney, S.; Fung, C.; Nikolai, L.; Lewis, M.; Coutouly, M.-A.; Forsythe, I.; Tang, P.; Shrivastava, S.; Jeroncic, K.; Stothard, P.; Amegbey, G.; Block, D.; Hau, D. D.; Wagner, J.; Miniaci, J.; Clements, M.; Gebremedhin, M.; Guo, N.; Zhang, Y.; Duggan, G. E.; MacInnis, G. D.; Weljie, A. M.; Dowlatabadi, R.; Bamforth, F.; Clive, D.; Greiner, R.; Li, L.; Marrie, T.; Sykes, B. D.; Vogel, H. J.; Querengesser, L. Nucleic Acids Res. 2007, 35, D521−D526. (27) Cabrera, C.; Artacho, R.; Gimenez, R. J. Am. Coll. Nutr. 2006, 25, 21. (28) Hodgson, J. M.; Croft, K. D. Mol. Aspects Med. 2010, 31, 495− 502. (29) Yang, C. S.; Wang, X.; Lu, G.; Picinich, S. C. Nat. Rev. Cancer 2009, 9, 429−439. (30) Sang, S.; Lambert, J. D.; Ho, C.-T.; Yang, C. S. Pharmacol. Res. 2011, 64, 87−99. (31) van der Hooft, J. J. J.; Akermi, M.; Unlu, F. Y.; Mihaleva, V.; Roldan, V. G.; Bino, R. J.; de Vos, R. C. H.; Vervoort, J. J. Agric. Food Chem. 2012, 60, 8841−8850.

(32) Mihara, R.; Mitsunaga, T.; Fukui, Y.; Nakai, M.; Yamaji, N.; Shibata, H. Tetrahedron Lett. 2004, 45, 5077−5080. (33) Fabre, N.; Rustan, I.; de Hoffmann, E.; Quetin-Leclercq, J. J. Am. Soc. Mass Spectrom. 2001, 12, 707−715. (34) Stöggl, W. M.; Huck, C. W.; Bonn, G. K. J. Sep. Sci. 2004, 27, 524−528. (35) Ishida, H.; Wakimoto, T.; Kitao, Y.; Tanaka, S.; Miyase, T.; Nukaya, H. J. Agric. Food Chem. 2009, 57, 6779−6786. (36) Sang, S.; Lee, M.-J.; Hou, Z.; Ho, C.-T.; Yang, C. S. J. Agric. Food Chem. 2005, 53, 9478−9484. (37) Bastos, D. H.; Saldanha, L. A.; Catharino, R. R.; Sawaya, A.; Cunha, I. B.; Carvalho, P. O.; Eberlin, M. N. Molecules 2007, 12, 423− 432. (38) Gondoin, A.; Grussu, D.; Stewart, D.; McDougall, G. J. Food Res. Int. 2010, 43, 1537−1544. (39) Lin, L.-Z.; Chen, P.; Harnly, J. M. J. Agric. Food Chem. 2008, 56, 8130−8140. (40) Hashimoto, F.; Nonaka, G.-i.; Nishioka, I. Chem. Pharm. Bull. 1987, 35, 611−616. (41) Atoui, A. K.; Mansouri, A.; Boskou, G.; Kefalas, P. Food Chem. 2005, 89, 27−36. (42) Yoshikawa, M.; Morikawa, T.; Nakamura, S.; Li, N.; Li, X.; Matsuda, H. Chem. Pharm. Bull. 2007, 55, 57−63. (43) Elegami, A. A.; Bates, C.; Gray, A. I.; Mackay, S. P.; Skellern, G. G.; Waigh, R. D. Phytochemistry 2003, 63, 727−731. (44) Liu, D.-y.; Shi, X.-f.; Wang, D.-d.; He, F.-j.; Ma, Q.-h.; Fan, B. Chem. Nat. Compd. 2011, 47, 704−707. (45) Danne, A.; Petereit, F.; Nahrstedt, A. Phytochemistry 1994, 37, 533−538. (46) Schmidt, C. A.; Murillo, R.; Bruhn, T.; Bringmann, G.; Goettert, M.; Heinzmann, B.; Brecht, V.; Laufer, S. A.; Merfort, I. J. Nat. Prod. 2010, 73, 2035−2041. (47) Schmidt, C. A.; Murillo, R.; Heinzmann, B.; Laufer, S.; Wray, V.; Merfort, I. J. Nat. Prod. 2011, 74, 1427−1436. (48) Hoffmann-Bohm, K.; Lotter, H.; Seligmann, O.; Wagner, H. Planta Med. 1992, 58, 544−548. (49) Loeschcke, V.; Francksen, H. Naturwissenschaften 1964, 51, 140−140. (50) Jaiswal, R.; Kuhnert, N. Rapid Commun. Mass Spectrom. 2010, 24, 2283−2294.

H

dx.doi.org/10.1021/ac400861a | Anal. Chem. XXXX, XXX, XXX−XXX