Ultrafast PubChem Searching Combined with Improved Filtering

May 12, 2014 - A new and improved software tool for elemental composition annotation of molecular ions detected in mass spectrometry, based on improve...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/ac

Ultrafast PubChem Searching Combined with Improved Filtering Rules for Elemental Composition Analysis Arjen Lommen* RIKILT Wageningen UR, P.O. Box 230, 6700 AE Wageningen, The Netherlands S Supporting Information *

ABSTRACT: A new and improved software tool for elemental composition annotation of molecular ions detected in mass spectrometry, based on improved filtering rules followed by ultrafast querying in publicly available compound databases, is provided. Pubchem is used as a general source of 1.3 million unique chemical formulas. A plant metabolomics database containing ca. 100 000 formulas is used as a source of naturally occurring compounds. Four modes with different sets of rules for heuristic filtering of candidate formulas coming from elemental composition analysis are incorporated and tested on both databases. The elemental composition analysis is then coupled to ultrafast PubChem searching based on a mass-indexed intermediate system. The performance of the filters is compared and discussed. When reactive compounds are assumed not to be present, 99.95% of the 1.3 million PubChem formulas is correctly found, while ca. 30% less formulas per mass are given compared to previously published rules. For the ca. 100 000 plant metabolomics based formulas, 100% fit the improved rules.

H

igher resolution in liquid chromatography and mass spectrometry leads to larger data sets with more and more precise data. In full-scan high-resolution hyphenated-MS profiling, the targets for analysis are not always known in advance. A recent extensive review on identification, its pitfalls, and the state of the art techniques has been given by Dunn et al.1 Standard analysis starts off with deducing the molecular mass of a signal of interest from its isotope pattern and adducts. Candidate chemical formulas are calculated from the molecular mass and the information obtained from the isotope pattern. The next direct and simple approach may be a search of a formula in public databases. Popular databases are for instance HMDB,2 DrugBank,3 Knapsack,4 PubChem (http://PubChem.ncbi.nlm.nih.gov/), and Chemspider (http://chemspider.com/). The success of such an approach depends on the uniqueness of the formula and the knowledge of what may be expected for the sample in question. For example, 1512 different compounds in Pubchem have the elemental composition, C19H28O2. However, if this is a human metabolite and C19H28O2 is searched in the human metabolite database (HMDB), then only 8 compounds are retrieved. Seven of these are hormones with the same basic backbone as testosterone. A single precise molecular mass or formula is not conclusive in itself but can help in initially narrowing down possible compound identities. It is very common that additional steps in identification are required such as fragmentation of ions or orthogonal analytical techniques such as NMR. In many cases, a reference standard can be obtained to verify an identity. If the unidentified compound is in a database, but a standard is not available, then fragmentation information may alternatively be analyzed in silico together with the available chemical structure using programs like metFrag5 and MAGMa.6 © 2014 American Chemical Society

In principle, any option in this identification game can be useful as long as it narrows down the number of structures that fit the information. A primary step is minimizing the number of elemental compositions that are possible for a given mass and its accuracy. Increased mass precision and the analysis of the isotopic patterns are primary constraints. Interestingly, it is possible to obtain subppm precision with postprocessing of UHPLC-Orbitrap-MS at 50 000 resolution.7−9 Kind and Fiehn10 have published excellent and extensive work by postulating 7 golden rules to reduce the number of calculated elemental compositions (formulas). Also, they searched PubChem to affirm the plausibility of chemical formulas. A drawback, however, in their approach is that databases like PubChem and Chemspider are slow to search due to the 10s of millions of entries. In this study, the aim is to improve the filtering of chemical formula solutions coming from elemental composition analysis and to query if and how many times a formula is present in PubChem ultrafast. To do this, essential information from PubChem has been converted to a size below 1Gb and indexed. Ca. 1.3 million unique formulas have been retrieved from PubChem and subjected to analysis. A software tool, called HR3 (interface as well as batch mode), in which now improved and previously published filtering is available is given; this tool also performs the ultrafast PubChem searching. By automatically providing PubChem IDs and hyperlinks to chemical formulas obtained from the analysis, all functionality and information in PubChem and Chemspider is retained. Received: February 21, 2014 Accepted: May 12, 2014 Published: May 12, 2014 5463

dx.doi.org/10.1021/ac500667h | Anal. Chem. 2014, 86, 5463−5469

Analytical Chemistry

Article

Ca, Ni, Zn, Co} is present. (5) Leave out: stereo isomers by analyzing SMILES annotations. This is done as follows. Consider a set of entries describing possible stereo isomers. They have canonical and isomeric smiles annotations. If these are the same, no stereo isomer information is present for that entry. These are preferentially selected. If no entry exists in which canonical and isomeric smiles are the same, then they are all selected. Modification of Formulas. Since permanently charged compounds (for instance, quaternary nitrogen) will not pass the LEWIS rule11 (LEWIS as well as SENIOR rules are extensively explained and put into context by Kind and Fiehn10), the number of hydrogens is adjusted to make a neutral formula. When one cannot distinguish between a permanent charge or ionization, a search using an uncharged molecular formula will then also get the permanently charged alternative link. Making Accessing the Intermediate Ultrafast. (A) Sorting on mass: The molecular mass of the first isotope has been calculated for all formulas selected. The entire collection of formulas are then sorted on the basis of the mass (low to high). In parallel, the corresponding formulas and IDs are sorted accordingly. All formulas are then broken up in atom files containing numbers of each atom in each formula. (B) Indexing the search: The mass is then correlated to the position in the PubChem intermediate with an index file. This way, the mass and mass error determine the positions in the intermediate PubChem between which the formula and ID can be found. Linking the PubChem Intermediate to Internet Information. This is achieved by automatically creating hyperlinks to all IDs (Internet pages in the original PubChem) in the excelcompatible output; in the same effort, these formulas are counted. Since Chemspider is a subset of PubChem, Chemspider hyperlinks are produced (using the formula) if a formula is found in PubChem. This gives the opportunity to rank hits in Chemspider based on the number of times it is referenced in the literature. Filtering of Formulas. The theoretical basis for filtering of formulas is given in Kind and Fiehn,10 and the original code for elemental composition analysis is given in HR2 in the Supporting Information of their publication. Each formula solution is tested for LEWIS and SENIOR rules10−12 (radical, charge, and valence tests) as well as filtering on the basis of H/ C, N/C, O/C, P/C, and S/C ratios and certain improbable combinations of numbers of N, O, P, and S. The filtering ratios are given in Table 1. Kind and Fiehn optimized the H/C, N/C, O/C, P/C, and S/ C ratios based on 45 000 formulas in the Wiley mass spectral database. The more restrictive values cover 99.7%, while the less restrictive values cover 99.9%. It is these ratios that have

1.3 million unique PubChem formulas (a source of natural and synthetic compounds) as well as ca. 100 000 formulas (a source of natural compounds) from a plant metabolomics library have been checked using 4 different filtering modes (incl. those of Kind and Fiehn). To simulate experimentally obtained data, different mass errors have been assumed for all existing formulas in the databases. Additionally, numbers of C, S, Cl, and Br are assumed to be deduced from the isotope pattern within a given error. This input data is then run through by the software. Together, this gives an evaluation of performance of filtering rules.



EXPERIMENTAL SECTION Databases. PubChem (>40 000 000 entries on chemicals) has been obtained as a download from ftp://ftp.ncbi.nlm.nih. gov/pubchem/Compound/CURRENT-Full/SDF/. As a test for natural compound formulas, a ca. 100 000 formulas containing list has been used (metaboliteMass.txt; available from the RIKEN-CSRS Metabolomics Research Group on request). Software. Author: All 32 bit software has been written using Microsoft Visual C++ 2010 and run on either a standard quadcore PC (4Gb RAM) with a Windows 7 32 bit operating system or a 16-core PC (64Gb RAM) Windows 7 64 bit operating system. The latter has been used when running multiple batches simultaneously in compute intensive calculations. The basis of the c code (here called “HR3.exe”) for elemental composition analysis and filtering has been derived from the original c code in HR2 in the Supporting Information of the publication of Kind and Fiehn.10 HR3 software and a manual have been given using “ac500667h_si_001.zip” in the Supporting Information (future updates available at www. metalign.nl). Creating the PubChem Intermediate Database. The complete set of SDF files of PubChem (21st of January 2014) has been downloaded and converted using make_PubChem_lib.exe. (See “ac500667h_si_004.zip” in the Supporting Information for the software and manual; future updates will be available at www.metalign.nl.) Other Evaluation Software. Evaluating the PubChem intermediate database for elemental composition analysis of formulas has been done using “ac500667h_si_003.zip”. (See Supporting Information for software and instructions; future updates will be available at www.metalign.nl.) For evaluating metaboliteMass.txt, “ac500667h_si_002.zip” has been used. (See Supporting Information; future updates will be available at www.metalign.nl).



THEORY Database Searching. Creating a PubChem Intermediate for Fast Searching. PubChem SDF files are converted using make_PubChem_lib.exe. (See “ac500667h_si_004.zip” in the Supporting Information.) Make_PubChem_lib creates the PubChem intermediate. For keeping formulas, the following rules are used: (1) Leave in: all formulas consisting of any number of atoms in {C, H, N, O, P, S, F, Cl, Br} and any number of atoms of one kind in {Si, I, Mg, Fe, Mo, Mn, Cu, B, Ca, Ni, Zn, Co}. (For example, Fe and B may not occur simultaneously.) (2) Leave out: radicals. (3) Leave out: masses below 40 and above 2000. (4) Leave out: salts or formulas consisting of multiple organic compounds, except if one of the following atoms {Mg, Fe, Mo, Mn, Cu, B,

Table 1. Ratio Rules for Filtering of Formulas as Given in Kind and Fiehn10 restrictive values H/C N/C O/C P/C S/C NOPS rules 5464

wider values

min

max

0.2 0 0 0 0 yes

3 2 1.2 0.32 0.65

H/C N/C O/C P/C S/C NOPS rules

min

max

0 0 0 0 0 no

6 4 3 6 2

dx.doi.org/10.1021/ac500667h | Anal. Chem. 2014, 86, 5463−5469

Analytical Chemistry

Article

Figure 1. Fraction of unique PubChem formulas (total 1.3 million over mass range of 40−1000) passing filtering rules as a function of unit mass. The Y-axis represents the fraction (0 < fraction < 1); the X-axis represents the log2(mass). Modes as described in the text: Blue = mode 1; Red = mode 2; Green = mode 3; Purple = mode 4. Panel A is uncorrected data. Panel B is corrected data: for modes 2 and 3, all noncarbon containing formulas are discarded; for mode 1, all phosphonium/phosphine type formulas are discarded.

been reviewed here on the basis of ca. 1 300 000 unique Pubchem formulas (from 40 million PubChem formulas of synthetic and natural origin) and the 100 000 formulas containing the metaboliteMass.txt file (natural compounds). Changes in ratios in the current improvement in filtering are related to the following issues: (A) Phosphate containing compounds may be missed. Phosphoenol pyruvate (C3H5O6P) fails O/C and P/C ratio for the restrictive values in Table 1. Propyl phosphate (C3H9O4P) (but not butyl phosphate (C4H11O4P)) fails O/C and P/C; ethyl phosphate (C2H7O4P) also fails on H/C. Essentially, the smaller phosphate containing molecules have a higher chance of failing because phosphate adds more P, O, and H than foreseen for the restrictive values. In larger molecules, such as adenosine triphosphate (C10H16N5O13P3), it is the higher number of phosphates that causes failure on the O/C ratio. While these 3 examples do not fail for the wider values, adding more phosphates as in inositol hexaphosphate (C6H18O24P6) and inositol pentaphosphate (C6H17O21P5) eventually causes failure there as well. (B) Sulfate containing compounds may be missed in a similar way. Using the restrictive values, taurine (C2H7NO3S) fails on O/C (and H/C due to the extra hydrogen on nitrogen). Propyl sulfate (C3H8O4S) fails on O/ C. Like with the phosphates in smaller molecules, the extra oxygen in the sulfate tends to make these compounds fail for the restrictive values. Adding more sulfate moieties (i.e., inositol hexasulfate (C6H12O24S6)) eventually will cause failure even for the wider values. (C) Small amino containing compounds such as the insecticide metamidophos (C2H8NO2PS), propyl amine (C3H9N), and diethylenetriamine (C4H13N3) all fail on H/C ratio using the restrictive values. The nitrogen adds extra hydrogen to the formula. For nitrate explosives, for instance, pentaerythritol tetranitrate (C5H8N4O12) or nitroglycerin (C3H5N3O9), the O/C ratio fails using restrictive values due to the extra oxygen of the nitrate moieties. (D) High substitution of hydrogen by F, Cl, or Br, as for example in contaminants such as pentachlorophenol (C6HCl5O) or perfluorooctanoic acid (C8HF15O2), makes these compounds fail for H/C for the restrictive values. (E) The original code

only works for organic compounds. Compounds without C are not analyzed. In the present code for the improved rules, the following changes have been made to accommodate for the above: (1) The O/C ratio has been made dependent on the number of P, S, and N to compensate for possible phosphate, sulfate, and nitrate. (2) The H/C ratio has been recoded in (H + Cl + Br + F)/C to solve point D. (3) The restrictive intervals were widened for small molecules with C + N + O + P + S ≤ 10. (4) Compounds without C are still done but without filters. In filtering formulas, it is useful to consider the reactivity of compounds: (A) Short-lived compounds (such as by far the most radicals) are not likely to be found as molecular ion in high resolution LC-MS, although they may perhaps occur as fragments (in particular in GC/MS). Here, however, we focus on uncharged nonradical molecular masses (passing the LEWIS rule) and linking them to databases. (B) High valences (i.e., N, P, S, Cl, and Br with a valence of 5, 5, 6, 7, and 7, respectively) normally indicate that these compounds are reactive and can convert to lower valences. Therefore, in practice, high valences are not found very often. The SENIOR rule is an equation calculating the rings-plus-double-bonds equivalent based on valences of atoms in a formula. Formulas are thrown out if the rings-plus-double-bonds equivalent is below zero.10,12 Using a too low valence for an atom in a molecule will only cause a formula to be discarded by this rule if the molecule does not contain a double bond or is noncyclic. Using low valences of N, P, and S at 3, 3, and 2, respectively, therefore will not cause stable compounds to be missed and will still allow a number of less stable high valence compounds with double bonds or rings. On the other hand, using, for instance, a valence of 5 for N will add to the number of less logical solutions for a given mass. An example of a less logical but allowed formula is then C3H11N. (C) In isotopic patterns, no information on the number of P is present, due to the fact that the first isotope is at 100% abundance. Adding P as an additional element in solving elemental composition makes the number of solutions increase quite fast with increasing molecular mass. In nature, P occurs as phosphate. In synthetic compounds, like P-containing insecticides, P is bonded with C, O, and S. Other classes of 5465

dx.doi.org/10.1021/ac500667h | Anal. Chem. 2014, 86, 5463−5469

Analytical Chemistry

Article

compounds are phosphonium and phosphine; these classes have P bonded primarily to C. The latter classes are normally reaction intermediates in organic synthesis; therefore, these may be regarded as quite improbable and also quite reactive. In the present code for the improved rules, the following changes have been made to accommodate the above: (1) The LEWIS and SENIOR rules have been kept. (The PubChem intermediate has been modified to accommodate permanently charged molecules.) (2) Compared to the original code, the valences of N, P, and S have been set to 3, 3, and 2, respectively. (3) An extra filter has been added in the present code: (N + S + O)/P < 3 is a rejection. This allows for phosphates and P-containing insecticides but repels high amounts of potential phosphonium and phosphine formulas. Finally, in the original code, the elements were restricted to {C, K, H, D, N, 15N, O, F, Na, Si, P, S, Cl, Br}. Because salts and isotopes were eliminated in the PubChem intermediate, the elements {K, Na, D, 15N} are not used anymore. Instead, {I, B, Mn, Mg, Mo, Cu, Fe, Ca, Zn, Ni, Co} have been added, provided only one of these elements is present in a formula. The present code now has 4 modes (options) for elemental composition analysis (all with LEWIS and SENIOR restrictions): (1) Modifications as mentioned above and provided in the modified code. (2) Original code (HR2) using the restrictive values in Table 1. (3) Original code (HR2) using the wider values in Table 1. (4) Original code (HR2) with all ratios allowed. Software Downloads. The present program, called HR3, can be run through an interface as well as in a batch mode (see manual_HR3.ppt provided “ac500667h_si_001.zip” in the Supporting Information). It falls under the GNU GPL 2 license (http://www.gnu.org/licenses/old-licenses/gpl-2.0. html, http://www.gnu.org/licenses/old-licenses/gpl-2.0-faq. html); as such, this source code is freely available on request under the same conditions. All other software is part of a separate program suite. For this suite, source code can be made available under a Material Transfer Agreement on an individual basis.

Table 2. Summary of Numbers of Failing Compounds for the 4 Different Modes of Filteringa PubChem

mode 1

mode 2

mode 3

number of compounds that failed 23 950 18 762 2889 compounds with zero C 0 1700 1700 phosphonium/phosphine 21 815 0 0 LEWIS 23 23 23 SENIOR 212 212 212 PNS valence error 1221 0 0 type error ND ND ND rest 679 16 827 954 percentage (rest vs total) 0.054 1.33 0.076 metaboliteMass.txt mode 1 mode 2 mode 3 number of compounds that failed compounds with zero C phosphonium/phosphine LEWIS SENIOR PNS valence error type error rest percentage (rest vs total)

mode 4 235 0 0 23 212 0 ND 0 0 mode 4

142

552

157

139

0 2 135 4 0 1 0 0

0 0 135 4 0 1 412 0.418

0 0 135 4 0 0 18 0.018

0 0 135 4 0 0 0 0

a

1 262 586 unique formulas from PubChem and 98 528 formulas from metabolitMass.txt have been tested.

elemental composition analysis. To simulate experimentally obtained data, a mass error is assumed as well as an error in the estimation of the numbers of C, S, Cl, and Br, that are normally obtained from the isotope pattern. Assumed mass errors are between 0.5 and 5 ppm. Assumed errors in the estimation of the numbers of C, S, Cl, and Br are between 10% and 100%. Calculations for the simulations for the different filtering modes have been done with “ac500667h_si_004.zip” in the Supporting Information. For comparison of the results, all masses have been binned to unit mass resolution. The number of compounds found by each mode has been divided by the total number of compounds examined. Figure 1A shows the fraction of compounds found for each mass unit bin (for clarity, a log2 is used) for the 4 filtering modes. Figure 1A shows the direct results. Quite a large number of low molecular weight compounds are missed by filters. For modes 2 and 3, this is easily explained for the most part, since they do not allow inorganic compounds (zero number of Cs). Mode 1 intentionally filters out phosphonium and phosphine type compounds, which are present in quite large numbers in Pubchem. Figure 1B represents the improved results when omitting inorganic compounds for modes 2 and 3 and omitting phosphonium and phosphine for mode 1. All unique formulas together with their analysis have been given as an example in the Supporting Information (“ac500667h_si_004.zip” in the Supporting Information). As described above in the Theory section, mode 2 is limited for smaller molecules. This is also apparent in Figure 1A,B. Using ratios to exclude formulas has a tendency to lose some real solutions below mass 300. When using less strict ratios as in mode 3, this is mostly compensated for. Method 4 does not have a filter on the ratios. Still, mode 4 misses compounds. This is due to, for instance, halogen-containing compounds in which the halogen is of a higher valence than assumed; these are lost in all 4 modes due to the SENIOR rule. Mode 1 seems to perform somewhere between 2 and 3 in Figure 1A. However,



RESULTS AND DISCUSSION Speed of Searching the PubChem Intermediate. The speed of accessing the PubChem intermediate and retrieving the relevant PubChem IDs has been assessed by analyzing mass 180.06339 (C6H12O6) with a 1 ppm mass error with the following elements {C, H, N, O, P, S, F, Cl, Br} as variables. There is only one elemental composition solution. The PubChem intermediate, however, contains IDs of 119 molecules with the solution formula C6H12O6. This run has been done 1000 times after each other in a batch with and without PubChem searching. Without PubChem searching, this takes 10 s including writing the 1000 csv reports to disk. Including PubChem searching adds 12 s. Of this 12 s, about 10 s is needed to load the ca. 1Gb of PubChem intermediate files into memory. Therefore, 2 s is needed to get 119 IDs 1000 times and additionally write them as hyperlinks in the csv reports. This extremely fast searching and access is possible because of mass indexing of IDs which are stored in memory. Performance of the Filtering Options. To be able to analyze the performance of elemental composition filtering by the 4 different modes, the masses of the ca. 1 300 000 unique formulas (only those with elements in {C, H, N, O, P, S, F, Cl, Br} and mass between 40 and 1000) selected from the PubChem intermediate have been calculated and subjected to 5466

dx.doi.org/10.1021/ac500667h | Anal. Chem. 2014, 86, 5463−5469

Analytical Chemistry

Article

Figure 2. log2 of the average number of possible formulas per unit mass calculated on the basis of ca. 1.3 million unique PubChem formulas. Blue = mode 1; Red = mode 2; Green = mode 3; Purple = mode 4. Panel A: assuming 0.5 ppm mass error and a 10% error in C, S, Cl, and Br estimation. Panel B: same as panel A, but discarding formulas and solutions with F, Cl, and Br. Panel C: assuming 5 ppm mass error and a 20% error in C, S, Cl, and Br estimation. Panel D: same as panel C, but discarding formulas and solutions with F, Cl, and Br.

error in C, S, Cl, and Br estimation (Figure 2C). Despite the obvious drastic difference in numbers of formulas due to the increased mass resolution and the precision in estimation of C, S, Cl, and Br, both figures show up to 30% less formulas for mode 1. Modes 2 and 3 seem to show little effect due to restrictions in ratios. Filtering using elemental ratio rules appears to be relatively insensitive to mass error. If halogen containing formulas are excluded from mass calculations as well as solutions, Figure 2B,D is the result. This simulates a situation as expected for natural compounds. If halogens are assumed not to be present, the number of formula solutions for mode 1 (with 0.5 ppm mass error and 10% error in C and S) goes down to about 4 for masses close to 1000. The relationship between the elemental ratio rules and modes and their effect on the number of formula solutions is further investigated in Figure 3. Figure 3 simulates the effect of filtering in relation to mass error and the error in estimation of C, S, Cl, and Br when comparing modes 1, 2, and 3 to mode 4. In practice, the most important estimation error is that in C (in accordance with Kind and Fiehn10) due to the higher occurrence of this element in formulas. Assuming reasonably accurate isotope patterns (10−20%), mode 3 (Figure 3C,F, blue and red) has nearly no effect. Under the same conditions,

taking out phosphonium and phosphine type compounds in Figure 1B shows a performance similar to mode 3. To further investigate where and how many compounds are lost, the compounds that fail have been analyzed. The results are shown in Table 2. 23 950 compounds out of 1 262 586 unique formulas have not been found by mode 1. 21 815 of these have been eliminated as phosphonium/phosphine type compounds. 235 have not survived LEWIS and SENIOR rules, because they were unnoticed radicals or there have been high valence halogens present. 1221 compounds have not been found because one or more of P, N, or S had a higher valence than that set in mode 1 and the compound was fully saturated. In principle, the above numbers of failing compounds may be classified as reactive and probably unstable and therefore unlikely to be found in for instance LC-MS. This leaves 679 compounds not found due to the rules used for mode 1; this is about 0.05% of the original number of formulas. Looking at the modes in this way has shown that mode 1 outperforms modes 2 and 3 and comes close to mode 4. Figure 2A,C illustrates the difference between the 4 modes in terms of average number of formula solutions obtained per unit mass bin. This is shown for 0.5 ppm and 10% error in C, S, Cl, and Br estimation (Figure 2A), respectively, 5 ppm and 20% 5467

dx.doi.org/10.1021/ac500667h | Anal. Chem. 2014, 86, 5463−5469

Analytical Chemistry

Article

Figure 3. Relative reduction of the number of formula solutions by the filtering modes as a function of mass error and error in estimation of C, S, Cl, and Br. Y-axis: Number of formulas for mode 1 (panel A and D), mode 2 (panel B and E) and mode 3 (panels C and F) divided by that found by mode 4. For panels A, B, and C, a mass error of 0.5 ppm has been used; for panels D, E, and F, a mass error of 5 ppm has been used. The error in estimation of C, S, Cl, and Br is color coded as follows: Blue = 10%; Red = 20%; Green = 50%; Purple = 100%.

mode 1 restrictions are less dependent on the isotope pattern estimations. 50−100% errors in isotope pattern information (Figure 3A− F, green and purple) simulate more or less a situation in which isotope patterns are very bad or unavailable which would be the case for compounds with mass signals that are close to noise levels. When the number of Cs is not well determined, the filtering based on C-ratios takes over. Although modes 2 and 3 help in this kind of situation by drastically decreasing the number of solutions, mode 1 again outperforms them because it adds to the filtering as described above. Finally, metaboliteMass.txt was tested as a natural compound database with 99 273 formulas. On the basis of the name and formula, salts were eliminated as well as compounds containing elements other than {C, H, N, O, P, S, F, Cl, Br}. Of the

mode 2 (Figure 3B, blue and red) ensures a maximum reduction of ca. 4% at a 0.5 ppm mass error for masses between 300 and 500. At a 5 ppm mass error, this increases to ca. 12% between mass 200 and 300 (Figure 3E, blue and red). The main reason that these modes have a limited (but helpful) effect is that they are for the most part related to ratio’s involving C, while at the same time the number of Cs is also restricted in the isotopic pattern. Therefore, selection using ratios is dependent on estimations coming from the analysis of isotope patterns. In comparison, mode 1 appears superior. Mode 1 shows up to 30% less solutions toward higher masses (Figure 3A,D, blue and red). This is for a large part the effect of the (N + S + O)/P < 3 restriction in mode 1 and to a lesser extent also assuming a lower valence for N, P, and S. Both these 5468

dx.doi.org/10.1021/ac500667h | Anal. Chem. 2014, 86, 5463−5469

Analytical Chemistry

Article

Notes

remaining 98 528 compounds, 135 failed LEWIS and 4 failed SENIOR rules; this was due to errors in charges (not always unambiguously noted in formulas), radicals, and typing errors. Table 2 shows the results of all 4 modes. Mode 1 failed on tetrafosmin, dodecyltriphenylphosphonium(1+), and 15-hydroxy-3,7,11-cembratrien-18,2-olide. The first 2 are synthetic phosphonium type compounds, which should probably not be present in this database. The last compound was incorrectly noted as C26H56O30 (this should be C26H56O3). Mode 4 did not fail but also missed the mentioned type error. Mode 3 failed on 18 phosphorylated or sulfated compounds. (See Theory section above.) Mode 2 failed on many smaller compounds and the same ones as mode 3. Therefore, mode 1 comes out as the best option considering the more limited number of solutions to be expected compared to mode 4. Performance Comparison to Other Fast Software. Sakurai et al.13 recently reported a very fast relational database for obtaining formulas for masses and discussed other software options. In their work, they precalculated all possible formulas, fulfilling LEWIS and SENIOR rules, based on the elements C, H, N, O, P, and S. 400 000 000 formulas are stored, indexed, and retrievable. Besides this, a HR2 based filtered database was precalculated. Since no calculations are necessary anymore, this useful relational database is faster than what is described here. However, the following elements included in the present study are not in there: F, Cl, Br, I, B, Mn, Mg, Mo, Cu, Fe, Ca, Zn, Ni, and Co. Expanding their database for these elements would increase the number of formulas beyond storage capabilities. Also, no restrictions based on the isotope pattern can be put in. This results in including solutions that do not fit the isotope pattern. Furthermore, Sakurai et al. included an option for fast searching of Pubchem. However, no curation (salts etc.) for formulas is present as described here, nor is there a direct link to Pubchem itself or to elemental composition analysis results. The present study consists of a more integrated approach, which on the one hand is slightly slower but on the other hand will give a smaller list of relevant formula solutions.

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank Jörg Hau (Nestle) for publishing the formula generator HiRes under the GNU Public License (GPL). The current research has been funded by the Ministry of Economic Affairs (project: Foodomics, KB-15-001-020).





CONCLUSION This study has shown improved filtering of solutions from elemental composition analysis and has combined this with ultrafast PubChem searching. A software tool has been provided with 4 modes of filtering (interface as well as batch processing). Filtering based on ratios is the most effective when no or little information on the isotope pattern is available. In a simulated ideal situation (nonhalogen containing natural compounds; 0.5 ppm mass error; 10% error in estimation of C), masses of 400, 750, and 1000 will have on average only 1, 2, and 4 formula solutions, respectively. Modern FT-MS techniques can routinely achieve this in combination with postprocessing mass-precision enhancement.7 Although this will not lead to actual identification, this study is thought to help in the process of identification by eliminating unlikely chemical formulas and very quickly searching PubChem.



REFERENCES

(1) Dunn, W. B.; Erban, A.; Weber, R. J. M.; Creek, D. J.; Brown, M.; Breitling, R.; Hankemeier, T.; Goodacre, R.; Neumann, S.; Kopka, J.; Viant, M. R. Metabolomics 2013, 9, S44−S66. (2) Wishart, D. S.; Jewison, T.; Guo, A. C.; Wilson, M.; Knox, C.; Liu, Y.; Djoumbou, Y.; Mandal, R.; Aziat, F.; Dong, E.; Bouatra, S.; Sinelnikov, I.; Arndt, D.; Xia, J.; Liu, P.; Yallou, F.; Bjorndahl, T.; Perez-Pineiro, R.; Eisner, R.; Allen, F.; Neveu, V.; Greiner, R.; Scalbert, A. Nucleic Acids Res. 2013, 41 (Database issue), D801−D807. (3) Knox, C.; Law, V.; Jewison, T.; Liu, P.; Ly, S.; Frolkis, A.; Pon, A.; Banco, K.; Mak, C.; Neveu, V.; Djoumbou, Y.; Eisner, R.; Guo, A. C.; Wishart, D. S. Nucleic Acids Res. 2011, 39 (Database issue), D1035− 1041. (4) Afendi, F. M.; Okada, T.; Yamazaki, M.; Hirai-Morita, A.; Nakamura, Y.; Nakamura, K.; Ikeda, S.; Takahashi, H.; Altaf-Ul-Amin, M.; Darusman, L. K.; Saito, K.; Kanaya, S. Plant Cell Physiol. 2012, 53, No. e1(1−12). (5) Wolf, S.; Schmidt, S.; Müller-Hannemann, M.; Neumann, S. BMC Bioinf. 2010, 11, 148−160. (6) Ridder, L.; van der Hooft, J. J.; Verhoeven, S.; de Vos, R. C.; Bino, R. J.; Vervoort, J. Anal. Chem. 2013, 85, 6033−6040. (7) Lommen, A.; Gerssen, A.; Oosterink, J. E.; Kools, H. J.; RuizAracama, A.; Peters, R. J. B.; Mol, H. G. J. Metabolomics 2011, 7, 15− 24. (8) Lommen, A. Anal. Chem. 2009, 81, 3079−3086. (9) Lommen, A.; Kools, H. J. Metabolomics 2012, 8, 719−726. (10) Kind, T.; Fiehn, O. BMC Bioinf. 2007, 8, 105−124. (11) Noury, S.; Silvi, B.; Gillespie, R. J. Inorg. Chem. 2002, 41, 2164− 2172. (12) Senior, J. K. Am. J. Math. 1951, 73, 663−689. (13) Sakurai, N.; Ara, T.; Kanaya, S.; Nakamura, Y.; Iijima, I.; Enomoto, M.; Motegi, T.; Aoki, K.; Suzuki, H.; Shibata, D. Bioinformatics 2013, 29, 290−291.

ASSOCIATED CONTENT

S Supporting Information *

Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. 5469

dx.doi.org/10.1021/ac500667h | Anal. Chem. 2014, 86, 5463−5469