Highly Accurate Chemical Formula Prediction Tool Utilizing High

Apr 12, 2012 - Table S2: Formal valence of elements for the calculation of RDBE ... format,(19) the MZmine 2 project file, the SIRIUS workspace file, ...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/ac

Highly Accurate Chemical Formula Prediction Tool Utilizing High-Resolution Mass Spectra, MS/MS Fragmentation, Heuristic Rules, and Isotope Pattern Matching Tomás ̌ Pluskal,*,† Taisuke Uehara,‡ and Mitsuhiro Yanagida† †

G0 Cell Unit, Okinawa Institute of Science and Technology Graduate University, 1919-1 Tancha, Onna, Okinawa 904-0495, Japan Biomarkers & Personalized Medicine Core Function Unit, Eisai Co., Ltd., Tsukuba, Ibaraki, Japan



S Supporting Information *

ABSTRACT: Mass spectrometry is commonly applied to qualitatively and quantitatively profile small molecules, such as peptides, metabolites, or lipids. Modern mass spectrometers provide accurate measurements of mass-to-charge ratios of ions, with errors as low as 1 ppm. Even such high mass accuracy, however, is not sufficient to determine the unique chemical formula of each ion, and additional algorithms are necessary. Here we present a universal software tool for predicting chemical formulas from high-resolution mass spectrometry data, developed within the MZmine 2 framework. The tool is based on the use of a combination of heuristic techniques, including MS/MS fragmentation analysis and isotope pattern matching. The performance of the tool was evaluated using a real metabolomic data set obtained with the Orbitrap MS detector. The true formula was correctly determined as the highest-ranking candidate for 79% of the tested compounds. The novel isotope pattern-scoring algorithm outperformed a previously published method in 64% of the tested Orbitrap spectra. The software described in this manuscript is freely available and its source code can be accessed within the MZmine 2 source code repository.

M

within the MZmine 2 framework.8 The program uses a combination of the techniques outlined above and includes a novel algorithm for isotope pattern scoring. Great emphasis was put on the flexibility of the tool: no restrictions regarding resolution, mass accuracy, or ionization adduct type are placed on the characteristics of the input mass spectra. In addition, the prediction is not limited to the most common chemical elements, but any element of the periodic table may be included in the search process. The performance of the tool was evaluated using a metabolomic data set obtained from a fission yeast cell extract. The known chemical formulas of 48 compounds identified previously were compared to the formulas predicted by this tool.

ass spectrometry (MS) is commonly applied to qualitatively and quantitatively profile small molecules, such as peptides, metabolites, or lipids. Modern mass spectrometers using high-resolution Orbitrap, quadrupole time-offlight (Q-TOF), or Fourier transform ion cyclotron resonance (FTICR-MS) detectors provide accurate measurements of mass-to-charge ratios of ions, with mean errors as low as 1 ppm, depending on the instrument and calibration.1 Even such high mass accuracy, however, is not sufficient to determine the unique chemical formula of a molecule based on the mass value alone.2 Evaluation of other data, such as isotope distributions and tandem MS (MS/MS) fragmentation is necessary to elucidate the elemental composition and structural information.3 In addition, the search space of potential chemical formulas may be greatly reduced by applying heuristic rules.4 The problem of chemical formula elucidation from mass spectra was addressed previously for FTICR-MS spectra of tomato metabolites by calibration of the exact masses using internal standards followed by isotope pattern comparison,5 or for tandem MS spectra by evaluating fragmentation trees.6 Mass spectrometry vendors provide their own software modules for chemical formula prediction, such as Xcalibur/ Mass Frontier (Thermo Fisher Scientific), SmartFormula3D (Bruker Daltonics), MassLynx (Waters), or PeakView (AB SCIEX). The capabilities of these tools vary by vendor. The open-source Multistage Elemental Formula (MEF) tool utilizing MSn trees was also recently proposed.7 A universal, user-friendly, and easily accessible tool, however, is yet to be established in the mass spectrometry community. Here we present a universal software tool for predicting chemical formulas from high-resolution MS data, developed © 2012 American Chemical Society



EXPERIMENTAL SECTION Cell Cultivation and Sample Preparation. The metabolome sample analyzed in this study was prepared as described previously.9 Briefly, the wild-type heterothallic haploid 972 strain of the fission yeast Schizosaccharomyces pombe10 was cultivated in Edinburgh Minimal Medium 2 (EMM2).11 Cells from cultures (40 mL, 5 × 106 cells/mL) were collected by vacuum filtration and immediately quenched in methanol at −40 °C. The cells were harvested by centrifugation, and a constant amount of internal standards (10 nmol of HEPES and PIPES) was added to each sample. Cells were disrupted using a Multi-Beads Shocker (Yasui Kikai) in 500 μL of 50% methanol. Proteins were removed by Received: January 11, 2012 Accepted: April 12, 2012 Published: April 12, 2012 4396

dx.doi.org/10.1021/ac3000418 | Anal. Chem. 2012, 84, 4396−4403

Analytical Chemistry

Article

Figure 1. Screenshot of the formula prediction tool windows. (A) Peak list in MZmine 2 showing the option to predict a molecular formula. (B) Parameter setup window of the formula prediction tool. (C) Setup of the element counts heuristics. (D) Setup of the RDBE value restrictions. (E) Setup of the isotope pattern filter. (F) Setup of the MS/MS filter. (G) Results of the formula prediction.

filtering with an Amicon Ultra 10-kDa cutoff filter (Millipore), and the samples were concentrated by vacuum evaporation. Finally, each sample was resuspended in 20 μL of 50% acetonitrile, and 1 μL was used for autosampler injection. LC-MS Analysis. LC-MS data were obtained using a Paradigm MS4 HPLC system (Michrom Bioresources) coupled to an LTQ Orbitrap mass spectrometer (Thermo Fisher Scientific). LC separation was performed on a ZIC-pHILIC column (Merck SeQuant; 150 × 2.1 mm, 5 μm particle size). Acetonitrile (A) and 10 mM ammonium carbonate buffer, pH 9.3 (B), were used as the mobile phase, with gradient elution from 80% A to 20% A in 30 min and 100 μL/min flow rate. Raw data were processed using MZmine version 2.7.2.8 Detailed data analysis procedures and parameters are described in Table S1, Supporting Information.

functionality for MS data processing: raw data import, peak detection, MS/MS scan recognition, and isotope pattern detection and comparison. The module’s code was written in the Java programming language. Calculation of theoretical isotope patterns was performed using the Chemistry Development Kit library.12 Figure 1A and 1B shows the input window of the chemical formula prediction module. Note that the algorithm does not attempt to automatically determine the correct ionization adduct but rather requires that the user selects it. Prediction of the ionization adduct is a challenging computational problem by itself, but this lies outside the scope of this manuscript. The mass of the ionization adduct is removed from the detected ion mass, and formulas are searched for within the specified tolerance range from the calculated neutral molecular mass. The user may limit the chemical elements and their minimal and maximal counts in the resulting formula and select



RESULTS AND DISCUSSION The chemical formula prediction tool was implemented as a module for the MZmine 2 framework,8 which provides the core 4397

dx.doi.org/10.1021/ac3000418 | Anal. Chem. 2012, 84, 4396−4403

Analytical Chemistry

Article

additional heuristic criteria to reduce the number of possible candidates. Seven Golden Rules. Kind and Fiehn developed seven golden rules for determining the validity of chemical formulas.4 These include restrictions for elemental counts and ratios, valence heuristics, isotope pattern evaluation, and a trimethylsilylation check for derivatized compounds analyzed by gas chromatography−mass spectrometry. The rules (other than the trimethylsilylation check, which was not relevant for this work) were applied and extended as described below. Element Counts and Ratios. Apart from setting the allowed chemical elements and their minimum and maximum counts (Figure 1B), the user may select three empirical restrictions (Figure 1C). The “H/C ratio” condition checks whether the ratio of the number of hydrogens to the number of carbons is between 0.1 and 6 (inclusive). The “NOPS/C” condition checks the following ratios: N/C ≤ 4, O/C ≤ 3, P/C ≤ 2, and S/C ≤ 3. Finally, the “multiple element counts” condition restricts the maximum counts for combinations of elements, as defined in Table 3 in Kind and Fiehn, 2007.4 These rules are only applied to formulas that contain at least one carbon atom. It should be noted that the heuristic “HNOPS probability” rule proposed by Kind and Fiehn was not implemented because some common natural metabolites violate that rule. The HNOPS probability rule requires the O/C ratio in the formula to be less than or equal to 1.2, but this requirement is too restrictive in the cases of metabolites such as adenosine triphosphate (ATP, C10H16N5O13P3; O/C = 1.3) or fructose-1,6bisphosphate (C6H14O12P2; O/C = 2). Ring Double Bond Equivalents. The ring double bond equivalents (RDBE) value, which estimates the number of rings and unsaturated bonds in a molecule, can be calculated from a chemical formula using the following general equation:13 RDBE = 1 +

mass indicates an odd number of nitrogen atoms and vice versa, because among common elements only nitrogen has an even nominal mass (14) but odd valence (3). Isotopic Pattern Distribution. Matching of isotope patterns is crucial for the selection of the best fitting molecular formulas among the possible candidates.2,15 For this purpose, Kind and Fiehn4 recommended comparing the detected vs predicted relative intensities of A + 1[%], A + 2 [%], A + 3 [%], and A + 4 [%] ions, where A + i means i additional neutrons in the molecule. Such an approach, however, cannot take advantage of modern high-resolution instruments, which are capable of distinguishing fine differences between the isotopes of different elements. The mass difference between the lightest stable isotope of carbon (12C) and its next heavy counterpart (13C) is 1.00335 Da, and the difference between the lightest stable nitrogen (14N) and its successor (15N) is 0.99703 Da. In case of oxygen, the difference between 16O and 17O is 1.00422 Da. Figure 2A shows predicted continuous spectra of glutathione (C10H17N3O6S) positive ion [M + H]+ using different mass resolving power settings from 20 000 to 1 000 000. Mass resolving power (R) is defined as R=

M ΔM50%

where M is mass and ΔM50% is the mass difference of two peaks that can be clearly distinguished at 50% of the maximum intensity.16 As apparent from the spectra, the ability to compare the fine features of ultrahigh resolution isotope patterns may provide precise information about the elemental composition. Wang and Gu developed the concept of spectral accuracy,17 which allows for accurate isotope pattern comparison by predicting the exact shape of the expected spectrum using instrument-specific calibration. Such an approach requires the acquisition of raw spectra in continuous mode. For convenience, however, mass spectra are often preprocessed (centroided) by the acquisition software, which usually greatly reduces the size of the raw data files. We propose a new isotope pattern-scoring algorithm, the development of which was based on two requirements. First, the algorithm should take advantage of the fine elemental isotopic features, if those can be detected in the spectra. Second, the algorithm should be able to evaluate spectra that have already been centroided. The algorithm is based on the following formal definition: Let p1(x) and p2(x) be the functions describing the isotope patterns that are being compared. Let

1 (∑ ni(vi − 2)) 2 i

where ni is the number of atoms and νi the formal valence of the element i. Theoretically, each ring or a double bond increases the RDBE value by 1, while each triple bond increases the value by 2. This equation can only be used for formulas composed of elements with a well-defined formal valence (Table S2, Supporting Information). Although a number of exceptions to the RDBE rule have been established,4 the RDBE value is still a useful indicator of the validity of a molecular formula. Figure 1D shows the setup window of the RDBE restrictions in our formula prediction tool. The restrictions are only applied to formulas for which the RDBE value can be calculated. Two optional rules are available: (1) A range of allowed RDBE values. Kind and Fiehn4 recommended an RDBE upper limit of 40 for common chemical compounds. They also stated that the RDBE should not be negative (as formalized in the SENIOR rules14), although there may be certain exceptions when formal valence states are exceeded. For the evaluation of this tool, we applied default values in the range of 0 to 40. (2) The RDBE value must be an integer. This condition is a natural implication of the principle of valence balance,14 which states that the number of atoms with an odd valence must be even. Such assumption is valid for all neutral, nonradical molecules. In the mass spectrometry field, the same principle is often formulated as the “nitrogen rule”, stating that an odd nominal molecular

k(x) = e−x

2

/2R2

be a Gaussian function of width R, where R depends on the mass precision and resolution of the MS instrument used for data acquisition. First, we calculate the difference pΔ of the two isotope patterns: pΔ (x) = p1 (x) − p2 (x)

Then we group the isotopes of similar masses by performing mathematical convolution using k(x) as a kernel function: res(x) = (pΔ *k)(x) 4398

dx.doi.org/10.1021/ac3000418 | Anal. Chem. 2012, 84, 4396−4403

Analytical Chemistry

Article

Figure 2. Evaluation of isotope pattern similarity. (A) A continuous spectrum showing the isotope pattern of glutathione (chemical formula C10H17N3O6S) predicted by the Xcalibur software (Thermo Fisher Scientific) using different levels of mass resolving power (R). The shape of the signal of the second isotope (309 m/z) is magnified. (B) The algorithm used to calculate the score of the isotope pattern similarity. (C) Detected and predicted isotope patterns of glutathione. Mass difference (Δ in Da) between the detected and predicted isotopes is shown on the left side, and mass difference between individual minor isotopes is shown on the right.

(2) A window of user-defined width Tmass is moved over the whole m/z range. All isotope peaks fitting within the window are summed together. (3) The final similarity score is calculated from the remaining peaks as

The difference index diffp1,p2 of the two isotope patterns is calculated from the residual of the convolution product (lower value represents higher similarity of the isotope patterns): diff p , p = 1 2

∫0



|res(x)|dx

score isotopes =

For practical purposes, however, this formal definition is not convenient. Therefore, we evaluate the algorithm using a simplified calculation. Figure 2B demonstrates the flow of the simplified algorithm, assuming two centroided isotope patterns p1 and p2, a noise intensity level Inoise, and mass tolerance Tmass as the input. The following procedure is followed: (1) All isotopes below given noise intensity level Inoise are removed from patterns p1 and p2 and the difference pΔ is calculated as

∏ (1 − |Ii|) i

where Ii is the intensity of the remaining peak i. For two identical isotope patterns, the similarity score will be 100%, while for two completely different patterns, the score will be 0%. Note that the optimal value of the Tmass parameter might be different from the commonly perceived “mass accuracy” of the instrument because different types of isotopes must be considered. For example, even if the mass error of the major isotope is less than 0.001 m/z, the mass difference between minor isotopes can be remarkably higher (Figure 2C). The biggest advantage of the proposed algorithm is its scalability.

pΔ = p1 − p2 4399

dx.doi.org/10.1021/ac3000418 | Anal. Chem. 2012, 84, 4396−4403

Analytical Chemistry

Article

Figure 3. Evaluation of MS/MS fragmentation patterns. (A) MS (left panel) and MS/MS (right panel) spectra showing the precursor ion peak, its isotopes, and MS/MS fragment ions. Neutral loss is calculated as the difference between the precursor mass and the fragment mass. (B) Results of the formula prediction for the glutathione ion peak when the fluorine element was allowed in the prediction. The results without the minimum MS/MS score restriction (left side) and minimum MS/MS score set to 100% (right side) are shown. The correct formula of glutathione is outlined in red. (C) Venn diagrams showing the efficiency of the different types of heuristics for the prediction of formulas of the ATP, glutathione (GSSG) and NAD+ ions. The red circle shows the number of generated formulas that passed through the elemental ratios and RDBE filters. The blue circle shows the number of formulas that passed through the isotope pattern-scoring filter (minimum score 80%). The green circle shows the number of formulas that passed through the MS/MS filter (minimum score 100%). The number indicated in red shows the number of formula candidates that were rejected by the MS/MS filter and would be accepted had the MS/MS filter not been used.

MS/MS fragmentation, part of the original ion dissociates and the mass of the uncharged part is called the neutral loss (Figure 3A). The neutral loss represents a fragment of the original molecule, so the chemical formula of such a fragment must be a subset of the original chemical formula. When searching for the ion’s chemical formula, each candidate formula may therefore be evaluated using the ion’s MS/MS spectrum. Provided with a centroided MS/MS pattern PMS/MS of (m/z, intensity) pairs obtained from precursor mass Mprecursor, a noise intensity level Inoise, and mass tolerance Tmass, the MS/MS score is calculated for each candidate formula Fcandidate using the following algorithm: (1) Filter out all ions from PMS/MS below the intensity level Inoise. If PMS/MS contains any isotopes, remove them. Isotopes are defined as ions with mass 1 ± Tmass Da higher than another ion in PMS/MS with a higher intensity.

When the mass-resolving power of the instrument is improved, the algorithm will provide more precise results without requiring modification, simply by reducing the value of the Tmass parameter. Figure 1E shows the setting of parameters for the isotope pattern comparison algorithm. The “isotope m/z tolerance” parameter defines the size of the tolerance window (Tmass). The “minimum relative abundance” parameter defines the minimum relative intensity of calculated isotopes for predicted isotope patterns (lower value means more precise results but longer calculation times). The “minimum absolute abundance” parameter defines the intensity level under which isotopes are ignored (Inoise). The “minimum score” parameter defines the minimum similarity score for the formula to be retained as a good candidate. Because the isotope pattern similarity is the most distinguishing parameter among the candidate formulas, the results generated by the formula prediction tool are sorted using the similarity score by default (Figure 1G). MS/MS Fragment Interpretation. Tandem MS is a common method for obtaining structural information about analyzed ions,18 typically utilizing collision-induced dissociation. During

Pfiltered = {(m/z , intensity) ∈ PMS/MS where intensity > Inoise and (m/z , intensity) is not an isotope} 4400

dx.doi.org/10.1021/ac3000418 | Anal. Chem. 2012, 84, 4396−4403

Analytical Chemistry

Article

Evaluation of the Tool. The formula prediction tool was evaluated using a metabolomic data set of the extract from fission yeast Schizosaccharomyces pombe cells. The detailed procedures for sample preparation and data acquisition were described previously.9 The cells were cultivated in modified Edinburgh Minimal Medium 2.11 The medium contains the following chemical elements: C, H, O, N, P, S, K, Na, Mg, Ca, Cl, I, B, Fe, Mn, Zn, Mo, and Cu (total 18). It is therefore safe to assume that the intracellular metabolites found in S. pombe would be also composed of these elements. Forty-eight compounds previously identified using pure standards were selected for the evaluation, based on a single requirement that their isotope pattern must be clearly detectable in the MS spectra. The raw MS data file used for the evaluation is included with this manuscript in mzXML format19 in Supporting Information. The MS data were obtained using the LTQ Orbitrap mass spectrometer (Thermo Fisher Scientific) with mass resolving power set to 30 000. This resolving power can distinguish individual isotopes but is not sufficient to reveal the fine isotopic features identifying individual elements (see Figure 2A). Therefore, searching the elemental compositions using all 18 possible elements would inevitably generate a large number of false positives. We therefore decided to restrict the allowed elements to C, H, N, O, P, and S (basic building elements of natural metabolites) and elements with characteristic isotopic features, that is B, Cl, Fe, and Mg (maximum one atom of each). The chemical formulas were searched in a 5-ppm or 0.001 Da window from the detected mass after the removal of the ionization adduct. MZmine version 2.7.2 was used for the data processing. The exact parameters used to process the raw data and perform peak detection are provided in Table S1. For each peak of interest, the chemical formula prediction tool was executed with the parameters specified in Table S3, Supporting Information. The list of considered metabolites and the results of their chemical formula prediction are summarized in Table S4, Supporting Information. The number of possible candidate formulas, which increases with the compound’s mass, could be greatly reduced using the heuristic restrictions presented in this manuscript (e.g., from 62 036 to 1048 in the case of acetyl-CoA; Figure 4A). For compounds smaller than 250 Da, the true formula was always predicted correctly and, in 67% of the cases, as the only candidate conforming to the heuristic rules. When the filtered candidates were sorted by the isotope pattern similarity score, the true formula was ranked first among the candidates in 79% of the cases and it was among the top 10 candidates in all cases, except that of acetyl-CoA (reason discussed below). The accuracy of the prediction decreased with a decrease in the peak intensity (Figure 4B). This was probably caused by the deteriorating quality of the detected isotope patterns in the low intensity range (Figure 4C). This effect is a documented problem of the Orbitrap detector.20 It was particularly apparent in the case of acetyl-CoA, which combined the unfavorable effects of high mass, causing a large number of possible candidates, with low peak intensity, causing poor isotope pattern quality. The low similarity score of the detected and predicted isotope patterns of acetyl-CoA (Figure S6, Supporting Information) resulted in the true formula being ranked 96th among the 1048 filtered candidates. Evaluation of the Isotope Pattern-Scoring Algorithm. The sorting of the candidate formulas based on the similarity of isotope patterns is crucial for selection of the best candidates.

(2) Calculate a set N of neutral losses for all the ions by subtracting the fragment ion mass from the precursor mass Mprecursor. Small neutral losses (less than 5 Da) are ignored. N = {M neutral = M precursor − m/z where (m/z , intensity) ∈ Pfiltered and M neutral > 5 Da}

(3) Generate chemical formulas for each neutral loss Mneutral ∈ N using the elements and maximum counts of formula Fcandidate, within the user-defined mass tolerance Tmass. No heuristic rules are applied to filter the formulas in this step. (4) If at least one formula could be found, the neutral loss is considered as interpreted. (5) Let ninterpreted be the number of all neutral losses from N that could be interpreted. (6) The MS/MS score of the formula Fcandidate is calculated as n interpreted scoreMSMS = |N | In the case of [M + H]+ or [M − H]− precursor ions it is reasonable to assume that once the noise peaks have been removed (see step 1), all remaining true MS/MS fragment peaks must be interpretable by the algorithm. A score lower than 100% thus indicates an invalid candidate. That is not the case when [M + Na]+ ions are analyzed, as a mixture of [M − F1 + Na]+ and [M − F2 + H]+ fragment ions might be detected. As a general recommendation, the MS/MS score threshold parameter can be set to 100% in the case of (de)protonated ions but should be set to a lower value in the case of other ionization adducts. The MS/MS filter is particularly useful when elements without isotopes such as fluorine (F) are allowed. We used the molecular formula prediction tool to find the chemical formula of a 308.092 m/z peak, representing the [M + H]+ ion of glutathione (C10H17N3O6S). The elements C, H, N, O, P, S, F, and Fe were used for the prediction. As shown in the left panel of Figure 3B, without using the MS/MS restriction, the correct formula was fifth among the candidates because four other fluorine-containing formulas had a better isotope pattern similarity to the experimental data than that of the true glutathione formula. The neutral losses of the MS/MS pattern of the glutathione ion, however, could not be completely interpreted using the fluorine-containing formulas, and by enabling the 100% MS/MS interpretation restriction, the true glutathione formula was ranked first (Figure 3B, right panel). To further demonstrate the benefit of the MS/MS filter, we generated all possible candidate formulas for three detected compounds with MS/MS data: ATP, glutathione, and NAD+. We separated the sets of formulas that passed through the different types of heuristic filters and calculated the overlaps of these sets, as shown in Figure 3C. Notably, in the case of ATP, after applying the elemental, RDBE, and isotope pattern heuristics, the MS/MS filter further reduced the number of remaining candidates from 249 to 130 (52% reduction). Figure 1F shows the setting of parameters for the MS/MS evaluation algorithm. The “MS/MS m/z tolerance” parameter defines the mass accuracy of the MS/MS scans, which may be different from the mass accuracy of full MS scans, depending on the instrument used. The “MS/MS score threshold” parameter defines the minimum score for the formula to be retained as a good candidate. 4401

dx.doi.org/10.1021/ac3000418 | Anal. Chem. 2012, 84, 4396−4403

Analytical Chemistry

Article

Figure 4. Evaluation of the formula prediction results using a metabolomics data set (see Table S4 for numerical data). (A) Scatter plot of 48 analyzed compounds with m/z value on the X-axis and number of possible formulas (red square) or number of filtered candidates (blue circle) on the Y-axis. The acetyl-CoA data point is indicated by an arrow. (B) Scatter plot of 48 analyzed compounds showing peak height on the X-axis and rank of the correct formula among all filtered candidates on the Y-axis. The dashed line indicates the trend of improving rank with increasing intensity. The acetyl-CoA data point is indicated by an arrow. (C) Scatter plot of 48 analyzed compounds showing peak height on the X-axis and isotope pattern similarity score of the correct formula of each peak on the Y-axis. The dashed line indicates the trend of decreasing isotope pattern score with decreasing intensity. The acetyl-CoA data point is indicated by an arrow. (D) A pie chart showing the results of the comparison of the isotope pattern-scoring algorithms of MZmine (version 2.7.2) and SIRIUS (version 0.9). Numerical results are shown in Table S5, Supporting Information.

(Figure 4D). The comparison of the formula ranks generated by MZmine and SIRIUS using a Wilcoxon signed rank test22 confirmed that the ranks provided by MZmine were significantly lower (p-value 0.00014). The slight differences between the numbers of the generated formulas by MZmine 2 versus SIRIUS (Table S5) were caused by the different methods used to calculate the mass of the neutral molecule from the ion’s m/z value. The use of the different method also caused the SIRIUS software to entirely miss one of the compounds. The experimentally obtained m/z value of the [M + H]+ ion of PIPES (303.0690 m/z) was converted to 302.0617 Da neutral mass by MZmine 2 and 302.0622 Da by SIRIUS. The theoretical monoisotopic mass of

To evaluate the performance of our algorithm, we compared it with that of the previously published SIRIUS method.21 From Table S4, we selected compounds smaller than 500 Da to maintain the number of candidate formulas at a reasonable level (total 38). For each compound, we used both the MZmine 2 formula prediction tool and the SIRIUS tool (version 0.9) to generate all possible formulas within a 5-ppm or 0.001 Da mass window using elements C, H, N, O, P, and S. No additional heuristic restrictions were applied. The generated formulas were sorted by the isotope pattern scores, and the resulting rankings of the true formulas are listed in Table S5. Our algorithm provided a better ranking than SIRIUS in 64% (61% + 3%) of the cases, while SIRIUS provided a better result in 5% of the cases 4402

dx.doi.org/10.1021/ac3000418 | Anal. Chem. 2012, 84, 4396−4403

Analytical Chemistry



PIPES (C8H18N2O6S2) is 302.0606; thus, the neutral mass imprecisely calculated by SIRIUS fell out of the 5-ppm or 0.001 Da tolerance window (5.5 ppm/0.0017 Da difference). A possible explanation of the obtained results could be that the SIRIUS algorithm was developed and optimized using Q-TOF instrument data, which typically provide better isotope pattern accuracy compared to the Orbitrap spectra analyzed in this manuscript.

CONCLUSIONS The software tool presented in this manuscript offers a highly accurate method of chemical formula prediction from MS data, which is not restricted to only certain types of chemical elements or a certain type of raw data, but offers a versatile set of algorithms with adjustable parameters. The source code of this tool is freely available in the MZmine 2 source code repository, allowing easy porting of the presented algorithms to other software packages. Evaluation of the software using a real metabolomic data set showed that the sensitivity and mass resolving power of the Orbitrap MS detector, which was used for data acquisition, was sufficient to determine the correct chemical formula of 79% of the 48 analyzed compounds. Incorrect predictions were caused by measurement errors in the isotope intensities obtained by the Orbitrap detector. Among the algorithms presented, the isotope pattern comparison algorithm was crucial for sorting the candidate formulas by relevance. Evaluation of the scoring algorithm showed that it outperformed the previously published SIRIUS method in 64% of the tested cases. Furthermore, the scalability of the developed algorithms ensures that when MS instruments with higher mass resolving power and isotope sensitivity are used, the formula prediction will become more precise simply by adjusting the parameter values, without requiring adjustments to the software. ASSOCIATED CONTENT

S Supporting Information *

Table S1: Parameter settings used for data processing with MZmine 2.7.2. Table S2: Formal valence of elements for the calculation of RDBE values. Table S3: Parameters used to perform the chemical formula prediction. Table S4: Results of the formula prediction evaluation. Table S5: Results of the comparison of the isotope pattern-scoring algorithms of MZmine 2 and SIRIUS. Figure S6: Detected and predicted isotope patterns of acetyl-CoA. Additionally, the raw data file used for the evaluation in mzXML format,19 the MZmine 2 project file, the SIRIUS workspace file, and the lists of all the generated chemical formulas used for evaluation. This material is available free of charge via the Internet at http://pubs.acs.org.



REFERENCES

(1) Marshall, A. G.; Hendrickson, C. L. Annu. Rev. Anal. Chem. 2008, 1, 579−99. (2) Kind, T.; Fiehn, O. BMC Bioinf. 2006, 7, 234. (3) Kind, T.; Fiehn, O. Bioanal. Rev. 2010, 2, 23−60. (4) Kind, T.; Fiehn, O. BMC Bioinf. 2007, 8, 105. (5) Iijima, Y.; Nakamura, Y.; Ogata, Y.; Tanaka, K.; Sakurai, N.; Suda, K.; Suzuki, T.; Suzuki, H.; Okazaki, K.; Kitayama, M.; Kanaya, S.; Aoki, K.; Shibata, D. Plant J. 2008, 54, 949−62. (6) (a) Böcker, S.; Rasche, F. Bioinformatics 2008, 24, i49−i55. (b) Rasche, F.; Svatoš, A.; Maddula, R. K.; Böttcher, C.; Böcker, S. Anal. Chem. 2011, 83, 1243−51. (7) Rojas-Chertó, M.; Kasper, P. T.; Willighagen, E. L.; Vreeken, R. J.; Hankemeier, T.; Reijmers, T. H. Bioinformatics 2011, 27, 2376−83. (8) Pluskal, T.; Castillo, S.; Villar-Briones, A.; Oresic, M. BMC Bioinf. 2010, 11, 395. (9) Pluskal, T.; Nakamura, T.; Villar-Briones, A.; Yanagida, M. Mol. Biosyst. 2010, 6, 182−98. (10) Gutz, H.; Heslot, H.; Leupold, U.; Loprieno, N. Handbook Genet. 1974, 1, 395−446. (11) Mitchison, J. M. Methods Cell Physiol. 1970, 131−165. (12) Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. J. Chem. Inf. Comput. Sci. 2003, 43, 493−500. (13) Pretsch, E.; Bühlmann, P.; Affolter, C. Structure determination of organic compounds: tables of spectral data, 3rd ed.; Springer Verlag: Berlin, 2000; p 425. (14) Morikawa, T.; Newbold, B. T. Khimiya (Sofiya, Bulg.) 2003, 12, 445−450. (15) Stoll, N.; Schmidt, E.; Thurow, K. J. Am. Soc. Mass Spectrom. 2006, 17, 1692−9. (16) Marshall, A. G.; Hendrickson, C. L.; Shi, S. D. Anal. Chem. 2002, 74, 252A−259A. (17) Wang, Y.; Gu, M. Anal. Chem. 2010, 82, 7055−62. (18) De Hoffmann, E.; Stroobant, V. Mass spectrometry: principles and applications; Wiley-Interscience: New York, 2007. (19) Pedrioli, P. G.; Eng, J. K.; Hubley, R.; Vogelzang, M.; Deutsch, E. W.; Raught, B.; Pratt, B.; Nilsson, E.; Angeletti, R. H.; Apweiler, R.; Cheung, K.; Costello, C. E.; Hermjakob, H.; Huang, S.; Julian, R. K.; Kapp, E.; McComb, M. E.; Oliver, S. G.; Omenn, G.; Paton, N. W.; Simpson, R.; Smith, R.; Taylor, C. F.; Zhu, W.; Aebersold, R. Nat. Biotechnol. 2004, 22, 1459−66. (20) Erve, J. C.; Gu, M.; Wang, Y.; DeMaio, W.; Talaat, R. E. J. Am. Soc. Mass Spectrom. 2009, 20, 2058−69. (21) Böcker, S.; Letzel, M. C.; Liptak, Z.; Pervukhin, A. Bioinformatics 2009, 25, 218−24. (22) Wilcoxon, F. Biom. Bull. 1945, 1, 80−83.





Article

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]; fax: +81-98-966-2890. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank Matej Orešič for valuable comments to this paper. We appreciate the generous advice of Gunnar Wilken and Radek Motal regarding mathematical analysis. 4403

dx.doi.org/10.1021/ac3000418 | Anal. Chem. 2012, 84, 4396−4403