Fundamentals of Molecular Formula Assignment to ... - ACS Publications

Jan 9, 2007 - Tallahassee, Florida 32306-4320, and Bruker Daltonik GmbH, .... species is better than for the N6-species, the “best fit” or “lowe...
9 downloads 0 Views 143KB Size
Anal. Chem. 2007, 79, 1758-1763

Fundamentals of Molecular Formula Assignment to Ultrahigh Resolution Mass Data of Natural Organic Matter Boris P. Koch,*,† Thorsten Dittmar,‡ Matthias Witt,§ and Gerhard Kattner†

Department of Chemical Ecology, Alfred Wegener Institute for Polar and Marine Research, Am Handelshafen 12, 27570 Bremerhaven, Germany, Department of Oceanography, Florida State University, OSB 311, Tallahassee, Florida 32306-4320, and Bruker Daltonik GmbH, Fahrenheitstrasse 4, 28359 Bremen, Germany

Ultrahigh-resolution mass spectrometry via the Fourier transform ion cyclotron resonance technique (FT-ICR-MS) allows the identification of thousands of different molecular formulas in natural organic matter and petroleum samples. Molecular formula assignment from mass data is most critical and time-consuming for these samples, and in many cases, several formulas can be determined for the same molecular mass. Therefore, automated procedures are required for an efficient exploitation of the extensive data sets. Here, we revise statements in a recent publication,1 which might result in a misleading impression about our approach of formula assignment in a previous work. We also summarize and categorize existing procedures for formula assignment. In addition, we propose new techniques, which are suitable to be implemented in automated evaluation software. The homologous series approach is extended toward a building block approach that can be applied as a new exclusion criterion for incorrect formula assignments. The examination of stable isotope ratios of individual molecules in natural organic matter can be applied as an additional and intrinsic evaluation for calculated molecular formulas. The application of ultrahigh-resolution mass spectrometry via the Fourier transform ion cyclotron resonance technique (FT-ICRMS) led to extensive new insights into the molecular composition of complex natural organic matter.2-10 Natural organic matter and * Corresponding author. Phone: +49 471 4831 1346. Fax: +49 471 4831 1425. E-mail: [email protected]. † Alfred Wegener Institute for Polar and Marine Research. ‡ Florida State University. § Bruker Daltonik GmbH. (1) Kujawinski, E. B.; Behn, M. D. Anal. Chem. 2006, 78, 4363-4373. (2) Kujawinski, E. B.; Del Vecchio, R.; Blough, N. V.; Klein, G. C.; Marshall, A. G. Mar. Chem. 2004, 92, 23-37. (3) Koch, B. P.; Witt, M.; Engbrodt, R.; Dittmar, T.; Kattner, G. Geochim. Cosmochim. Acta 2005, 69, 3299-3308. (4) Stenson, A. C.; Marshall, A. G.; Cooper, W. T. Anal. Chem. 2003, 75, 12751284. (5) Kim, S.; Simpson, A. J.; Kujawinski, E. B.; Freitas, M. A.; Hatcher, P. G. Org. Geochem. 2003, 34, 1325-1335. (6) Koch, B. P.; Dittmar, T. Rapid Commun. Mass Spectrom. 2006, 20, 926932. (7) Hertkorn, N.; Benner, R.; Frommberger, M.; Schmitt-Kopplin, P.; Witt, M.; Kaiser, K.; Kettrup, A.; Hedges, J. I. Geochim. Cosmochim. Acta 2006, 70, 2990-3010.

1758 Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

petroleum is degraded biomass that occurs mainly in soils, sediments, sedimentary rock, and the oceanic water column. It comprises one of the largest organic carbon reservoirs on earth. Because of the ultrahigh resolution of high-field FT-ICR-MS, several thousand ions with different m/z values can be detected in one mass spectrum, and molecules with the same nominal mass can be distinguished.11 On the basis of the extraordinary mass accuracy and by applying basic tools such as nitrogen rule and calculation of double bond equivalents,4 discrete molecular formulas can be determined for each ion. For formula computation, numbers of atoms are combined iteratively until the resulting total mass matches a given mass window. The mass window is defined by the analytical precision of the instrument. Bulk natural organic matter is composed mainly of C, H, O, and N, with minor contributions of P and S. Other naturally occurring elements are also present in trace amounts. Ideally, all reasonable types of atoms should be considered in FT-ICR-MS analysis leading to just one molecular formula for every distinct m/z value in the mass spectrum. However, even within the high mass accuracy of modern FT-ICR-MS (600 Da, more than 15 different molecular formulas can be calculated for each detected mass within a mass tolerance of 1 ppm (range of elemental composition: C0-∞H0-∞O0-∞N0-30P0-2S0-2). The most challenging problem in the evaluation process of FT-ICR-MS data of natural organic matter samples is the identification of the correct molecular formula among the many theoretically possible solutions. Unambiguous parameters for the decision whether a (8) Kujawinski, E. B.; Hatcher, P. G.; Freitas, M. A. Anal. Chem. 2002, 74, 413-419. (9) Kim, S.; Kramer, R. W.; Hatcher, P. G. Anal. Chem. 2003, 75, 5336-5344. (10) Dittmar, T.; Koch, B. P. Mar. Chem. 2006, 102, 208-217. (11) Marshall, A. G.; Hendrickson, C. L.; Jackson, G. S. Mass Spectrom. Rev. 1998, 17, 1-35. 10.1021/ac061949s CCC: $37.00

© 2007 American Chemical Society Published on Web 01/09/2007

Figure 1. Suwannee River Fulvic Acid Standard (SRFA II, International Humic Substances Society). ESI, negative mode: number of possible molecular formulas for each detected ion with odd (5029 total peaks with S/N > 3 in the spectrum) and even (3400 total peaks with S/N > 3) nominal m/z. Number of total and odd peaks with at least 1 molecular formula assignment (total assigned, odd assigned) and the respective sum of all possible assignments (possible formulas) are presented. Four different assumptions regarding the number of elements and intensities were used for formula determination: (a) C0-∞H0-∞O0-∞, S/N > 20; (b) C0-∞H0-∞O0-∞; (c) C0-∞H0-∞O0-∞N0-30; C0-∞H0-∞O0-∞N0-30S0-2P0-2 (all S/N > 3). Common assumptions in all four scenarios: mass accuracy 0.3. Assuming that every molecule contains at least one C and one H atom is generally very useful to rule out some false positives. Here, this conservative assumption was already covered by H/C > 0.3. Quantitative validation of even m/z ions are not included, because 13C compounds that contribute substantially to even m/z peaks are not considered.

molecular formula is correct or incorrect are extremely important for FT-ICR-MS data evaluation, especially when automated software is used. Over the past years, several rules and assumptions have been established in order to avoid multiple formula assignments for one mass.2-5,12 Manual formula assignment is extremely time-consuming. Therefore, automated postprocessing is important and critical for an efficient exploitation of the FT-ICR-MS data. Recently, Kujawinski and Behn1 presented a computer routine for molecular formula assignment of FT-ICR-MS data from complex natural organic matter samples. This routine is promising and the first published step toward automatically sorting through the multiple formula assignments in order to identify a single solution for each identified mass. One objective of this work is to revise statements in Kujawinski and Behn1 on our previous work.3 In this context, we applied the principles of their approach to our data set3 in order to assess limitations and advantages of different procedures. For an improvement of universally applicable routines, we revisit the fundamental principals of molecular formula assignment from ultrahigh-resolution mass data of natural organic matter. Another objective is to introduce novel rules for the exclusion of elements and the refinement of the homologous series approach.4,12 A new stable carbon isotope approach allows for an independent validation of the calculated molecular formulas and can be implemented into automated routines. A Priori Exclusion of Elements and “Best Fit”. The number of possible molecular formulas at a given mass window is a function of the number of elements considered in the calculations (Figure 1, Table S-1 Supporting Information). Therefore, as an indispensable first step, the major elements in the sample must be defined a priori. However, even if only the major elements of natural organic matter (C, H, O, N, P, S) for a fulvic acid standard are considered (SRFA II, IHSS, Figure 1), unequivocal formula assignment is not possible (within a mass window of 1 ppm). At 500 Da, ∼10 solutions can be calculated, and for (12) Hughey, C. A.; Hendrickson, C. L.; Rodgers, R. P.; Marshall, A. G.; Qian, K. N. Anal. Chem. 2001, 73, 4676-4681.

molecules of >1000 Da more than 80 molecular formulas can be calculated for each detected mass. If the calculations are restricted to C, H, O, and N, formula assignments are unequivocal for all masses of 20, no molecular formula containing 1-5 N atoms could be calculated for any of the detected peaks. In other words, the relatively low abundance of even m/z in the spectra indicated that there were no abundant (S/N > 20) N1, N3, N5, etc., compounds in the spectra. Based on the assumption that N2, N4, and N6 compounds (which would show up on odd m/z) are likely to have similar intensities as compounds containing odd numbers of N these species can be excluded for abundant odd m/z values. At lower signal intensities, several N compounds could be identified that contained one N atom (see also Figure 2). Some compounds with two N atoms were also present at even lower abundances. Based on this information, only peaks with high signal intensities (S/N > 20) were used for the formula assignment. These peaks were essentially all free of N, so that N could be excluded for the formula assessment. This procedure led to an unequivocal (13) Kim, S.; Rodgers, R. P.; Marshall, A. G. Int. J. Mass Spectrom. 2006, 251, 260-265. (14) Fu, J. M.; Purcell, J. M.; Quinn, J. P.; Schaub, T. M.; Hendrickson, C. L.; Rodgers, R. P.; Marshall, A. G. Rev. Sci. Instrum. 2006, 77.

1760

Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

identification for all considered peaks over the entire mass range without the need of applying a “homologous series” approach (see below) and certainly without using the best fit. The intensity-based a posteriori approach excludes all peaks in the mass spectrum that potentially contain N and probably other rare elements. As a consequence, fewer peaks in the mass spectrum are identified compared to the a priori exclusion of elements (Figure 1a). The a priori approach calculates molecular formula for substantially more peaks in the spectrum (Figure 1b), but with a higher risk of wrong assignments. Panels c and d in Figure 1 demonstrate that without any further types of evaluation, peaks from Figure 1b can also be identified as S or P compounds. “Chemical Building Block” Approach. All published mass spectra of refractory natural organic matter samples exhibit remarkably regular patterns. Most peaks can be organized into “molecular families” connected through constant relative peak distances of 14.015 65 Da, 36.4 mDa, and 0.995 25 Da among several others. These regular patterns in the mass spectra of refractory natural organic matter can be assigned to specific molecular changes within a series and can be advantageously sorted and evaluated by means of a Kendrick mass analysis.4,8,10,12,15 The mass difference of 14.015 65 Da can be attributed to the addition of 12CH2, and the difference of 36.4 mDa can be explained by the exchange of 12CH4 versus 16O, e.g., in C20H24O9 and C19H20O10 (Figure 2, peaks 2 and 3). The exchange of 12CH with 14N results in a mass difference of 0.995 25 Da (Figure 2, peaks 3 and 3a). These chemical families are often referred to as “homologous series”. It should be noted that this expression does not necessarily match the strict definition of homologous series. The CH2 series CnH2nO1, for example, is characterized by relative peak distances of 14.015 65 Da and could represent a mixture of ketones, aldehydes, unsaturated ethers, or alcohols within the same series. However, these “functional” relationships (chemical building blocks) are very useful for molecular formula assignment.1,4 At low nominal masses, molecular formula assignment is more reliable, because the number of possible element combinations for one detected peak increases with molecular mass (Figure 1). The molecular formulas for low masses in the spectrum can be extended to higher masses by exploring the chemical relationships along a “homologous series”. However, unless other criteria are taken into account, this approach can be problematic for the mass range beyond 350 Da where unequivocal peak assignment is not always possible (Figure 1). A prerequisite to use the mass difference between two or more peaks for formula assignment is to unequivocally identify one of these peakssoften the smallest member of a series. If this identification is ambiguous, the whole series assignment might be incorrect. Nevertheless, even in the higher mass range, the observed regular patterns in the mass spectra contain valuable information that can be explored for automated formula assignment in a chemical building block approach. For small peaks, less abundant elements (N, P, S, etc.) cannot be excluded a posteriori and several molecular formulas can be calculated even within the lower mass range. For instance, including C, H, O, N, P, and S and thresholds of N/C e 1, O/C e 1, and H/C > 0.3 for the assignment of peak 1 at 407.062 00 m/z in Figure 2 results in four theoretically possible (15) Kendrick, E. Anal. Chem. 1963, 35, 2146-2154.

Figure 2. (a) Negative electrospray ionization FT-ICR mass spectrum (9.4 T) at 407 and 408 m/z for a marine DOM sample from the Weddell Sea (Antarctica, 700 m water depth). Mass accuracies for all assignments were 500 Da). The error of the 13C-isotope signal intensities was too variable to assess the exact number of carbon atoms in the parent ion. In their study, Cdev ranged from 0 to 93 carbon atoms. Inspired by Stenson et al.’s work,4 we explored our FT-ICR-MS database for thousands of parent ions and their respective 13C112Cn-1 isotope signal. For the assessment of these numbers, we focused on C, H, O compounds and restricted m/z to 3 (2110 ion pairs) was -1.59 C atoms; the mean was -1.89. As expected, Cdev scattered strongly for peaks with relatively low intensities. However, for isotope peaks with S/N > 25 (515 ion pairs), we found that Cdev varied only from -2.61 to +0.54 (total range, 3.15 C atoms). Data points were normally distributed, so that median and mean were both -1.13 ((0.59 standard deviation). Hence, the number of C atoms in a molecule with high 13C-signal intensities can be predicted from the 13C approach with a precision of better than (1.6 C atoms. In this range of error, the isotope signal can be very useful to eliminate false identifications for abundant ions. For instance, the N6-series described above can be identified as incorrect on the basis of the isotopic validation. However, it should be considered that the proposed Cdev limit of (1.6 C atoms represents a first estimate based on six samples and might vary with the number of analyses and instrument characteristics. In a few special cases, other isotope information might also be helpful. For example, peak 2 in Figure 2 can be assigned with C17H8N14 as an additional mathematically possible formula. The unlikely compound (which is excluded by the homologous series approach) would create a small (5.2%) and resolvable 15N-isotope signal between peaks 2a and 2b at 408.095 m/z. The absence of this signal can be taken as an additional indication that this formula is incorrect. The isotope approach is especially valuable because it is independent from a priori or a posteriori assumptions and uses only intrinsic unequivocal information from the spectral data. (19) Mitchell, D. W.; Smith, R. D. Phys. Rev. E 1995, 52, 4366-4386.

Conclusions and Perspectives. In order to achieve unambiguous formula assignment from ultrahigh-resolution mass data, we propose different approaches that reduce the number of considered elements for each peak based on intrinsic information from the mass spectrum. The problematic extrapolation of bulk chemical information to the molecular level can thus be avoided. In addition, all assigned molecular formulas within a given mass window are equally considered, without applying the ambiguous best-fit approach to sort through the possible molecular formulas. The most conservative approach is probably the restriction of formula assignments to mass peaks that do not contain N (a posteriori exclusion of elements). However, this approach excludes all smaller peaks and is only suitable for samples that exhibit a pronounced odd over even m/z pattern. The chemical building block approach can be applied to peaks that contain other heteroatoms. This approach explores the high organization within the mass spectra, which is typical for most natural organic matter samples, to calculate the number of principal chemical building blocks in a molecule. Different from the original homologous series approach, unlikely combinations of atoms can be excluded based on the number of peaks within a molecular family. If only enough chemical building blocks are considered, molecular formula assignment becomes unambiguous. This approach achieves formula assignments for most peaks in a spectrum, but compared to the a posteriori exclusion of elements, it includes the additional assumption that the observed spacing patterns in the spectra are indeed related to chemical building blocks. This assumption is most likely true, because the observed complex space patterns and the high degree of organization within most spectra of natural organic matter cannot be explained by a random combination of elements. For fresh biological samples, however, chemical building block approaches are challenging and probably less efficient because these samples exhibit a lower degree of order and only a little information on metabolic and degradative molecular level reactions is available.1 As a conclusion, we can categorize the procedures for ultrahigh resolution mass spectrometry data evaluation into two mandatory (1-2) and two optional (3-4) steps. (1) Formula calculation on the basis of a most conservative a priori definition of elements: C, H, O, N, P, S, and 13C for natural organic matter. Further elements like Na or Cl might be important depending on the type of ionization used. (2) Definition of unequivocal exclusion criteria: this comprises the specification of the instrument error, check for double bond equivalents, application of the nitrogen rule, and most conservative thresholds for molecular element ratios.

(3) A posteriori procedures: sorting molecular formulas into homologous series, in case one peak belongs to several homologous series of the same type (e.g., CH4 vs O), the longest complete series is most likely correct (chemical building block approach). An upper threshold for the intensity of nitrogen compounds allows exclusion of N for peaks with intensities beyond this threshold (a posteriori approach for the exclusion of elements). (4) Exploitation of implicit mass spectral information: calculation of isotope ratios and predicted carbon number for intense peaks (S/N of isotope peak >25). Exclusion of single molecular formulas as well as complete series if Cdev surpasses established limits (in the example given above (1.6). These procedures were implemented in our FT-ICR-MS database and can be optionally combined. In the near future, some of the difficulties concerning formula assignment presumably will be resolved by technical improvements of the ICR technique itself. Progress in new cell designs and technical developments for digitizers and amplifiers already deliver general mass accuracies of better than 0.1 ppm even for very complex natural organic matter samples.13 New evaluation methods for FT-ICR-MS data will help to increase the confidence in the assigned formula and to support the general aim to identify as many correct formulas as possible. ACKNOWLEDGMENT We thank the associate editor John R. Yates, E. Kujawinski, and an anonymous reviewer for valuable comments and suggestions. This work was financially supported by the Petroleum Research Fund (ACS PRF#41515-G2), the National Oceanic and Atmospheric Administration (NOAA GC05-099), Deutsche Forschungsgemeinschaft (DFG KO 2164/3-1), and the German Academic Exchange Service (DAAD PPP USA 315/ab). SUPPORTING INFORMATION AVAILABLE Two tables in ASCII-Format (tab delimited text; mass range e750 and >750 m/z; each ∼62 000 rows). Files contain all formulas shown in Figure 1 including peak intensity, S/N, measured and calculated masses, mass accuracy, DBE, and molecular elemental ratios. This material is available free of charge via the Internet at http://pubs.acs.org.

Received for review October 17, 2006. Revised ?????. Accepted December 20, 2006. AC061949S

Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

1763