Environ. Sci. Technol. 2010, 44, 7576–7582
Multivariate Statistical Approaches for the Characterization of Dissolved Organic Matter Analyzed by Ultrahigh Resolution Mass Spectrometry R A C H E L L . S L E I G H T E R , † Z H A N F E I L I U , §,† JIANHONG XUE,‡ AND P A T R I C K G . H A T C H E R * ,† Department of Chemistry and Biochemistry, Old Dominion University, Norfolk, Virginia 23529, and Department of Biological Sciences, Virginia Institute of Marine Sciences, Gloucester Point, Virginia 23062
Received February 08, 2010. Revised manuscript received July 29, 2010. Accepted September 01, 2010.
We apply multivariate statistics to explore the large data sets encountered from Fourier transform ion cyclotron resonance mass spectra of dissolved organic matter (DOM). Molecular formula assignments for the individual constituents of DOM are examined by hierarchal cluster analysis (HCA) and principal component analysis (PCA), to measure the relationships between numerous DOM samples. We compare two approaches: (1) using averages of elemental ratios and double bond equivalents calculated from the formulas, and (2) employing individual formulas and either their presence/absence or relative magnitude in each sample. With approach 2, PCA deciphers which of the thousands of formulas are significant to particular samples, and then a van Krevelen diagram highlights what types of compounds are molecular signatures to the samples. Our dual approach, especially approach 2, allows for complex data sets to be more easily interpreted, aiding in the characterization of DOM from various sources. By applying this methodology, clear trends can be delineated, trends that are not apparent from currently employed methods. Terrestrial DOM contains various lignin-derived compounds, tannins, and condensed aromatics. Marine DOM contains aliphatic compounds with heteroatom functionalities, as well as lignin-like molecules.
1. Introduction The application of Fourier transform ion cyclotron resonance mass spectrometry (FTICR-MS) for the characterization of natural organic matter (NOM) has revolutionized our understanding of NOM in soil and water systems (ref 1 and references therein). Humic substances (2-4) were the initial focus of such studies, but more recently the spotlight has expanded to that of dissolved organic matter (DOM) (5-11), to understand how it cycles and transforms in aquatic environments. * Corresponding author phone: 757-683-6537; 757-683-5310; email:
[email protected]. † Old Dominion University. ‡ Virginia Institute of Marine Sciences. § Current address: Marine Science Institute, University of Texas at Austin, Port Aransas, TX 78373. 7576
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 44, NO. 19, 2010
While much has been discovered from these previous studies, the data interpretation required for the large, complex data sets generated by FTICR-MS has yet to be optimized. A single DOM mass spectrum contains thousands of resolved peaks that can be assigned to individual formulas, due to the mass accuracy of FTICR-MS (3, 12). Analyzing these data sets is key to comprehending the makeup of each sample. Visualization schemes, for example, van Krevelen diagrams (5, 9, 11, 13) and Kendrick mass defect plots (3, 8, 10), have been applied for data exploration. High performance liquid chromatography (HPLC) preceding FTICR-MS to assist in simplifying the acquired data (14-16) has also been employed. The major drawback of visualization diagrams and chromatographic separations is that it can be challenging to compare many samples, each containing thousands of informational data points. Multivariate statistics, such as hierarchal cluster analysis (HCA) and principal component analysis (PCA), are particularly useful for distinguishing relationships and trends among large numbers of samples. HCA has been used for mass spectral differentiation of oceanic DOM from various depths (9), DOM before and after photoirradiation (17), fractions of DOM collected from HPLC (14, 16), and pore water DOM from riverine DOM (18). To our knowledge, there are only two published papers that combine two multivariate statistical methods to explore samples analyzed by FTICR-MS. Kujawinski et al. (19) uses HCA and nonmetric multidimensional scaling (NMS) in combination with indicator species analysis (ISA) to pinpoint molecular source markers for photodegradation of DOM and bacterial metabolism. ISA is employed to identify molecular formulas that correspond to certain groups of samples, requiring the samples to be grouped prior to analysis. Hur et al. (20) utilize HCA and PCA to characterize and differentiate between various crude oil samples. In our current study, we combine HCA and PCA to examine large data sets of DOM FTICR mass spectra, following the suggestion of Reemtsma (21) that development of statistical methods is needed to compare different NOM samples. Water samples from various aquatic environments were collected, and DOM was isolated by C18 extraction and electrodialysis (ED). Here, small-scale ED was utilized for desalting, analogous to a previously used isolation of DOM (22, 23). HCA and PCA are employed with two different approaches for parameter selection, and our methods group the samples and simultaneously identify their characteristic variables. The purpose of this study is to employ HCA and PCA to assist in data interpretation, with the goal of characterizing large numbers of samples from various natural environments prepared in numerous ways.
2. Experimental Section 2.1. Samples and Preparation. Water samples along a terrestrial to marine transect of the lower Chesapeake Bay were collected at five locations in August 2007 (Figure S1 in the Supporting Information (SI)). Samples from a similar transect were recently analyzed by FTICR-MS (11, 16), but were collected at a different time. Briefly, sampling begins at the Dismal Swamp (DS, Suffolk, VA), continues north up the Elizabeth River where two sites were used for collection (Great Bridge (GB), VA and Town Point (TP) Park, Norfolk, VA). Sampling continues to the Chesapeake Bay Bridge (CBB) and concludes at an offshore coastal (OSC) site about 10 miles off the lower Delmarva Peninsula. Water samples were collected from 15 other locations adjacent to the Delaware Bay and Chesapeake Bay in the Atlantic Ocean (SI Figure S1). Depth profiles were collected at stations 12 (2 m, 40 m, 10.1021/es1002204
2010 American Chemical Society
Published on Web 09/13/2010
200 m, 350 m) and 13 (2 m, 30 m, 401 m). These samples (20 total) were collected by Niskin bottles on a CTD rosette aboard the R/V Hugh R. Sharp in November 2007. A brown-rotted cedar wood from the DS, that is enriched in lignin and depleted in cellulose (16), was extracted with water and serves as a soluble natural lignin standard (WE). SI Table S1 displays the various physical properties that were measured for each sample. All water samples were filtered and subjected to C18 solid phase extraction (3M, Empore) according to Kim et al. (6) to isolate desalted and concentrated DOM brackish and marine samples. Because the DS water and WE both have high DOM contents and zero salinity, they were directly analyzed without the C18 extraction. The five DOM samples from the Chesapeake Bay transect were additionally desalted by electrodialysis, ED (Harvard Apparatus), as described by Chen et al. (24). Details of the sample preparation, including C18 extraction and desalting by ED, are described in the SI. 2.2. Instrumentation. Samples were continuously infused into an Apollo II electrospray (ESI) ion source of a Bruker Daltonics 12 T Apex Qe FTICR-MS. Negative ion mode was used, varying the ESI voltages along with the ion accumulation times and number of coadded scans for optimization. DS-WW was analyzed three times in 1 day and then again 12, 22, and 31 days later for reproducibility testing. In total, 38 samples [six DS-WW, five transect samples (DS, GB, TP, CBB, OSC) each prepared with C18 and ED, WE-WW and C18, and 20 C18 extracted cruise samples] were analyzed by ESI-FTICR-MS. All mass spectra were calibrated according to previously published protocols (25). A molecular formula calculator developed at the National High Magnetic Field Laboratory (v.1.0 NHMFL, 1998) generated formulas using carbon, hydrogen, oxygen, nitrogen, sulfur, and phosphorus. Only m/z values with a signal-to-noise (S/N) above 3 were used for formula assignments. All assigned formulas agreed within 1.0 ppm of the theoretical mass. Details of the FTICR-MS analyses can be found in the SI, and Table S2 shows the number of assigned molecular formulas and the percentage of peaks that were assigned a formula for each sample. 2.3. Statistical Analysis. We use hierarchal cluster analysis (HCA) and principal component analysis (PCA); the principles of how these two statistical tools are applied to geochemical data are described in detail by Xue et al. (26). HCA results are expressed in a hierarchical tree, a dendrogram. The branching indicates similarity of samples to each other but does not reveal which variables are responsible for the groupings. PCA, which assumes a linear relationship, reduces a multidimensional sample space into fewer dimensions, and the first dimension (i.e., the first principal component, PC1) explains the most variance among the data set. The second (PC2), orthogonal to the first, explains most of the residual variance (27). Generally, at least 10-15 samples are acceptable for PCA analysis, but larger sample sets are superior because it picks out more general character among samples and minimizes relative errors. Scores represent the projections of the original sample onto each PC. Loadings are the projections of all the variables onto each PC and indicate that variable’s contribution to the data variability along each PC. From the PCA results, the samples and variables can be plotted on a two-dimensional PCA projection, a biplot, not only to group the samples, but also to determine relationships between samples and variables. Further explanation of the statistical methods employed can be found in the SI. Two types of FTICR-MS data are used in this multivariate statistical study, after each m/z value was assigned to an individual molecular formula. Because each formula corresponds to a single m/z value, it does not matter which parameter (m/z or formula) is used for subsequent statistical
analysis. We chose to utilize formulas because they provide chemical information. The first type of data was obtained by averaging the characteristics (O/C, H/C, DBE, DBE/C, DBE/ O, and C#) of the formulas assigned to the large number of peaks in each samples’ spectrum, as discussed below in Section 3.1. The second type of data involves a statistical examination of the individual formulas assigned to the multitude of peaks and is shown as their presence/absence or relative magnitude within each sample, as discussed below in Section 3.2. The HCA and PCA were performed by MATLAB programs (26, 28).
3. Results and Discussion All mass spectra of DOM are characteristic of those previously published (3, 5, 8, 11), with multiple peaks detected at each nominal mass throughout the overall m/z range of 200-800. In the ensuing discussion, we demonstrate the utility of the dual statistical methodology to significantly increase the information recovery salient to these data sets for DOM. We examine and compare the compositional nature of the DOM isolated by numerous techniques to evaluate the effectiveness and understand the biases involved in the isolation methods. 3.1. Statistical Analysis of Averaged Mass Spectral Parameters. From the assigned formulas representing the multitude of peaks, a set of magnitude-weighted parameters based on each formula (O/Cw, H/Cw, DBEw, DBE/Cw, DBE/ Ow, and C#w) are calculated to describe each of the 38 samples, as discussed previously (11). Equations for the calculations can be found in the SI. While ESI-FTICR-MS is not a quantitative technique, each sample was analyzed on the same instrument with the same optimized parameters. Therefore, each spectrum was biased in an equal fashion, so relative peak magnitudes within the acquired spectra can be compared to each other, but they cannot be related back to concentrations in the original sample, due to varying ionization efficiencies. Comparisons based on relative magnitudes have previously been utilized to compare samples (11, 16, 18-21, 29), as well as by the use of three-dimensional van Krevelen diagrams (5, 7, 9, 11, 14, 30). Table S3, in the SI, shows the calculated parameters for each sample, and these values are used for HCA and PCA. Figure 1a shows the resulting dendrogram based on the parameters in SI Table S3. Samples cluster together based on their correlation value, r. The dendrogram yields two main clusters: one is of terrestrial riverine samples and the other is of estuarine/oceanic samples. The riverine cluster contains the wood extract (WE), the Dismal Swamp (DS) samples, and the Great Bridge (GB) samples, with r ) 0.4. The DS whole water (DS-WW) samples that were analyzed on different days all cluster very similarly, with r > 0.95. This demonstrates a high reproducibility for the FTICR-MS, based on the averaged mass spectral parameters. The C18 extracted DS water (DS-C18), along with the wood extract samples (WEWW and WE-C18) also group together near the DS-WW. The GB-C18 extract and its electrodialysis (ED) preparation cluster together with the DS-ED sample. The other large cluster is broadly described as estuarine/oceanic DOM. The Town Point (TP) samples (both C18 and ED) cluster closely together, along with the offshore coastal (OSC) ED prepared sample. The Chesapeake Bay Bridge (CBB) samples cluster with the remaining oceanic C18 extracts. To determine which parameters are important for the groupings in Figure 1a, PCA is performed using the parameters of SI Table S3. Figure 1b shows the resulting biplot. Blue data points are scores for the samples, while red data points are the loadings from each variable. The more distal the variable loading is from the origin, the more impact that variable has on the variance of the samples. The loadings for DBEw, DBE/Cw, H/Cw, and C#w are separated on the PC1 axis, indicating that most of the variance among samples is VOL. 44, NO. 19, 2010 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
7577
FIGURE 1. Statistical plots using the magnitude-weighted averages shown in SI Table S3. (a) Dendrogram from the cluster analysis and (b) Biplot from the principal component analysis. due to their characteristic differences in these four variables. The loadings of the O/Cw and DBE/Ow variables separate on PC2, indicating that variance among samples is also due to these two variables. Because the first 2 PCs explain approximately 85% of the variance among these samples, a linear relationship between the variables is a reasonable assumption in this PCA. Adding a third PC only explains an additional 14% of the variance and is related to the C#w variable. There are no apparent trends among the samples relating to C#w, so the first two PCs are discussed further below. PC1 explains nearly 60% of the variance in the sample and is related to the aromaticity of the samples. Samples that have more aromaticity (i.e., high DBEw and DBE/Cw values) have positive PC1 scores, while samples that are more aliphatic (i.e., high H/Cw values) have negative PC1 scores. The samples in the riverine cluster in the dendrogram of Figure 1a all have positive PC1 scores, which is expected as these samples are all swamp and up-river locations that source from woody vegetation, such as lignin. The DS-WW samples all cluster very tightly together, with a close proximity to the WE samples. The DS-WW and WE-WW samples are color-coded to distinguish them as samples analyzed directly, without any extraction or sample preparation. The estuarine/ oceanic cluster of Figure 1a yields samples with negative PC1 scores, indicating a highly aliphatic DOM. The cruise station samples all group together within the circle on the plot. Overall, the variability explained by PC1 correlates with 7578
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 44, NO. 19, 2010
the trend from aromatic terrestrial samples to aliphatic marine samples. PC2 explains about 25% of the variance and is related to the oxygen content of the samples. Samples with high O/Cw ratios have a negative PC2 score. It is clear that the ED samples (labeled in green) all have higher oxygen contents in relation to their C18 extracted counterparts. In addition, the ED samples as a group follow the aromaticity trend along PC1. Further examination of these high O/Cw compounds reveal that the DS-ED and GB-ED samples contain many compounds that fall into the tannin region of the van Krevelen diagram, while the CBB-ED and OSC-ED samples have compounds that fall into the carbohydrate region. TP-ED has peaks that fall into both of these two areas. The van Krevelen diagrams for these five samples prepared with ED are shown in SI Figure S2. Overall, the highly oxygenated compounds that are eliminated during the C18 extractions fall into two different compound categories that differ greatly in their H/C ratios: tannins and carbohydrates. The results shown in this section serve as a quick and efficient way to examine the bulk properties of the molecular formulas assigned to the mass spectra of a large number of samples. Overall trends among the formulas can be discerned and preliminary results emphasizing the similarities and differences between the types of compounds in the samples are able to be recognized. However, to create a better understanding of the molecular-level characteristics of a large group of samples, another statistical route must be taken. While this type of analysis is useful for bulk properties, a formula specific approach can explore these large, complex data sets more effectively to obtain more detailed molecular information. 3.2. Statistical Analysis of Individual Molecular Formulas. Because thousands of formulas can be assigned for a single DOM sample, the 38 samples yield a total of 68 085 formulas, excluding the contribution from 13C isotopes. With duplicates removed, the final list includes 10 008 individual formulas. To minimize the contribution from peaks with low magnitudes that are near the noise level, we chose the 500 dominant peaks in each sample to represent each sample. This selection of the 500 most abundant peaks corresponds to the selection of peaks having an approximate S/N ratio of at least 10 and a relative magnitude of at least 1%. The formulas assigned to these 500 peaks for each of the 38 samples are compiled (yielding a total of 19 000 formulas), and this list is then condensed to 2143 distinct formulas by removing duplicates. Two data matrices are then created from the list of 2143 formulas, similar to other studies (19, 20). The first matrix is constructed from the relative magnitudes of the 2143 formulas in each of the 38 samples. The second matrix consists of the presence or absence of the 2143 formulas in the 500 most abundant peaks of each of the 38 samples. HCA is then applied to these two matrices. The dendrogram in Figure 2, from the relative magnitude matrix, consists of four major clusters. The first contains the terrestrial, riverine samples of various preparations, while the second consists of the estuarine/oceanic samples desalted by ED. The reproducibility of the FTICR-MS gives r ) 0.9 for the six replicates of DS-WW. The third cluster is mostly oceanic samples, but it also contains the estuarine C18 extracted transect samples. The fourth cluster contains the remaining oceanic cruise samples. The dendrogram created from the presence/absence matrix, SI Figure S3, yields only two clusters: the first containing all C18 extracted samples and the samples without any preparation (DS-WW and WEWW); and the second contains the five samples desalted by ED. In this dendrogram, all six replicates of DS-WW samples group together with r ) 0.85. This is slightly lower than that of Figure 2, but reproducibility is still confirmed. Despite the lack of resolution in this dendrogram, riverine and estuarine/
FIGURE 2. Dendrogram from the cluster analysis using the relative magnitudes of the selected 2143 formulas in each sample. oceanic samples are grouped together in smaller clusters. The general overlap of peaks detected in multiple samples (9, 11) is likely for the reason of this lack of resolution. Presence/absence data does not extract information because the overlap exists, but by incorporating relative magnitudes, shifts in magnitude of the peaks can be detected and utilized. While HCA is useful for drawing attention to the broad categories in which these samples fall, PCA gives information about which parameters are important for such groupings. Because the presence/absence matrix did not resolve the samples into more distinct groupings, PCA was applied to the matrix utilizing the relative magnitudes of the 2143 formulas in each of the 38 samples. Figure 3 shows the resulting biplot from the relative magnitudes of the specified 2143 formulas. For ease of viewing, the scores and loadings are shown on separate plots. PC1 and PC2 each explain 28% and 19%, respectively, of the variance among the 38 samples based on their differences in the 2143 variables, giving a total of 47% of the variance. Because the variable set is so large, this amount of variance is sufficient to indicate that a linear relationship between the variables is a reasonable assumption. Figure S4 in the SI displays the percentage of variance explained by each additional PC. Adding a third PC only explains an additional 13% of the variance and is not shown in Figure 3 since it further complicates the plot and does not provide additional relevant information. The groupings of the samples in Figure 3a are similar to those in the PCA biplot of Figure 1b, but cluster due to different variables as highlighted in Figure 3b. The 6 DS-WW samples cluster closely together in quadrant 3 of Figure 3a, highlighting the reproducibility of the FTICRMS (in agreement with Figures 1, 2, and SI Figure S3). To fully understand why the samples’ scores fall into their respective quadrants, we examined the formulas (i.e., the variables’ loadings). Figure 3b is crowded by the 2143 data points, but the loadings distal to the origin are those that impact the variance of the samples most. Densely populated distal regions of the diagram are highlighted by circles labeled as “areas 1-7”. Areas containing variable loadings (formulas) and samples (scores) in the same quadrant indicate a close correspondence of formulas with high relative magnitude. This colocation of loadings and scores determines the relationship between the samples and the variables (the formulas in this case). This biplot created from the PCA, indicating the variables responsible for the groupings, is why PCA has such an advantage over HCA used alone.
Area 1 has high positive loadings on PC2 and spans the PC1 axis, corresponding to the ED and OSC-C18 samples that have high scores on PC2 in Figure 3. This indicates that the area 1 formulas are enriched in ED and the OSC-C18 samples. Area 2 has moderately high PC2 loadings and nearly 0 for its PC1 loadings, indicating that these formulas are specifically enriched in the TP-ED, CBB-ED, and OSC-ED samples. Area 3 has large PC1 loadings and low PC2 loadings, corresponding to formulas with high relative magnitudes in the cruise station samples and the CBB-C18 sample. Area 4 has the most negative PC2 loadings and spans the PC1 axis; these formulas are enriched in the DS-WW, DS-C18, and WE-C18 samples. Area 5 has large negative PC1 and PC2 loadings, implicating the DS-WW samples. Area 6 has large negative PC1 values and spans the PC2 axis; the DS-WW, WE-WW, DS-ED, and GBED samples are enriched in these formulas. Finally, area 7 has large negative PC1 loadings and positive PC2 loadings, and the DS-ED and GB-ED samples specifically have these formulas in high relative magnitude. The m/z values and their assigned formulas for each of the seven areas are given in SI Table S4. Plotting the formulas in each region on a van Krevelen diagram (Figure 4) shows that the areas group remarkably within certain regions of the diagram. The overall characteristics of each of these areas are presented in Table 1, and the total number of formulas in areas 1-7 account for approximately 60% of the original 2143 formulas used for the analysis. Formulas in area 1 (those enriched in the OSCC18 and ED samples) mostly fall in the upper lignin region, although there are a few formulas that scatter across the plot at higher H/C values. These formulas are mostly CHO-only compounds, but some have NSP functionality. While most plot in the lignin region, the typically high H/C ratios could indicate different types of compound. The vast majority of area 2 formulas have H/C > 1.5 but vary across the O/C range. The large amount of heteroatom functionalities in these formulas appears to be specific to the ED preparation of the estuarine/oceanic transect samples. C18 extractions discriminate against heteroatoms due to their polar nature (11), and it seems as though the ED desalting method retains these compounds, in agreement with previous findings (23). Heteroatom functionality increases across the transect from terrestrial to marine waters (11), explaining why these compounds correlate specifically to TP-ED, CBB-ED, and OSC-ED, but not to the DS-ED and GB-ED samples. Area 3, characteristic of the oceanic C18 samples and the CBB-C18 extract, forms a very close grouping in the upper left lignin region. There has been discussion on the overlap of the lignin region with a recently suggested compound class called CRAM, or carboxylic rich alicyclic molecules (13). Having the knowledge of a formula does not indicate a specific structure, because mass spectrometry cannot distinguish between structural isomers. Lignin is not generally thought to be a large component of oceanic DOM, and recent studies have suggested that the lignin can be modified during transport to the ocean such that the lignin biomarkers lose specificity (11, 16). Because the formulas of area 3 peaks are specifically enriched in oceanic samples and devoid in terrestrial samples, perhaps they are attributable to CRAM. This cannot be confirmed at this time, as we can only make speculations based solely on formulas. Formulas in areas 4-7 correspond to the terrestrial samples. Areas 4 and 6 contain formulas that fall into the lignin region of the van Krevelen diagram. The main difference in these groups is the degree of oxygenation. Area 4 corresponds to formulas having low O/C ratios that have high relative magnitudes in the DS-WW, DS-C18, and WE-C18 samples. Area 6 contains formulas with higher O/C ratios and implicates DS-WW, WE-WW, DS-ED, and GB-ED. Again, compounds having a high degree of oxygenation are very VOL. 44, NO. 19, 2010 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
7579
FIGURE 3. Statistical plots using the relative magnitudes of the selected 2143 molecular formulas. (a) Biplot of the scores from the principal component analysis and (b) Biplot of the loadings from the principal component analysis. For simplicity, only data points (without labels) are shown in 3b. Circled areas were chosen for further analysis and are colored according to the van Krevelen diagram in Figure 4.
FIGURE 4. van Krevelen diagram of the selected areas and colored according to the biplot of the loadings shown in Figure 3b. The samples that these areas are enriched in are indicated on the legend. polar and are only detected in samples that are not fractionated by C18. Area 5 formulas are specific to the DS7580
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 44, NO. 19, 2010
WW samples and fall into the condensed aromatic and lower lignin regions. These are compounds that have high DBEs
ED samples, OSC-C18 TP-ED, CBB-ED, OSC-ED cruise samples, CBB-C18 DS-WW, DS-C18, WE-C18 DS-WW, WE-WW, WE-C18 DS-WW, WE-WW, DS-ED, GB-ED DS-ED, GB-ED 1 2 3 4 5 6 7
There is an exception for two formulas in this area: C39H21O7 and C45H31O6, b a Percentage is of the 2143 formulas used for the individual molecular formula statistical analysis. giving H/C ratios of 0.56 and 0.71; DBE values of 29 and 30; and DBE/C values of 0.74 and 0.67.
0.4 0.6b 0.4 0.6 0.8 0.7 0.7 0.10.00.20.30.50.40.411 12b 11 13 17 15 15 20456571.9 2.2b 1.7 1.4 1.2 1.4 1.3 1.21.01.30.80.50.70.70.7 1.0 0.4 0.4 0.6 0.8 0.9 0.20.10.20.10.20.40.5upper central lignin lipid, carbohydrate, peptide/N-aliphatic upper left lignin left central lignin condensed aromatic, lower lignin central lignin tannin 557 730 603 467 525 569 565 29724730923323328535998 (5%) 266 (12%) 196 (9%) 149 (7%) 210 (10%) 145 (7%) 142 (7%)
samples enriched in area no.
some NSP func. many NSP func. CHO-only CHO-only CHO-only CHO-only CHO-only
DBE/C range DBE range H/C range O/C range area of van Krevelen m/z range number of formulas (percentagea) type of formulas
TABLE 1. Description of Areas Chosen for Further Analysis from the Biplot of the Loadings Shown in Figure 3b and Further Highlighted on the van Krevelen Diagram in Figure 4
and are probably related to black carbon, likely due to the numerous fires that have taken place in the Dismal Swamp over the course of many decades. Area 7 formulas fall exclusively in the tannin region of the van Krevelen diagram and are associated with the DS-ED and GB-ED samples. Tannins are polyphenolic compounds that occur in higher plants, generally indicating a terrestrial source. The C18 extraction does not retain these highly polar compounds, but the ED process clearly does. Tannins were also detected in the DS-WW samples, but not with magnitudes as high as in the DS-ED and GB-ED. Overall, this formula specific approach helps to determine which of the numerous assigned molecular formulas are important for characterizing certain types of samples. The van Krevelen diagram of the areas indicated on the loadings biplot (and correlated to the scores of the samples) assists in establishing what types of compounds are characteristic of each group of samples and delineates trends among the formulas assigned to the numerous samples. 3.3. Utility of Statistical Methods. The formula specific method, combining PCA and the van Krevelen diagram, yields the most molecular-level information regarding the compositional differences among a large number of samples analyzed by FTICR-MS. However, assigning formulas to the peaks in 38 individual DOM samples is quite laborious. This method could easily be simplified by performing the HCA and PCA on the relative magnitudes of the m/z values rather than the formulas and determining which m/z values are important for specific samples. One can then assign formulas only to the subset of peaks of significance, rather than to the tens of thousands of peaks detected, drastically reducing the time involved in the characterization. Previous work using multiple multivariate statistical approaches (17, 19, 20) has laid the foundation for a more detailed characterization of DOM analyzed by mass spectrometry. Our dual statistical approach has allowed us to further describe changes among various sample preparations and along the land-to-ocean transect that are only subtle and, perhaps, not apparent if we were to employ previous methods without statistics. Most importantly, we are able to clearly differentiate samples prepared by different isolation methods and determine the chemical reasons for selective recovery of various compounds. Also, we recognize the power of this dual statistical approach to segregate compound classes that can directly be correlated to their position in van Krevelen diagrams, further aiding compositional analysis. The use of this statistical methodology in combination with graphical methods like the van Krevelen plot represents a stepwise evolution in the analytical power of ESI-FTICR-MS. Employing this newly discovered power for DOM characterization, one can begin to examine numerous environmental processes with a more discriminating tool. Seasonal changes as well as degradation and bioavailability experiments could likely be streamlined to allow for a more high through-put analysis and more precise determination of the molecular factors responsible for the observed trends in DOM composition. Instrumental studies on other complex mixtures, such as petroleum or sedimentary OM, are also strong candidates for multivariate statistics. However, it should be recognized that the major limitation of these analyses are that the parameters are chosen at the discretion of the researchers. Perhaps as more studies utilize statistical analyses, the DOM community could establish a standard protocol to follow. Furthermore, the creation of an electronic database, as suggested by Reemtsma (21), would make more mass spectral data sets available to share among researchers, thereby allowing for more collaborations to develop based on statistical analysis. VOL. 44, NO. 19, 2010 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
7581
Acknowledgments We thank Hussain Abdulla and the Mulholland research group at ODU for their assistance in sample collection and laboratory measurements, Susan Hatcher and Mahasilu Amunugama at the College of Sciences Major Instrumentation Cluster facility at ODU for their assistance with the FTICR-MS analyses, and the crews of the R/V Fay Slover and the R/V Hugh R. Sharp. This work was funded by the National Science Foundation Chemical Oceanography Program (grant number OCE-0612712).
Supporting Information Available Expanded experimental section, as well as the figures and tables referenced in the manuscript. This material is available free of charge via the Internet at http://pubs.acs.org.
(13)
(14)
(15) (16)
(17)
Literature Cited (1) Sleighter, R. L.; Hatcher, P. G. The application of electrospray ionization coupled to ultrahigh resolution mass spectrometry for the molecular characterization of natural organic matter. J. Mass Spectrom. 2007, 42 (5), 559–574. (2) Kujawinski, E. B.; Hatcher, P. G.; Freitas, M. A. High-resolution Fourier transform ion cyclotron resonance mass spectrometry of humic and fulvic acids: Improvements and comparisons. Anal. Chem. 2002, 74 (2), 413–419. (3) Stenson, A. C.; Marshall, A. G.; Cooper, W. T. Exact masses and chemical formulas of individual Suwannee River fulvic acids from ultrahigh resolution electrospray ionization Fourier transform ion cyclotron resonance mass spectra. Anal. Chem. 2003, 75 (6), 1275–1284. (4) Kramer, R. W.; Kujawinski, E. B.; Hatcher, P. G. Identification of black carbon derived structures in a volcanic ash soil humic acid by Fourier transform ion cyclotron resonance mass spectrometry. Environ. Sci. Technol. 2004, 38 (12), 3387–3395. (5) Kim, S.; Kramer, R. W.; Hatcher, P. G. Graphical method for analysis of ultrahigh-resolution broadband mass spectra of natural organic matter, the van Krevelen diagram. Anal. Chem. 2003a, 75 (20), 5336–5344. (6) Kim, S.; Simpson, A. J.; Kujawinski, E. B.; Freitas, M. A.; Hatcher, P. G. High resolution electrospray ionization mass spectrometry and 2D solution NMR for the analysis of DOM extracted by C18 solid phase disk. Org. Geochem. 2003b, 34 (9), 1325–1335. (7) Kim, S.; Kaplan, L. A.; Benner, R.; Hatcher, P. G. Hydrogendeficient molecules in natural riverine water samples-- evidence for the existence of black carbon in DOM. Mar. Chem. 2004, 92 (1), 225–234. (8) Kujawinski, E. B.; Del Vecchio, R.; Blough, N. V.; Klein, G. C.; Marshall, A. G. Probing molecular-level transformations of dissolved organic matter: insights on photochemical degradation and protozoan modification of DOM from electrospray ionization Fourier transform ion cyclotron resonance mass spectrometry. Mar. Chem. 2004, 92 (1-4), 23–37. (9) Koch, B. P.; Witt, M.; Engbrodt, R.; Dittmar, T.; Kattner, G. Molecular formulae of marine and terrigenous dissolved organic matter detected by electrospray ionization Fourier transform ion cyclotron resonance mass spectrometry. Geochim. Cosmochim. Ac. 2005, 69 (13), 3299–3308. (10) Tremblay, L. B.; Dittmar, T.; Marshall, A. G.; Cooper, W. J.; Cooper, W. T. Molecular characterization of dissolved organic matter in a North Brazilian mangrove-fringed estuary by FTICR mass spectrometry and synchronous fluorescence spectroscopy. Mar. Chem. 2007, 105 (1-2), 15–29. (11) Sleighter, R. L.; Hatcher, P. G. Molecular characterization of dissolved organic matter (DOM) along a river to ocean transect of the lower Chesapeake Bay by ultrahigh resolution electrospray ionization Fourier transform ion cyclotron resonance mass spectrometry. Mar. Chem. 2008, 110 (3-4), 140–152. (12) Kujawinski, E. B.; Behn, M. D. Automated analysis of electrospray ionization Fourier transform ion cyclotron resonance mass
7582
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 44, NO. 19, 2010
(18)
(19)
(20)
(21)
(22)
(23)
(24) (25)
(26)
(27) (28) (29) (30)
spectra of natural organic matter. Anal. Chem. 2006, 78 (13), 4363–4373. Hertkorn, N.; Benner, R.; Frommberger, M.; Schmitt-Kopplin, P.; Witt, M.; Kaiser, K.; Kettrup, A.; Hedges, J. I. Characterization of a major refractory component of marine dissolved organic matter. Geochim. Cosmochim. Acta 2006, 70 (12), 2990–3010. Koch, B. P.; Ludwichowski, K. U.; Kattner, G.; Dittmar, T.; Witt, M. Advanced characterization of marine dissolved organic matter by combining reversed-phase liquid chromatography and FT-ICR-MS. Mar. Chem. 2008, 111 (3-4), 233–241. Stenson, A. C. Reversed-phase chromatography fractionation tailored to mass spectral characterization of humic substances. Environ. Sci. Technol. 2008, 42 (6), 2060–2065. Liu, Z.; Sleighter, R. L.; Zhong, J.; Hatcher, P. G. A molecular evaluation of the contribution that lignin makes to coastal waters of the Chesapeake Bay region, using HPLC combined with ultrahigh resolution mass spectrometry. Estuar. Coast. Shelf Sci. 2010, in review. Dittmar, T.; Whitehead, K.; Minor, E. C.; Koch, B. P. Tracing terrigenous dissolved organic matter and its photochemical decay in the ocean by using liquid chromatography/mass spectrometry. Mar. Chem. 2007, 107 (3), 378–387. Schmidt, F.; Elvert, M.; Koch, B. P.; Witt, M.; Hinrichs, K. U. Molecular characterization of dissolved organic matter in pore water of continental shelf sediments. Geochim. Cosmochim. Acta 2009, 73 (11), 3337–3358. Kujawinski, E. B.; Longnecker, K.; Blough, N. V.; Del Vecchio, R.; Finlay, L.; Kitner, J. B.; Giovannoni, S. J. Identification of possible source markers in marine dissolved organic matter using ultrahigh resolution mass spectrometry. Geochim. Cosmochim. Acta 2009, 73 (15), 4384–4399. Hur, M.; Yeo, I.; Park, E.; Kim, Y.; Yoo, J.; Kim, E.; No, M.; Koh, J.; Kim, S. Combination of statistical methods and Fourier transform ion cyclotron resonance mass spectrometry for more comprehensive, molecular-level interpretations of petroleum samples. Anal. Chem. 2010, 81 (1), 211–218. Reemtsma, T. Determination of molecular formulas of natural organic matter molecules by (ultra-) high-resolution mass spectrometry: Status and needs. J. Chromatogr., A 2009, 1216 (18), 3687–3701. Vetter, T. A.; Perdue, E. M.; Ingall, E.; Koprivnjak, J. F.; Pfromm, P. H. Combining reverse osmosis and electrodialysis for more complete recovery of dissolved organic matter from seawater. Sep. Purif. Technol. 2007, 56 (3), 383–387. Koprivnjak, J. F.; Pfromm, P. H.; Ingall, E.; Vetter, T. A.; SchmittKopplin, P.; Hertkorn, N.; Frommberger, M.; Knicker, H.; Perdue, E. M. Chemical and spectroscopic characterization of marine dissolved organic matter isolated using coupled reverse osmosiselectrodialysis. Geochim. Cosmochim. Acta 2009, 73 (14), 4215– 4231. Chen, H.; Stubbins, A.; Hatcher, P. G. Application of minielectrodialysis to small volume saline sample preparation for FTICR-MS analysis. Limnol. Oceanogr.: Methods 2010, in review. Sleighter, R. L.; McKee, G. A.; Liu, Z.; Hatcher, P. G. Naturally present fatty acids as internal calibrants for Fourier transform mass spectra of dissolved organic matter. Limnol. Oceanogr.: Methods 2008, 6, 246–253. Xue, J.; Armstrong, R. A.; Lee, C.; Wakeham, S. G. Using principal components analysis (PCA) along with cluster analysis to study the organic geochemistry of sinking particles in the ocean. Org. Geochem. 2010, accepted. Lattin, J. M.; Carroll, J. D.; Green, P. E. Analyzing Multivariate Data; Thomson Brooks/ Cole: Pacific Grove, CA, 2003. Middleton, G. V. Data Analysis in the Earth Sciences Using MATLAB; Prentice Hall: Saddle River, 2000. Sleighter, R. L.; McKee, G. A.; Hatcher, P. G. Direct Fourier transform mass spectra analysis of natural waters with low dissolved organic matter. Org. Geochem. 2009, 40 (1), 119–125. Kim, S.; Kaplan, L. A.; Hatcher, P. G. Biodegradable dissolved organic matter in a temperate and a tropical stream determined from ultra-high resolution mass spectrometry. Limnol. Oceanogr. 2006, 51 (2), 1054–1063.
ES1002204