Estuarial Fingerprinting through Multidimensional Fluorescence and

Aug 25, 2005 - LFDA model for the assessment of water quality through Microtox® using excitation-emission matrices. Oscar Martinez , Ranga Dabarera ,...
0 downloads 10 Views 532KB Size
Environ. Sci. Technol. 2005, 39, 7560-7567

Estuarial Fingerprinting through Multidimensional Fluorescence and Multivariate Analysis GREGORY J. HALL,† KERIN E. CLOW, AND JONATHAN E. KENNY* Department of Chemistry, Tufts University, 62 Talbot Avenue, Medford, Massachusetts 02155

FIGURE 1. Sampling locations: Mystic Lake (L), Mystic River (R), Boston Harbor (H), and Chelsea River (C) (Reproduced with the permission of the Mystic River Watershed Assoc.).

As part of a strategy for preventing the introduction of aquatic nuisance species (ANS) to U.S. estuaries, ballast water exchange (BWE) regulations have been imposed. Enforcing these regulations requires a reliable method for determining the port of origin of water in the ballast tanks of ships entering U.S. waters. This study shows that a three-dimensional fluorescence fingerprinting technique, excitation emission matrix (EEM) spectroscopy, holds great promise as a ballast water analysis tool. In our technique, EEMs are analyzed by multivariate classification and curve resolution methods, such as N-way partial least squares Regression-discriminant analysis (NPLS-DA) and parallel factor analysis (PARAFAC). We demonstrate that classification techniques can be used to discriminate among sampling sites less than 10 miles apart, encompassing Boston Harbor and two tributaries in the Mystic River Watershed. To our knowledge, this work is the first to use multivariate analysis to classify water as to location of origin. Furthermore, it is shown that curve resolution can show seasonal features within the multidimensional fluorescence data sets, which correlate with difficulty in classification.

Introduction In the past decade, extensive work has been conducted using excitation-emission matrix (EEM) spectroscopy to characterize natural waters, both marine and estuarial, as well as colored dissolved organic matter (CDOM) and humic acids (HA) from both environmental samples and standards (111). At the same time, the statistical methods for analyzing such data, specifically “multi-way” or multivariate statistical methods, such as parallel factor analysis (PARAFAC) and N-way partial least squares regression (N-PLS), have developed into practicable tools for the analysis of such data (1216). This body of work has made it possible to use information-rich EEMs to fingerprint estuaries and to determine physical reasons for outliers in the classification process. Past research on EEMs (also called total luminescence or two-way fluorescence, because of the dependence of fluorescence intensity on the two variables, excitation wavelength and emission wavelength) of CDOM has been performed on many different bodies of water, from the Arabian Sea (5) to Danish estuaries (11) to U.S. rivers (1-3, 9). For each site, the intensity patterns within these EEMs have shown * Corresponding author phone: (617)627-3397; fax: (617)627-3443; e-mail: [email protected]. † Present address: U.S. Coast Guard Academy (ds-1), 27 Mohegan Ave., New London, CT 06320. 7560

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 39, NO. 19, 2005

differences, although in all cases contributions to intensity are assignable to the same general categories of compounds, principally humic acids, fulvic acids, and biologically significant amino acids such as tryptophan and tyrosine. In a given natural water sample, fulvic and humic acid content, relative concentrations of proteins containing fluorescent amino acids, and levels of contamination will vary characteristically, depending on the flora, fauna, and human activity specific to that location. Studies have shown that it may be possible to use these differences to track the movement of groundwater and, more recently, to try to characterize water samples from different locations within an estuary system (11, 17). Until now, classification of samples according to source location has been done using simple ratios of peaks within the EEM (4, 17). This method does not take advantage of all of the features within an EEM that may be characteristic. Recent advances in chemometrics allow for classification models to be built on the basis of the entire covariance represented by a training set of two-way data (12). N-PLS discriminant analysis (NPLS-DA) is such a method, and we show here how water samples can be classified as to source location utilizing this method. We know of only one other application of NPLS-DA. There it was used to analyze surface acoustical wave data of chemical warfare agents (18). A limitation of NPLS-DA models is that the vectors used to classify data are selected on the basis of capturing the most variance and cannot be interpreted as having chemical significance (19). Therefore, other techniques, such as PARAFAC, must be used when searching for understanding of spectral phenomena that occur within the data set. We also show how such a PARAFAC model helps to understand a date-correlated trend within our data set. We analyzed EEM data from water samples taken at different locations in the Mystic River Watershed, including the Chelsea River, Boston Harbor, Mystic Lake, and Mystic River, over the course of an entire year from September 2003 until September 2004. We have shown that we can classify samples according to both location along an estuary and river of origin. In addition, PARAFAC results can show us what chemical features may be causing some of the samples to classify less well than others.

Experimental Section Water Sampling and Preparation. The water samples were collected nearly weekly for a 1-year period from four selected sites in the Mystic River Watershed region (Figure 1). The samples were collected by surface grab or Van-Dorn style sampling bottle from docks or the shoreline and transferred to 125 mL light-protected Nalgene screw-cap bottles, immediately refrigerated and later frozen for long-term preservation. Samples were vacuum filtered with 0.22 µm nylon 10.1021/es0503074 CCC: $30.25

 2005 American Chemical Society Published on Web 08/25/2005

FIGURE 2. N-way partial least squares (NPLS) model. For classification models (NPLS-DA), the Y block is an array of logicals. membrane filters (Millipore). The pH, recorded for all samples, ranged from 7.8 to 8.2. Spectroscopy. All fluorescence measurements were made using the Varian Cary Eclipse fluorescence spectrophotometer with a Xenon flashlamp. Unpurged samples were held in standard 1 cm quartz cuvettes (Hellma). EEMs were gathered by collating emission spectra at a range of excitation wavelengths. Emission spectra were scanned from 220 to 600 nm in 1 nm steps. The excitation wavelengths were stepped by 10 nm from 220 to 600 nm to produce EEMs of dimension 381 × 39. Excitation and emission spectral resolution (5 nm) and lamp voltage (600 V) were held constant in all spectra. Spectra were then corrected for emission spectrometer bias by the method described by Melhuish (20) employing a quartz diffuse scatterer and concentrated (8 g/L in ethylene glycol) Rhodamine B. Spectra were additionally corrected across the excitation axis for lamp source intensity by normalizing to the Rhodamine B excitation spectrum measured at 660 nm emission under identical instrumental conditions. Inner filter effect correction was accomplished as described by Patterson et al. (21) based on beam profile information provided by Varian. In practice, all corrections were applied by a MATLAB program written inhouse. The EEMs corresponding to a sampling location were “stacked” in chronological order to produce a three-way matrix we term date-ordered EEM or DO-EEM. This matrix, of size 31 × 381 × 39 for Mystic Lake (L), 28 × 381 × 39 for Mystic River (R), 28 × 381 × 39 for Boston Harbor (H), and 27 × 381 × 39 for the Chelsea River (C), contains three independent variables (date, emission wavelength, and excitation wavelength) and one dependent variable (fluorescence intensity). NPLS-DA Model. NPLS regression provides a platform for developing calibration models with the ability to classify new data and indicate the degree of certainty (12). This regression method, combined with discriminant analysis (DA), allows for supervised pattern recognition for separation of classes with a high degree of similarity. The NPLS regression algorithm we used, tri-linear PLS2, performs iterative calibration of two or more independent variables. The command is first operated in calibration mode to generate a regression model of the sample sites of interest (12). Second, test samples not used in the model are supplied for use of NPLS in prediction mode. Each calculated class prediction value can be compared with a Bayesian distribution curve to determine the probability that a test sample belongs to a discrete class. NPLS data are introduced in a standard X block of dependent variables, and samples are assigned to class values within a multi-column Y block of independent variables (Figure 2). For this work, the X block is the DO-EEM stack with J and K being the spectroscopic axes and I being the date axis. For discriminant analysis, the Y block consists of discriminate logicals, values of 1 or 0, that assign true classes for X block samples, whether each EEM belongs or does not belong to a model class of interest (22). In this way, the Y block is of dimensions: I (number of samples) × N (number of classes of interest).

FIGURE 3. PARAFAC three-factor model. X represents the DO-EEM, an represents the score for factor n, and bn and cn are the loadings for factor n in the excitation and emission axes. The tri-linear PLS2 algorithm decomposes the spectral fluorescence information and defines latent variables (LV) in X correlated with Y while explaining large amounts of variance in X, in other words maximizing the covariance (12). NPLS-DA performs this type of decomposition on our DO-EEM data to model samples by location. In our case, in Figure 2, X represents the DO-EEM data, Y represents logicals for each EEM along I, and whether they belong to each location N. T represents scores to weight the LVs of number size A. P can be visualized as a stack of two-way latent variables that describe features within the fluorescence. These features are weighted by the values in T to model each slab in X. Q can be interpreted as the relative importance of each latent variable to the classification because Q multiplied by the score in T leads to the Y values predicted by the model. These two equations are solved with the common variable T such that the residuals E and F are minimized. LVs are similar to the principal components of principal component analysis but correlate to latent structures in the spectral data and do not translate directly to representative chemical meaning. The number of LVs that are appropriate must be determined through cross-validation of the model. Once the number of LVs is selected, and the model is calculated, then the percent variance between the samples that is captured by the model can be assessed. A well-fit model will describe about 90% of the variance. Specific information of how this cross-validation is conducted is available in the Supporting Information. NPLS-DA Data Preparation and Methods. Weights maps were constructed for each EEM separately to assign Rayleigh and Raman scatter regions to not-a-number (NaN) and the valuable spectral data to a value of 1 for regression analysis. Initial outliers were identified if positioned beyond one standard deviation from the mean of predicted Y values calculated in separate sample site models. Calibration EEMs were chosen by inclusion of half of the EEMs falling within one standard deviation of the mean Y value while reserving the remainder as test samples for prediction. EEMs falling outside of one standard deviation were included in the prediction set. Class threshold calculations were performed to determine the percent of calibration samples correctly classified by each model. The predicted Y values for any given test sample were compared to probability distribution curves for the model of interest to determine the probability of the sample belonging to the correct class. No preprocessing (mean centering or variance scaling) of data was performed (with the exception of instrument bias removal and absorbance correction). Multiway cross validation was performed using the “leave-one-out” method to exclude one sample per iteration. The appropriate number of latent variables was determined by cross validation. PARAFAC. Parallel factor analysis is a decomposition method for multivariate data. The theory and application of PARAFAC is well covered in the literature and by several excellent reviews (13, 15, 19), so only an overview of the method and important experimental considerations will be discussed here. Whereas NPLS-DA is used for classification, PARAFAC is used to yield chemically relevant “factors”, which can be interpreted as pure component spectra (23). PARAFAC VOL. 39, NO. 19, 2005 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

7561

is a statistical method used to model multivariate data where each element of the model X can be described as: F

xijk )

∑a b c

ij jf kf

+ eijk

f)1

For three-way spectroscopy, a more useful visualization of the model is that of a matrix with dimensions i, j, and k (date, excitation wavelength, and emission wavelength, in our case), that is, the sum of the tensor products of a, b, and c where the subscript refers to the factor number (Figure 3) (15). Residual variation not fit by the model is expressed by E, a matrix of the same dimension as X. One of the most attractive features of PARAFAC is the fact that the solution of the model is unique, the analytical result of which is that given appropriate signal-to-noise ratios, and trilinearity of the data, the “loadings”, or values of the vectors in the spectroscopic dimensions, can be interpreted as pure component spectra, and the “scores”, the elements of the vector a, can be interpreted as the contribution of that spectrum at that sample, or date in our case (23). It should be noted that those contributions that might be thought of as an indication of concentration are only relative to other scores for a particular factor, and should not be directly compared to the scores from other factors. The PARAFAC model has limitations and model development considerations that must be addressed when using it as an analytical tool based on the user’s knowledge of the data and the phenomenon being described. These considerations include permutation indeterminacy (15), selection of the number of factors to model (24), and the detection of outliers (25). Page limitations preclude description of all of the methods to address these considerations, but the literature describes these methods well (13, 15). PARAFAC Data Preparation and Methods. The DO-EEMs used for PARAFAC modeling were identical to those used in the NPLS. PARAFAC used expected values to fill those regions that were set to NaN; these regions represented between 16% and 21% of the data points (first- and second-order scatterer excitation light) in each EEM. Initial PARAFAC modeling was performed with the PARAFAC function. As seen in the text of this Eigenvector company webpage (http://software.eigenvector.com/ toolbox/3_5/), when written in text an underscore is found between the words PLS and Toolbox but not before the version number. It should then read “of PLS_Toolbox 3.0.4 by Eigenvector Research, Inc. Final results, factor quantity, and model validation analysis were run by programs published by Bro for jack-knife analysis (available for free from his website http://www.models.kvl.dk/users/rasmus/). Each software package is a third party toolbox for MATLAB. In all cases, the PARAFAC algorithm was constrained to nonnegative scores and loadings in all three modes and was run to a stop criterion corresponding to a change of the sum square residuals (of the data to the model) of less than 1 × 10-6 as is generally accepted in the literature. Initial modeling ranged from 100 to 70 000 iterations. All final models run by the jack-knife PARAFAC routine reached convergence in less than 150 iterations. Number of factors and outliers along the date axis were determined as described by Bro and Riu (24, 25). Specifics for these analyses are available in the Supporting Information. Factor assignment addressing permutation indeterminacy was done by visual inspection. Preanalysis corrections, NPLS, and PARAFAC models were performed on a Dell Dimension 8300 Desktop (Windows XP Operating System, Intel Pentium 4 CPU with 3.00 GHz Processor, 512 MB RAM) running MATLAB 6.5 by MathWorks Inc. 7562

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 39, NO. 19, 2005

FIGURE 4. Plot of RMSEC and RMSECV versus LV for the L,C NPLSDA model.

FIGURE 5. NPLS-DA model LC plot of calibration and test data in predicted coordinate space. Dashed lines are class separation threshold values corresponding to a probability of 0.50 between each location class.

Results and Discussion Classification (NPLS-DA). Calibration. Two regression models based on calibration data sets of EEMs from several dates at each location were created. In this case, eight EEMs were used for calibration from each of the locations L, R, H, and C, as described in the Experimental section. The first model included only locations L and C, while the second included all four locations. Seasonal fluctuation of the fluorescent components in each site was a theoretical challenge for constructing calibration models with discrimination capabilities. Data previously published by our group demonstrate that calibration models of these data constructed with subsets from any select season are comparable to the models presented here (26). This finding indicates that the seasonal fluctuations are not a significant factor affecting the success of NPLS-DA models on these environmental samples. For this study, calibration samples were scattered over the entire year to represent a reasonable average of any small annual variability. The geographical closeness of the sites of interest presented some challenge for model calibration and prediction of new samples, but was not completely restricting. Model LC, two locations along different rivers, was successful in accurately explaining the variance between the two sites and was strong in predicting the source location of test data. The four location Model LRHC encountered more difficulty defining the differences between sites but remained fairly strong in correctly classifying test data to location of origin. Models can only be used to classify samples from the locations used in the calibration step. During the creation

FIGURE 6. (a) Results from NPLS-DA analysis of the EEMs of water from Mystic Lake (L) and Chelsea River (C). A threshold of 50% probability of classification was used to calculate both correct classifications and false positives. (b) Results from NPLS-DA analysis of the EEMs of water from Mystic Lake (L), Mystic River (R), Boston Harbor (H), and Chelsea River (C). A threshold of 50% probability of classification was used to calculate both correct classifications and false positives. of the models, it was necessary to determine the appropriate number of LVs to be used. This determination is a tradeoff between minimizing the root mean squared error of calibration (RMSEC or the measure of deviation in the model’s fit to the X block data), and the root mean squared error of cross validation (RMSECV or error in the predictive Y block); see Figure 4. For the model including all four sites (model LRHC), six LVs were selected. For the model describing just sites L and C, five latent variables were selected. Following the selection of the number of LVs, model cross validation must be performed to ensure that no one sample influences the model inappropriately, and variance analysis must be performed to ensure that the model describes a sufficient amount of variability in the data (usually 90%). The results of these analyses for these two models are available in the Supporting Information. The final models can be plotted in a space with the same dimensionality as the number of sites included. These plots have axes that correspond to the predicted value for each sample in each Y block column (Figure 5 for LC). In this manner, an intuitive feeling for the separation of the classes (locations) can be acquired. The predicted Y coordinates of both the calibration data used in model creation and test data not included in calibration can be displayed. Classification of Water Samples to Location of Origin. All samples collected during the study not used in the

calibration models were tested in prediction mode. Each test sample is compared to the calibration model, and a set of predictive Y space values are delivered for comparison to an overall probability distribution for that model. The Bayesian distribution curves comparing two classes intersect at a probability of 0.50, where all values above that threshold have a higher probability of being in the class of interest. In predictive mode, the results from Model LC correctly classified 90% of test samples from L and 59% of test samples from C. Also, Model LRHC correctly classified 80% of test data from L, 100% of test data from R, 88% of test data from H, and 82% of test data from C. Despite seasonal variation and the close proximity of locations, test sample classification to the correct geographical site was above a probability of 0.99 or 0.95 for some and above a 0.50 probability for most samples (Figure 6). These results indicate that over a 12-month period, the variability within one sample site did not exceed the spectral differences between the four sites. The classification results show that NPLS-DA of EEM data is a promising technique for the determination of the location of origin of estuarial water samples. The decreased effectiveness of the classification including sites H and C is not surprising because these sites are within 1 mile of each other. In addition, the waters from L, C, and R all mix into the Boston Harbor, which causes difficulty in correctly classifying samples from H. However, given the close proximity of all these sampling locations, our VOL. 39, NO. 19, 2005 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

7563

FIGURE 7. PARAFAC results for spectroscopic modes. Factor 1 most resembles humic acids, which fluoresce in the UV, Factor 2 most resembles visible fluorescing humic acids, and Factor 3 closely resembles tyrosine-containing proteins. demonstrated success in classifying samples to their point of origin is encouraging. We were also interested in which factors may cause classification of certain samples by the model to be problematic. As discussed before, the LV loadings cannot be used to elucidate reasons with physical or chemical significance. Therefore, PARAFAC analysis was performed on these data to see if there were common characteristics of these samples. PARAFAC. To understand the spectroscopic reasons for the variability in this data set, we performed PARAFAC analysis on the DO-EEM for each location. For PARAFAC modeling, the data were split into separate data sets, each 7564

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 39, NO. 19, 2005

data set representing all of the EEMs taken from a particular location, from both the calibration and the test sets. PARAFAC models were then fit for each location independently. The number of factors for each model was determined both through the use of core consistency (CONCORDIA) analysis (24) and by considering redundant factors. For all models calculated in this work, three factors were most appropriate. Details on CONCORDIA analysis are available in the Supporting Information. Following the determination of number of factors, outliers in the data were removed, and the models were refit. The results show the spectra and relative importance with respect to date of the fluorescent components in

FIGURE 8. PARAFAC scores for each factor with respect to date of sample collection. Blue ) Factor 1, green ) Factor 2, red ) Factor 3. Shaded dates indicate dates that categorized poorly in the NPLS-DA model of this same data. With the exception of two dates in location H, all poorly categorized dates clustered around the anomaly in Factor 3. the water samples. Theoretically, the spectra yielded by PARAFAC analysis are the pure component spectra of each component (23), after the number of components is preselected as described above. In reality, because the analysis is constrained by the complex nature of the samples and the resolution of our measurements, each factor found by PARAFAC represents the response of a class of compounds that co-vary with respect to date. Outliers that may arise from errors in analysis or from anomalous samples must be identified so that they do not exert undue influence on the model. For this work, outliers were identified by using resample influence plots (RIP) and identity match plots (IMP) as described by Riu et al. (25). Specifics of this analysis are available in the Supporting Information. Following the removal of outliers, PARAFAC models were calculated for each location. The PARAFAC model provides loadings in each mode that represent the response of each factor in that mode. The PARAFAC results are shown in Figure 7. In all locations, the best-fit PARAFAC models included three factors. Furthermore, these three factors generally resemble each other from location to location. Because each locational model was an independent model, assignment of labels 1, 2, and 3 was done to indicate this resemblance. Factor 1 in all cases manifests as a broad emission spectrum along with a relatively broad excitation spectrum. For the R, L, and C locations, the excitation/emission maximum (ex/em) wavelengths are 250/455 nm. This corresponds well with the description of UV fluorescing humiclike spectra in previous literature (4), which assigned the ex/em pair as 260/400-460 nm (4). At this point, it should be noted that comparison of excitation and emission maxima from study to study should take into account the wide variety of instrumental and inner-filter correction schemes applied,

in addition to the variations in humic substances from different locations. The extensive corrections applied to our data based on the work of Melhuish (20) and MacDonald (21) may make these spectra seem to be blue shifted as compared to previous work. Factor 1 for locations C, L, and R shows a possible secondary excitation peak appearing as a shoulder at about 370 nm, which corresponds to species previously characterized by Coble et al. as visible fluorescing humic acids (4). The spectra for Factor 1 for Boston Harbor seem to be red shifted as compared to Factor 1 from the other locations in both the excitation and the emission modes. No obvious explanation is apparent for this shift, because marine humic acids tend to be blue shifted from freshwater humics (4). However, it is probable that the overall composition of the humics in Boston Harbor at this location is influenced by more sources than just the rivers sampled in this study. Factor 2 is more complex than Factor 1. In the case of locations C, L, and R, this factor seems to be the sum of two chemical contributions. These would show as the same factor in PARAFAC if they co-varied with respect to date and their spectroscopic properties were similar. The sharp peak in these emission spectra at 350 nm corresponds well with tryptophan-containing proteins known to be present in natural water samples, although the excitation maxima at 240 and 300 nm bracket the published excitation maximum for tryptophan at 275 nm (4). The broader features of this factor, which are more prevalent in the river locations closer to the harbor, are most likely contributions from marine humic acids shown as the broad emission signal in the harbor. The ex/em values for Factor 2 in Boston Harbor are 320/390 nm, which correspond to literature values for marine humic acids of 290-310/370-410 nm (4). VOL. 39, NO. 19, 2005 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

7565

Factor 3 very closely resembles tyrosine, an amino acid present in biologically important proteins seen in marine and estuarial samples in many different studies (1, 5, 27). Factor 3 has an ex/em pair of 221-280/306 nm, which corresponds almost exactly with previously published values (4, 27). The secondary peak in the emission spectrum near 600 nm is the second-order response from the monochromator grating. The similarity of the spectra underlying the EEMs from locations L and R, which the NPLS-DA analysis was able to classify well, shows the need to utilize the entire EEM in the classification analysis. Previous work has tried to fingerprint sources of CDOM with peak ratios, but it is shown here that the spectra of CDOM from different locations may be very similar. Therefore, to resolve the small differences present, it is prudent to use the entire data set. The date scores for these factors (Figure 8) produce some interesting results. In each case, the humic acid and tryptophan factors stay fairly constant, while Factor 3 shows a clear peak in June. This peak could represent a biological bloom during this time frame. Previous work has shown that proteins can be released into the water column during phytoplankton blooms (28). NPLS-DA Outlier Date Examination by PARAFAC. The motivation for examination of the PARAFAC analysis in relation to the classification work was to examine possible physical or chemical reasons for certain samples to classify poorly. The powerful synergy of these two methods can be shown by using the PARAFAC results to identify differences in the fluorescent factors in samples that classify poorly during the NPLS-DA analysis, because the PARAFAC factors can be interpreted to have physical or chemical meaning. The shaded regions in Figure 8 represent the bottom 15% of samples with relation to classification model fit. With the exception of two samples in Boston Harbor, all outliers occur in the same time frame as this tyrosine peak. The ability of multivariate chemometric treatment of multidimensional fluorescence to both identify the location of origin of estuarial water and to identify factors that contribute to the variability of the fluorescence of these samples suggests it has great promise as a fingerprinting method. The contribution of interferents, including petroleum pollutants that may be present only in single dates or in a ship’s ballast tanks, is among the difficulties such a method will need to overcome. Ongoing work in this lab involves the addition of a third spectroscopic dimension, fluorescence lifetime, to the data set, forming date-ordered time-resolved excitation emission matrices (DO-TREEMs). The addition of this dimension is expected to enable the identification of spectral contributions of pollutants, which generally have much longer lifetimes than natural products. These features can then be removed and classification improved. In addition, ongoing work is expanding the dataset from just the Boston Harbor watershed to several ports around the United States. Because water from two rivers of close geographic location can be discriminated, it is probable that discrimination between estuaries separated by thousands of miles should also be possible. If this methodology can be proven, it will be an excellent resource for enforcement officials to determine the location of origin of ballast water samples. More generally, these statistical methods are not specific to water samples. We have shown that use of NPLSDA and PARAFAC for general use with EEMs or other trilinear, multivariate data sets is a powerful classification and characterization technique. This marriage of techniques is important to indicate real physical or chemical reasons for difficulties in classification, rather than just examining latent variable differences. Deployment of such techniques for the enforcement of BWE regulations will need to include a protocol for sampling that is guided by the environmental 7566

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 39, NO. 19, 2005

considerations that may make classification more difficult and less certain.

Acknowledgments We would like to thank Hao Chen, Laurie Handwerker, Tom Dionne, Lisa Schupmann, Alex Caplan, and Im Pradipasena for help collecting samples and data. We would like to thank Dr. Martha Laurie of the Coast Guard Marine Safety Laboratory for her advice on aquatic nuisance species enforcement applications, the scientists at Eigenvector Research Inc. for their assistance with chemometric algorithms, and our funding agencies, NOAA under SeaGrant #NA16RG225, internal Tufts University funds, and the U.S. Coast Guard Academy Alexander Trust. The research described herein does not necessarily reflect the position of the U.S. Coast Guard and no official endorsement should be inferred.

Supporting Information Available NPLS-DA model validation and variance-captured results, and PARAFAC outlier and factor quantity analysis. This material is available free of charge via the Internet at http:// pubs.acs.org.

Literature Cited (1) Chen, W.; Westerhoff, P.; Leenheer, J. A.; Booksh, K. Fluorescence excitation - Emission matrix regional integration to quantify spectra for dissolved organic matter. Environ. Sci. Technol. 2003, 37, 5701-5710. (2) Baker, A. Spectrophotometric discrimination of river dissolved organic matter. Hydrol. Processes 2002, 16, 3203-3213. (3) Baker, A. Fluorescence excitation - Emission matrix characterization of river waters impacted by a tissue mill effluent. Environ. Sci. Technol. 2002, 36, 1377-1382. (4) Coble, P. G. Characterization of marine and terrestrial DOM in seawater using excitation emission matrix spectroscopy. Mar. Chem. 1996, 51, 325-346. (5) Coble, P. G.; C., D. C.; Avril, B. Distribution and optical properties of CDOM in the Arabian Sea during the 1995 Southwest Monsoon. Deep-Sea Res., Part II 1998, 45, 2195-2223. (6) Hemmingsen, S. L.; McGown, L. B. Phase-resolved fluorescence spectral and lifetime characterization of commercial humic substances. Appl. Spectrosc. 1997, 51, 921-929. (7) Mobed, J. J.; Hemmingsen, S. L.; Autry, J. L.; McGown, L. B. Fluorescence characterization of IHSS humic substances: Total luminescence spectra with absorbance correction. Environ. Sci. Technol. 1996, 30, 3061-3065. (8) Moberg, L.; Robertsson, G.; Karlberg, B. Spectrofluorimetric determination of chlorophylls and pheopigments using parallel factor analysis. Talanta 2001, 54, 161-170. (9) Newson, M.; Baker, A.; Mounsey, S. The potential role of freshwater luminescence measurements in exploring runoff pathways in upland catchments. Hydrol. Processes 2001, 15, 989-1002. (10) Parlanti, E.; Worz, K.; Geoffroy, L.; Lamotte, M. Dissolved organic matter fluorescence spectroscopy as a tool to estimate biological activity in a coastal zone submitted to anthropogenic inputs. Org. Geochem. 2000, 31, 1765-1781. (11) Stedmon, C. A.; Markager, S.; Bro, R. Tracing dissolved organic matter in aquatic environments using a new approach to fluorescence spectroscopy. Mar. Chem. 2003, 82, 239-254. (12) Bro, R. Multiway calibration. Multilinear PLS. J. Chemom. 1996, 10, 47-61. (13) Bro, R. PARAFAC. Tutorial and applications. Chemom. Intell. Lab. Syst. 1997, 38, 149-171. (14) Andersson, C. A.; Bro, R. The N-way Toolbox for MATLAB. Chemom. Intell. Lab. Syst. 2000, 52, 1-4. (15) Andersen, C. M.; Bro, R. Practical aspects of PARAFAC modeling of fluorescence excitation-emission data. J. Chemom. 2003, 17, 200-215. (16) Andersson, C.; Bro, R. Special issue: Multiway analysis. J. Chemom. 2000, 14, 103-103. (17) Yan, Y.; Li, H.; Myrick, M. L. Fluorescence fingerprint of waters: Excitation-emission matrix spectroscopy as a tracking tool. Appl. Spectrosc. 2000, 54, 1539-1542.

(18) Shaffer, R. E.; Rose-Pehrsson, S. L.; McGill, R. A. Multiway analysis of preconcentrator-sampled surface acoustic wave chemical sensor array data. Field Anal. Chem. Technol. 1998, 2, 179-192. (19) Bereton, R. G. Chemometrics: Data analysis for the laboratory and chemical plant; Wiley: Hoboken, NJ, 2003. (20) Melhuish, W. H. Calibration of Spectrofluorimeters for Measuring Corrected Emission Spectra. J. Opt. Soc. Am. 1962, 52, 1256. (21) MacDonald, B. C.; Lvin, S. J.; Patterson, H. Correction of fluorescence inner filter effects and the partitioning of pyrene to dissolved organic carbon. Anal. Chim. Acta 1997, 338, 155162. (22) Perez-Enciso, M.; Tenenhaus, M. Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach. Hum. Genet. 2003, 112, 581-592. (23) Sidiropoulos, N. D.; Bro, R. On the uniqueness of multilinear decomposition of N-way arrays. J. Chemom. 2000, 14, 229239. (24) Bro, R.; Kiers, H. A. L. A new efficient method for determining the number of components in PARAFAC models. J. Chemom. 2003, 17, 274-286.

(25) Riu, J.; Bro, R. Jack-knife technique for outlier detection and estimation of standard errors in PARAFAC models. Chemom. Intell. Lab. Syst. 2003, 65, 35-49. (26) Clow, K. E.; Hall, G. J.; Chen, H.; Kenny, J. E. Spectral fingerprinting and classification by location of origin of natural waters by multidimensional fluorescence; SPIE 5586; SPIE: Philadelphia, PA, 2004. (27) Du, H.; Fuh, R. C. A.; Li, J. Z.; Corkan, L. A.; Lindsey, J. S. PhotochemCAD: A computer-aided design and research tool in photochemistry. Photochem. Photobiol. 1998, 68, 141-142. (28) Sommerville, K.; Preston, T. Characterisation of dissolved combined amino acids in marine waters. Rapid Commun. Mass Spectrom. 2001, 15, 1287-1290.

Received for review February 15, 2005. Revised manuscript received July 11, 2005. Accepted July 19, 2005. ES0503074

VOL. 39, NO. 19, 2005 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

7567