A Comparison of Positive Matrix Factorization and the Weighted

Oct 26, 2011 - A comparison of the results obtained by MCR-ALS, MCR-WALS, and PMF for raw and scaled environmental data collected during a monitoring ...
1 downloads 10 Views 3MB Size
ARTICLE pubs.acs.org/est

A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method. Application to Environmental Data Ivana Stanimirova,† Roma Tauler,‡ and Beata Walczak†,* † ‡

Department of Analytical Chemistry, Institute of Chemistry, The University of Silesia, 9 Szkolna Street,40-006 Katowice, Poland Institute of Environmental Assessment and Water Studies, IDAEA-CSIC, C/Jordi Girona 18-26,08034 Barcelona, Spain ABSTRACT:

In recent years, positive matrix factorization, PMF, has gained popularity in environmental sciences and it has been recommended by the U.S. Environmental Protection Agency as a general modeling tool in air quality control. Among the attractive features contributing to its popularity is that measurement uncertainty information can be incorporated into the PMF model, which allows the handling of missing measurements and data below the reporting limits. In addition, the solutions obtained from PMF obey constraints such as the non-negativity of the source compositions and source contributions of samples that make their interpretation physically meaningful. A less popular multivariate curve resolution method based on a weighted alternating least-squares algorithm, MCR-WALS, also incorporates the measurement error information and non-negativity constraints, which makes this method a potential tool when obtaining composition and contribution profiles of environmental data. Both methods use the same loss function, but they differ in the way the profiles are obtained. The goal of this study was to compare the performance of PMF with the performance of MCR-WALS for data sets simulated with different correlation and error structures. The results showed that the profiles extracted by both methods are virtually the same for data with different error structures.

’ INTRODUCTION Over the past several years there has been an increased interest in developing and applying chemometric approaches that allow for the incorporation of the measurement uncertainty information in the chemometric analysis of the collected chemical data. This has been driven by the possibility of improving the extraction of chemical information including a priori knowledge about sampling error, instrumentation noise or other possible sources of variation. Maximum likelihood principal component analysis, MLPCA, multivariate curve resolution-weighted alternating least-squares, MCR-WALS and positive matrix factorization, PMF, are some of the methods that have been proposed in the literature14 as counterparts of the classic principal components analysis, PCA,5,6 multivariate curve resolution-alternating least-squares, MCR-ALS7,8 and traditional factor analysis.9,10 The general goal of all of these methods is to solve a bilinear problem which can be expressed r 2011 American Chemical Society

mathematically in the following way: X ¼ GFT þ E

ð1Þ

In other words, with these methods the original high-dimensional experimental data organized as a matrix X (m  n) with m rows (samples or objects) and n columns (chemical species or variables) is decomposed into two lower-rank matrices G (m  p) and FT (p  n) the p components of which, in general, are related to the chemical information in the data, whereas matrix E (m  n) contains information about the experimental error, that is, the part of variation not explained by the model of definite complexity p. Different constraints can be applied to the G and FT Received: March 27, 2011 Accepted: October 26, 2011 Revised: October 23, 2011 Published: October 26, 2011 10102

dx.doi.org/10.1021/es201024m | Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology matrices under various assumptions for the error matrix, E, which implies the use of different algorithms in order to perform the decomposition described by eq 1. In PCA and classic factor analysis, the columns of the scores matrix G, often called principal components or latent factors, are obtained by maximizing the variance of the projected data and are constrained to be orthogonal to each other, whereas the rows of the loadings matrix FT are orthonormal, that is, orthogonal and normalized to unit length. Obtaining a unique solution of the bilinear problem is an attractive feature in chemical studies; however, neither method provides such a solution due to the rotational freedom of factors. Furthermore, including the orthogonality constraints into the score and loading matrices often results in identified sources that are difficult to interpret or are physically unrealistic. In order to enrich the interpretability of the obtained factors, either orthogonal (varimax, quartimax, equimax) or oblique rotations9 can be applied. Another more general solution to the bilinear modeling problem can be obtained using the multivariate curve resolution approach, MCR. In contrast to PCA, the factors found with MCR are not orthogonal and in general an infinite number of solutions for the contribution and composition matrices, G and FT, exist. However, the imposition of various constraints like non-negativity, unimodality, closure or others to the G and/or FT matrices via the alternating least-squares, ALS, algorithm gives the possibility of obtaining a unique or nearly unique solution with a physically meaningful interpretation. The MCR-ALS approach with non-negativity constraints has found applications in kinetic studies,11 modeling of environmental data,12,13 analysis of microarray data,2 resolving signals of coeluting mixture components in chromatography, etc.14,15 A general problem with the classic methods is that the decomposition defined by eq 1 is only optimal when the errors for all measurements in X are independent and identically distributed (i.i.d.) normal.16 Even though the measurements obtained from spectroscopic and chromatographic methods are, in the majority of cases, precise with low and relatively uniform uncertainties, this assumption is often not fulfilled when analyzing complex natural samples such as environmental samples. With these types of data the measurement uncertainty varies systematically with the magnitude of the signal. Therefore, chemical components present in low concentrations/contents may be neglected during chemometric analysis with the classic methods, although they have the same signal-to-noise ratio as those chemical components present in high concentrations/contents. A straightforward way of dealing with this problem is to apply approaches that offer the possibility of including information about the magnitude of the measurement errors during the modeling process. In recent years, positive matrix factorization, PMF, has gained popularity in the field of environmentalsciences4,1722 and has been recommended by the U.S. Environmental Protection Agency as a modeling tool in air quality control. An EPA-supported version of the PMF model and user’s guide are available from the U.S. EPA Web site (http://www.epa.gov/heasd/products/pmf/pmf. html). On the other hand, MCR with weighted alternating leastsquares (MCR-WALS) has been successfully used to analyze time course DNA-microarrays data2 and in aerosol source apportioning studies.13 The objective function in PMF and MCRWALS is defined in the same way, but the two methods differ algorithmically. Therefore, in this work we investigated how the algorithmic differences may influence the contribution and composition profile matrices obtained from both methods. A comparison of the results obtained by MCR-ALS, MCR-WALS,

ARTICLE

and PMF for raw and scaled environmental data collected during a monitoring campaign of aerosol pollution was recently presented in Tauler et al.13 In this work, we offer a comprehensive comparative study of the performances of PMF and MCR-WALS using data sets simulated with different correlation and error structures and a real environmental data set. In the following section, the algorithmic aspects of MCRWALS and PMF are described in detail starting with a comparison of customary MCR-ALS and its weighted version. Then, attention is focused on the simulation study in which the performances of the methods are compared for data sets with different error structures. The data sets used in this study are described in the Experimental Section. The results of the study are presented and discussed in the Results and Discussion Section.

’ THEORY As was already mentioned in the Introduction, MCR-ALS, MCR-WALS, and PMF aim to solve the general bilinear model defined by eq 1. With MCR-ALS the contribution and composition profile matrices, G and FT of a definite number of factors, p, are found by minimizing the sum of squared residuals, SSR, via the alternating least-squares algorithm. SSR ¼

m

n

∑ ∑ ðxij  ~xijÞ2 i¼1 j¼1

ð2Þ

In this equation, xij is the measurement of the j-th (j = 1, ..., n) chemical component for the i-th (i = 1, ..., m) sample and ~xij is the measurement predicted by the model of a given complexity. The ALS algorithm is designed to work when both matrices G and FT are not known in advance. This is possible by simplifying the problem of finding the solution of the bilinear problem in eq 1 by performing two simpler least-squares calculations in an alternating way, which means estimating one matrix given the estimation of the other under suitable constraints. Thus, after the selection of the model’s complexity, p, the next step of the algorithm is to initialize either the G or FT matrix. Let us assume that the FT matrix is to be initialized. This can be done in several possible ways. The elements of FT can be chosen completely at random, which may slow down the convergence of the algorithm and thus finding an optimal solution is not guaranteed. Another way may be to randomly select p rows from the original data X or to use, for example, the needle searching method,23 the orthogonal projection approach, OPA,24 or the simple-touse interactive self-modeling analysis (SIMPLISMA).25 No matter which of the initialization approaches is chosen, it is highly recommended that the algorithm be run several times in order to ensure the reliability of the contribution and composition profile matrices obtained. Once the FT matrix is initialized, the elements of the G matrix are found using the following projection: G ¼ XFðFT FÞ1

ð3Þ

An easy way to incorporate the non-negativity constraint is to set the negative elements of G to zero. In our work, we used the fast non-negativity least-squares algorithm, FNNLS, proposed by Bro and de Jong,26,27 which minimizes the sum of squared residuals in X so that all elements of G are equal to or larger than zero. The estimation obtained for G is then used to recalculate the 10103

dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology

ARTICLE

profiles matrix FT under the non-negativity constraint: FT ¼ ðGT GÞ1 GT X

ð4Þ

At each iteration, the two steps described by eqs 3 and 4 are performed sequentially and the sum of squared residuals in X is estimated using the interim model. The iterative procedure continues as long as the convergence criterion is not fulfilled, that is, the difference in the SSR values obtained in two consecutive iterations is not smaller than a predefined number (e.g., 108). Different constraints mentioned earlier can be incorporated within the ALS algorithm, but only the non-negativity of G and FT was considered in this work. Recently, further work has been done in understanding and calculating the rotation ambiguity of the solution obtained from MCR-ALS.2830 Unlike MCR-ALS, the core of the MCR-WALS algorithm is the inclusion of information about error variances in order to account for the level and structure of the noise. MCR-WALS2 minimizes the weighted sum of residuals, WSSR, expressed by WSSR ¼

m

n

∑∑ i¼1 j¼1

ðx ij  ~x ij Þ2 σ2ij

ð5Þ

where σij is the calculated standard deviation of the measurement. This objective function holds for uncorrelated errors and the respective maximum likelihood projections implemented in the steps of the ALS algorithm can be performed in the following way. Suppose that matrix Σ of dimension m  n contains the error variances for the measurements. After the initialization of FT, the rows of the original data X, xi., are projected on FT as follows: T 1 1 T x i: ¼ xi: S1 i FðF Si FÞ F

ð6Þ

Here, Si is a diagonal matrix, the diagonal elements of which are those of the i-th row of Σ. This maximum likelihood projection makes it likely that the row measurements with larger uncertainties are downweighted more and therefore, have less influence in the final model. The next step of the algorithm is to find the elements of the G matrix according to eq 3 using X* under nonnegativity or other constraints. Then in order to re-estimate FT, the maximum likelihood projection of the columns of X, x.j, is executed in the space of G: 1 T 1 x :j ¼ GðGT S1 j GÞ G Sj x :j

ð7Þ

The matrix Sj is the respective error covariance matrix the diagonal elements of which are those of the j-th column of Σ. Finally, the leastsquares solution for FT is found under the desired constraints using X* in eq 4. All of these steps are repeated iteratively as long as the convergence criterion is not fulfilled. Similar to the ALS algorithm presented earlier, the difference in the values of the loss function in two consecutive iterations can be used as a stopping criterion. Positive matrix factorization, PMF, is an iterative algorithm31 which, in general, minimizes the weighted sum of squares defined by eq 5. When non-negativity constraints are imposed on the G and FT matrices, this objective function is enhanced by regularization and logarithmic penalty terms, and is expressed in the following way: Q ¼

m

n

e2

m

p

m

p

p

n

ij  α ∑ ∑ logg ik  β ∑ ∑ log f kj ∑ ∑ 2 i ¼ 1 j ¼ 1 σ ij i¼1 k¼1 k¼1 j¼1

þγ

In this expression, α and β are the penalty coefficients which prevent the elements of G and FT from becoming negative, whereas the regularization coefficients, γ and δ, reduce the rotational freedom in the obtained factors. The coefficients ci and dj are added in order to remove the scaling differences in the rows in G and in the columns of FT. To solve the nonlinear leastsquares problem expressed by eq 8, the classic GaussNewton method is used. The algorithm can only be applied to minimize the sum of squared function values and therefore, the logarithmic penalty functions need to be represented in quadratic forms. The core of the iterative algorithm is the update of the elements of the G and FT matrices with two respective incremental matrices, the (m + n)  p elements of which are obtained by solving a system of m  n equations taking into account the logarithmic and penalty terms. The iterative procedure continues as long as the difference in values of the sum of weighted squared residuals obtained in two consecutive iterations is not smaller than a predefined value, for example, 108. To start the iterative procedure, the elements of G, F, α, β, γ, and δ need to be initialized, while the scaling coefficients ci and dj are estimated during the iterative procedure. The speed of reaching the minimum value of the objective function and the quality of the solution obtained depend on the initialization and effective optimization of the many parameters of the PMF model. Specifically, in order to perform the PMF calculations in this work we used the algorithm described by Lu and Wu31 and the programming guide presented therein to optimize the different steps within the iterative procedure. As the authors report, this algorithm may differ in the way the optimization of the PMF parameters is performed from the one proposed by Paatero and Tapper3,32 and the one implemented in the software that can be freely downloaded from the webpage (http://www.epa.gov/heasd/products/pmf/pmf. html) of the U.S. Environmental Protection Agency. Details related to the initialization and optimization of the PMF parameters which were used in this study will be described in the Results and Discussion section. For comparative reasons, results obtained using the EPA PMF program for solving the PMF problem, are also presented. The so-called multilinear engine (ME-2) program developed by Paatero33 is the core of the available U.S. EPA software.

p

n

∑ ∑ ci g2ik þ δ k∑¼ 1 j∑¼ 1 dj f 2kj i¼1 k¼1

ð8Þ

’ EXPERIMENTAL SECTION All calculations, except EPA PMF, using in-house implemented routines were performed with MATLAB 7.0 (R14) on a personal computer (Intel(R) Pentium(R) M, 1.60 GHz with 2GB RAM) using the Microsoft Windows XP (service pack 2) operating system. A general MATLAB code for the MCR-ALS can also be obtained.34 The available EPA PMF software (v3.0) was used to obtain results from the multilinear engine, ME-2, program. Simulation of the Data Sets. The capability of the methods described earlier was investigated using several simulated data sets. Four data sets, each of rank four (i.e., with four factors) containing 30 samples and 50 variables were simulated. Each errorfree matrix, Y, was generated by multiplying a 30  4 matrix, G, of elements from a log-normal distribution (mean = 0.01, standard deviation = 1) by a 4  50 matrix, FT, also drawn from a lognormal distribution with the same parameters. The pairwise correlation between the underlying composition profiles in FT was set to 0.0, 0.3, 0.6, or 0.9. Simulation of the composition profiles 10104

dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology

ARTICLE

Figure 1. Root mean square, RMS, errors averaged for the contribution, G, profiles obtained from (a) MCR-ALS, (b) PMF, (c) EPA PMF, (d) MCRWALS and for the composition profiles, FT, obtained from (e) MCR-ALS, (f) PMF, (g) EPA PMF, and (h) MCR-WALS for data sets with independent and identically distributed (i.i.d.) normal measurement errors. The radius and the intensity of color (in grayscale) of each circle marker are proportional to the magnitude of the estimated RMS error.

with a joint distribution of a definite correlation structure was performed using a copula function.35 Errors at different levels and with various structures were added to each of the four error-free data sets with different correlation structures. Here, we studied only cases of uncorrelated measurement errors.16 First, measurement errors that were i.i.d. normal with standard deviations of 1%, 5%, or 10% of the maximum value of the corresponding error-free data matrix were considered. For this purpose, each 30  50 matrix of standard deviations (1%, 5%, or 10%) was multiplied elementby-element by a matrix, the elements of which were generated from normal distribution, N(0,1), (mean = 0, standard deviation = 1). Each error matrix was then added to each of the error-free data sets to obtain 12 data set configurations with different correlation structures and different levels of noise. Three data sets were simulated from each type resulting in a total of 36 sets. Next, the most common case where the measurements had uncorrelated proportional errors was considered. Three matrices of standard deviations with fixed proportions of 5%, 10%, and 30% of the measurement values were simulated. Again each of the three standard deviation matrices was multiplied by a matrix with normally distributed elements and the error matrices were added to each of the error-free data to obtain 12 data sets with proportional error structures. For consistency of the study, three data sets were also generated from each type. Finally, data sets with constant and proportional errors were also simulated. The elements of each matrix of standard deviations, σij, in this case, were obtained as the square root of the sum of the squares of the constant part, a, and the proportional part,

byij, using the following measurement error model: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi σ ij ¼ a2 þ b2 y2ij

ð9Þ

The constant part was taken to be 1% of the maximum value of the error-free data, Y, while the proportional part was calculated as 5%, 10%, or 30% of the elements, yij, in the error-free data matrix. As was done previously, the final error matrix in each case was found by multiplying element-by-element the matrix of normally distributed numbers by the matrix of standard deviations and was added to the error-free data sets giving 12 data set configurations. Again three data sets were considered from each configuration. A total of 108 data sets were simulated in order to compare the capability of the methods. Description of a Real Data Set. The real environmental data set used here for an illustrative purpose is a part of a larger data set which was analyzed previously using other chemometric approaches.36,37 The data set contains concentrations of 13 chemical components (K+, Ca2+, Mg2+, NO3, SO42‑ C, Cd, Cu, Fe, Mn, Pb, V, Zn) and the particulate matter mass measured in spring, summer, autumn, and winter, in five different particle size fractions (PM0.04PM0.1, PM0.1PM0.4, PM0.4PM1.6, PM1.6 PM6.4, PM6.4PM25) at Arnoldstein, which is located in the Austrian province of Carynthia. The water-soluble ions K+, Ca2+, Mg2+, NO3, SO42‑ were determined using two ion-chromatographic systems, whereas the concentrations of heavy metals were analyzed using atomic absorption spectrometry.38 The analyzed data set is of the dimensions 20  14 and the measurements are presented as values obtained by averaging the respective parameters analyzed in quadruplicate samples of the five size fractions 10105

dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology

ARTICLE

Figure 2. Root mean square, RMS, errors averaged for the contribution, G, profiles obtained from (a) MCR-ALS, (b) PMF, (c) EPA PMF, (d) MCRWALS and for the composition profiles, FT, obtained from (e) MCR-ALS, (f) PMF, (g) EPA PMF, and (h) MCR-WALS for data sets with measurements that have independent proportional errors. The radius and the intensity of color (in grayscale) of each circle marker are proportional to the magnitude of the estimated RMS error.

collected in four seasonal sampling campaigns. Uncertainties of the data set were estimated as the standard deviations of those mean values and were considered in the analyses with PMF, EPA PMF, and MCR-WALS. In fact, these uncertainties are associated with the different sources of variability related to sampling procedures, preparation, and measurement. In general, a proportional error model is considered.

’ RESULTS AND DISCUSSION First, the results of the simulation study will be discussed and then a comparison of the performance of PMF and MCR-WALS will be presented for the real data set described earlier. Results from the Simulation Study. All data sets were simulated with complexity four. In order to limit the problem of the rotational ambiguity of the solutions, non-negativity constraints were imposed to G and FT and the known composition profiles FT were used as an initialization matrix in each of the approaches. The PMF algorithm by Lu and Wu is not very sensitive to the selected initial values because the optimization of the regularization and penalty coefficients is performed in a dynamic way. The solution with the lowest sum of weighted squared residuals was used for the final comparison. Since the true contribution and composition profiles are known beforehand, one way to compare the performance of the methods is to estimate how well those profiles are recovered by each of the methods run under the same constraints. Therefore, as a figure of merit in this comparison, the root mean square, RMS, difference between each profile obtained from each of the methods and the respective true profile is calculated and then averaged over the four G or FT factors. To be precise, one RMS error value

characterizes the resolution of four G or FT profiles for each method. All profiles including the true ones are normalized to unit length in order to obtain RMS errors, at comparable magnitudes. For each data combination of correlation, level and type of noise, three data sets were simulated and therefore, the final RMS errors are presented as mean values of the triplicate estimations. Another, but similar method of a comparison is to use the correlation coefficient estimated between the respective profiles obtained from each of the two methods, but the visualization of such a comparison is difficult to display for all of the data variants considered in this simulation study, and therefore, only a discussion of such results will be provided further in the text. The influence of noise and correlation levels on the root mean square, RMS, errors averaged for the contribution and composition profiles obtained from MCR-ALS, PMF, EPA PMF, or MCR-WALS for data sets with i.i.d normal errors are presented in Figure 1. The horizontal axis of the maps shows the level of noise, whereas the vertical axis presents the strength of the pairwise correlation between the underlying composition profiles, FT. The radius and the intensity of the color (in grayscale) of each circle marker are proportional to the magnitude of the estimated RMS error. Comparable RMS error values were obtained for the contribution and composition profiles (see Figure 1), which indicated the same performance of all methods for data sets with uniformly distributed measurement errors. As was expected, the profiles are recovered with higher RMS errors when data have higher levels of measurement errors. It is worth remembering that errors at a level of 5% or 10% of the maximum value of the 10106

dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology

ARTICLE

Figure 3. Root mean square, RMS, errors averaged for the contribution, G, profiles obtained from (a) MCR-ALS, (b) PMF, (c) EPA PMF, (d) MCRWALS and for the composition profiles, FT, obtained from (e) MCR-ALS, (f) PMF, (g) EPA PMF, and (h) MCR-WALS for data sets with a constant and proportional error parts. The paired numbers on the horizontal axis of the maps show the magnitudes of the constant and proportional error parts. The radius and the color (in grayscale) of each circle marker are proportional to the magnitude of the estimated RMS error.

corresponding error-free data matrix can already be considered as an extreme case under normal operating conditions. It is also not surprising that a higher noise and higher correlation would imply greater difficulty in recovering the true profiles by any method as can be seen from the magnitude of the RMS errors shown in Figure 1. The pairwise correlation coefficients calculated between the corresponding profiles obtained from PMF and MCRWALS are ca. 0.99, which supports the conclusion that the same contribution and composition profiles are resolved for data with i.i.d. measurement errors with the two methods. Somewhat lower values, that is, ca. 0.96 of the correlation coefficients are estimated between the respective contributions and composition profiles of PMF and EPA PMF, which indicates some small differences in both solutions. The results for the estimated RMS errors for data sets with measurements that have uncorrelated proportional errors are presented in Figure 2. Compared with a case of i.i.d. normal errors in the data, the MCR-ALS method (see Figure 2a and e) presents greater challenges in recovering the true profiles for data with higher levels of proportional errors (e.g., 30%) than PMF, EPA PMF, or MCR-WALS, which is indicated by the higher values of the RMS errors observed for MCR-ALS as compared to those for PMF, EPA PMF, or MCR-WALS. This is to be expected since the uncertainty information is not included into the steps of the ALS algorithm. For lower levels of proportional noise, up to 5%, MCR-ALS presents RMS errors comparable with the RMS errors obtained from PMF, EPA PMF, or MCR-WALS. Although the MCR-WALS shows somewhat lower recovery RMS errors than PMF and EPA PMF (compare Figure 2b and c with Figure 2d and Figure 2f and g with Figure 2h), the performance of the methods is comparable. The pairwise correlation coefficients

calculated between the corresponding profiles from PMF and MCR-WALS algorithms as well as between the respective profiles obtained from EPA PMF and MCR-WALS are close to 0.99, which also indicates that virtually the same profiles are recovered. The real advantage of the methods incorporating the uncertainty information in obtaining the solution is demonstrated for data sets with constant and proportional error parts (see Figure 3). Compared to the recovery RMS errors obtained from PMF, EPA PMF, and MCR-WALS, relatively higher values of those RMS errors in all data sets simulated with different levels of measurement errors and correlation are found for MCR-ALS (compare Figure 3a with Figure 3b, c, and d as well as Figure 3e with Figure 3f, g, and h). Again, there is the tendency, which was observed earlier, that the recovery of contribution and composition profiles in the data with a higher correlation and higher measurement errors presents greater difficulties for all methods. The same magnitude of RMS errors is observed on Figure 3b, c, and d as well as on Figure 3f, g, and h for PMF, EPA PMF, and MCR-WALS, respectively, which suggests, once more, that virtually the same contribution and composition profiles are obtained. To support the latter finding from the study, the correlation coefficients between the respective profiles of each of the two methods were calculated once again and found to be above 0.98 in all cases. Results from the Real Data Study. The aim of this study was to compare the contribution and composition profiles obtained from MCR-ALS, MCR-WALS, and the PMF algorithm by Lu and Wu for real noisy environmental data, rather than to interpret the profiles obtained, which has already been done in several studies published elsewhere.36,37 Since the results of repeated measurements were available, information about the uncertainty was incorporated during the analyses with MCR-WALS and 10107

dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology

ARTICLE

Figure 4. Comparison of (a) the first contribution (g1) profiles, (b) the second contribution (g2) profiles, (c) the third contribution (g3) profiles, (d) the first composition (f1) profiles, (e) the second composition (f2) profiles, and (f) the third composition (f3) profiles obtained from the MCR-ALS, PMF, and MCR-WALS methods applied to the real data set.

PMF. The data set of interest was first preprocessed to eliminate the differences in measurement units by scaling each variable to a unitary standard deviation. The uncertainty data were also scaled using the standard deviations of the original measurement data. This procedure gives the variables the same importance during the analysis. It is worth noting that centering was not applied, as this is usually performed on environmental data in order to remove the offset, because non-negativity constraints were imposed on both contribution and composition profiles. The first step in an analysis is to decide on the number of factors in the model. Here, one can use the classic PCA method to account for the amount of variability explained by each factor in the data. Due to the weighting scheme in MCR-WALS and PMF, the selection of complexity can be performed using a plot of the sum of weighted residuals as a function of the model’s complexity. With PCA, three factors explain 92.5% of the variance, whereas with MCR-WALS and PMF a relatively large reduction in the weighted sum of squares residuals for models with complexities larger than three was not observed. Therefore, the solutions with three factors were further considered in this study. As was mentioned earlier, the initialization step in the algorithms can be performed in different ways. In order to guarantee the optimality of the solution, the numerical values of the three composition profiles, FT (3  14), were generated completely at random at each rerun (e.g., 300 times) of the algorithms. The final solution is the one with the lowest value of the respective objective function. The results of the comparative study are presented in Figure 4. All of the panels in Figure 4 show the highly overlapping contribution and composition profiles obtained from MCRWALS (dash-dotted red line) with the respective profiles found with PMF (solid black line). This is also indicated by the values of the pairwise correlation coefficients presented in Table 1, which are ca. 0.99.

Table 1. Pairwise Correlation Coefficients Calculated between the Respective Contribution (gk) and Composition (fk) Profiles Obtained from the MCR-ALS, MCR-WALS, and PMF Models Applied to Measurement Data with Complexity Three g1

g2

g3

f1

f2

f3

MCR-WALS/PMF

0.989

0.999

0.999

0.995

0.998

0.999

MCR-ALS/PMF

0.616

0.952

0.929

0.558

0.844

0.814

MCR-ALS/MCR-WALS

0.708

0.961

0.937

0.606

0.864

0.821

Thus, both methods found virtually the same profiles for the studied data. Since the data had a high level of proportional errors, it was expected that the profiles obtained from the traditional MCR-ALS approach would be different in comparison with those obtained from MCR-WALS and PMF. These differences are also reflected in the relatively lower values of the correlation coefficients calculated between the MCR-ALS profiles and the respective PMF or MCR-WALS profiles. The general conclusion of the simulation and the real data studies is that MCR-WALS and PMF (or EPA PMF) extract virtually the same contribution and composition profiles for data sets with i.i.d. normal measurement errors and for data with measurements that have uncorrelated proportional errors or constant and proportional error parts. This was confirmed by the high values of the correlation coefficients (r > 0.98) calculated between the respective contribution factors obtained from MCRWALS and PMF or MCR-WALS and EPA PMF as well as between the composition profiles obtained, using each of the two methods. As was expected, a higher level of noise and a higher correlation would present greater difficulty in recovering the true profiles by either method. With the conventional MCR-ALS method, higher recovery errors for the profiles extracted for data 10108

dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology with proportional errors or data with a constant and proportional error parts were obtained in comparison with MCR-WALS, EPA PMF, and PMF. The major outcome of this study is that the MCR-WALS, EPA PMF, and PMF methods give the same results for the major factors. MCR-WALS and the two versions of the PMF method share some common attractive features such as the possibility to (i) incorporate information about the measurement uncertainty, which makes them capable of handling missing values and data below the reporting limits and to (ii) impose different constraints on the obtained solutions. However, compared to MCR-WALS, PMF, and EPA PMF are considerably complicated algorithms, which require the optimization of a number of parameters. The convergence speed and the quality of the solution greatly depend on the way in which the algorithmic steps are optimized.31 Moreover, the imposition of different constraints like selectivity, local rank, equality and inequality or unimodality (i.e., for reactions and chromatographic profiles), in PMF is also not an easy task. MCR-WALS as an alternating least-squares algorithm is a simpler procedure that does not require the optimization of many initial parameters and the implementation of the abovementioned constraints is not difficult. Nevertheless, both methods share some common problems, which require a further study like the choice of the model’s complexity, obtaining optimal profiles and controlling the rotational freedom of the solution.39 In general, having knowledge about the measurement errors in the data, one could choose the model’s complexity using MLPCA.1

’ AUTHOR INFORMATION Corresponding Author

*Phone: +48-32-359-1219; fax: +48-32-259-9978; e-mail: beata@ us.edu.pl.

’ ACKNOWLEDGMENT I.S. gratefully acknowledges the financial support of the Human Capital Programme at the University of Silesia, Katowice, Poland. We are grateful to the anonymous reviewers of this article for their insightful comments. ’ REFERENCES (1) Wentzell, P. D.; Andrews, D. T.; Hamilton, D. C.; Faber, K.; Kowalski, B. R. Maximum likelihood principal component analysis. J. Chemom. 1997, 11, 339–366. (2) Wentzell, P. D.; Karakach, T. K.; Roy, S.; Martinez, M. J.; Allen, C. P.; Werner-Washburne, M. Multivariate curve resolution of time course microarray data. BMC Bioinf. 2006, 7, 343. (3) Paatero, P.; Tapper, U. Positive matrix factorization: A nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics 1994, 5, 111–126. (4) Juntto, S.; Paatero, P. Analysis of daily precipitation data by positive matrix factorization. Environmetrics 1994, 5, 127–144. (5) Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. (6) Massart, D. L.; Vandeginste, B. G. M.; Buydens, L. M. C.; de Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. Handbook of chemometrics and Qualimetrics: Part A; Elsevier: Amsterdam, The Netherlands, 1997. (7) Rutan, S. C.; de Juan, A.; Tauler, R. Introduction to multivariate curve resolution. In Comprehensive Chemometrics; Brown, S. D., Tauler, R., Walczak, B., Eds.; Elsevier: Amsterdam, 2009; Vol.2, pp 249. (8) de Juan, A.; Tauler, R. Chemometrics applied to unravel multicomponent processes and mixtures. Revisiting latest trends in multivariate resolution. Anal. Chim. Acta 2003, 500, 195–210.

ARTICLE

(9) Malinowski, E. R. Factor Analysis in Chemistry; John Wiley & Sons: New York, NY, 1991. (10) Armstrong, J. S. Derivation of theory by means of factor analysis or Tom Swift and his electric factor analysis machine. Am. Stat. 1967, 21, 17–21. (11) Maeder, M.; Neuhold Yorck-Michael. Kinetic modeling of multivariate measurements with nonlinear regression. In Practical Guide to Chemometrics; Gemperline, P. J., Ed.; CRC Press: Taylor & Francis Group: New York, 2006; pp 218. (12) Pere-Trepat, E.; Lacorte, S.; Tauler, R. Alternative calibration approaches for LC-MS quantitative determination of coeluted compounds in complex environmental mixtures using multivariate curve resolution. Anal. Chim. Acta 2007, 595, 228–237. (13) Tauler, R.; Viana, M.; Querol, X.; Alastuey, A.; Flight, R. M.; Wentzell, P. D.; Hopke, P. K. Comparison of the results obtained by four receptor modelling methods in aerosol source apportionment studies. Atmos. Environ. 2009, 43, 3989–3997. (14) Gemperline, P. J. Target Transformation Factor Analysis with linear inequality constraints applied to spectroscopic-chromatographic data. Anal. Chem. 1986, 58, 2656–2663. (15) Gemperline, P. J.; Cash, E. Advantages of soft versus hard constraints in self-modeling curve resolution problems. Alternating least squares with penalty functions. Anal. Chem. 2003, 75, 4236–4243. (16) Wentzell, P. D. Other topics in soft-modeling: maximum likelihood-based soft-modeling methods. In Comprehensive Chemometrics; Brown, S. D., Tauler, R., Walczak, B., Eds.; Elsevier: Amsterdam, 2009; Vol. 2, pp 507. (17) Kim, E.; Hopke, P. K. Improving source identification of fine particles in a rural northeastern U.S. area utilizing temperature-resolved carbon fractions. J. Geogr. Res. 2004, 109, D09204. (18) Paatero, P.; Tapper, U.; Aalto, P.; Kulmala, M. J. Matrix factorization methods for analysing diffusion battery data. Aerosol Sci. 1991, 22, 273–276. (19) Anttila, P.; Paatero, P.; Tapper, U.; J€arvinen, O. Source identification of bulk wet deposition in Finland by positive matrix factorization. Atmos. Environ. 1995, 29, 1705–1718. (20) Brinkman, G.; Vance, G.; Hannigan, M. P.; Milford, J. B. Use of synthetic data to evaluate positive matrix factorization as a source apportionment tool for PM2.5 exposure data. Environ. Sci. Technol. 2006, 40, 1892–1901. (21) Logue, J. M.; Small, M. J.; Robinson, A. L. Identifying priority pollutant sources: apportioning air toxics risks using positive matrix factorization. Environ. Sci. Technol. 2009, 43, 9439–9444. (22) Yakovleva, E.; Hopke, P. K.; Wallace, L. Receptor modeling assessment of particle total exposure assessment methodology data. Environ. Sci. Technol. 1999, 33, 3645–3652. (23) de Juan, A.; van den Bogaert, B.; Cuesta Sanchez, F.; Massart, D. L. Application of the needle algorithm for exploratory analysis and resolution of HPLC-DAD data. Chemom. Intell. Lab. Syst. 1996, 33, 133–145. (24) Cuesta Sanchez, F.; Toft, J.; van den Bogaert, B.; Massart, D. L. Orthogonal projection approach applied to peak purity assessment. Anal. Chem. 1996, 68, 79–85. (25) Windig, W.; Guilment, J. Interactive self-modeling mixture analysis. Anal. Chem. 1991, 63, 1425–1432. (26) Bro, R.; de Jong, S. A. A fast non-negativity-constrained least squares algorithm. J. Chemometr. 1997, 11, 393–401. (27) FCNNLS.M; http://www.cc.gatech.edu/∼hpark/software/ fcnnls.m (accessed on October 21, 2011). (28) Tauler, R. Calculation of maximum and minimum band boundaries of feasible solutions for species profiles obtained by multivariate curve resolution. J. Chemometr. 2001, 15, 627–646. (29) Abdollahi, H.; Maeder, M.; Tauler, R. Calculation and meaning of feasible band boundaries in multivariate curve resolution of a twocomponent system. Anal. Chem. 2009, 81, 2115–2122. (30) Jaumot, J.; Tauler, R. A user friendly MATLAB program for the evaluation of rotation ambiguities in multivariate curve resolution. Chemom. Intell. Lab. Syst. 2010, 103, 96–107. 10109

dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology

ARTICLE

(31) Lu, J.; Wu, L. Technical details and programming guide for a general two-way positive matrix factorization algorithm. J. Chemometr. 2004, 18, 519–525. (32) Paatero, P. Least squares formulation of robust non-negative factor analysis. Chemom. Intell. Lab. Syst. 1997, 37, 23–35. (33) Paatero, P. The multilinear engine—A table-driven least squares program for solving multilinear problems, including the n-way parallel factor analysis model. J. Comput. Graphical Stat. 1999, 8, 1–35. (34) http://www.ub.edu/mcr/web_mcr/download.html (accessed on March 21, 2011). (35) Nelsen, R. B. An Introduction to Copulas; Springer: New York, NY, 2006. (36) Stanimirova, I.; Simeonov, V. Modeling of environmental fourway data from air quality control. Chemom. Intell. Lab. Syst. 2005, 77, 115–121. (37) Kroonenberg, P. M. Applied Multiway Data Analysis; John Wiley & Sons: Hoboken, NJ, 2008. (38) Lavric, T. Composition and Sources of Aerosols (PM10) in the border region CarynthiaSloveniaItaly. Ph.D. Dissertation, Technical University of Vienna, Vienna, 2002. (39) Paatero, P.; Hopke, P.; Song, X.-H.; Ramadan, Z. Understanding and controlling rotations in factor analytic models. Chemom. Intell. Lab. Syst. 2002, 60, 253–264.

10110

dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110