Impact of Species Uncertainty Perturbation on the Solution Stability of

Impact of Species Uncertainty Perturbation on the Solution Stability of Positive Matrix Factorization of Atmospheric Particulate Matter Data. William ...
0 downloads 0 Views 440KB Size
Environ. Sci. Technol. 2008, 42, 6015–6021

Impact of Species Uncertainty Perturbation on the Solution Stability of Positive Matrix Factorization of Atmospheric Particulate Matter Data W I L L I A M F . C H R I S T E N S E N * ,† A N D JAMES J. SCHAUER‡ Department of Statistics, Brigham Young University, Provo, Utah 84602, and Environmental Chemistry and Technology Program, University of WisconsinsMadison, Madison, Wisconsin 53706

Received January 9, 2008. Revised manuscript received May 8, 2008. Accepted May 22, 2008.

Statistical measures for evaluating the similarity of different source apportionment solutions are proposed. The sensitivity of positive matrix factorization to small perturbations in species measurement uncertainty estimates is examined using fine particulate matter measurements on organic carbon, elemental carbon, ions, and metals at the St. LouissMidwest Supersite. A perturbed uncertainty matrix is created by multiplying each original uncertainty value by a random multiplier generated from a log-normal distribution with a mean of 1 and a standard deviation (and CV) equal to either 0.25, 0.50, or 0.75. The relative errors in reproducing the average contribution estimates from the perturbed data are generally highest for the gasoline exhaust, with the relative error (expressed as a percentage of the “true” value) exceeding 30% for all three perturbation scenarios. The most stable estimates of average source contribution were associated with secondary sulfate and secondary nitrate, with relative errors always less than 4%. Averaged over all 10 sources, the average values for our measure of relative error for the three scenarios are 8%, 14%, and 17%, respectively. Relative errors associated with day-today estimates of source contributions can be more than double the size of the relative errors associated with estimates of average source contributions, with errors for four of 10 source contributions exceeding 30% for the largest-perturbation scenario. The stability of source profile estimates in our simulation varies greatly between sources, with a mean correlation between perturbed gasoline exhaust profiles and the true profile equal to only 59% for the largest-perturbation scenario. The process used for evaluation is a tool that may be used to assess the stability of solutions in source apportionment studies.

Pollution source apportionment (PSA or SA) refers to the problem of extracting estimates of pollution source contributions and pollution source profiles from ambiently measured data. In this discussion, we focus on the analysis of the chemical composition of atmospheric fine particulate matter * Corresponding author e-mail: [email protected]. † Brigham Young University. ‡ University of WisconsinsMadison. Published on Web 07/22/2008

X ) GF + E

(1)

where G is an n × p matrix containing pollution source contributions for the p sources, F is a p × m matrix whose rows are the profiles for the p different sources, X is an n × m matrix of measurements of m different chemical species observed at n times, and E is an n × m matrix of measurement errors associated with X. Thus, for example, the concentration of species j observed at time i, xij, measured at a receptor can be explained as xij ) gi· f·j + eij 1×1 1×p p×1 1×1

(2)

where gi · is the ith row of G and f · j is the jth column of F. PMF constrains G and F to be non-negative, thus satisfying the constraints necessary for realistic pollution source apportionment models. PMF solves the factor analysis equations by iteratively computing F and G via the minimization of n

Q)

p

∑ ∑ eˆ

2 2 ij ⁄ sij

(3)

i)1 j)1

where eˆij ) xij - gˆi · ˆf · j and sij is the standard deviation associated with each data point (3). Two challenges exist with respect to the specification of the values of sij. First, the sij values are often derived from measurement protocols as opposed to being estimated via replication. Consequently, sij may be a poor estimate of the measurement error standard deviation. Second, even if the measurement error standard deviations sij are adequately estimated, the actual sizes of the measurement errors in the E matrix in eq 1 will only be adequately described by sij when the proper values for G and F are used in the fitted model. For example, suppose the correct model for X is X ) GF where G has 10 columns (sources) and F has 10 rows. If we use the model ¨ F¨ X)G

1. Introduction

10.1021/es800085t CCC: $40.75

including organic carbon (OC), elemental carbon (EC), ions, and metals. A brief but very useful introduction to PSA can be found in ref 1. When at least approximate knowledge about pollution source profiles is available, the problem is referred to as a chemical mass balance (CMB) problem, and regression techniques have been traditionally applied. An overview of approaches to the CMB problem within the framework of the measurement error model is given in ref 2. However, it is often the case that sources are not welldefined, and in such cases, factor analysis tools have been used. One of the most commonly used factor analysis approaches is positive matrix factorization (PMF), which was introduced by Paatero and Tapper (3) and developed by Paatero (4), Hopke et al. (5), Eberly (6), and others. PMF is based on the equation

 2008 American Chemical Society

¨ has nine columns (sources) and F ¨ has nine rows, where G then the variance of {xij - g¨i · ¨f · j} will in general be unequal to sij because of the model error. Before evaluating the impact of sij on the stability of PMF solutions, we first devise methods for quantifying and evaluating solution stability and similarity. Section 2 addresses this topic and proposes metrics to be used throughout the manuscript. In section 3, we assess the impact that small changes in sij have on estimates. We conclude with a VOL. 42, NO. 16, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

6015

discussion of this study’s implications about the interpretation of pollution source apportionment studies and some recommendations for evaluating solution stability.

2. Measures of Solution Similarity Before we are able to evaluate the stability of PSA solutions, we first must consider how to quantify the notion of estimate variability in the PMF setting. Because the interpretation of SA solutions depends jointly on the estimated profiles and the estimated contributions, we consider stability measures related to both of these PMF outputs. Suppose we are interested in comparing two different sets of contributions ˜ and G ˆ , where G ˜ represents the source contributions from G ˆ represents the a standard (or “true”) solution and G contributions from an alternative solution. The simplest ˆ and measure of the similarity between the kˆth column of G ˜ (denoted gˆ.kˆ and g˜.k˜, respectively) is the the k˜th column of G simple linear correlation corr{gˆ.kˆ, g˜.k˜}

(4)

In order to define a single value to summarize the similarity ˆ to G ˜ , we calculate eq 4 for all pairings of gˆ.kˆ and g˜.k˜ (kˆ of G ) 1,..., pˆ; k˜ ) 1,..., p˜), where the number of sources associated with the two solutions are pˆ and p˜, respectively. Assuming pˆ ) p˜ ) p, we then consider all p! ways of matching the p ˆ to the p sources in G ˜ . For each permutation, we sources in G average the correlation values associated with each of the p ˆ, G ˜ } to be the pairs of sources and then define corr*{G maximum of the p! averages associated with the p! permutaˆ, G ˜ } is a single-number summary of the tions. Thus, corr*{G similarity of the two sets of contributions after finding the most optimistic one-to-one matching of the sources. The advantage of using eq 4 is that, when one source in ˆ is essentially a summation of two different sources in G ˜ G (or vice versa), such a relationship will be readily apparent. The problem with using eq 4 in the context of this problem is that we are usually interested in measures of stability or reproducibility. Specifically, we wish to evaluate the degree ˆ yield the to which the source contribution estimates in G same interpretation as the source contribution estimates in ˜ (assuming pˆ ) p˜ ) p). For this purpose, the correlation can G be a poor measure of similarity. For example, two different estimates of a source’s contribution can be highly correlated yet vastly dissimilar in the size of measured values. Consequently, when comparing contribution estimates in this study, we generally prefer the relative average absolute error ˆ and (RAAE), which we define first for the kˆth column of G ˜: the k˜th column of G

RAAE{gˆ.kˆ, g˜.k˜} )

1 n

∑ |gˆ 1 n∑ n

i)1

˜ ik˜| ikˆ - g

n

i)1

(5) g˜ik˜

Note that RAAE for gˆ.kˆ differs from the AAE (found in the numerator of eq 5) in that RAAE divides AAE by the mean of gˆ.kˆ (the contribution vector considered the “gold standard” in the comparison). We use eq 5 instead of the alternative expression n

|gˆikˆ - g˜ik˜| 1 n i)1 g˜ik˜



because the latter is a poor measure of stability when g˜ik˜ is equal or nearly equal to zero. To define a single value to ˆ to G ˜ in the context of RAAE, summarize the similarity of G we use an approach that is identical to that used in defining ˆ, G ˜ } above. That is, we calculate eq 5 for all pairings corr*{G of gˆ.kˆ and g˜.k˜ and then consider all p! ways of matching the ˆ to the p sources in G ˜ . For each of the p! p sources in G 6016

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 16, 2008

permutations, we average the RAAE values associated with ˆ, G ˜} each of the p pairs of sources and then define RAAE*{G to be the minimum of all p! averages. We are also interested in the similarity between the profile ˆ and F ˜ ) associated with the two solutions of matrices (F interest. For this purpose, there is a clear deficiency when using the standard linear correlation between profiles (rows) ˆ and F ˜ (denoted ˆfkˆ. and ˜fk˜.). Pollution source profiles used in F in most PSA studies include one or more species that dominate the other species in size. For example, when OC, EC, sulfate, and nitrate are among the measured species, the profiles for many important sources will be dominated by one or more of these species. However, it is often the case that one of the less-dominant species will be the most important species for the identification of a source. For example, it may be the case that most of the ambient zinc particulates observed at a receptor are originating from a zinc smelter. Notwithstanding, zinc may account for only a small fraction of a zinc smelter’s emissions (which may be dominated by OC, for example). In such a situation, the linear correlation coefficient will often be a poor metric for matching the zinc profiles in two different solutions. In fact, the linear correlation coefficient will indicate a high degree of similarity with any other profile containing a high percentage of OC. In order to avoid this type of ambiguity in matching profiles from different solutions, we use the profiles together with the associated contributions in order to create an ˆ explained species mass matrix. Note that each row of F represents a source profile, with elements summing to 1. ˆ has the same dimension as the The explained mass matrix H ˆ matrix, where the jth column of H ˆ gives the fraction of the F explained mass of species j originating from source k. ˆ is Specifically, the (k, j) element of H ˆ kj ) h

µ ˆkˆf kj p Σl)1 µ ˆl ˆf lj

(6)

where µˆ k ) (1/n)∑ni)1 gˆik. Thus, if 45% of the estimated chlorine is originating from an estimated zinc smelter source, then ˆ associated with the zinc smelter row and the element of H the chlorine column would be 0.45. Note that two sources might be considered similar if this modified view of the source ˆ ) is similar. profiles (found in the rows of H Once we have obtained the explained mass matrices for ˆ and H ˜ ), we define the similarity of the the two solutions (H ˆ and the k˜th profile in F ˜ using the kˆth profile (row) in F explained mass correlation (EMC): ˜ k˜.} ˆ kˆ., h EMC{fˆkˆ., ˜f k˜.} ) corr{h

(7)

Note that, technically, EMC is a function of not only ˆfkˆ. and ˜fk˜. but also gˆ.kˆ and g˜.k˜. ˆ with G ˜ , we would like a As with our comparison of G ˆ with F ˜ . Thus, single value to summarize the similarity of F we calculate eq 7 for all pairings of ˆfkˆ. and ˜fk˜. and then consider ˆ to the p profiles all p! ways of matching the p profiles in F ˜ . For each permutation, we average the EMC values in F associated with each of the p pairs of sources and then define ˆ, F ˜ } to be the maximum of the p! averages associated EMC*{F ˆ, F ˜ } quantifies the with the p! permutations. Thus, EMC*{F similarity of the two profile matrix estimates after finding the most optimistic one-to-one matching of the profiles.

3. Impact of Uncertainty Measurement Perturbation on PMF Solutions The stability of modern CMB and PMF source apportionment models to model assumptions has been investigated to different degrees using real atmospheric particulate matter data sets. Lough and Schauer (7) found that the largest uncertainty associated with the apportionment of gasoline

TABLE 1. Summary Statistics for the Distribution of mij when σm = 0.25, 0.50, and 0.75a percentiles σm

first

25th

50th

75th

99th

0.25 0.50 0.75

0.55 0.30 0.17

0.82 0.65 0.51

0.97 0.89 0.80

1.15 1.23 1.26

1.72 2.68 3.78

a

FIGURE 1. Average contribution to the PM2.5 mass for each of 10 sources from Lee et al. (12) (black bars) and from the current reanalysis (gray bars). The Lee et al. analysis uses F-element pulling, while the current analysis does not. and diesel vehicles in molecular marker CMB models is the characterization of smoking vehicles. They conclude that the gas/diesel split is only practically feasible when “smokers” are properly characterized. In a study of sensitivity to the choice of biomass burning profiles, the standard deviation of estimates associated with different profiles was a little over 30% of the annual average contribution (8). This finding was corroborated by ref 9. The CMB apportioned mass of OC to food cooking emissions varies by a factor of 9 depending on the choice of food cooking profiles (10). In the PMF setting, stability has mostly been investigated in the context of species and the exclusion of extreme events. The specific choice of input species was not found to influence the nature of the sources chosen in a seven-source model, but the ability to consistently estimate the food cooking, gasoline, and diesel contributions was affected by the specific choice of input species (11). In our study, we wish to quantify the stability of solutions at an even more fundamental level than the previous studies just citedsreflecting the unavoidable instability inherent in a source apportionment solution after fixing the apportionment method (PMF in our case), input species, model choice, and program settings. We begin by considering a data set obtained from the St. LouissMidwest Supersite in East St. Louis, Illinois, during the period of May 2001 to May 2003. Details on data collection and chemical analyses can be found in section 2.1 of ref 12. This data set contains roughly 2 years of the daily chemical composition for fine particulate matter including OC, EC, ions, and metals. In order to mimic the procedure used in sophisticated source apportionment studies in the literature, we follow the protocol for data manipulation outlined in ref 12, including the procedures for screening species for inclusion, determining species uncertainty values, handling missing values, and adapting values below the method detection limit. Attempting to follow the protocol in 12 precisely, we create a data set for analysis which contains 706 observations for 33 species, including fractionated OC and EC, as well as ions and metals. We use PMF to identify 10 factors affecting the St. Louis airshed which roughly correspond to the sources identified in ref 12. (In keeping with commonly used terminology, we use “source” and “factor” interchangeably but recognize that a factor/estimated source usually does not neatly align with a single actual pollution source.) Although the protocol for using PMF (including the value of FPEAK, the use of robust mode, and the handling of outlier thresholds) is the same as that described in ref 12, we obtained slightly different estimates for the average contribution to PM2.5 mass for the sources. A comparison of the source contributions from our analysis and from ref 12 can be found in Figure 1. We note that using the same matrix of FKEY values as described in ref 12 (to reduce the contribution of sulfate to the nonferrous

The mean of mij is 1 for all three cases.

metal processing sources) had virtually no impact on our solutions. Consequently, all of the analyses reported here do not involve the use of F-element pulling. In order to illustrate the impact of small uncertainty measurement perturbations on PMF solutions obtained from these data, we consider the use of slightly perturbed uncertainty matrices using differing degrees of perturbation. Specifically, let U be the assumed uncertainty matrix whose (i, j) element is denoted uij. The perturbed uncertainty matrix ˙ ) lets u (U ˙ ij ) uijmij, where each mij is a random draw from a log-normal distribution with a mean of 1 and a standard deviation of σm ∈ [0.01, 1]. (The details on generating from the log-normal distribution are discussed in ref 13.) Note that, instead of multiplying an entire column or matrix by a single multiplier m, this algorithm allows the uncertainty associated with each measurement to be uniquely perturbed, reflecting the nature of measurement errors introduced into the data in practice. To illustrate the possible values for the multiplier mij when using different values for σm, Table 1 summarizes the distribution for the perturbation multiplier mij when σm ) 0.25, 0.50, and 0.75. PMF solutions were obtained using the original uncertainty matrix (U) and each of the perturbed uncertainty ˙ ). Let F and G be the 10-source PMF solution matrices (U using the original uncertainty matrix (shown in Figure 1), ˆ and G ˆ denote the 10-source solution using a and let F perturbed uncertainty matrix. For each solution obtained, ˆ , G} and EMC*{F ˆ , F} as described we then calculate RAAE*{G ˆ , G} quantifies the magnitude in section 2. Note that RAAE*{G of the average daily deviation from the original PMF estimate G. That is, we treat G as our “gold standard” and refer to the ˆ - G as “errors.” deviations G Because we are often interested in the average contribution of a source (instead of the day-to-day contributions), 706 g ˆ ikˆ and µg.k we also calculate these averages µgˆ.kˆ ) 1/706 ∑i)1 706 g , k ) 1,..., 10, and we calculate the absolute ) 1/ 706 ∑i)1 ik deviation between µgˆ.kˆ and µg.k. Figure 2 gives boxplots of the average contributions (µgˆ.kˆ) associated with the 50 perturbed uncertainty matrices created using σm ) 0.25, 0.50, and 0.75. The average source contribution obtained using the original uncertainty matrix (µg.k) is denoted on each boxplot with an “X”. As can be seen in Figure 2, the average contribution estimates for most sources are not highly variable, the exceptions being the sources for gasoline exhaust, carbonrich sulfate, and diesel/railroad for m g 0.50. However, letting each uncertainty value be uniquely perturbed can yield some systematic deviations (or biases) from the original source contribution estimatessnote that the value of µg.k is not always found in the middle of the distribution of µg.k. A practical way to compare the stability of average contribution estimates across sources is to consider the average absolute deviation between µgˆ.kˆ and µg.k, expressed as a percentage of µg.k. When using the subtlest degree of perturbation in the simulation (σm ) 0.25), the range of this measure of relative error for the 10 sources is [0.3%, 35%], with the gasoline exhaust contribution estimate having the largest relative error. Averaged over all 10 sources, the average value for this measure of VOL. 42, NO. 16, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

6017

FIGURE 2. Average source contributions (µg/m3) obtained when using different randomly adjusted uncertainty matrices. The three plots correspond to the cases when each uncertainty is multiplied by a random draw from a log-normal distribution with a mean of 1 and a standard deviation of (a) σm ) 0.25, (b) σm ) 0.50, and (c) σm ) 0.75. The average source contribution when using the original uncertainty matrix is denoted on each boxplot with an “X”.

FIGURE 3. Values of RAEE{gˆ.kˆ, g.k} obtained when using different randomly adjusted uncertainty matrices. The three plots correspond to the cases when each uncertainty is multiplied by a random draw from a log-normal distribution with a mean of 1 and a standard deviation of (a) σm ) 0.25, (b) σm ) 0.50, and (c) σm ) 0.75. The horizontal gray line denotes the overall mean of the RAAE values.

relative error is 8% for the cases using σm ) 0.25, 14% for the case using σm ) 0.50, and 17% for the case using σm ) 0.75. A more complete story is illustrated in Figure 3, which gives the RAAE{gˆ.kˆ, gˆ.k} values for each of the 10 resolved ˆ are sorted to match the sources (after the columns of G columns of G). Note that, unlike the estimates of average contribution illustrated in Figure 2, RAAE measures the relative error in the day-to-day estimates of each pollution source. For each of the values σm ) 0.25, 0.50, and 0.75, the secondary nitrate and secondary sulfate sources are the most accurately and consistently estimated (in terms of relative deviation from g.k) with mean RAAE values below 7% for all values of σm. On the other extreme, the gasoline exhaust source estimates have the largest relative errors, with a mean RAAE as high as 63% when σm ) 0.75. (For comparison, errors of this magnitude are roughly twice the size of the relative errors observed in ref 8 when using competing source profiles to estimate biomass contributions in the molecular marker CMB setting.) The overall means of the RAAE values

(averaging across all 10 sources) for the three perturbation scenarios (σm ) 0.25, 0.50, and 0.75) are 10%, 18%, and 25%, respectively. That is, for these perturbations of the uncertainty values, the size of a typical source contribution estimation error is roughly 1/10 to 1/4 the size of the average contribution for the source. For each source type, Table 2 compares the typical sizes of errors when estimating day-to-day and overall average contributions. As previously noted, there are some sources that are virtually unaffected by uncertainty perturbation, especially secondary sulfate and secondary nitrate. As expected, the typical relative error when estimating a daily source contribution is always greater than the typical relative error when estimating the source’s overall average contribution. However, for half (five) of the sources considered, the error magnitude for the estimates of the day-to-day contributions are less than 1.5 times higher than the error magnitude for the overall average contribution (regardless of the value of σm). This relative similarity in the size of the

6018

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 16, 2008

TABLE 2. Error Magnitudes of Contribution Estimates (Proportion of “True” Value)a σm ) 0.25

σm ) 0.50

σm ) 0.75

source

day-to-day

average

day-to-day

average

day-to-day

average

lead smelting secondary sulfate copper smelting gasoline exhaust zinc smelting carbon-rich sulfate soil steel processing diesel/railroad secondary nitrate

0.03 0.01 0.05 0.37 0.08 0.11 0.05 0.05 0.19 0.01

0.03 0.00 0.05 0.35 0.07 0.09 0.03 0.03 0.18 0.00

0.07 0.03 0.12 0.61 0.16 0.20 0.10 0.16 0.27 0.03

0.07 0.01 0.12 0.40 0.16 0.16 0.05 0.15 0.24 0.01

0.18 0.06 0.19 0.63 0.17 0.31 0.22 0.37 0.32 0.05

0.15 0.03 0.19 0.30 0.14 0.12 0.11 0.35 0.26 0.02

a For each of the levels of σm used in the perturbation of the uncertainty values, the table gives measures of the relative magnitudes of errors in source contribution estimates. Error magnitudes for day-to-day estimates of g.k are quantified using RAAE{gˆ.kˆ, g.k} in eq 5, and error magnitudes for the estimates of the average source contribution (µg.k) are quantified using the average absolute deviation between µgˆ.kˆ and µg.k, expressed as a percentage of µg.k.

day-to-day error estimates and the errors in the estimates of the overall average is a result of the systematic bias observed in Figure 2. Figure 4 gives the EMC{fˆkˆ., fk.} values for each of the 10 ˆ are sorted to match the resolved sources (after the rows of F rows of F). With respect to explained mass, the profiles exhibiting the most instability are the gasoline exhaust and carbon-rich sulfate sources, with median EMC values of 63% and 72% for the case when σm ) 0.75. For context, Figure 5 gives the values of the carbon-rich sulfate source profile when σm ) 0.75. For each of the three values of σm, the overall means of the EMC values (averaging across all 10 sources) are 0.99, 0.97, and 0.90, respectively. That is, using the definition of profile similarity that we introduce in eq 7, the average correlations between the profiles in the perturbed and unperturbed analyses are 0.99, 0.97, and 0.90. To give context to these calculations based on uncertainty perturbation, Figure 6 illustrates the comparative impact of applying the perturbation multiplier (mij) to the uncertainty matrix U and to the measurement matrix X. Note that, while the perturbation of the uncertainties can be substantial, the impact of a given value of σm yields a much larger impact on the estimates when the multipliers are applied to the measurement matrix as opposed to the uncertainty matrix.

4. Implications for Source Apportionment Studies In this manuscript, we consider the impact of species uncertainty on PMF solutions. We develop approaches for quantifying the similarity between two sets of pollution source apportionment estimates and consider the utility of these metrics for evaluating solution stability. We consider uncertainty perturbation as a tool for evaluating the solution stability of PMF. The data-perturbation approach illustrated herein is one that could be of benefit for researchers interested in evaluating source apportionment solution stability. It is wellaccepted that most specifications of measurement uncertainty are approximate at best and do not reflect a complete synthesis of the factors leading to measurement error. Consequently, running multiple analyses with the same measurement matrix, model, and program settingssbut with slightly perturbed uncertainty matricesscan be of value in quantifying the proportion of solution instability that is due to factors that are “downstream” from the selection of source apportionment approach, input species, program settings, and specific model. Specifically, we recommend choosing (max) to represent the maximum possible coefficient of σm variation for the uncertainty associated with each measurement. One can then replicate the approach outlined in section

(max)] to generate perturbed versions of 3 using σm ∈ [0.01, σm the uncertainty matrices. Researchers should look for large values of RAAE in eq 5 and EMC in eq 7, as these are potential indications of a spurious fit. As in our analyses, one would also inspect the distribution of the estimates of the average contribution for each source, with problems indicated by large variation or systematic biases in the estimates obtained from the perturbed data sets. In addition to the perturbation approach discussed here for evaluating solution stability, many researchers use more than one source apportionment approach to lend legitimacy to their models and conclusions (e.g., CMB, conventional factor analysis, and other statistical approaches). As previously noted, our approach complements the practice of using a suite of source apportionment methods in that our approach indicates the amount of solution instability that is due to factors that are “downstream” from the selection of source apportionment methods. The perturbation approach can be applied to any source apportionment method. We note that the analyses herein represent a fairly optimistic view of solution stability because they focus only on mild perturbations of the uncertainty values used in PMF without considering errors in the measured values, uncertainty associated with choosing the number of sources, or other model-fit problems. Also, applying the same degree of perturbation to the measurement matrix (as opposed to the uncertainty matrix) results in much higher volatility in the estimates of both source contributions and profiles. Our simulation is based only on the data from the St. Louis Supersite, so we cannot draw firm generalizations to all PMF or source apportionment analyses. However, we believe these data and associated model based on ref 12 to be typical of PMF analyses using elemental data. We therefore draw several tentative conclusions from the simulation based on the St. Louis data. First, the errors associated with the average source contributions remain relatively stable for most sources. Specifically, when estimating the average source contribution, perturbing the uncertainties using a value of σm as small as 0.25 yields relative errors between 0.3% and 35%, with all but the sources for gasoline exhaust and diesel/railroad having relative errors less than 10%. However, as the degree of perturbation increases to the level of the σm ) 0.75 case, relative errors for average source contributions increase to the point where all sources except for secondary sulfate and secondary nitrate have relative errors greater than 10%. Second, the relative errors associated with day-to-day estimates of source contributions can be more than double the size of the relative errors associated with estimates of

VOL. 42, NO. 16, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

6019

FIGURE 5. Values of source profile for a carbon-rich sulfate source when using σm ) 0.75. The source profile values obtained from the PMF analysis of the unperturbed data are denoted on each boxplot with an “X”.

FIGURE 4. Values of EMC{fˆkˆ., fk.} obtained when using different randomly adjusted uncertainty matrices. The three plots correspond to the cases when each uncertainty is multiplied by a random draw from a log-normal distribution with a mean of 1 and a standard deviation of (a) σm ) 0.25, (b) σm ) 0.50, and (c) σm ) 0.75. The horizontal gray line denotes the overall mean of the EMC values. average source contributions. In the σm ) 0.75 case, relative errors for day-to-day estimates of four different source contributions were greater than 30% (see Table 2). Third, the stability of source profile estimates in our simulation varies greatly between stable sources such as the secondary nitrate and secondary sulfate sources, and the often erratic profile estimates of gasoline exhaust and carbonrich sulfate (see Figure 4). For example, under the scenario with σm ) 0.75, the gasoline exhaust source profile yields values of EMC{fˆkˆ., fk.} that are below 0.32 roughly one-quarter of the time with a mean value of 0.59. Fourth, point sources and secondary formation sources tend to be more robust to small errors when specifying the uncertainty matrix. Estimates of poorly separated source groups (e.g., gas/diesel/carbon-rich sulfate) tend to be the most volatile when subjected to perturbations in the uncertainty matrix. In our exploration of this data perturbation method, the stability of solutions were predictably decreased 6020

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 16, 2008

FIGURE 6. (a) Values of RAEE{gˆ.kˆ, g.k} when using simulated data and σm ∈ [0.01, 1]. The black line denotes the RAAE values when the uncertainty matrix is perturbed, and the gray line denotes the RAAE values when the same method of perturbation is applied to the measurement matrix. (b) Same as plot a but using values of EMC{fˆkˆ., fk.}. by decreasing the sample size and by including additional species with low signal-to-noise ratios.

Acknowledgments This work was supported in part by the STAR Research Assistance Agreement No. RD-83216001-0 awarded by the U.S. Environmental Protection Agency. The article has not been formally reviewed by the EPA. The views expressed in this document are solely those of the authors, and the EPA does not endorse any products or commercial services mentioned in this publication. The authors thank Dr. Jay Turner for assistance in obtaining the data from the EPA St. LouissMidwest Supersite. The authors also thank the editors

and reviewers for helpful comments, which improved the manuscript.

Supporting Information Available This material is available free of charge via the Internet at http://pubs.acs.org.

Literature Cited (1) Hopke, P. K. An introduction to receptor modeling. In Receptor Modeling for Air Quality Management; Hopke, P. K., Ed.; Elsevier: Amsterdam; pp 1-10. (2) Christensen, W. F.; Gunst, R. F. Measurement error models in chemical mass balance analysis of air quality data. Atmos. Environ. 2004, 38, 733–744. (3) Paatero, P.; Tapper, U. Positive Matrix Factorization: a nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics 1994, 5, 111–126. (4) Paatero, P. Least squares formulation of robust non-negative factor analysis. Chemom. Intell. Lab. Syst. 1997, 37, 23–35. (5) Hopke, P. K.; Xie, Y.; Paatero, P. Mixed multiway analysis of airborne particle composition data. J. Chemom. 1999, 13, 343–352. (6) Eberly, S. EPA PMF 1.1 User’s Guide; U.S. Environmental Protection Agency: Washington, DC, 2005. (7) Lough, G. H.; Schauer, J. J. Sensitivity of source apportionment of urban particulate matter to uncertainty in motor vehicle emissions profiles. J. Air Waste Manage. Assoc. 2007, 57, 1200–1213.

(8) Sheesley, R. J.; Schauer, J. J.; Zheng, M.; Wang, B. Sensitivity of molecular marker-based CMB models to biomass burning source profiles. Atmos. Environ. 2007, 41, 9050–9063. (9) Subramanian, R.; Donahue, N. M.; Bernardo-Bricker, A.; Rogge, W. F.; Robinson, A. L. Contribution of motor vehicle emissions to organic carbon and fine particle mass in Pittsburgh, Pennsylvania: Effects of varying source profiles and seasonal trends in ambient marker concentrations. Atmos. Environ. 2006, 40, 8002–8019. (10) Robinson, A. L.; Subramanian, R.; Donahue, N. M.; BernardoBricker, A.; Rogge, W. F. Source apportionment of molecular markers and organic aerosol. 3. Food cooking emissions. Environ. Sci. Technol. 2006, 40, 7820–7827. (11) Shrivastava, M. K.; Subramanian, R.; Rogge, W. F.; Robinson, A. L. Sources of organic aerosol: Positive matrix factorization of molecular marker data and comparison of results from different source apportionment models. Atmos. Environ. 2007, 41, 9353–9369. (12) Lee, J. H.; Hopke, P. K.; Turner, J. R. Source identification of airborne PM2.5 at the St. Louis-Midwest Supersite. J. Geophys. Res. 2006, 111, D10S1010.1029/2005JD006329. (13) Lingwall, J. W.; Christensen, W. F. Pollution source apportionment using a priori information and Positive Matrix Factorization. Chemom. Intell. Lab. Syst. 2007, 87, 281–294 .

ES800085T

VOL. 42, NO. 16, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

6021