Environ. Sci. Technol. 2003, 37, 2460-2476
Advanced Factor Analysis of Spatial Distributions of PM2.5 in the Eastern United States PENTTI PAATERO Department of Physical Sciences, University of Helsinki, Box 64, FIN-00014 Helsinki, Finland PHILIP K. HOPKE* AND JANJIRA HOPPENSTOCK Department of Chemical Engineering, Clarkson University, P.O. Box 5705, Potsdam, New York 13699-5705 SHELLY I. EBERLY U.S. Environmental Protection Agency, Research Triangle Park, North Carolina 27711
This work analyzes PM2.5 24-h average concentrations measured every third day at over 300 locations in the eastern United States during 2000. The non-negative factor analytic model, Positive Matrix Factorization, has been enhanced by modeling the dependence of PM2.5 concentrations on temperature, humidity, pressure, ozone concentrations, and wind velocity vectors. The model comprises 12 general factors, augmented by 5 urban-only factors intended to represent excess concentration present in urban locations only. The computed factor components or concentration fields are displayed as concentration maps, one for each factor, showing how much each factor contributes to the average concentration at each location. The factors are also displayed as flux maps that illustrate the spatial movement of PM2.5 aerosol, thus enabling one to pinpoint potential source areas of PM2.5. The quality of the results was investigated by examining how well the model reproduces especially high concentrations of PM2.5 on specific days at specific locations. Delimiting the spatial extent of all such factors that exhibit a clear regional maximum surrounded by an almost-zero outer domain lowered the uncertainty in the computed results.
Introduction In 1997, the U.S. Environmental Protection Agency (EPA) promulgated new regulations that established National Ambient Air Quality Standards (NAAQS) for particulate matter with aerodynamic diameters less than 2.5 µm (PM2.5). As part of the program to implement these standards, a network of ambient mass monitoring sites was established and equipped with Federal Reference Method (FRM) samplers for PM2.5 (1). These samplers began functioning at many sites at the beginning of 1999, and by the end of 1999, over 1000 sites were in operation. The measurement of PM2.5 is a labor-intensive operation in that it requires equilibration and weighing of each filter, transport to the field stations to install the filters into the FRM samplers, removal of the filters after exposure, re* Corresponding author telephone: (315)268-3861; fax: (315)268-6654; e-mail:
[email protected]. 2460
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 11, 2003
equilibration, and a second weighing. Because of this burden, sampling is generally performed only every third day. The EPA has undertaken a program of analysis to provide improved understanding of sources and transport of PM2.5 over the eastern United States. Such improved understanding will be necessary for developing control strategies to attain or maintain levels of air quality that meet the PM2.5 NAAQS, estimating trends in PM2.5 concentrations, and evaluating the design of the PM2.5 FRM monitoring network. It is the purpose of this paper to examine the application of a new approach to factor analysis, an extended variant of Positive Matrix Factorization (PMF), to spatial data to gain such understanding.
Background Factor analytic models have been used for receptor modeling for a long time, first based on the Principal Component Analysis (PCA) approach (2-6) including those where external constraints are imposed on the solutions (7, 8). Recently, models based on the non-negatively constrained PMF approach have gained popularity (9-16). In the analysis of temporal-spatial data, factor analytic models have often been called Empirical Orthogonal Functions (EOF). EOF has been widely used in meteorology to analyze spatially distributed data such as barometric pressure measured across a spatial domain (17-21) with a summary of the application of PCA to meteorological spatial fields given by Preisendorfer (22). Given that the transport process and large-scale meteorological patterns can be modeled using linear structures such as principal components, it then makes sense to then model pollutant transport with similar techniques (23-27). In these studies, the pattern of gaseous pollutants (SO2, O3) and particulate sulfate across regions of various geographical scales have been modeled. Such models consider the data as organized in a matrix. One dimension (row numbers) corresponds to time while the second dimension (column numbers) corresponds to the two spatial dimensions. Data in one row of the matrix correspond to observations made in a single day at all locations of the measuring network. Similarly, data in one column represent the time series measured at one location. In the EOF approach, the matrix is analyzed similarly to a PCA, although the name of the method is different. The data matrix is denoted by X and its elements by xij with i ) 1, ..., I and j ) 1, ..., J. The equation for EOF can be written in matrix form as
X ) GFT + E
(1)
In component form, the equation is P
xij )
∑g
ip fjp
+ eij
(2)
p)1
The columns of matrix F represent spatial coefficients, each column of G contains a time series, and E is the matrix of residuals that are not fitted by the model. The task of the fitting is to determine the elements of G and F so that the norm of E is minimized. Each pair of corresponding columns of G and F represents a ”concentration field”, an entity that has a certain time behavior and a certain spatial pattern. In the framework of EOF, the columns of G are constrained to be orthogonal to each other and similarly the columns of F. The fitting is usually achieved through the use of a singular value decomposition (SVD). Equations 1 and 2 can also be 10.1021/es0261978 CCC: $25.00
2003 American Chemical Society Published on Web 05/01/2003
FIGURE 1. Map of the eastern United States indicating the spatial domain within which the PMF analysis was performed and the locations of PM2.5 monitors. solved using PMF (28). The additional constraints imposed on the system are different however. In PMF, the elements of G and F are usually required to be non-negative. Also, the weight of each data value xij is individually adjustable in the definition of the function that is to be minimized. The present work describes a model of PM2.5 that utilizes auxiliary or parametric variables that have been measured concurrently with the PM2.5 data values. The fitting of the model is performed using the Multilinear Engine (ME-2) program (29). The analysis presented here builds on the methodology of Paatero and Hopke (30). In that paper, a number of auxiliary variables, such as wind speed and wind direction, were used to improve the source apportionment factor model. In the present study, the auxiliary variables depend on place and time while influencing the dependent variable PM2.5 and are consequently called parametric variables. Thus, this model of PM2.5 includes one component that depends only on time, one component that depends only on location, and another component that depends on both time and location.
Data Description PM2.5 measurements collected using FRM samplers every third day of 2000 at a series of sites were analyzed. The spatial domain of measurements consists of the area between 32 and 43° N and 72 and 96° W (Figure 1). The monitoring data were declustered because of the density of monitors in urban areas and sparse distribution of monitors in rural areas. The resulting data set consists of 304 locations. Some of the locations consist of a single station, while others represent the median of several clustered stations. Missing values were imputed by spatially interpolating neighboring locations. Of the 304 locations, 83 were classified as rural, and the rest
were classified as urban. See the Supporting Information for a detailed description of the processing of the PM2.5 measurements. The following parametric variables were used: 24-h average temperature, T (°F); 24-h average specific humidity, S (g/kg); 24-h average pressure difference, P (mb, ) deviation from the 2000 annual mean pressure at that location); ozone daily maximum 8-h average concentration, Z (ppb); and 6 AM-9 AM average wind velocity vector (Vx,Vy) (m/s). Ozone data were provided for May-October, inclusive, only. Other combinations of meteorological variables were not tried in a systematic way. Thus, one should not assume that the current set of meteorological variables would be optimal. For examples, using relative humidity instead of specific humidity, using PM instead of AM wind data, or using back trajectories instead of wind directions might provide a better or more interpretable solution. The meteorological and ozone variables typically were not monitored at the geographical locations of the PM2.5 stations. The nearest available meteorological station and ozone station were assigned to each PM2.5 location. The 304 PM2.5 locations were assigned to approximately 100 meteorological stations, with an average separation less than 30 mi. A very small number of missing values occur for some meteorological stations. The corresponding PM2.5 data have been omitted from the analysis as if they are missing values.
Model for the PM2.5 Analysis The mathematical model is defined by the following general equation: P
xij )
∑m
p)1
P
ijp
gip fjp + eij )
∑r
ijp
+ eij
(3)
p)1
VOL. 37, NO. 11, 2003 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
2461
FIGURE 2. Results of the analysis for general factor number 1. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. The matrices G and F, consisting of unknown factor elements, have a similar meaning as in eq 1. In contrast, the values mijp are not unknown adjustable values but functionals that depend on the values of parametric variables at location j on day i. The functionals can in principle be defined in different ways. In this work, the dependence is defined as a product of effects related to different parametric variables:
mijp ) Tp(Tij)Sp(Sij)Pp(Pij)Zp(Zij)Vp(VijX,VijY)
(4)
The known values Tij, Sij, Pij, Zij, VijX, and VijY indicate the meteorological and ozone values on day i at location j. However, the values themselves are not used directly. They are used to establish a limited number of intervals for each variable such that reasonable numbers of data values fall into each interval and that all of the intervals, taken together, cover the range of variation for each variable as described 2462
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 11, 2003
by Paatero and Hopke (30). Tp, Sp, Pp, Zp, and Vp are functionals represented as matrices, consisting of unknown values to be estimated during the fitting process. Each of these matrices has p columns, one for each of the p factors. The definitions of the first four functionals are structurally similar. For each of the p factors, each functional has a unique definition, and the definition is the same for all locations and for all times within that factor. Intuitively, this can be understood as defining the shapes of functionals in an attempt to mirror laws of physics and chemistry. Those laws are the same for all times and all places. During the iteration, the shapes of the functionals are determined numerically so that the best possible fit (smallest possible values of the residuals) is achieved. In practice, the functionals are implemented as tables that define the value of any functional (e.g., T1) for all values of temperature T, creating a dependence T1(T). The values contained in the tables that determine the
FIGURE 3. Results of the analysis for general factor number 2. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. functionals are called parametric factors. In the fitting algorithm, they are treated similarly as the customary factor elements gip and fjp. The functionals Zp(Z), describing the dependence of concentration of PM2.5 on ozone concentration, are omitted from eq 4 for those days (in winter) when ozone data are not available. The fifth functional V is slightly different: it defines each of the p values V1 to Vp as a two-way table, depending on the two wind velocity components (Vx, Vy). A solution to the model in eq 3 is that which minimizes the sum-of-squares expression Q, where Q is defined as I
Q ) Q m + Qa )
J
∑∑(e /σ ) ij
ij
i)1 j)1 I J
2
+ Qa ) P
∑∑((x - ∑m ij
i)1 j)1
p)1
ijp
gip fjp)/σij)2 + Qa (5)
The values σij are uncertainties associated with each original data value xij where i ) 1, ..., I and j ) 1, ..., J. In the present work, each σij was specified by the expression σij ) 1 µg/m3 + 0.08xij. This expression is based on the work of Juntto and Paatero (31). The symbol Qa denotes the auxiliary sum of squares that is created by the auxiliary equations and used for regularization and normalization, to be discussed later. The numerical values obtained for Qm and Qa were typically 28 000 and 7800, respectively. The main Q value (Qm) approximately equals the number of data values to be fitted. This indicates that the specified standard deviation values σij approximately correspond to the level of unexplained residuals eij in this data set. In contrast to simpler factor analytic models, multilinear models have many local minima of the sum of squares expression Q. When the iteration is started from different random initial values, different solutions will be obtained. VOL. 37, NO. 11, 2003 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
2463
FIGURE 4. Results of the analysis for general factor number 3. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. It is necessary to perform multiple computations. From the results obtained, those solutions that have the lowest Q values are examined, and the solutions that offer a meaningful interpretation are chosen. Rotational ambiguity in a factor analysis can be limited by the presence of zeros in the factors. If there are a sufficient number of zero values in all left factors and all right factors, then the rotational status of the solution tends to be locked (32, 33). In the ME model, one might expect more problems because there are many factors so that it is likely there will not be enough zeros present to block all of the possible rotations. However, there are two special aspects of the ME model that tend to further minimize the possibility of rotations. (i) In the normal PMF, the presence of a single zero in the solution only blocks the net rotation in one direction (i.e., in the direction that would make the element in question 2464
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 11, 2003
negative). In the present ME model, a large number of spatial elements have been fixed to zero (the small elements around a central blob). These fixed-to-zero elements exert a double action; they prevent net rotations in both ways, both up and down. Thus, these zeros are more valuable than “spontaneous” zeros. (ii) The presence of products of parametric factors tends to eliminate rotational ambiguity in two-dimensional factor analytic models in the same way like the third dimension in PARAFAC models eliminates rotational ambiguity in threedimensional models. A detailed discussion of this principle goes beyond the present paper and will be analyzed in future work. Despite the reasons given for likely low rotational ambiguity, one should not assume that there should be no ambiguity whatsoever. Minor rotational adjustments to the solutions will be possible. However, it is believed that these adjustments
FIGURE 5. Results of the analysis for general factor number 4. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. are so small that the essential characteristics of the factors will not change. It is of interest to know how the PM2.5 concentrations in urban locations differ from the concentrations in surrounding rural areas. The conceptual framework is that there is a component of PM2.5 that will be present equally in neighboring rural and urban locations. In addition, there will be recently released primary PM2.5 that occurs mostly near its origins, which typically are in the urban areas. In the model presented here, a number of factors were reserved to represent the components of PM2.5 that occur only in urban locations. This result was achieved using the following technique. For the last five factors (p ) 13, ..., 17), the spatial factor elements fjp were forced to be zero for all locations that were determined to be rural locations j. These factors are called urban factors. The other factors (p ) 1, ..., 12) are called general factors. This total number of factors and the division between urban and general was determined
through a large number of trials in which a number of indicator variables such as the Q value and the scaled residuals were examined as well as the interpretability of the final results. There are no direct methods for identifying the number of factors nor their separation into the two factor classes, and thus, a large number of runs were made to arrive at what was determined to be the best solution. The term urban excess denotes the increase of PM2.5 in urban areas with respect to rural background stations. One should note that there might be more urban excess than what is contained in the urban factors. The explanation is that the contributions rijp for the general factors (p ) 1, ..., 12) may well have larger values at urban locations than in rural ones. The urban factors only explain such part of the urban excess that has different behavior than the general factors with respect to time or with respect to parametric variables. The dilution effect is one example of such a different behavior: the concentration of locally emitted PM2.5 will be VOL. 37, NO. 11, 2003 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
2465
FIGURE 6. Results of the analysis for general factor number 5. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. lower if the wind is stronger, while the spread-out PM2.5 concentration will not be significantly influenced by the wind speed. The separation of the urban factors has not been without problems in these computations. The reason is that in some areas, particularly near some edges of the domain, there are large numbers of urban locations without adjacent rural stations. In such areas, the urban-only factors are able to explain all locations. This distorts the separation so that in these areas the urban-only factors tend to explain a significant fraction of the general PM2.5 too. The paradoxical result emerges that, in order to understand what is going on in the urban areas, more rural stations are needed.
Results In any customary factor analysis, the results in the form of the columns of factor matrices G and F contain the 2466
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 11, 2003
dependence of each factor on time and space. The columns can be plotted as such. The only consideration is how the column vectors are normalized since either G or F columns can be normalized. Experience has shown that, with the multilinear model given in eq 3, displaying columns of G or F can be quite misleading because the parametric variables may be partially collinear with time and also with the spatial coordinates. For example, the average humidity depends on time of year. Thus, the columns of G do not correspond well to the behavior of the factor with respect to time of year because indirect or hidden time dependence is present as the dependence of the factor on humidity. A better picture is obtained by using the contributions rijp, defined in eq 3. Each value of rijp ) mijpgip fjp indicates how much the pth factor contributes to the data value xij. The contributions have the dimension of the original data values (i.e., µg m-3).
FIGURE 7. Results of the analysis for general factor number 6. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. Time-Averaged Contributions. The time-averaged contributions of each factor p are obtained so that the values rijp are averaged over time (i.e., over the first index i). Plotted in a map, the time-averaged contributions represent the spatial patterns of factors. Although the urban factors improve the fit of the data, their spatial patterns are disconnected and, hence, are not interpretable. Parametric Factors. The values mijp in eq 3 are dimensionless, as are the five functionals in eq 4. The values of a functional specify how much the contribution to PM2.5, by the factor in question, increases/decreases with different values of the parametric variable. For example, the contributions of factor 9 (Figure 10) increase with increasing humidity. Thus, the values of the functional S9(S) are above 1 for high values of humidity (e.g., S >10 g/kg) and below unity for lower values of humidity. For each functional, its average value has been normalized to unity for each factor. Typically, the values of functionals
are defined for 12 values of the argument, a parametric variable (humidity, temperature, etc.). For between argument values, the nearest defined value is used. If there is no dependence [e.g., factors 3 and 4 (Figures 4 and 5) do not seem to depend on the ozone concentration], then all values of the functional approximately equal 1 for the factor in question, Z3(Z) ≈ 1 and Z4(Z) ≈ 1. The functionals are plotted versus their arguments. Thus, the argument axis shows the dimension of the parametric variable (g/kg, °F, mb, ppb, or m/s). The wind velocity functionals Vp depend on two variables, wind velocity components Vx and Vy. It is difficult to display this dependence. In Figures 2-14, the square diagrams represent the dependence with axes marked x wind and y wind. Each diagram contains a grid of 11 by 11 dots. Each dot corresponds to a range of values for the x wind component and to a range of values for the y wind component. For example, the upper right-hand dot corresponds to x wind VOL. 37, NO. 11, 2003 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
2467
FIGURE 8. Results of the analysis for general factor number 7. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. and y wind stronger than 4 m/s (i.e., strong winds traveling toward the northeast). As another example, the centermost dot corresponds to “no wind”, that is, both wind components below 0.3 m/s. In the diagram, the dots are placed at the center of the range of values they represent. The area of each dot represents the value of the parametric factor Vp corresponding to the range of wind velocity values. As an example, for some factors, the centermost dots in the diagram tend to be large. This pattern indicates that factor p predicts large concentrations for low wind speeds. In some diagrams, the dots are small in the lowest rows, that is, y wind < 0. This indicates that strong northern winds (winds blowing toward south) tend to carry low concentrations. The upper right-hand corners of Figures 2-14 contain another small diagram with axes marked x wind and y wind. Two arrows are shown in each of these diagrams. The end point of the thin arrow indicates the location of the center of mass of the dots in the wind factor diagram. This 2468
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 11, 2003
location reflects those wind speeds where factor p assigns largest concentrations (e.g., if the dots are large on the upper half of the diagram, then the thin arrow will point up, indicating that on the average the concentrations are large when winds are going toward north). The thick arrow represents the average flux density (averaged over all locations and all times) corresponding to factor p (see below). Flux Density of PM2.5. It is useful to know how each factor explains the movement of PM2.5 aerosol across the region. In that way, the relationships between sources and affected regions may be better understood. Inspection of the parametric wind velocity factors V1(Vx,Vy) to Vp(Vx,Vy) may sometimes be misleading: a high value in the wind factor table may be of little significance if the wind vector corresponding to the large value only occurs infrequently. A better picture is obtained by considering the flux density of PM2.5. The flux density vector φ is defined as the product of concentration and velocity. The flux density contribution vector φijp of factor
FIGURE 9. Results of the analysis for general factor number 8. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. p to data point xij is φijp ) (rijpVijx, rijpVijy). By averaging the components of φijp with respect to time and location, the average (overall) flux density vector of factor p is obtained. By only averaging with respect to time, one obtains the flux density map of each factor p. Figures 2-14 show the flux densities as directional arrows whose lengths are proportional to the magnitudes of time-averaged flux density vectors. The proportionality constant is the same in all maps. The flux density vectors in the pth map show the spatial details of the movement of that fraction of PM2.5 that corresponds to factor p. The dimension of the flux density vector is µg/m3 × m/s, which simplifies to µg m-2 s-1. This dimension suggests that average flux density indicates the average mass that traverses a surface of 1m2 in 1 s, when the surface is vertical and perpendicular to the average flux direction. In this definition, averaging has been done over the period of measurements that in this case is 1 yr. The longest arrows, as seen for factors 10 and 12, correspond to approximately 20 µg m-2 s-1. These
flux density maps can provide additional insights into the source/receptor relationships. The flux of PM2.5 from a specific region is a measure of the net amount of PM2.5 that is transported from the region (net amount is the difference of amounts that are transported out and in). In principle, the flux is obtained by integrating the flux density through a closed surface that encloses the region. The flux density is well-known near the surface because both PM2.5 concentration and wind velocity are known. If wind velocity is known throughout the mixing layer, then flux transported by the mixing layer can be calculated because it may be assumed that PM2.5 concentration is constant throughout the layer. A quantitative estimate is thus obtained for the PM2.5 emission from the enclosed region. However, the estimation of the fluxes is beyond the scope of this current study. Numerical Results. The spatial patterns, parametric factors, and flux densities for the 12 general factors are shown VOL. 37, NO. 11, 2003 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
2469
FIGURE 10. Results of the analysis for general factor number 9. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. in Figures 2-13. The results for one of the five urban factors are presented in Figure 14. The other urban factors are presented in the Supporting Information for this paper. From the spatial patterns, it can be seen that the factors are generally concentrated in different subregions of the spatial domain. This is not always the case though since the parametric variables can cause separation of factors. For example, factors 1 and 2 (Figures 2 and 3) are generally centered over the south-central states, but the factors are associated with different wind speed, direction, and specific humidity. Factor 1 shows a net N-NW wind while factor 2 has very low wind speeds and essentially no average direction. Also, factor 1 is associated with specific humidity less than 3 g/kg whereas factor 2 is associated with specific humidity in the range of 6-10 g/kg. Similarly, examination of factors 8 and 11 (Figures 9 and 12) shows that these factors cover similar geographical areas, but the two factors have different 2470
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 11, 2003
temperature dependencies. Factor 8 increases for higher temperatures (summer) while factor 11 is higher when temperatures are low (winter). Recent factor analysis studies of sources in the northeastern United States (14-16) have found two sources that can be related to midwestern coalfired power plants with one representing maximum photochemical conversion of sulfur dioxide to sulfate and one representing minimum conversion. Factor 8 may be comparable to the maximum photochemical factor while factor 11 is the minimum activity factor. Factor 6 (Figure 7) shows a dependence on low barometric pressure and higher temperatures while factor 9 depends on high relative humidity but low temperature. Factor 7 is high for low wind speeds and low humidity, suggesting that it may represent stagnant winter conditions. Examination of the flux density maps provides additional insight into the interpretation of the spatial factors. For ex-
FIGURE 11. Results of the analysis for general factor number 10. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. ample, examination of factor 10 (Figure 11) shows the highest flux densities of PM2.5 in the area of northeastern Ohio. There are significant sources of particulate precursors such as sulfur dioxide around the Ohio River Valley. However, it takes time to convert the sulfur dioxide into particulate sulfate. It can be seen that the flux density increases for the sampling locations as one moves from southwest to northeast across Ohio. Similar patterns can be observed in factors 9, 10, and 12 (Figures 10, 11, and 13, respectively) where factors 9, 12, and 10 cover areas from west to east. In each case, large sulfur dioxide sources exist to the south that may give rise to the increasing flux across the locations from south to north. Factor 4 (Figure 5) shows the flux moving into the analysis domain from the source region to the south and west (eastern Texas and central Louisiana). The flux densities in this figure appear to be anomalous since there are sharp changes in direction over short distances. Such behavior could be the result of the influence of local topography on the prevailing
wind directions or errors in the original meteorological data from the single meteorology measurement station in this area from which all of the data were drawn. Factors 8 and 11 (Figures 9 and 12, respectively) show the transport along the eastern coastal plain showing the flux from the midwestern power plants in factor 8 (Figure 9). The effects of sources around New York City on the downwind New England region appear in factor 11 (Figure 12). Factor 3 also suggests transport from the midwest across North and South Carolina possibly from the power plants in the area of the Tennessee Valley. Factors 6 and 9 (Figures 7 and 10, respectively) show transport from south to north in the northwestern corner of the analysis domain with the amount of PM2.5 increasing in the vicinity of Chicago but not showing strong fluxes into or out of this subregion. As noted above, factor 9 appears to be a winter (colder weather) factor, and factor 6 is related to low barometric pressures and higher temperatures. The flux VOL. 37, NO. 11, 2003 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
2471
FIGURE 12. Results of the analysis for general factor number 11. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. density pattern appears to radiate from the St. Louis area and dissipate across the midwest. Factor 5 (Figure 6) appears to be related to high humidity and low wind speeds and may represent stagnant Bermuda high episodes that produce significant amounts of secondary aerosol mass. Factor 7 (Figure 8) also involves low wind speeds and low humidity and suggests the influence of stagnant, cold winter air.
Performance of the Multilinear Model This section describes special techniques used to derive the more interpretable solutions to eq 3 presented above as well as the evaluation of the model predictions for high-concentration observations. Smoothing of Parametric Factors. Various multilinear models tend to exhibit instability as more error is introduced in the data. That is, the factors tend to assume shapes without 2472
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 11, 2003
physical reasonableness, such as random-looking oscillations between large and small values. This phenomenon appears similar to such instability that is observed when performing regression or least-squares fitting when the basis vectors are almost collinear. A well-known solution to this problem is called regularization or ridge regression. Additional terms are introduced in the least-squares expression so that these terms tend to damp the meaningless oscillations while introducing as little bias as possible. Instability was observed with the initial solutions of the PM2.5 multilinear model. As an example, the shapes of temperature or humidity factors could contain two strong maxima and two deep minima. Smoothing equations were then introduced in the model. These equations specify that the first and/or second differences of all parametric factors should be equal to zero. The standard deviations assigned to these equations were adjusted so that the smoothing equations
FIGURE 13. Results of the analysis for general factor number 12. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. for one parametric factor matrix typically contributed 100 units to the value of Q (24). In this way, almost all of the parametric factors assumed plausible shapes. At the same time, the main Q value (Qm) of the fit increased, but this increase was relatively small, on the order of 1000 units. This increase is less than the increase of Q if one factor is omitted from the model. The introduction of smoothing equations brings a certain aspect of arbitrariness to the model. This same arbitrariness has been the subject of much debate when considering ridge regression. However, it is well-known that useful results are obtained if this arbitrariness is accepted. Smoothing the G (Time) Factors. As discussed above, there is collinearity between time and the parametric variables. For this reason, some variation of PM2.5 concentration may be explained either by the parametric factors or by variation of G values. It is more useful to have the
parametric factors explain the variation of PM2.5, because then it may be possible to understand the underlying chemical or physical processes. For this reason, strong smoothing was applied to the columns of G. This smoothing attempts to enforce day-to-day smooth behavior of the columns of G. Although this smoothing appears to produce some useful results, its effect was not as strong as anticipated. The time factors and also the corresponding locationaveraged contributions still show strong variation whose origins are not clear. Thus, the usefulness of the smoothing of G was questionable. Limiting Spatial Extent of Individual Factors. The spatial shape of most factors consisted originally of a clearly limited high-concentration region or “blob” plus low (nonzero) concentration values scattered all around the domain. The widespread, noise-like stray concentrations have nothing to do with the physical or meteorological reasons that have VOL. 37, NO. 11, 2003 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
2473
FIGURE 14. Results of the analysis for urban factor number 1. The map, top of the figure, displays the time-averaged contribution of this factor in different locations in the eastern United States. The time-series diagram, bottom of the figure, displays the location-averaged contribution of this factor in different days. The smaller diagrams represent the dependence of this factor on the parametric variables: wind velocity, temperature, humidity, pressure, and ozone concentration. created the central high-intensity area. It is believed that random fluctuations in the data are the cause of noise-like spatial coefficients. Thus, the result will be more informative if the noise coefficients are eliminated. For this reason, the final results were computed in the following way. The spatial factor maps were inspected and the clear blobs were determined. Latitude-longitude limits in the form of rectangular “boxes” were specified around the blobs, allowing ample margins between the box and the blob. No spatial coefficients were allowed to be nonzero outside the boxes. For each factor, the box was specified individually. There are a few factors that have a wide spatial distribution without any clearly defined blob. For those factors, no spatial constraints or especially mild constraints were specified. In the case of the urban factors, spatial constraints could not be imposed because of the irregularity in the spatial distribution of urban areas. 2474
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 11, 2003
The spatial constraints caused the stray spatial components to decrease as well inside the box so that the distinction of the blob and the surrounding flat domain became even clearer. This result supports the interpretation that the stray components were originally caused by a time-space rotation that was “undone” by specifying the limiting boxes. The increase of Q, caused by the spatial limits, is small (less than 2000 units). Also, no unnatural details are seen. On the other hand, a parsimonious solution (containing as few fitted values as possible) is generally preferable. Thus, it is suggested that the latitude-longitude-limited solution is preferable to the original solution that was computed without spatial limits. The question of spatial constraining is connected with the question of analyzing as large domains as possible. To avoid edge effects, it would be useful to analyze large domains within a single model. A large domain means that a large
TABLE 1. Summary of Goodness of Fitting High Valuesa no. of cases
a
outcome
L ) 50
L ) 60
L ) 70
alarm OK alarm failed false alarm
23 29 6
5 9 0
2 6 0
L is the limit.
number of factors is needed in order to represent the different conditions in different parts of the domain. If each factor is allowed to carry stray components all around the domain, then the proportion of noise-like terms will increase in comparison to the structured part of the model, eventually weakening the structure of the model and preventing the analysis of large domains. It is believed that with increasing domain size, the spatial constraining will become a must so that no successful detailed analysis is possible without spatial constraining of stray components. Goodness of Fit of High Values. Factor analytic techniques are basically geared for analyzing the typical or average behavior. Thus, it is necessary to check the quality of the fit of the highest concentrations. This review of these high points is important because the 24-h NAAQS for PM2.5 focuses on the highest concentrations, specifically on the annual 98th percentile values. High values also have a significant influence on the annual arithmetic mean, which is the subject of the other PM2.5 air quality standard. Table 1 illustrates how PM2.5 values above 50, 60, and 70 µg m-3 are fitted (the allowed limit for the 24-h NAAQS is 65 µg m-3). The row marked “alarm OK” indicates the numbers of successfully identified high-concentration cases where both the data value and the corresponding fitted value exceed the limit (L) specified on the top line. The other rows indicate numbers of failed cases. “Alarm failed” means cases where the data value exceeded the limit while the fitted value did not (i.e., no alarm was sounded). “False alarm” indicates cases where the fitted value exceeded the limit without reason (i.e., when the data value did not exceed the limit). In a successful fit, the numbers of failures should be small in comparison to the numbers of successes. Table 1 shows that the highest values are not especially well fitted. This is a common problem in least-squares regression. As more parameters are included in the model, the smaller will be the degree of underfitting, but it will always be the case that the highest values will necessarily be underfitted. For this model, there are 23 + 29 ) 52 data values above 50 µg m-3, of which 23 (44%) get a fit above 50. There are 14 data values above 60 µg m-3, of which 5 (36%) get a fit above 60, and 8 values above 70 µg m-3, of which 2 (25%) get a fit above 70. Utility of Complex Model Analysis. Although this analysis is complex in terms of its implementation and its interpretation, the ability to see the patterns of variation of PM2.5 “at a glance” is valuable. A deeper understanding of PM2.5 distributions cannot be obtained without treating each major chemical constituent of PM2.5 separately. A multilinear analysis of the aggregate PM2.5 mass cannot offer such deep understanding. However, the entire PM2.5 data set is huge, and it is not trivial to extract essential information from such a data set. The present analysis offers a compact view of the PM2.5 mass concentration fields. The parametric factors provide indications of the reasons behind the observed concentrations. The temperature dependency provides indication of seasonality of the measured mass. The urban factors indicate that their concentration is subject to dilution by wind and hence is due to local sources.
The flux density maps show at a glance the major transport patterns of PM2.5 aerosol. For example, they show the increase in particle mass as the air moves from the regions of the gaseous precursor (SO2) and is converted in sulfate. Recognition of this combination of transport and transformation is necessary in order that control procedures can be targeted to significant causes of PM2.5 aerosol. The ME model was able to provide a good fit to only 5 of the 14 highest values. However, since this model is fitting the data over a wide spatial domain, uniformly good fits to local high values would not be expected. Thus, this approach would seem more satisfactory for exploring spatial patterns of average concentrations related to the annual NAAQS rather than predicting occasional high single-day values that would lead to violations of the 24-hour PM NAAQS.
Acknowledgments The work at Clarkson University and the University of Helsinki was supported by the U.S. Environmental Protection Agency under Contract ID-6632-NTEX. Although the research described in this paper has been funded in part by the United States Environmental Protection Agency, the views expressed herein are solely those of the authors and do not represent the official policies or positions of the U.S. EPA.
Supporting Information Available Additional information including the four figures describing the other urban factors and additional details of the data processing of the PM2.5 data. This material is available free of charge via the Internet at http://pubs.acs.org.
Literature Cited (1) U.S. EPA. Code of Federal Regulations, Part 50, Title 40, 1997; Fed. Regist. 1997, 62, 38651. (2) Blifford, I. H.; Meeker, G. O. Atmos. Environ. 1967, 1, 147-157. (3) Prinz B.; Stratmann, H. Staub-Reinhalt Luft 1968, 28, 33-39. (4) Hopke, P. K.; Gladney, E. S.; Gordon, G. E.; Zoller, W. H.; Jones, A. G. Atmos. Environ. 1976, 10, 1015-1025. (5) Gaarenstroom, P. D.; Perone, S. P.; Moyers, J. P. Environ. Sci. Technol. 1977, 11, 795-800. (6) Thurston, G. D.; Spengler, J. D. Atmos. Environ. 1985, 19, 9-26. (7) Henry, R. C.; Kim, B. M. Chemom. Intell. Lab. Syst. 1989, 8, 205-216. (8) Henry, R. C.; Lewis, C. W.; Collins, J. F. Environ. Sci. Technol. 1994, 28, 823-832. (9) Anttila, P.; Paatero, P.; Tapper, U.; Jarvinen, O. Atmos. Environ. 1995, 29, 1705-1718. (10) Chueinta, W.; Hopke P. K.; Paatero, P. Atmos. Environ. 2000, 34, 3319-3329. (11) Huang, S.; Rahn, K. A.; Arimoto, P. Atmos. Environ. 1999, 33, 2169-2185. (12) Lee, E.; Chan, C. K.; Paatero, P. Atmos. Environ. 1999, 33, 32013212. (13) Paterson, K. G.; Sagady, J. L.; Hooper, D. L.; Bertman, S. B.; Carroll, M. A.; Shepson, P. B. Environ. Sci. Technol. 1999, 33, 635-641. (14) Poirot, R. L.; Wishinski, P. R.; Hopke, P. K.; Polissar, A. V. Environ. Sci. Technol. 2001, 35, 4622-4636. (15) Polissar, A. V.; Hopke, P. K.; Poirot, R. L. Environ. Sci. Technol. 2001, 35, 4604-4621. (16) Song, X. H.; Polissar, A. V.; Hopke, P. K. Atmos. Environ. 2001, 35, 5277-5286. (17) Paegle, J. N.; Haslam, R. B. J. Appl. Meteorol. 1982, 21, 127-138. (18) Obled, C.; Creutin, J. D. J. Appl. Meteorol. 1986, 25, 1189-1204. (19) Chang, J. J.-C.; Mak, M. J. Atmos. Sci. 1993, 50, 613-630. (20) Fraeddrich, K.; Pawson, S.; Wang, R. J. Atmos. Sci. 1993, 50, 3357-3365. (21) Chemg, X.; Nitsche, G.; Wallace, J. M. J. Climate 1995, 8, 17091713. (22) Preisendorfer, R. W. Principal Component Analysis in Meteorology and Oceanography; Elsevier: Amsterdam, 1988; 425 pp. (23) Peterson, J. T. Atmos. Environ. 1970, 4, 501-518. (24) Eder, B. K.; Davis, J. M.; Bloomfield, P. Atmos. Environ. 1993, 27A, 2645-2668. (25) Henry, R. C. J. Air Waste Manage. Assoc. 1997, 47, 216-219. VOL. 37, NO. 11, 2003 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
2475
(26) (27) (28) (29) (30)
Henry, R. C. J. Air Waste Manage. Assoc. 1997, 47, 220-225. Henry, R. C. J. Air Waste Manage. Assoc. 1997, 47, 226-230. Paatero, P. Chemom. Intell. Lab. Syst. 1997, 37, 23-35. Paatero, P. J. Comput. Graph. Stat. 1999, 8, 854-888. Paatero, P.; Hopke, P. K. Chemom. Intell. Lab. Syst. 2002, 60, 25-41. (31) Juntto, S.; Paatero, P. Environmetrics 1994, 5, 127-144. (32) Anderson, T. W. An Introduction to Multivariate Statistical Analysis, 2nd ed.; Wiley-Interscience: New York, 1984; 704 pp.
2476
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 11, 2003
(33) Paatero, P.; Hopke, P. K.; Song, X. H.; Ramadan, Z. Chemom. Intell. Lab. Syst. 2002, 60, 253-264.
Received for review September 30, 2002. Revised manuscript received March 24, 2003. Accepted March 27, 2003. ES0261978