Environ. Sci, Technol. 1994,28, 1023-1030
Multiple Site Receptor Modeling with a Minimal Spanning Tree Combined with a Neural Network Dletrich Wienke,? Nlng Gao, and Philip K. Hopke’ Department of Chemistry, Clarkson University, Box 5810, Potsdam, New York 13699-5810
A combination of two pattern recognition methods has been developed that allows the generation of geographical emission maps from multivariate environmental data. In such a projection into a visually interpretable subspace by a Kohonen self-organizing feature map, the topology of the higher dimensional variables space can be preserved, but parts of the information about the correct neighborhood among the sample vectors will be lost. This loss can partly be compensated for by an additional projection of Prim’s minimal spanning tree into the trained neural network. This new environmental receptor modeling technique has been adapted for multiple sampling sites. The behavior of the method has been studied using simulated data. Subsequently, the method has been applied to mapping data sets from the Southern California Air Quality Study (SCAQS). The projection of 17 chemical variables measured at up to eight sampling sites provided a two-dimensional, visually interpretable, geographically reasonable arrangement of air pollution sources in the South Coast Air Basin.
Introduction For the past 2 decades, source-receptor relationship models have been well-studied and developed into widely recognized, important means for air quality study and air quality control policy development and implementation (1). Receptor modeling methods employ data sets generated by ambient air quality monitoring networks that provide ambient concentrations of a number of chemical species. If there are multiple monitoring sites in the network, the data can be arranged as a three-dimensional data block, Crimp Such a data block contains measurements of m physical and chemical properties in n samples collected during different sampling periods a t r multiple sampling sites. Recently developed three-dimensional modeling methods such as three mode factor analysis (TMFA) (2)and direct trilinear decomposition (DTDMR) ( 3 ) do not provide any geographic information of the identified emission sources. Another problem with these eigenvalue-based methods is that they require synchronized sample collections a t all of the sampling sites. There are often problems of missing samples or partial data either that cannot be used or that require estimates for the missing values. An alternative method has been developed that incorporates meteorological information (wind direction, wind speed, etc.) into its algorithm and, therefore, can provide information on the locations of possible contributing sources. This technique is potential source contribution function (PSCF) analysis ( 4 , 5 ) . A PSCF analysis combines
* Author to whom correspondence should be addressed; e-mail address:
[email protected]. Present address: Catholic University of Nijmegen, Laboratory for Analytical Chemistry, Toernooiveld 1,6525 ED Nijmegen, The Netherlands. 0013-936X/94/0928-1023$04.50/0
0 1994 American Chemical Society
the ambient concentrations of a single chemical species with meteorologicaldata in the form of backward air parcel trajectories and is able to yield a grided map identifying the possible geographic locations of emission sources for this particular species. Currently, there is no receptor modeling method that can analyze a three-dimensional data block as a whole and yield both source profiles and geographical information on the identified emission sources. A technique that could possibly overcome these disadvantages has been recently introduced by Wienke and Hopke (6, ‘7) for the evaluation of single receptor site measurements. This new receptor site modeling technique is based on the combination of two well-established but distinct pattern recognition techniques. By the projection of Prim’s minimal spanning tree (8)into a Kohonen selforganizing neural network (9, 10) that has been trained with measurements from a single receptor site, the geographic direction, the emission level, and the emission profile for several sources have been obtained (5, 6). In the present work, the use of Prim’s minimal spanning tree projected into a Kohonen self-organizing feature map is extended to the simultaneous evaluation of multiple receptor sites measurements for a number of chemical species. The results of the mapping of a data set from the Southern California Air Quality Study (SCAQS) (11)are compared with results previously obtained using principal component analysis (PCA) and PSCF analysis (12).
Methodology for Multiple Site Data Figure 1shows a schematic diagram of a model system and its projection into a rectangular array of neural network nodes. The method assumes that there are r receptor sites, x l , with 1 = 1 ... r, distributed over a given geographical area. An unknown number,p, of air pollution sources, Sp, with unknown geographic locations emit patterns of chemical constituents that are determined as m particulate or gaseous species. Neighboring pollution sources and meteorological factors (wind speed, wind direction, wind duration, rain, etc.) change the pure source profile a, to a multivariate vector of sample concentrations c,. This variation of a pure source profile, akj with k = 1 ... p and j = 1 ... m, during its transport in atmosphere can be considered as being weighted by the mass contribution of the particular source, bjk, according to
where eij represents noise. If written in a matrix notation that is transposed to the classical receptor model (eq l), the matrix Apmof p source profiles is weighted by a source contribution matrix Bnpaccording to
If the n samples were taken simultaneously a t r receptor Environ. Sci. Technol.,Vol. 28, No. 6,
1994
1023
Flgure 1. Model arrangement of a sampling network of five sites (X) situated between three emission sources (S). The arrows symbolize distinct sample vectors, c,,,, that have been emitted by a source as source profile, weighted by the meteorological conditions, and finally transported through the atmosphere to a sampling site. The geographical area can be superimposed onto the nodes of an internal Kohonen neural network.
sites, all of the individual r data matrices C,, can be combined to a three-dimensional data block (tensor),C,,,. However, this rather ideal case requires synchronized measurements between all r sampling sites. This constraint can only be fulfilled by a fully synchronized sampling network. If not, the data cube would be incomplete in several of the i = 1,...,n rows. The present model does not require synchronized measurements. For the present neural network approach, the individual r data matrices, C,,,, with distinct sample numbers nl, are combined to a two-dimensional total sample data matrix CN, with r
N =Enl
(3)
1=1
The total number of samples, N , from the r sampling sites can be arranged in this matrix in an arbitrarily chosen order. The present model needs a further assumption: As the number of identical species analyzed at all of the sampling site increases, the probability increases that a given source profile will become mixed with species from neighboring emission sources. With this increasing probability of specific weighing through contamination from nearby sources, the probability also increases that this geographical information of source proximity will become encoded as chemical information in the sample vectors. To illustrate this assumption, assume a single sampling site is affected by two power plants and a lead smelter arranged in a triangle, where the lead smelter is located closer to one of the two power plants. If only the combustion products SO2 and NO, are measured and both power plants emit an identical ratio of them (for example, from the use of the same type of coal), they would be indistinguishable without using any additional geographical or meteorologicalinformation. However, the additional analysis of Pb could solve this problem because the geographic information is coded by a distinct mixing of both source profiles of the power plants with the profile of the lead smelter. The arrows for the sample vectors, c,, in Figure 1 have different orientation, length, and thickness. These differences symbolize the distinct weighing of the source profiles. The more that distinct 1024
Environ. Sci. Technol., Vol. 28, No. 6, 1994
meteorological situations exist, then more of the diverse geographic information is coded in these chemical sample vectors by distinctively created mixtures of source profiles. For the internal representation of the present neural network model in the computer, the geographical area in Figure 1 (bottom) is thought to be superimposed onto a two-dimensional Kohonen neural network containing u neurons (Figure 1,top). This superposition creates a twodimensional data projection that can be compared with a real geographical map. In Figure 1, a square of 8 X 8 neurons has been chosen (u = 64). Alternative twodimensional neural arrangements can be, for example, a circle, a rectangular or an ellipse. In reality, the u neurons are only an index expression for t = 1,..., u weight vectors, w, of the same length m as the N sample vectors. The essence of Kohonen’s algorithm is a repeated comparison of the i = 1, ..,, N sample vectors, ci,, with these t = 1,...,u weight vectors, w,, using a distance metric suchas the Euclidian distance or the correlation coefficient. During each single comparison, a winner, t , among all u weight vectors can be found that has the highest degree of similarity (smallest distance) to Cim. After that, the j = 1,...,m elements, wtj(old),of the winning weight vector t are adapted a small step closer to the elements C i j of the actual ith input sample vector using the Kohonen learning rule
The learning rate, 7, is chosen as a positive real number with a value less than 0.1. Together with the winner t, further weight vectors are modified around the winner within a limited topological neighborhood of radius R. As topological neighborhoods, squares, hexagons, or circles have been used (5,6,9,10). In the beginning, R is large but decreases slowly with the training time. After this adaption, the subset of weight vectors within R became slightly more similar to the actual sample input vector in terms of the distance metric employed. A comparison of all N sample vectors with all u neural weight vectors and their modifications is called one epoch. This process is repeated over many epochs yielding a selforganized behavior of the N samples in the low-dimensional neural array. The samples form a visual topological structure due to their arrangement in the original mdimensional variables space. The final result of this training process is the collection of subsets of sample vectors aggregated to some of the weight vectors. The number of these “loaded” neurons is u with u less than or equal to u. The trained weight vector for such an aggregated cluster of input vectors comes numerically very close to their mean vector. The remaining u-u weight vectors do not have any assoicated input vectors. Their corresponding neurons are defined as “unloaded neurons. A t this point, only an unsupervised pattern recognition result similar to those obtainable from hierarchical cluster analysis, nonlinear mapping, or principal component scores plots has been obtained. However, despite the interesting fact that the topology of the m-space will be preserved by this special type of projection, the quantitative interpretation of such a Kohonen map remains difficult especially when information about the correct neighborhood disappears during such a projection into a 2-D plane. The
interpretability of these results can be improved by adding a procedure that allows the simultaneous visualization of global relationships between loaded neurons in the map. The minimal spanning tree (8) is now added in order to provide the visualization of global relationships between loaded neurons in the map. Wienke and Hopke (5)provide a complete description of the combined algorithm. Details of the calculation and the choices for the training parameters are given by Wienke and Hopke (6).
Experimental Section This multiple receptor site modeling method was tested in a computer simulation experiment and for areal ambient data set from the SCAQS'87 study. Computer Simulation Experiments. The model system is shown in Figure 1. This system serves as the basis for the simulations. There are p = 3 sources with source profiles, a1 = [lo, 10, 1, 2, 2, 11,a2 = 11, 1, 1, 10, 10,101, and a3 = [2,5,10,1,1,2] for m = 6 species. These profiles were weighted according to eq 2 with several ~ different different source contribution matrices, B Nand associated error terms EN,^. Particular choices of values ~ made based on distinct weather situations in B N were symbolizedas arrows of different length (duration of wind), line thickness (wind speed), and orientation (wind direction). Large values in only one column, for example, represent a 'directed storm'. Filling the matrix with small random numbers codes a weak, rather diffuse and not well-directed air movement. According to the closeness of a particular sampling site to a specific source, the weights, Ank3and Bnk3were increased. Each of the r = 5 sampling sites have a block of nl = 21 sampling periods giving N = 125 data vectors, c16. Errors were added as Gaussian-distributed random numbers with the mean value of 0. The size of the noise term Ejvmwas varied from 0 % to 20 % of the maximum value of C N ~ . Description of Experimental Data. The data set used in this modeling study has been obtained from the Southern California Air Quality Study (SCAQS'87) (11) that contains ambient air concentrations of 17 chemical species analyzed from gaseous and fine fraction particulate (PM2.5) samples collected using the SCAQS samplers (13) during 4-,5-, or 7-h sampling intervals on 11 days in the summer of 1987 a t sampling sites in Anaheim, Azusa, Burbank, Claremont, downtown Los Angeles, Hawthorne, Long Beach, and Rubidoux. The measured species included "03, NO3-, "3, NH4+, S02, S042-, C1-, Na, Mg, Si, P, S, K, Ca, Ti, Fe, and Zn. The SCAQS samplers were designed and built for SCAQS and are discussed in Wolff et al. (13). The fine fraction anions, Sod2-, Nos-, C1-, and SO2 analyzed as S042were determined using ion chromatography (IC). The fine fraction ammonium species NH4+ and NH3 analyzed as NH4+ were measured using colorimetry. Trace elements were determined using X-ray fluorescence (XRF). For the IC and colorimetry samples, the missing values were replaced with the mean concentration values of the respective species. For the XRF samples, the below detection limit concentration measurements were replaced with the values obtained by multiplying the reported detection limit by a uniformly distributed random number between 0 and 1. The missing values were also replaced by the mean concentrations of the species. A detailed study of the sources, transportation, transformation, and
fate of gaseous and particulate species measured a t the Burbank, Claremont, and Rubidoux sampling sites is presented in Gao et al. (13). Data Preprocessing. The m columns of the raw N X m data matrix of N = 440 samples and m = 17 species were scaled to unit length to give the different concentration levels of the species a comparable contribution (single scaling). In addition, for the training of the Kohonen network, the N rows were scaled to unit length (double scaling). In this way the sample vectors were classified with respect to their relative concentration ratios of the species but not according their absolute concentration level. A detailed discussion of the effect of several scaling techniques and similarity measures on the results of the combined mapping techniques are given by Wienke and Hopke (5).
Results and Discussion Simulation Experiment. The first simulation experiment was carried out without additional noise (E125,3 = 0). Four large groups of aggregated objects can be found in the four corners of the map of the trained 5 X 5 Kohonen neural network shown in Figure 2a. However, the minimal spanning tree shows that rather three large clusters are present. Two clusters are closer to each other (upper left corner and lower half of the map). A quantitative edge length analysis within the minimal spanning tree shows that the three clusters form a topological structure of a rather irregular triangle. A semiquantitative visualization is provided by the line thicknesses in Figure 2a,b. This result agrees with the underlying geographic model for the three emission sources in Figure 1. The medians of the concentration profiles for the clusters (Table 1)show that the clusters indeed represent the simulated emission sources. The addition of an increasing amount of noise up to 20 3'% does not significantly disturb the correct topological arrangement of the three clusters in the Kohonen map (Figure 2b). Their concentration profiles (Table 1)reflect the emission pattern of the sources. However, small interstitial clusters are observed that form a geographic transition area, influenced simultaneously by several emission sources. These transition areas connect the clusters with each other by a simultaneous coupling over the whole neural network area. This simultaneous coupling is provided mathematically by the topological radius, R , that yields a correct overall image of the geographical reality in such a map. In a third simulation experiment, a 'tornado' has been simulated by a weighing of the source profile matrix, A3,6, with a completely random matrix, B,, for the source contributions. This kind of uniform mixing is not a likely meteorological event. By utilizing a number of sampling intervals over time, contributions from unusual events will be usually averaged out over a long measurement time at an air pollution sampling site. However, even in this extreme case, three clusters were found. Only two of them showed the correct averaged profiles of the two distinct emission sources. The third cluster represented a transition area. However, this result suggests that Prim's minimal spanning tree mapped into a Kohonen neural network is a robust combination even for the analysis of data from extreme, unlikely meteorological conditions. Results for the SCAQS Data. A subset of the total 440 X 17 SCAQS data matrix has recently been characEnviron. Sci. Technol., Vol. 28, No. 6, 1994
1025
c-9 -py &\ ‘L3J1
(37)
i/:
Figure 2. Map of a 125 X 6 matrix of artificial data, C125,6, that results from the model shown in Figure 1 with 0 % added error (a, top) and 20% added error (b, bottom) using a 5 X 5 Kohonen neural network. The loaded neurons are connectedbased on a Prim’sminimal spanning tree. [Training parameters (5,6): matrix is scaled to unit length by row, Euclidian distance measure, Ro = 9, Re = 3, nel = 3000, nez = 1000, kl = 0.1, kz = 0.008.1 Numbers in the loaded neurons equal the number of associated samples. Three line thicknesses symbolize the three categoriesof the tree’s edge lengths (thick = ‘close’, medium = ‘normal’, thin = ‘distant’).
Table 1. Simulation Results for Increasing Noise for Model System with Five Sampling Sites and Three Sources added cluster no. noise 0%
5%
10%
20%
i ii iii i ii iii i ii iii i ii iii
median profile of the cluster 0.21 0.20 0.18 0.18 0.55 0.56
source profile 0.16 0.22 0.55
a1,6
0.62 0.22 0.13
0.63 0.45 0.15
0.14 0.79 0.14
0.65 0.22 0.12
0.67 0.46 0.15
0.11 0.20 0.78 0.18 0.13 0.56
0.20 0.18 0.56
0.14 0.25 0.57
0.67 0.20 0.12
0.65 0.45 0.15
0.17 0.77 0.14
0.21 0.19 0.55
0.20 0.20 0.56
0.17 0.29 0.55
a1,6
0.68 0.19 0.08
0.65 0.45 0.17
0.09 0.78 0.15
0.20 0.20 0.54
0.18 0.19 0.58
0.17 0.21 0.55
a1,6
a2,6
a3,6 a1,6 a2,6
a3,6 a2,~ a3,6
82.6
a3,6
terized by Gao et al. (13) using PCA and PSCF analysis. On the basis of 165 measurements taken at the three sampling sites (Burbank, Claremont and Rubidoux),they detected a number of distinctive source regions. Three near coastal areas (Los Angeles, Hawthorn, Long Beach), 1026
Environ. Sci. Technol., Vol. 28, No. 6, 1994
Flgure 3. Map of the 165 sample subset of measured ambient concentrations from the SCAQS’87 data set onto a 6 X 6 Kohonen neuralnetwork (a, top) and onto a 19 X 19 network (b, bottom)followed by the connectionof loaded neurons based on Prim’s minimal spanning tree results.
for example, showed significant sources of “03. An extended area of SO2 emissions has been found close to the Pacific Ocean to the southwest of Los Angeles in the Long Beach area. In the area around Rubidoux, increased concentrations of NHs and NH4+ were found. The PCA results suggested a ‘marine aerosol factor’ with typical seasalt components such as Na, Mg, and C1, and ‘industrial factors’ with significant loadings of Zn and P could be observed at each site. For comparison and validation of this new receptor site modeling technique, the same 165 X 17 subset as examined by Gao et al. (13) has been mapped as shown in Figure 3a,b. By comparing the two maps, it can be seen that the minimal spanning tree unfolds and relaxes if it is given more space. Second, most of the highly loaded neurons group in the corners and along the outer edges. Third, the projected minimal spanning tree connects the loaded
..................... , .
.......................................... jj 6’0 l
i 2 0 I 1 : :
0
0
0 :
0 :
:
0
0
0
0
010 0 ; o 1
0
0
L
0 0
;
I
0
0
0
0
1
0
0
0
1&2
111.1
1 0 0 1 I 0 0 0 I / 1 * 1 ~ * 1 ~ 1 * ’ 2 - 2
o