Cluster Analysis of Rural, Urban, and Curbside ... - ACS Publications

Mar 30, 2009 - Cluster Analysis of Rural, Urban, and Curbside Atmospheric Particle. Size Data. DAVID C. S. BEDDOWS,. MANUEL DALL'OSTO, †. AND. ROY M...
0 downloads 10 Views 2MB Size
Environ. Sci. Technol. 2009, 43, 4694–4700

Cluster Analysis of Rural, Urban, and Curbside Atmospheric Particle Size Data DAVID C. S. BEDDOWS, MANUEL DALL’OSTO,† AND ROY M. HARRISON* National Centre for Atmospheric Science, Division of Environmental Health and Risk Management, The School of Geography, Earth and Environmental Sciences, University of Birmingham, Edgbaston, Birmingham B15 2TT, United Kingdom

Received November 5, 2008. Revised manuscript received February 26, 2009. Accepted March 8, 2009.

Particle size is a key determinant of the hazard posed by airborne particles. Continuous multivariate particle size data have been collected using aerosol particle size spectrometers sited at four locations within the UK: Harwell (Oxfordshire); Regents Park (London); British Telecom Tower (London); and Marylebone Road (London). These data have been analyzed using k-means cluster analysis, deduced to be the preferred cluster analysis technique, selected from an option of four partitional cluster packages, namely the following: Fuzzy; k-means; k-median; and Model-Based clustering. Using cluster validation indices k-means clustering was shown to produce clusters with the smallest size, furthest separation, and importantly the highest degree of similarity between the elements within each partition. Using k-means clustering, the complexity of the data set is reduced allowing characterization of the data according to the temporal and spatial trends of the clusters. At Harwell, the rural background measurement site, the cluster analysis showed that the spectra may be differentiated by their modal-diameters and average temporal trends showing either high counts during the day-time or night-time hours. Likewise for the urban sites, the cluster analysis differentiated the spectra into a small number of size distributions according their modal-diameter, the location of the measurement site, and time of day. The responsible aerosol emission, formation, and dynamic processes can be inferred according to the cluster characteristics and correlation to concurrently measured meteorological, gas phase, and particle phase measurements.

1. Introduction The health effects of airborne particulate matter are widely accepted to be particle size dependent. Not only does particle size determine the locus of deposition in the respiratory tract but also the health impact of deposited particles is likely to depend critically upon the size distribution. Smaller (ultrafine) particles are able to penetrate the pulmonary epithelium, reaching the interstitium, and also present greater surface area per unit mass than coarser particles. The shape * Corresponding author phone: +44 121 414 3494; fax: +44 121 414 3709; e-mail:[email protected]. † Present address: Centre for Climate & Air Pollution Studies, National University of Ireland, Galway, University Road, Galway, Ireland. 4694

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 43, NO. 13, 2009

of the particle size distribution can provide valuable insights into the formation mechanisms of particles (e.g., nucleation) as well as the geographic source areas (1, 2). Hence, the assessment of particle size information - in the form of a particle size spectrum - is important in the study of atmospheric aerosol dynamics and is also important in studies of human health. Both physical and chemical measurements are routinely made on atmospheric particles, and univariate analysisinvestigating how a single measured quantity varies with a second - can be easily carried out. Bulk measurements such as particle number, PM10, and PM2.5 tend to be of this nature, whereby a single measurement representing the bulk material within a cubic meter of air is cyclically measured (3-11). As the required detail needed to address policy relevant issues such as human health impacts increases, the degree of complexity of the field measurement and subsequent data analysis method will increase. Data fractionated according to their chemical makeup and particle size association will naturally improve the representation of the aerosol composition. Indeed temporal trends in the bulk quantities can continue to be a core feature of the measurement, flagging up episodic events, e.g. exceedence of air quality objectives; however, once the measurements become size-fractionated, the extra degrees of freedom imply a necessity to account for variation of the patterns across these dimensions (12). Fractionated mass measurements like those of Allen et al. (13) can be carried out using cascade impactors which divide the data into 2 to 10 size ranges. Using coarse and fine size fractions, Singh et al. (14) also demonstrated apportionment of chemical measurements to the resuspension of dust, vehicular activity, and industrial power plants and oil refineries. Dillner et al. (15, 16) carried out similar source apportionment using Ward’s method of hierarchical clustering to attributed groups of elements with similar size distributions (0.056 < Dp < 1.8 µm) to sources such as automobile catalysts, fluid catalytic cracking unit catalysts, fuel oil burning, a coal-fired power plant, and high-temperature metal working. In the work presented, the partitional clustering of particle number count spectra is considered, rather than size fractionated mass which is biased toward higher particle diameter. However, faced with a massive number of spectra whose size range may potentially extend from 12 nm up to >10 µm if an SMPS-APS combination is used (17) it is helpful to be able to reduce them into manageable groups with similar characteristics before attempting a generic interpretation, and for this purpose cluster analysis methods have been used.

2. Methods The spectra considered consist of values of dN/dlog(Dp) measured across 106 size fractions Dp, of a TSI Scanning Mobility Particle Sizer, SMPS spanning the size range from 12 nm up to 661 nm. When considering such data using statistical applications, two coarse levels of descriptive, exploratory, and confirmatory data analyses can be identified. The first level is found in the cluster analyses discussed, where natural groups are found in the data according to the patterns measured by the values of dN/dLog(Dp) across the Log(Dp) scale. Source recognition is achieved in the first instance at the rural site, by correlating the temporal trends of the resulting cluster counts to concurrently measured chemical and meteorological variables which are used to assign the clusters to long-range transport, photochemical processes, and local 10.1021/es803121t CCC: $40.75

 2009 American Chemical Society

Published on Web 03/30/2009

TABLE 1. Validation Statistics Calculated for the 4 Partitional Clustering Methods, Applied to the SMPS Dataa method

avg diam

avg btwn

avg within

wb ratio

avg silwidth

dunn index

hg index

k-means Fuzzy r ) 1.07 k-medians Fuzzy r > 1.07 MBC (9 VVV)

0.5140 0.5111 0.5445 0.8237 1.2648

0.2786 0.2767 0.2763 0.2771 0.2798

0.05515 0.05646 0.05910 0.09744 0.15442

0.1927 0.1988 0.2089 0.3508 0.5541

0.3265 0.3039 0.2824 0.0387 0.0688

0.000563 0.000550 0.000525 0.000375 0.000480

0.2830 0.2706 0.2657 0.2499 0.2171

a Averaged over cluster number settings 2 to 30 measured at the rural UK AURN monitoring site at Harwell, Oxfordshire (Supporting plots are given in Figure S1).

anthropogenic activity. In the second instance, we consider data from three locations in central London and demonstrate the ability of cluster analysis to characterize the spectra measured at the three different sites, those being urban background, urban curbside, and urban aloft. Classification of the spectra according to these naturally found groups can then be correlated to either concurrently measured parameters or measurement location, both demonstrated in this work with the data collected at UK sites of Harwell and London. This macroscopic approach makes the assumption that each measurement site has its own characteristic particle size spectra. From this, the data can be divided according to location-type before using the more exploratory invasive analyses can be considered. The latter include the following: principal component analysis; factor analysis (18); modal analysis (19); and positive matrix analysis (20), where the particle size distributions are characterized as a superposition of distributions associated with specific sources. 2.1. Cluster Tendency, Methods, and Validation Statistics. The mathematical details of the methods discussed are available in the Supporting Information. The use of cluster analysis was justified in this work using a Cluster Tendency test (21), which calculated a Hopkins Index of 0.20 ( 0.16, implying that the particle size data formed natural clusters. The choice of k-means clustering was made from a selection of the partitional cluster packages, described within the taxonomy of cluster analysis (22, 23), and available within the Statistical Programming Environment R (24); these were as follows: Fuzzy; k-means; k-median; and Model-Based clustering (24-29). The methods were compared using cluster validation indices (28) which were used to identify k-means clustering as the preferred technique, producing clusters with the smallest size, furthest separation, and importantly the highest degree of similarity between the elements within each partition. Table 1 summarizes the results for the cluster validity values obtained by applying each of the methods using 2 to 30 clusters. The data set consisted of a sample of 5000 hourly smpsh,m measurements averaged from SMPS spectra collected over 15 min cycles at the rural UK AURN monitoring site at Harwell, Oxfordshire in 2005, and the spectra were all normalized to the vector length before analysis. The distance based measurements include the following: cluster diameter (diam - maximum separation of the SMPS spectra within the cluster); cluster separation (btwn - minimum distance between SMPS spectra of neighboring clusters); the average distances within clusters - (avg within); and the average distances between clusters. From these measurements, the cluster validation indices may be derived; at their most simple the wb index - wb ratio is the ratio of the average distances within to the average distances between clusters. Other indices considered included the Hubert-Gamma Index (hg) and the Davies-Broulin Index. But, in this work the Dunn index (dunn index) was found to be the most useful index which is a function of the ratio of the minimum cluster separation to the maximum cluster, implying that the larger the Dunn index the more compact and well separated are

the clusters within the space. A second useful measurement is the Silhouette width (silwidth), which is a measure of the similarity of the SMPS spectra within a cluster. Each field of Table 1 gives the average results for a different cluster validation index for each clustering technique listed in the records. Prior to averaging the values were plotted for each cluster technique, and all except the Dunn Index showed trends common to those described by the trends seen in Table 1 (see Figure S1). The Dunn Index values were more varied across the cluster number domain, but it could be generally seen that the techniques of k-means and Fuzzy clustering (using r ) 1.07) presented the highest values. Furthermore, across the range of interest (either side of the optimum of 10 clusters) the Dunn index for k-means was the highest with the exceptions for cluster number ) 8, 9, and 14 where the Fuzzy cluster r ) 1.07 presented the highest value. Note, that for the Fuzzy clustering algorithm 2 different values for the exponent r were chosen to produce (i) as close as a mutually exclusive cluster result as possible using r ) 1.07 and (ii) a clustering result with a degree of overlap (1.07 < r < 2.0). Table 1 ranks the clustering methods in order of preference increasing from the bottom to the top entry showing k-means clustering to be the preferred technique and Model-Based Clustering (MBC) to be the least successful. Considering the values from bottom to top in each column, the decrease in the average cluster diameter and the trendless pattern of the average distance between (avg btwn) clusters indicate a progressively tighter packing of the SMPS spectra. This is confirmed by the decreasing trend of both the average distance within (avg within) each cluster and the wb ratio (avg within/avg btwn), implying that k-means is the preferred choice. The similarity of the elements of all the clusters judged using the average silhouette width also increases traversing up the table. Further confirmation of the hierarchy-ofpreference shown in Table 1 is given by the increasing trends of the Dunn Index and the Huber-Gamma Index - also included. Although MBC was the least successful method in this case, it has to be borne in mind that assumption is made that the clusters are normally distributed when applying MBC. If the data fail a normality test, then transforms potentially can be applied in order to meet this criterion for MBC.

3. Results 3.1. k-Means Cluster Analysis of UK Rural Data. The SMPS data from the rural Harwell site (mobility diameter range 12 to 437 nm) were analyzed using the preferred k-means clustering. In order to choose the optimum number of clusters the Dunn-Index was used. The other metrics, although useful in comparing techniques, showed either a gradual increase or decrease in value with cluster number making it difficult to choose a cluster number. On the other hand, when plotted against increasing cluster number, the Dunn-Index did show a general decreasing trend in the baseline values from 6.6 × 10-4 to 3.6 × 10-4, but there were distinct maxima with DunnIndex values equal to 8.5 × 10-4, 8.3 × 10-4, and 7.4 × 10-4 for 4, 10, and 16 clusters, respectively. The choice of cluster was then made based on the competing need to have a high VOL. 43, NO. 13, 2009 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

4695

FIGURE 1. Average rural k-means clustered spectra from Harwell annotated with the names of variables to which the temporal profiles are highly correlated and the average diurnal cycle of the hourly counts. Dunn-Index and a cluster number great enough to divide the data up into manageable sets of SMPS data, i.e. 10 or 15 clusters. The average SMPS spectra of the resulting clusters, and a table showing how many of the spectra are classified into each cluster, are arranged in Figure 1. Supporting Information Figure S2 presents a Cluster-Proximity Diagram which shows how the clusters are arranged relative to each other based on the Silhouette width used to measure which elements within the data set of H spectra have the least dissimilarity to those within each cluster but do not belong to the cluster. In general the clusters are bimodal. Clusters 4, 6, 7, 8, and 10 (placed to the left of the Cluster-Proximity Diagram) have modal-diameters less than 0.05 µm and can be characterized as being nucleation-dominated spectra, as there are no local traffic sources. Clusters 6, 7, and 8 also show a similar shaped accumulation mode, unlike cluster 8 where the accumulation mode is more prominent together with the nucleation mode. The remaining clusters tend toward being characterized as consisting of accumulation mode spectra with modaldiameter at and above 0.05 µm. The temporal evolution of the clusters over a typical day reveals further structure in the clustering results. Figure 1 divides the clusters into three groups depending over which hours the cluster counts maximize. From 00:00 to 23:00 of the average day, the hourly ratio of the number of counts for 4696

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 43, NO. 13, 2009

clusters 1, 2, 3, and 6 to the hourly total count of all the clusters follow a valley-shaped trend descending from, and rising to, a small plateau of high values occurring during the night. Similarly, the hourly ratio of the number of counts for clusters 4, 8, and 9 follow a clipped-peak shape, with a plateau of high values during the daytime and evening. The remaining clusters, numbered 5, 7, and 10 do not follow such distinct temporal patterns as the other clusters and form a subgroup with cluster 7 growing after 10:00 and peaking at 0:00 and clusters 5 and 10 having the stronger counts during the morning and afternoon hours. The results can be related to atmospheric processes if we consider the correlation between the temporal trends of the clusters with the temporal trends of the other gas-phase (O3, NO, NO2, NOX, and SO2), particlephase (PM2.5, PM10, nitrate, elemental carbon, and organic carbon) species and meteorological data (wind speed and wind direction) measured at the site (Table S3). Clusters 1, 2, and 3 show a correlation with PM10 and an inverse correlation with ozone. All fall in concentration in the afternoon when the boundary layer deepens, and ozone tends to maximize. They exhibit a distribution firmly centered in the accumulation mode which typically occurs in association with regional air pollutant transport. Clusters 4 and 8 show the greatest contribution of nucleation mode particles and are associated with elevated wind speeds, ozone and sulfur dioxide and low PM10, and peak during the

FIGURE 2. Average urban k-means cluster spectra and the SMPS deployment sites used during the REPARTEE campaign in London. afternoon. This is consistent with particles formed by nucleation of sulfuric acid formed by oxidation of sulfur dioxide by hydroxyl radical (formed by ozone photolysis) in the presence of a low condensation sink due to high wind speed (30). Cluster 6 shows an association with organic carbon and nitrogen dioxide and may be due to a combustion source. The other cluster types show less clear-cut associations with other components and can probably arise from a number of different production/advection scenarios. 3.2. k-Means Cluster Analysis of UK Urban Data. The SMPS data collected during the 2007 REPARTEE field campaign held in London during the months of October and November provided a huge urban data set for analysis. This is one of the few campaigns when concurrent particle size data measurements have been carried out at multiple sites in a major city. Considering the sites (Figure 2), the urban background measurement was located in the Inner Circle of Regents Park (RP) (ca. 2 km2) situated ca. 700 m from the curbside measurement site on one of London’s major highways, Marylebone Road (MR). The aloft measurements were taken approximately 1.2 km from Regents Park, off the 35th floor of the British Telecom Tower - BT, (60 Cleveland Street), at a height of 170 m above the ground. The SMPS systems were deployed at the three locations measuring particles with diameters over the size range 15.1 to 661 nm. At the Regents Park and at the BT tower locations, a DMA TSI 3080 and CPC TSI 3022A was deployed, and at the Marylebone Road site a DMA TSI 3080 and CPC TSI 3776 was in operation. The three systems were intercompared, and the data collected were corrected to account for the particle losses due to the different sampling lines. The measurements were made concurrently at the 3 different sites (5 min interval at RP and BT; 10 min interval at MR). The field study was conducted between 17 and 10-07 and 09-11-07, and approximately 7000 SMPS size distributions were measured. These size distributions were averaged over

6 h intervals (00-06, 06-12, 12-18, 18-00) reducing their number from 20,000 to 259, which were subsequently normalized by their vector-length and cluster analyzed. Using these long-averages, the characteristics of the size distributions peculiar to each site (road, tower, park) are derived using cluster analysis. From this one can decide which days or periods are of interest for future analyses (e.g., PCA, PMF, Modal-Analysis, etc.), using hourly averaged data which are more comparable to traffic, meteorological (inc. LIDAR measurements), and gas-phase and particle-phase data. The Dunn-Index for the results of the k-means cluster analysis for different cluster numbers showed a clear maximum for 15 clusters, the results of which are presented in Figure 2 and arranged according to the patterns identified in the Cluster-Proximity diagram (Figure S4). Characteristics of the spectral clusters are given in Table 2. Marylebone Road may be considered as a major particle line-source producing a size distribution characteristic of traffic with the highest loadings in the nucleation mode. Clusters 1 (observed during the day and night) and 7 (observed at night) are observed solely at this site. This characteristic distribution of the traffic source is also linked to spectra detected in Regents Park during the day (clusters 4 and 14) and night (cluster 12) as well as the BT tower at night (cluster 8) in which the traffic mode at 25 nm is less prominent. We have seen no evidence in earlier work with highly time-resolved data for large-scale nucleation processes as a source of particles in London, and therefore the nucleation mode is attributed to traffic emissions. Five of the cluster types were detected only at the BT Tower. Clusters 2, 11, and 15 were similar to each other having an accumulation mode at about 100 nm. However, clusters 2 and 11 presented very different temporal profiles which were differentiated by the fact that cluster 2 was observed during night time and cluster 11 was observed mainly during the afternoon. Cluster 15, which is very similar to cluster 2, was also observed only at the BT tower at night. This VOL. 43, NO. 13, 2009 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

4697

TABLE 2. Characteristics of Spectral Clusters in London cluster

sites of detection

period of day

physical description (see Figure 2)

1

MR

all

2

BT

night

3

day

5

mainly at RP with a few spectra collected BT detected mainly at RP with a few spectra collected MR detected BT

night

6 7

BT MR

night evening

8

MR/RP/BT

night

9

RP

midday

10 11

RP/BT BT

night all

12

MR/RP

night

13

RP/BT

day

large mode at 25 nm and at 100 nm

14

MR/RP

day

15

BT

night

mode at 25 nm and smaller mode at 100 nm, high loadings large accumulation mode at 100 nm

4

day

observation suggests that the regional background is characterized by a very prominent accumulation mode and is observed at nighttime when the boundary layer depth is low, separating regional aerosol measured aloft from groundbased emissions. Cluster 11, measured on the tower during daytime, shows influences of aged local emissions upon which the regional peak is superimposed. Clusters 5 and 6 measured on the BT tower at night are reminiscent of average spectra reported for the rural Harwell site, dominated by Harwell clusters 1, 2, and 9 (the most abundant Harwell clusters), and hence also represent regional background, perhaps heavily aged. The particle number concentrations are, however, notably higher. Even using spectra averaged over 6 h, clusters characteristic of the site and specific periods of the day are observed. In particular clusters 1 and 7 are observed only at the Marylebone Road site. Moreover, on inspection cluster 1 shows a diurnal pattern linked to fresh traffic emissions. The spectra within cluster 7 are observed to occur mainly during the night. Besides a component of nanoparticles from local emissions, it also shows the influence of the regional background caused by an elevated relative humidity and reduced urban boundary layer height. Similarly, the clusters peculiar to the observations at the BT tower (2, 6, 11, and 15) show considerable aging and are related to the regional background. Cluster 12, which is seen at both MR and RP, is an example of a similar boundary layer composition occurring at both sites. It is not observed at the elevated BT tower site.

4. Discussion Particle size spectra collected from various UK monitoring sites have been characterized using cluster analysis. We have 4698

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 43, NO. 13, 2009

mode at 25 nm and smaller mode at 100 nm, high loadings large accumulation mode at 100 nm small loadings and smaller mode at 25 nm small loadings at 25 and 100 nm low number concentrations low number concentrations mode at 25 nm and smaller mode at 100 nm, high loadings broad mode mode at 25 and 100 nm at the same magnitude large accumulation mode large accumulation mode at 100 nm large mode at 25 nm, small mode at 100 nm

traffic emissions imposed on low regional background regional background (less aged) aged traffic aerosol with regional background aged traffic aerosol regional background with aged road traffic regional background traffic emissions plus regional background regional background with small traffic emissions component regional background with aged traffic emissions regional background regional background traffic emissions imposed on low regional background similar contributions of traffic emissions and regional background traffic emissions and regional background regional background (less aged)

systematically dealt with issues concerning the suitability of cluster analysis, which method to use and how to choose the number of clusters. To address the point as to whether SMPS data should be treated with a cluster analysis method, a cluster tendency test was carried out using the Hopkins Index to prove that the data did indeed form natural clusters for the algorithms to partition. To choose which method was the most suitable for the SMPS data, cluster validation indices were used. By averaging the cluster validation indices, measured using the results obtained with 2 to 30 clusters for each technique, a comparison was made which concluded that k-means clustering was the preferred partitional clustering method. The averaged indices showed that the k-means clusters had the smallest size and highest separation and more importantly showed the highest similarity of the elements within each of the clusters. Using the Dunn Index, calculated for k-means using 2-30 clusters, an optimum number of 10 clusters was chosen for the rural background data, from which the average spectrum and hourly count for an average day for each cluster was deduced. The spectra were observed to exhibit distinctive profiles of the cluster counts averaged over a one day cycle either maximizing during the day or night time hours. These are explicable in terms of formation mechanisms and atmospheric processes using measured correlations between the temporal trends of the cluster counts across an average day to concurrently measured gas phase, particle phase, and meteorological values. This work was extended to consider data simultaneously measured in an urban environment over a month long campaign in London. Treating the data as one whole data set, the Dunn Index calculated for 2 to 20 clusters showed a maximum for 15 clusters. The data provided evidence for

the aging of aerosols with the nucleation mode-dominated spectra from Marylebone Road evolving into the coarser spectra measured in Regents Park representative of the urban background. The analysis also identified some common spectra measured in Regents Park and from the BT Tower, but all but one of the clusters measured at Marylebone Road were not detected at the BT tower, indicating a rapid transformation of the particle distribution from Marylebone Road (having a prominent nucleation mode) to the BT Tower whose spectra were more dominated by the loadings in the accumulation mode. Periods were also evident when the boundary layer was shallow and the tower site was isolated from ground-level emissions. Therefore in summary, a method has been proposed based on cluster analysis in which high dimensional SMPS data collected from various sites can be greatly simplified into spectral clusters which can be used to characterize the data set. It has also been shown that more than one SMPS data set can be analyzed in such a way as to characterize the particle size distribution measured simultaneously at three different sites showing differences in the modal-diameters of the spectra, the loadings, and the period when most frequently observed. It is envisaged that this analysis could be extended to include data upon other monitored particle sources around a major city as part of a source-apportionment procedure. The clustered data clearly show large variations in size distribution, both spatially and temporally within a small area of London. Given such cluster results, the structure making up the patterns of the particle size spectra, through the superposition of individual sources, can be investigated using the techniques discussed such as Principal Component Analysis, Modal-Analysis, and Positive Matrix Factor Analysis. Implications for Human Health. The implications of particle chemical composition and size association for effects on human health have yet to be fully elucidated (31). A recent epidemiological study carried out in London (32) has shown particle number count, which is dominated by the contribution of nanoparticles, to be correlated with cardiovascular health outcomes, while particle mass metrics such as PM10 show an association with respiratory disease. This may be related to the regional deposition of particles in the respiratory system and, in the case of nanoparticles, to their ability to penetrate the pulmonary interstitium and ultimately to enter the bloodstream. Consequently, the health impact of an aerosol with a size distribution such as clusters 1 and 14 in Figure 2 which are rich in nanoparticles may be quite different from that of clusters 2 and 15 in Figure 2 with few nanoparticles but a large accumulation mode. The peak efficiency of particle deposition in the alveolar region of the deep lung is around 30 nm, a size readily delivered by clusters 1 and 14. On the other hand, particles of 100 nm, as in clusters 2 and 15, do not deposit efficiently and are liable to be breathed out (33).

Acknowledgments This research was funded in part by the UK Department for Environment, Food and Rural Affairs under contract No. CPEA 28 and by the European Union under the EUSAAR project. The REPARTEE campaign was funded by the NERC National Centre for Atmospheric Science and the BOC Foundation.

Supporting Information Available Auxiliary data calculated by the k-means clustering provided the nearest neighboring cluster for each element. Using this information the positioning of the clusters was deduced and the percentage of the clusters surrounding any one cluster was calculated for the Harwell and London Data. This information is illustrated using Cluster-Proximity Diagrams.

Also included are the validation indices plotted against the cluster number for each cluster method, prior to the calculation of the average values reported in Table 2 and the mathematical details supporting the calculations. This material is available free of charge via the Internet at http:// pubs.acs.org.

Literature Cited (1) Charron, A.; Birmili, W.; Harrison, R. M. Fingerprinting particle origins according to their size distribution at a UK rural site. J. Geophys. Res. 2008, 113, D07202. (2) Tunved, P.; Stro¨m, J.; Jansson, H. - C. An investigation of processes controlling the evolution of the boundary layer aerosol size distribution properties at the Swedish background station Aspvreten. Atmos. Chem. Phys. 2004, 4, 2581–2592. (3) Jones, A. M.; Harrison, R. M. Assessment of natural components of PM10 at UK urban and rural sites. Atmos. Environ. 2006, 40, 7733–7741. (4) Lin, P.; Hu, M.; Wu, Z.; Niu, Y.; Zhu, T. Marine aerosol size distributions in the springtime over China adjacent seas. Atmos. Environ. 2007, 41, 6784–6796. (5) Longley, I. D.; Inglis, D. W. F.; Gallagher, M. W.; Williams, P.; Allan, J. D.; Coe, H. Using NOX and CO monitoring data to indicate fine aerosol number concentrations and emission factors in three UK conurbations. Atmos. Environ. 2005, 39, 5157–5169. (6) Ma¨kela¨, J. M.; Koponen, I. K.; Aalto, P.; Kulmala, M. One-year data of submicron size modes of tropospheric background aerosol in southern Finland. J. Aerosol Sci. 2000, 31, 595–611. (7) Querol, X.; Alastuey, A.; Ruiz, C. R.; Artin ˜ ano, B.; Hansson, H. C.; Harrison, R. M.; Buringh, E.; ten Brink, H. M.; Lutz, M.; Bruckmann, P.; Straehl, P.; Schneider, J. Speciation and origin of PM10 and PM2.5 in selected European cities. Atmos. Environ. 2004, 38, 6547–6555. (8) Vallius, M.; Lanki, T.; Tiittanen, P.; Koistinen, K.; Ruuskanen, J.; Pekkanen, J. Source apportionment of urban ambient PM2.5 in two successive measurement campaigns in Helsinki, Finland. Atmos. Environ. 2003, 37, 615–623. (9) Viana, M.; Querol, X.; Alastuey, A. Chemical characterisation of PM episodes in NE Spain. Chemosphere 2006, 62, 947–956. (10) Yin, J.; Harrison, R. M. Pragmatic mass closure study for PM1.0, PM2.5 and PM10 at roadside, urban background and rural sites. Atmos. Environ. 2007, 42, 980–988. (11) Zhao, W.; Hopke, P. K.; Prather, K. A. Comparison of two cluster analysis methods using single particle mass spectra. Atmos. Environ. 2007, 42, 881–892. (12) Charron, A.; Harrison, R. M.; Quincey, P. What are the sources and conditions responsible for exceedences of the 24 h PM10 limit value (50 µg m-3) at a heavily trafficked London site. Atmos. Environ. 2007, 41, 1960–1975. (13) Allen, A. G.; Nemitz, E.; Shi, J. P.; Harrison, R. M.; Greenwood, J. C. Size distributions of trace metals in atmospheric aerosols in the United Kingdom. Atmos. Environ. 2001, 35, 4581–4591. (14) Singh, M.; Jaques, P. A.; Sioutas, C. Size distribution and diurnal characteristics of particle-bound metals and receptor sites of the Los Angeles Basin. Atmos. Environ. 2002, 36, 1675–1689. (15) Dilner, A. M.; Schauer, J. J.; Christensen, W. F.; Cass, G. R. A quantitative method for clustering size distributions of elements. Atmos. Environ. 2005, 39, 1525–1537. (16) Christensen, W. C.; Dilner, A. M.; Schauer, J. J.; Reese, C. S. Clustering composition vectors using uncertainty information. Environmetrics 2007, 18, 859–869. (17) Shi, J. P.; Harrison, R. M.; Evans, D. Comparison of ambient particle surface area measurement by epiphaniometer and SMPS/APS. Atmos. Environ. 2001, 35, 6193–6200. (18) Charron, A.; Harrison, R. M. Primary particle formation from vehicle emissions during exhaust dilution in the roadside atmosphere. Atmos. Environ. 2003, 37, 4109–4119. (19) Angus, E. L.; Young, D. T.; Lingard, J. J. N.; Smalley, R. J.; Tate, J. E.; Goodman, P. S.; Tomlin, A. S. Factors influencing number concentrations, size distributions and modal parameters at a roof-level and roadside site in Leicester, UK. Sci. Total Environ. 2007, 386, 65–82. (20) Emma, P. - T.; Eugene, K.; Paatero, P.; Hopke, P. K. Source apportionment of time and size resolved ambient particulate matter measured with a rotating DRUM impactor. Atmos. Environ. 2007, 41, 5921–5933. (21) Jain, A. K.; Dubes, R. C. Algorithms for Clustering Data; Prentice Hall: 1988; ISBN 0-13-022278-X. VOL. 43, NO. 13, 2009 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

4699

(22) Jain, A. K.; Murty, M. N.; Flynn, P. J. Data Clustering: A Review. ACM Comp. Surv. 1999, 31, 264–323. (23) Hopke, P. K. Review: Evolution of chemometrics. Anal. Chim. Acta 2003, 500, 365–377. (24) R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2008; ISBN 3-90005107-0. (25) Maechler, M.; Rousseeuw, P.; Struyf, A.; Hubert, M. Cluster analysis basics and extensions; R package, 2005. (26) Leisch, F. flexclust: Flexible Cluster Algorithms, R package version 1.2-3; 2007. (27) Fraley, C.; Raftery, A. mclust: Model-Based Clustering/Normal Mixture Modeling, R package version 3.1-3; 2008. (28) Hennig, C. fpc: Fixed point clusters, clusterwise regression and discriminant plots, R package version 1.2-3; 2007.

4700

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 43, NO. 13, 2009

(29) Hartigan, J. A.; Wong, M. A. A k-means clustering algorithm. App. Statist. 1979, 28, 100–108. (30) Alam, A.; Shi, J. P.; Harrison, R. M. Observations of new particle formation in urban air. J. Geophys. Res. 2003, 108, 4093–4107. (31) Harrison, R. M.; Yin, J. Particulate matter in the atmosphere: Which particle properties are important for its effects on health. Sci. Total Environ. 2000, 249, 85–101. (32) Atkinson, R. W.; Fuller, G. W.; Anderson, H. R.; Harrison, R. M.; Armstrong, B. Differential toxicity of airborne particulate matter (size and chemistry) in London. Epidemiology Submitted for publication. . (33) Dept of Health, Committee on the Medical Effects of Air Pollutants, Non-Biological Particles and Health, HMSO, 1995.

ES803121T