Learning Principal Component Analysis by Using ... - ACS Publications

Dec 16, 2016 - measures like humidity, wind, etc., generating around 12,000 items of data per day. ... by means of any easily accessible media, includ...
0 downloads 0 Views 5MB Size
Article pubs.acs.org/jchemeduc

Learning Principal Component Analysis by Using Data from Air Quality Networks Luis Vicente Pérez-Arribas, María Eugenia León-González, and Noelia Rosales-Conrado* Departamento de Química Analítica, Facultad de Ciencias Químicas, Universidad Complutense de Madrid, Madrid 28040, Spain S Supporting Information *

ABSTRACT: With the final objective of using computational and chemometrics tools in the chemistry studies, this paper shows the methodology and interpretation of the Principal Component Analysis (PCA) using pollution data from different cities. This paper describes how students can obtain data on air quality and process such data for additional information related to the pollution sources, climate effects, and social aspects over pollution levels by using a powerful chemometrics tool such as principal component analysis (PCA). The paper could also be useful for students interested in environmental chemistry and pollution interpretation; this statistical method is a simple way to display visually as much as possible of the total variation of the data in a few dimensions, and it is an excellent tool for looking into the normal pollution patterns. KEYWORDS: Graduate Education/Research, Upper-Division Undergraduate, Environmental Chemistry, Computer-Based Learning, Chemometrics



An unsupervised learning algorithm (such as PCA) finds some patterns and regularities without direct supervision of a human.2 It is a mathematical tool that, by reducing the data dimensionality, makes visualization possible while retaining as much information as possible that was already present in the original data. Thus, PCA transforms the original measured variables into new, uncorrelated variables termed ‘“principal components”’ (PCs). The first PC accounts for the maximum of the total variance. The second is uncorrelated with the first and accounts for the maximum of the residual variance, and so on. The number of extracted PCs that are considered significant defines the optimal complexity of the model.9 In general, PCA decomposes the data into score and loading vectors. The directions of the loadings are placed so that they maximize the variation spanned by each vector, having the maximum variation in the first component, while there is a decreasing amount of variation for the subsequent orthogonal components. Interpretation of the results of a principal components analysis can be done by representation of scores and loadings.2,12 The combination of loading and score plots in biplots will often reveal which observations are connected to which variables. In short, the main goal of PCA is to reduce redundant information in a set of data by transforming the correlated variables into a new set of independent variables, also known as principal components. Usually, two or three of those principal components provide a good summary of all of the

INTRODUCTION Modern methods of automatic analysis provide opportunities to collect large amounts of data in a short time. With the example of a big city like Madrid, 24 stations provide every hour more than 20 parameters related to air quality and atmospheric measures like humidity, wind, etc., generating around 12,000 items of data per day. There are different methods of dealing with this extensive amount of data. One approach is to ignore most of them and to use only a few data (e.g., nitrogen dioxide (NO2) data in one or several specific stations at certain days). Another more interesting approach is to treat all data by means of the application of multivariate analysis methods, whose main objective in analytical chemistry is aimed at the grouping and classification of objects (in this case, measured parameters, stations, days, etc.), as well as modeling relationships between the different analytical data. The methods of multidimensional analysis made it possible to establish some correlations between different parameters and at the same time to find correlations among the amounts of several pollutants. Principal component analysis (PCA),1−3 like many of the multivariate methods of analysis, is based on data reduction, taking into account the correlation between the data, representing in a simple way the location of the elements in a reduced coordinate system. This is possible because only a small number of parameters are significant in a set of data.4 Consequently, PCA can be used to make distinctions between data sets that are highly correlated. It has been used extensively in chemistry because this statistical analysis method proves to be a very useful aid for data interpretation and classification.5−11 Basically, PCA is a factor model and an unsupervised method used to simplify a data set. © XXXX American Chemical Society and Division of Chemical Education, Inc.

Received: July 22, 2016 Revised: December 16, 2016

A

DOI: 10.1021/acs.jchemed.6b00550 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Figure 1. Air pollution control stations sites in Madrid region.

original variables by visually displaying the total variation in the data in a few dimensions. On the other hand, environmental chemistry and atmospheric pollution are often topics of interest for many chemistry students, who, in general, are very interested in environmental matters and the effect of pollution on the planet. It is possible to use available data from air quality to learn the principles and fundamentals of the Principal Component Analysis. According to European regulations, member states shall ensure that the public in general, as well as organizations related to health-care, environment, or industry, etc., should be informed of the ambient air quality adequately and in a timely fashion. In addition, this information shall be made available free of charge by means of any easily accessible media, including the Internet or any other appropriate means of telecommunication.13 Similar regulations exist in other countries of the Americas or Asia. Consequently, local authorities in most of the biggest cities supply pollution data, such as concentrations of certain pollutants, in real time. This paper shows, step by step, how students, assisted by the teacher, can obtain data on air quality in different cities or

regions, and process them for additional information related to the sources, climate effects, and social aspects of pollution by using the powerful chemometrics tool principal component analysis (PCA). The information obtained can be used to distinguish between primary pollutants, an air pollutant emitted directly from a source, and secondary pollutants, formed when other pollutants react in the atmosphere. Ozone can be used as an example of a secondary pollutant formed when volatile organic compounds, carbon monoxide, and nitrogen oxides react in the presence of sunlight.



EXPERIMENTAL SECTION

Data Collection

Several groups of students can collect information about pollution data from the air quality of Madrid city (Spain), the Madrid region, London (UK), and Paris (France). Data corresponding to Madrid city were supplied by the Atmospheric Protection Service of the Madrid Council, available from its Website,14 while those corresponding to the Madrid region are supplied by Atmospheric Quality AreaAir Quality Network of the Autonomous Region Authorities.15 In B

DOI: 10.1021/acs.jchemed.6b00550 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Table 1. Air Quality Control Stations, Population, and Concentration Parameter Values Used in This Work

a

Location Number

Station Location

Populationa

[NO]/(μg m−3)

[NO2]/(μg m−3)

[PM10]/(μg m−3)

[O3]/(μg m−3)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Getafe Leganés Alcalá de Henares Fuenlabrada Móstoles Torrejón de Ardoz Alcorcón Coslada Colmenar Viejo Aranjuez Collado-Villalba Arganda del Rey Villarejo de Salvanés S. Martiń de Valdeiglesias Rivas-Vaciamadrid Guadalix de la Sierra Algete Valdemoro El Atazar Villa del Prado Orusco de Tajuña Madrid (E. Aguirre) Madrid (Farolillo) Madrid (Casa de Campo)

173,057 186,696 200,768 195,864 205,712 126,878 170,336 88,847 47,445 57,792 62,587 55,307 7,301 8,516 80,483 6,057 20,102 72,265 98 6,506 1,283 3,165,235 3,165,235 3,165,235

85.3 13.5 82.9 59.0 48.5 35.5 50.9 132.6 32.4 9.7 87.2 31.0 2.4 4.2 58.1 8.0 3.4 48.1 1.0 3.9 1.0 75.0 51.0 35.0

76.6 88.5 57.0 75.0 72.5 46.6 73.8 74.6 63.9 23.7 72.0 38.1 19.9 19.7 56.2 23.1 27.0 52.3 6.7 24.8 9.1 83.3 64.0 54.7

40.0 40.0 39.3 40.4 29.3 36.5 32.6 49.6 20.5 19.8 39.8 19.3 19.5 23.7 30.7 16.1 15.2 32.8 8.8 23.2 12.4 27.0 25.2 19.9

15.3 16.1 11.4 12.6 10.5 13.4 14.2 8.2 28.5 30.8 9.3 18.3 42.0 25.7 12.8 29.4 41.0 17.2 54.9 24.0 74.2 12.4 13.5 15.9

Demographic data correspond to January 1, 2014.19

both cases, data were supplied in metadata text files that students transferred to an Excel sheet and then transformed into large data sets, compiled in tables, where relevant information was extracted. Information about air quality and pollutants from London (UK) is available from the London Air Quality Network Website,16 which supplies air pollution data corresponding to the City and Greater London. It is possible to choose to download either data for one given site, or data for one species from up to six sites, and to select the appropriate download page that is supplied in CSV format. It is easy to convert it into an Excel table. Measurements are stored as 15 min means, but it is also possible to obtain data for longer averaging periods representing the mean concentration for periods of 1, 8, or 24 h. Regarding pollution data from Paris (France), they are available in the AIRPARIF Website,17 where students can select several stations, add them into their basket, and, when all selections have been made, obtain the pollution information in a whole set of data on a FTP server as hourly data in a CSV format file. Some air pollution data from the aforementioned sources are provided in the Supporting Information as Excel files.

StatGraphics Centurion XVI, version 16.1.03 (both 32 and 64 bits available), that is a powerful intuitive software program for data analysis, data visualization, statistical modeling, and predictive analytics. At the official Website, it is possible to get more information about this computer package or later versions, or to download a trial version.18



MULTIVARIATE ANALYSIS AND DISCUSSION After the introduction of general information, the chemometrical value of Principal Component Analysis is related to data simplification and dimension reduction. PCA is then applied to data from air pollution of the Madrid region. The air quality of the Madrid region is monitored by means of 47 automatic stations, 23 of them under the regional authorities’ supervision, and the remaining 24 under Madrid city local authority. Several of these stations (24 of them) provide information about amount of nitrogen dioxide (NO2), nitric oxide (NO), ground level ozone (O3), and particulate matter (PM10), all related to the road traffic. All of them were selected for this work. Figure 1 shows the geographical location of the 24 air control pollution stations used for this work. Table 1 includes the locations of the stations chosen and the pollution data measured on November 19th, 2014 (daily mean). This set of data will be used to carry out the multivariate analysis. (Data can be found in the Supporting Information.)

Data Preparation and Statistical Treatment

Once data have been downloaded and turned into data tables, it is necessary to extract from them the relevant info to be treated, such as daily data from certain dates, monthly data averages, etc. These operations are easy to carry out using spreadsheets that must be transferred to the datasheet of the software used to carry out the principal component analysis. In addition, missing or nonvalid data from raw pollution information should be properly processed before performing PCA (see Supporting Information). Principal component analysis can be done using a variety of software packages commercially available, as Minitab, SAS, The Unscrambler, or IBM SPSS. In this work, we have used

Estimating the Number of Principal Components

The first decision, which must be made before carrying out a PCA, is whether to use the raw data or to first standardize each variable to zero mean and unit variance (centering and scaling). This is a very important decision because if the variables are not standardized and one variable has a much larger variance, then this variable will dominate the first principal component (PC).1 C

DOI: 10.1021/acs.jchemed.6b00550 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Table 2. Eigenvalues and Explained Variances for the Pollution Data in Table 1 Component

Eigenvalue, λ

Explained Variance/%

Cumulative Variance/%

1 2 3 4

3.2710 0.3308 0.2110 0.1872

81.77 8.27 5.27 4.68

81.77 90.05 95.32 100.00

coefficient values of the principal components scaled so that the length of a data vector is unaltered by the change. From these coefficients, also known as “component weights” or “loadings matrix”, the computer program calculates the scores for the PCs selected. These scores represent the coordinates of the new reduced dimensional system, and they can be graphically represented in a scatter plot showing the relationship between the different air pollution control stations in the Madrid region (Figure 3).

Thus, it is strongly recommended to standardize, making all variables carry equal weight. The PCA software generates a table of values called eigenvectors. The eigenvectors with the largest eigenvalues correspond to the dimensions that have the strongest correlation in the data set; the eigenvector provides us with information about the patterns in the data. Table 2 shows these eigenvalues and its explained variance obtained from the standardized Madrid data pollution in Table 1. According to the percentage of explained variance criterion, the first two principal components explain more than 90% of the data variance. With regard to the eigenvalue-one criterion, it is based on the fact that the average eigenvalue of centered and scaled data is just 1, so only components with eigenvalues greater than 1 are considered important. Consequently, as can be seen in Table 2, only the first principal component is significant. Finally, the Scree test is based on the fact that the residual variance levels off when the proper number of principal component is obtained; therefore, as shown in Figure 2, the slope drops between the first and the second principal components, such that the third and fourth principal components are not significant.

Figure 3. Scores plot for the first two principal components. Each point represents an air control pollution station of the Madrid region. For identification see Table 1, and for location into Madrid region see Figure 1. The first principal component explains the 81.8% of the total variance and the second the 8.3%.

As can be seen, those stations placed in Madrid city or in large towns of the metropolitan area, and also in medium-sized towns (around 60,000 inhabitants or more) close to the principal motorway (red squares), have positive scores for the first PC (right side of the graph). Stations placed in rural areas or in medium- and small-sized towns (less than 50,000 inhabitants) far away from Madrid city and its metropolitan area (green squares) have negative scores for the first PC (left side of the graph). Station no. 24 is a special case because although it belongs to the Madrid city municipality, it is located inside the Casa de Campo, one of the largest parks in the city (around 1722 ha). This explains that the air pollution measured at this station is less affected by the road traffic. In summary, the scores diagram reveals that the stations for air pollution control in the Madrid region fall into two distinct groups, a fact which is not readily apparent from the original data in Table 1. The correlation and importance of the variables included in a PCA study, (i.e., NO2, NO, O3, and PM10) must be analyzed by plotting the component weights in a principal component loading graph (Figure 4). This is usually done by drawing a line from the origin coordinates axis to the coordinates point represented for each component weight pair. From this plot, information about correlation of the variables can be inferred. The smaller the angle is between two lines representing a given pair of variables, the higher the correlation is between them. In this case there is an important correlation between NO and particulate matter (PM10). On the other

Figure 2. Scree plot for the principal component model of the pollution data.

Since the interpretation of results in PCA is usually carried out by visualization of the components’ scores and loading, and due to the observation that, in the present case, there is a criterion suggesting there is one principal component, and there are other criteria suggesting that two principal components are enough to explain the overall system performance; it is strongly advisible to choose the first two principal components. Graphical Interpretation of Principal Components

After a decision has been made regarding the number of PCs to keep, the next step is to look at the structure of these new reduced variables and to discuss the qualitative information provided by principal component analysis. The commercial software used in this work offers this information as a table of D

DOI: 10.1021/acs.jchemed.6b00550 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

behavior similar to that of station no. 10. A similar conclusion can be drawn from the position in the graph for scores representing station no. 8 in relation to nitric oxide (NO). Exploring Other Possibilities

In the previous sections, we have seen how to perform a principal component analysis (PCA) and how to interpret the results of such analysis by using data from the Madrid region in Spain. Students may also explore other possibilities based on data related to pollution from other cities or regions.20 These could then be correlated with, for example, climate or seasonal effects, including the effect of sunlight that facilitates photochemical reactions between some of the air pollutants, and social aspects like changes in the air pollution in big cities due to the weekend effect. Figure 6 shows the seasonal effect of the

Figure 4. Loadings plot for the first two principal components. Each line represents one of the variables included in the study of the Madrid region pollution.

hand, uncorrelated features are orthogonal to each other. Finally, when lines representing a pair of variables are in opposite directions (i.e., O3 and NO2), this means that those factors which favor the appearance of one variable will favor the disappearance of the other, and vice versa. In this example, the reason is the reaction of tropospheric ozone with NO emitted by engines, giving NO2. The size of the loadings in relation to the considered principal component is a measurement of the importance of that variable for the PC model. Loadings in the origin of the coordinate system or close to it represent unimportant features. Scores and loading can be interpreted jointly in a so-called biplot containing scores and loading superimposed in the same graph (Figure 5). This type of graph provides at a glance all information from scores and loadings graphs and, moreover, allows one to appreciate the relationship between both.

Figure 6. Three-dimensional biplot showing the seasonal effect on the air pollution in London (UK). Explained variance for each principal component: 53.9% for the first, 25.4% for the second, and 11.5% for the third principal component.

air pollution in London (UK) and how the predominant contributions to air pollution vary with the season. (Refer to Supporting Information for the pollution data.) The data analyzed by PCA correspond to the monthly averages for NO2, NO, O3, CO, SO2, and PM10, measured in Westminster-Marylebone Road during 2014. In this case, two or three principal components (PCs) are needed to explain the overall system performance, depending on the criterion used for this estimation. Since the commercial software used here provides three-dimensional plot facilities, a three-dimensional biplot has been used for data interpretation. In Figure 6, the coldest months (autumn and winter) fall on the right side of the graph, between the loadings corresponding to NO2, NO, CO, SO2, and PM10 pollutants, related to traffic and the use of heating systems. On the other hand, the warmer months (spring and summer) appear on the left side, indicating primary air pollutants, but with higher ground ozone levels. This is because, on sunny days, concentrations of ozone can increase, leading in some cases to summertime “smog”, a type of air pollution which occurs during hot weather in built-up urban areas. As mentioned earlier, road traffic in the center of big cities usually is much lower during the weekends than on weekdays, with the expected effect on the air pollution levels. This weekend effect is shown in Figure 7a, corresponding to a biplot representing several days in August 2015 in Paris (France). (Pollution data can be found in the Supporting Information.) Measurements of the air pollution were carried out by the air pollution station named Paris Centre, located at the Igor Stravinsky Square. Since weather has a significant effect on the

Figure 5. Biplot for the simultaneous characterization of the scores and loadings of two principal components. Explained variance for each principal component, as in Figure 3.

The proximity of objects to a loading vector reflects the importance of that variable for building the principal component model. In this example, station no. 21, close to loadings corresponding to O3 and at the end of this vector, means that it is the air pollution control station which measured the highest level of ozone, as can be seen in Table 1. Other air pollution stations placed in rural areas, such as 19 and 13, also show a high level of ozone. In general, ozone pollution tends to be highest in the countryside and away from the Madrid metropolitan area. This is because certain pollutants that are more prevalent in urban areas (i.e., NO) are in less proportion in these areas. Other rural stations (nos. 14, 16, and 20) show E

DOI: 10.1021/acs.jchemed.6b00550 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education



Article

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available on the ACS Publications website at DOI: 10.1021/acs.jchemed.6b00550.



Details about missing data processing (PDF) Air quality and atmospheric measures from Madrid Council, 19/11/2014 (XLSX) London air quality data, 2014 (XLSX) Paris Centre pollution data from the AIRPARIF Website, August 2015 (XLSX)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Noelia Rosales-Conrado: 0000-0002-7984-8340 Notes

The authors declare no competing financial interest.



REFERENCES

(1) Miller, J. N.; Miller, J. C. Statistic and Chemometrics for analytical Chemistry, 6th ed.; Prentice Hall: London, UK, 2011. (2) Kellner, R.; Mermet, J. M.; Otto, M.; Varcálcel, M.; Widmer, H. M. Analytical Chemistry: A Modern Approach to Analytical Science, 2nd ed.; Wiley-VCH: Weinheim, Germany, 2004. (3) Otto, M. Chemometrics: Statistics and Computer Application in Analytical Chemistry, 2nd ed.; Wiley-VCH: Weinheim, Germany, 2007. (4) De Lorenzi Pezzolo, A. To see the World in a grain of sand: Recognizing the origin of sand specimens by diffuse reflectance infrared Fourier transform spectroscopy and multivariate exploratory data analysis. J. Chem. Educ. 2011, 88, 1304−1308. (5) Reinholds, I.; Bartkevics, V.; Silvis, I. C.J.; van Ruth, S. M.; Esslinger, S. Analytical techniques combined with chemometrics for authentication and determination of contaminants in condiments: A review. J. Food Compos. Anal. 2015, 44, 56−72. (6) Mostert, M. M. R.; Ayoko, G. A.; Kokot, S. Application of chemometrics to analysis of soil pollutants. TrAC, Trends Anal. Chem. 2010, 29 (5), 430−445. (7) Madsen, R.; Lundstedt, T.; Trygg, J. Chemometrics in metabolomics-A review in human disease diagnosis. Anal. Chim. Acta 2010, 659 (1−2), 23−33. (8) Saurina, J. Characterization of wines using compositional profiles and chemometrics. TrAC, Trends Anal. Chem. 2010, 29 (3), 234−245. (9) Guillén-Casla, V.; Rosales-Conrado, N.; León-González, M. E.; Pérez-Arribas, L. V.; Polo-Díez, L. M. Principal component analysis (PCA) and multiple linear regression (MLR) statistical tools to evaluate the effect of E-beam irradiation on ready-to-eat food. J. Food Compos. Anal. 2011, 24 (3), 456−464. (10) Rusak, D. A.; Brown, L. M.; Martin, S. D. Classification of Vegetable Oils by Principal Component Analysis. J. Chem. Educ. 2003, 80, 541−543. (11) Besalú, E. From Periodic Properties to a Periodic Table Arrengement. J. Chem. Educ. 2013, 90, 1009−1013. (12) Liland, K. H. Multivariate Methods in Metabolomicsfrom Pre-Processing to Dimension Reduction and Statistical Analysis. TrAC, Trends Anal. Chem. 2011, 30 (6), 827−841. (13) The European Parliament and the Council. Directive 2008/50/ EC of 21 May 2008 on Ambient Air Quality and Cleaner Air for Europe; Official Journal of the European Union, 11.6.2008, L 152. (14) Concejaliá de Medioambiente. Sistema de Vigilancia de la Calidad del aire. http://www.mambiente.munimadrid.es/sica/scripts/ index.php (accessed Dec 2016).

Figure 7. Weekend effect on air pollution in (a) Paris (France) and (b) Madrid (Spain). Represented data as squares correspond to the first 23 days of August 2015. Squares in red correspond to Sundays and holidays. Squares filled in blue are for Saturdays, and white-filled squares are for the rest of the days. Explained variance: (a) 55.2% and 29.6% for the first and second principal components, respectively; (b) 67.4% and 20.4%.

air pollution, only the first 23 days of the month have been included in the analysis, due to the greater stability of the weather during that period. Sundays, holidays (15 August), and Saturdays fall at the center or left side, opposite to the loading corresponding to NO2 and CO, factors which are strongly related to the air pollution caused by road traffic. Similar results can be seen in Figure 7b, which represents the PCA biplot for the same dates in Madrid (Spain). Pollution data used in the PCA are the daily averages measured at Escuelas Aguirre air pollution control station, located in the city center.



CONCLUSION Principal component analysis (PCA) is a statistical technique that has found application in finding patterns in data of high dimension. PCA appears to be a very useful aid for the interpretation of data sets. Its methodology and usefulness can be taught using data sets of air pollution measurements from different cities that students can download from different Internet pages. This statistical method is a simple way to visually display as much as possible of the total variation of the data in a few dimensions, and it is an excellent tool for the normal pollution patterns. Consequently, PCA can also be useful for students interested in environmental chemistry and pollution interpretation. These students can search for social or economic activities that can explain the different behaviors of the air pollution. F

DOI: 10.1021/acs.jchemed.6b00550 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

(15) Atmospheric Quality AreaAir Quality Network. http://gestiona. madrid.org/azul_internet/html/web/2.htm?ESTADO_MENU=2 (accessed Dec 2016). (16) Air London. http://www.londonair.org.uk/london/asp/ datadownload.asp (accessed Dec 2016). (17) Airparif. http://www.airparif.asso.fr/en/ (accessed Dec 2016). (18) Statgraphics Centurion. http://www.statgraphics.com/ (accessed Dec 2016). ́ (19) Instituto Nacional de Estadistica. http://www.ine.es/jaxiT3/ Tabla.htm?t=2881&L=1 (accessed Dec 2016). (20) Shibata, H.; Branquinho, C.; McDowell, W. H.; Mitchell, M. J.; Monteith, D. T.; Tang, J.; Arvola, L.; Cruz, C.; Cusack, D. F.; Halada, L.; Kopácek, J.; Máguas, C.; Sajidu, S.; Schubert, H.; Tokuchi, N.; Záhora, J. Consequence of altered nitrogen cycles in the coupled human and ecological system under changing climate: The need for long-term and site-based research. Ambio 2015, 44, 178−193.

G

DOI: 10.1021/acs.jchemed.6b00550 J. Chem. Educ. XXXX, XXX, XXX−XXX