Pollution Trees: Identifying Similarities among Complex Pollutant

Jun 4, 2012 - The mutagenicity of each sample was then mapped to the “pollution tree”. The IUR-distance-based measure proved effective in comparin...
0 downloads 0 Views 823KB Size
Article pubs.acs.org/est

Pollution Trees: Identifying Similarities among Complex Pollutant Mixtures in Water and Correlating Them to Mutagenicity Weiwei Zheng,†,⊥ Xia Wang,†,⊥ Dajun Tian,†,⊥ Hao Zhang,† Weidong Tian,‡ Melvin E. Andersen,§ Yuxin Zheng,∥ Xin Sun,∥ Songhui Jiang,† Zhaojin Cao,∥ Gengsheng He,† and Weidong Qu*,† †

Key Laboratory of Public Health Safety, Ministry of Education, Department of Environmental Health, School of Public Health, Fudan University, Shanghai 200032, China ‡ Institute of Biostatistics, School of Life Sciences, Fudan University, Shanghai 20043, China § Institute for Chemical Safety Sciences, The Hamner Institutes for Health Sciences, Research Triangle Park, North Carolina 27709, United States ∥ National Institute of Occupational Health and Poison Control, Chinese Center for Disease Control & Prevention, Beijing 100050, China S Supporting Information *

ABSTRACT: There are relatively few tools available for computing and visualizing similarities among complex mixtures and in correlating the chemical composition clusters with toxicological clusters of mixtures. Using the “intersection and union ratio (IUR)” and other traditional distance matrices on contaminant profiles of 33 specific water samples, we used “pollution trees” to compare these mixtures. The “pollution trees” constructed by neighbor-joining (NJ), maximum parsimony (MP), and maximum likelihood (ML) methods allowed comparison of similarities among these samples. The mutagenicity of each sample was then mapped to the “pollution tree”. The IUR-distance-based measure proved effective in comparing chemical composition and compound level differences between mixtures. We found a robust “pollution tree” containing seven major lineages with certain broad characteristics: treated municipal water samples were different from raw water samples and untreated rural drinking water samples were similar with local water sources. The IUR-distance-based tree was more highly correlated to mutagenicity than were other distance matrices, i.e., MP/ML methods, sampling group, region, or water type. IUR-distance-based “pollution trees” may become important tools for identifying similarities among real mixtures and examining chemical composition clusters in a toxicological context.



INTRODUCTION While conducting risk assessments for complex mixtures remains challenging,1 there are signs of some advances in methods.2 Improved detection technologies have revolutionized analysis of pollutants in water and air allowing identification of individual compounds and characterization of the multiple components in environmental samples. However, evaluating the biological activity and the toxicological interactions of components in complex pollutant streams still pose significant challenges for assessing health risk of real-world mixtures.3,4 There are still only a meager number of tools for analyzing very large environmental data sets and correlating health effects to complex pollution profiles. Though toxicological concerns about chemical mixtures date back to at least the 1950s,5−7 it was not until 1986 that general guidelines for chemical mixture risk assessment became available from the U.S. EPA (United States Environmental Protection Agency), with subsequent revisions in 2000.8,9 The Society for Risk Analysis 2005 Annual Meeting discussed © 2012 American Chemical Society

information on current methods for chemical mixtures' health risk assessment.10 Additionally, World Health Organization (WHO) issued a report on risk assessment of combined exposures to multiple chemicals.11 Some specific approaches have looked at ways to identify profiles of complex mixtures in the environment and assess their health risks.12−22 However, few studies worked with real-world mixtures,23 limiting their practical application.3,24,25 The U.S. EPA9 has emphasized the need for methods to determine “sufficiently similar mixtures” on the basis of chemical composition.4,26,27 One path forward to deal with real-world complex mixtures is to apply a “comparative method” that would cluster the complex mixtures on the basis of chemical composition and component levels and then Received: Revised: Accepted: Published: 7274

February 22, 2012 June 1, 2012 June 4, 2012 June 4, 2012 dx.doi.org/10.1021/es300728q | Environ. Sci. Technol. 2012, 46, 7274−7282

Environmental Science & Technology

Article

Table 1. Group, Region, Sampling Period, Water Type, and Sample Size of Water Samples

a

The colors illustrate the samples in each group and their corresponding symbols in the figures.

between two complex mixtures can be derived from assessing both the compounds present and also the differences in their concentration among the samples. The relationships among samples based on their complex composition can then be visualized in a tree-like form following the application of various algorithms. We hypothesized that the tree-building methods and “pollution trees” could be useful in displaying the similarity and differences between complex mixtures on the chemical composition basis. To create measures for comparison, we defined and calculated the data matrix and distance matrix of real-world mixtures extracted from water samples using different measurements and then constructed pollution profile relationships by various tree-building methods in bioinformatics. Furthermore, we mapped and correlated the “pollution tree” lineages to mutagenic characteristics. Our “pollution tree” revealed sensible relationships among complex mixtures that correlated with their mutagenic potency.

investigate whether the chemical composition clusters correlate with toxicological clustering.2 This approach is similar, in principle, to that used for analyzing single compounds.9 A challenge in comparing and clustering real complex mixtures lies in the difficulty of dealing with thousands of compounds at various concentrations to create a pollution profile for the mixture. Some attempts have evaluated the similarity of disinfection byproduct (DBP) mixtures in drinking water.28−30 This work focused on specific DBPs mixtures in water. The researchers selected characteristics of the input and output water that were considered to be most important in affecting the degree of similarity such as the indices of total organic carbon (TOC), total organic halogen (TOX), and total trihalomethanes (TTM), etc. They used these summary metrics to account for unknown chemicals in the complex mixture, compositional changes, and possible interactions among chemicals.28−30 Other methods, such as principal component analysis (PCA), have been used to evaluate similarities between real-world mixtures directly based on the chemical composition of compounds in the samples.3,24,31 Because PCA mainly focuses on the correlated variables and provides simple information of sample clustering, it is difficult to reflect relationships between any two mixtures. The traditional tools of statistics may be limited in correlating the toxicological effects to the whole pollution profiles with the high dimensionality of real-world mixtures. Recent research on environmental forensic classification using cluster analysis tools provides new opportunities for complex mixtures clustering.25 The essence of cluster analysis is dependent on traditional distance measuring (such as Euclidean distance and Pearson correlation coefficient) among samples. These distances make more statistical and mathematical sense than simply using environmental, toxicological, or biological assumptions about the mixtures. More effective and robust distance measurements are obviously needed to compare similarities or differences of chemical composition and compound levels between complex mixtures. Successful application of tree-building methods in historical linguistics32,33 suggests a path forward for using these tools in environmental and toxicological analysis of mixtures. Moreover, mapping of biological tree lineages has clearly revealed pathogen phylogeography, ecological community structure, culture diffusion, and even human origin.35−38 A “pollution tree” idea and approach has promise to show relationships among different water samples. The relationship



MATERIALS AND METHODS Extraction of Mixtures from Water Samples. A total of 33 water samples were collected from 2 rural regions and 4 municipal regions in South China (Table 1). Water types in rural regions included surface water and groundwater. In municipal regions, water source, raw water, and water samples in the treatment process were collected in the water plants (Table 1). Organic compounds were extracted from water samples by XAD-2 resins.39 Pollution Profile Analysis. The whole pollution profiles of extracted mixtures were analyzed by gas chromatography−mass spectrometry (GC-MS) using published methods.39 A GC 2010 (Shimadzu Instruments, Japan) with a 30-m fused silica capillary column (0.25 mm i.d., 0.25 μm film thickness; Agilent, USA) was used for sample introduction into the mass spectrometer. The GC oven program was as follows: 40 °C for 1 min, 30 °C/min to 130 °C (for 3 min), 12 °C/min to 180 °C, 7 °C/min to 240 °C, and 12 °C/min to 300 °C (for 5 min).39 A Mass 2010 (Shimadzu Instruments) quadrupole mass spectrometer operated in the EI mode (70 eV) obtained mass spectra. The ion source temperature was 280 °C and GC interface temperature was 300 °C. A scan range of 40−450 m/z was used for full scan analysis of samples. Mutagenicity Assay. The mutagenicity tests on organic extracts were performed with the standard plate incorporation by Ames Salmonella typhimurium assay.40 The Salmonella 7275

dx.doi.org/10.1021/es300728q | Environ. Sci. Technol. 2012, 46, 7274−7282

Environmental Science & Technology

Article

Table 2. Detailed Information of Different Distance Matrix of Mixtures matrix “Intersection and union ratio” matrix

similarity

distance

Ratio of “intersection” compound numbers to “union” compound numbers 1-similarity weighted by the CVk of common compound levels

formula

dij = 1 −

m ∑k = 1

CVk 2

(1 − ) M

M = ni + nj − m di j: distance between mixture samples Si and Sj m: the number of common compounds between Si and Sj M: the number of “union” compounds between Si and Sj ni or nj: the total number of detected peaks in Si or Sj CVk: coefficient of variance of the kth common compound between Si and Sj Euclidean distance matrix

Euclidean distance

n

dij =

∑ (sik − sjk)2 k=1

sik or sjk: the peak area of the kth compound in Si or Sj n: the total number of detected compounds in Si or Sj Pearson correlation distance matrix

Pearson correlation coefficient

n

1-similarity

ρij =

∑i , j = 1 (sik − si)(sjk − sj) n

∑i , j = 1 (sik − si)2 (sjk − sj)

dij = |1 − ρij |

Cosine correlation distance matrix

Cosine correlation coefficient

1-similarity

ρij: Pearson correlation coefficient between Si and Sj si̅ or sj̅ : the average peak area of all compounds in Si or Sj Si·Sj cos(θij) = || Si || × || Sj ||

dij = 1 − cos(θij) Cos(θij): the Cosine correlation coefficient between Si and Sj Si · Sj: the dot product between vector Si and Sj ||Si|| or ||Sj||: the mode of vector Si or Sj

Data Matrix Conversion. After establishing data matrix (X), the presence or absence of each compound was coded as “1” or “0”, respectively, to produce a binary matrix (Y) of all matched and included compounds in X.32,33 Then the converted data matrix (Y) was used for character-featurebased tree-building methods. Distance Matrix. In doing similarity comparison and tree construction, the critical process is defining and calculating the “distance” or “similarity” matrix of all mixtures. To quantitatively represent the “similarity” or “dissimilarity” between complex mixtures, we defined an “intersection and union ratio distance” (IUR distance). For comparison between every two mixtures, the “similarity” can be denoted by the ratio of the number in the “intersection compound” group to the number in the “union compound” group. The “intersection” is the number of the common compounds between every two mixtures. The “union” is the sum of all detected peaks in two mixtures, subtracted by the number of “intersection” compounds. The increased ratio of “intersection” compounds to “union” compounds reflects increasing similarities between mixtures. However, the differences of common compound levels should also be considered in “similarity” comparison. We calculated the coefficient of variation (CV) of each “intersection” compound between every two mixtures: the “intersection and union ratio” was weighted by the CV of each

typbimurium strains TA 98 and TA 100 with and without S9 were used for assay. Mutagenicity was expressed as revertants per liter of water sample extracts and mutagenic potency was calculated from the slopes of the regression lines of the dose− response curves at three doses with three replicates at each dose. Common Compound Matching. Following earlier work,3 we wrote programs and macrocommands in SAS 9.241 for automatically matching the resolved spectra. Since retention times do not give sufficient information for identifying compounds, the resolved mass spectra were combined with retention times to ascertain that the same compound was represented by the same identification number in all samples.3,24 As did Eide et al.,3 we evaluated similarity between spectra based on several criteria: (1) peaks appeared within 4 min, (2) 10 most significant intensities of each resolved mass spectrum were employed for avoiding disturbances of small noise mass information, and (3) a similarity index of 0.8 was set.3,24 Finally, we calculated the integrated areas of the remaining resolved chromatograms. In the resulting data matrix (X), each row represented a water sample and each column denoted one compound, the latter identified by its mean retention time. Each value in the X represented the peak area of the compound in the extract. 7276

dx.doi.org/10.1021/es300728q | Environ. Sci. Technol. 2012, 46, 7274−7282

Environmental Science & Technology

Article

Figure 1. “Pollution tree” of complex mixtures in water samples constructed by IUR-distance-based NJ methods. Color patterns of branches in the tree represented 7 lineages. Respective symbols denote various types of water samples. Different colors of these symbols show the respective groups of water samples referred to in Table 1. Values near each branch (in blue) are the bootstrap proportions as a percentage.

“Pollution Trees” Construction. The neighbor-joining (NJ) method was used to construct trees of complex mixtures through MEGA 4.143 based on IUR and other distance matrices. The binary data matrix (Y) was used for constructing “pollution trees” by maximum parsimony (MP) and maximum likelihood (ML) methods through PAUP 4.0 beta 10.44 Topological robustness was investigated using 10 000 nonparametric bootstrap replicates. Correlation of Pollution Profiles to Mutagenicity. Using scatter plots, we analyzed the cluster of mutagenicity of mixtures and mapped them to their pollution lineages from the “pollution tree”. The correlation between pollution lineages of mixtures and their mutagenicity was computed by Stata 10.0.45

common compound. The maximum value of the CV, 21/2, occurs when the peak area of the common compound is zero (not detected) in one sample. Accordingly, CV/2 is always smaller than (21/2)/2, and 1 − CV/2 will always be positive. We used (1 − CV/2) to adjust the IUR distance formula. This adjusted IUR distance reflects not only the similarity in compound composition, but the similarity in compound level between two samples. For instance, if the CV value increases (which indicates the increased deviation in compound level between two samples), the IUR distance value will correspondingly increase, showing that the two samples are less similar. Detailed description and formula are given in Table 2. Additionally, we calculated traditional “distance” and “similarity” measures for mixtures. We used SPSS 18.042 to calculate the Euclidean distance between every two mixtures (Table 2). Also, “similarity” between mixtures was defined separately by general Pearson and Cosine correlation coefficients (Table 2). Then, according to the formula (Table 2), the relevant distance was calculated.



RESULTS AND DISCUSSION

Pollution Profiles of Complex Mixtures in Water. The sampling period, region, and sample size of water samples and their respective groups were characterized by different color patterns (Table 1 and SI Table S1). The numbers of detected 7277

dx.doi.org/10.1021/es300728q | Environ. Sci. Technol. 2012, 46, 7274−7282

Environmental Science & Technology

Article

to the neighbor joining branches; the surface water also clustered at the adjacent branches (Figure 1). During the sampling at identical or similar time periods, pollution profiles of local water samples with the same type were more similar to each other than to water samples with different types. The cluster patterns of mixtures denoted by the “pollution tree” are consistent with the known relationships among these water samples, further supporting the IUR distance matrix for constructing the “pollution tree”. Our “pollution tree” showed that real mixtures extracted from water samples collected in the same region during identical or close times clustered according to the type of sample (Figure 1). Pollution profiles of water samples in different portions of the treatment process of the same water plant were more similar to each other than to local natural water bodies. The main reason is that all water plants use chlorination for drinking water disinfection; and with the chlorination treatment process, disinfection products emerged in the water.46 In rural areas, the untreated underground water showed pollution profiles similar to local surface water. This observation suggested that local surface water pollution might cause adverse impacts on residents’ drinking water. Mutagenic Effects Analysis. None of the mixtures were mutagenic in TA 100 strain, either in the presence or absence of S9. With the TA 98 strain, extracts of water samples from municipal water plants (groups 1, 2, and 5) had higher mutagenic potency than water from the rural areas (groups 3, 4, 6, and 7) (SI Table S1). Water samples of HS in 2009 (group 6) showed the highest mutagenicity (38.1−111.8) (SI Table S1). XS mixtures separately sampled and extracted in 2005.4 and 2005.7 (groups 1 and 2) had similar mutagenicity (9.67− 31.47). However, SK samples (group 8) had much lower mutagenicity (2.49−7.14). For rural samples, mutagenic characteristics of pollution mixtures from water samples in KN during 2009 (group 6 and 7) were different from mixtures in BB during 2008 (groups 3 and 4). Mapping and Correlation of Biological Activity to the “Pollution Trees”. Figure 2 shows the scatter plot of mixtures based on their mutagenic potency to TA 98 with (Y-axis) and without S9 (X-axis). The plot was mapped to the groups (sampling region and period) of mixtures and their pollution profile lineages in the “pollution tree”. Some mixtures clustered together: despite the fact that that they were from different locations, they showed similar mutagenic potency. There were also mixtures belonging to different groups that showed similar mutagenic potency (Figure 2). The pollution profile lineages constructed by the “pollution tree” mapped more closely to mutagenicity than they did to sample group (Figure 2). For instance, the circle in the lower right of Figure 2 enclosed only two samples belonging to the lineage 3 in the IUR tree (Figure 1). These two samples were both river water samples and their chemical composition cluster was different from that of other samples collected in the same region during the same or closed periods (Figure 1). Also, the mutagenic cluster of these two isolated samples was not closed to the mutagenic cluster of other samples in the same region during same or closed periods (Figure 2). In summary, the clusters of samples based on their pollution profile lineages were more corresponding to the mutagenic clusters than clusters based on spatial and temporal properties (Figure 2). Therefore, their chemical composition cluster was more biologically important than the other factors such as spatial, temporal, or sources related properties.

peaks in extracts from 33 water samples ranged from 245 to 387. The total number of common compounds (appeared in at least two samples) in all mixtures was 1068 based on similarity matching among all resolved peaks using both retention time and mass spectrum information. Therefore, both the X and Y matrix had the size of 33 × 1068. Most of the peak areas of common compounds were higher than 100 000, suggesting that the common compounds between the mixtures were present at relatively high concentrations. Common Compounds and Distance Matrices. On average, 135 common compounds (ranging from 108 to 187) were matched between every two complex mixtures. The “union” compound amounts between mixtures were from 496 to 560. Therefore, the common compounds accounted for 1/4 to 1/3 of all “union” compounds between every two mixtures. These results indicate high diversity of pollution profiles among these water samples. Weighted by the level differences of each common compound between every two mixtures according to the formula in Table 2, the “intersection and union ratio” distance matrix was acquired. The distance between every two mixtures ranged from 0.757 to 0.873. The ranges of calculated Euclidean, Pearson correlation, and Cosine correlation distance matrix were respectively 0.019−0.353, 0.274−1, and 0.272−1. NJ Trees Based on Euclidean, Pearson, and Cosine Correlation Distance. Based on different distance matrices, the “pollution trees” were built by NJ methods to construct pollution profile relationships between these water samples. These trees showed different topological structures (Figure 1 and SI Figures S1−S3). The trees based on Euclidean, Pearson correlation, and Cosine correlation distance did not cluster the samples by water source (SI Figures S1−S3). Some samples from different regions during various periods were in the same or neighboring branches in these trees (SI Figures S1−S3). In all these trees using these other methods, isolated mixtures were observed (SI Figures S1−S3). IUR, MP, and ML Trees Identified Relationships among Water Samples. The “pollution trees” constructed based on the IUR distance matrix using NJ methods had topological structure similar to the MP and ML tree based on the Y matrix and contained 7 distinct lineages (Figure 1 and SI Figure S4), which showed different topological characteristics compared to other NJ trees (SI Figures S1−S3). There were no isolated mixtures in this tree (Figure 1). The pollution profiles of water samples in the same group (identical sampling period and region) were not entirely clustered together in the same branch (lineage). However, mixtures extracted from the same region at nearly the same sampling time did cluster in the same or neighbor joining branches (Figure 1). They had more similar pollution profiles than mixtures sampled from different regions at different times. Pollution profiles of complex mixtures in these water samples showed temporal−spatial cluster trends. All of these clusters for mixtures were supported by the relatively high bootstrap proportion values (Figure 1). Pollution profiles of water samples collected from municipal water plants clustered in three separate lineages (lineages 1, 2, and 5) (Figure 1 and SI Figure S4). In each lineage, mixtures in the same water plant during the same or close periods clustered together based on water types. Pollution profiles of mixtures in finished water were more analogous to those of mixtures in the treatment process water than in raw water (or water source). With the treatment process, the differences between pollution mixtures in treated and untreated water samples increased. In rural areas, complex mixtures in underground water belonged 7278

dx.doi.org/10.1021/es300728q | Environ. Sci. Technol. 2012, 46, 7274−7282

Environmental Science & Technology

Article

Effectiveness, Reliability, and Robustness of “Pollution Tree” Methods. (1). GC-MS Analysis. In our study, GCMS analysis measured sample composition. We would not identify compounds with high polarity and low volatility. These analytes require liquid chromatography−mass spectrometry (LC-MS). Nonetheless, the GC-MS data alone were successful in showing similarities of these water samples. (2). Compound Components in the Data Matrix. More quantitative and qualitative analytical methods using standards and calibration curve are only available for investigating specific compounds. They are less adequate for resolving the entire pollution profiles of samples due to the inability to identify and have standards for every analyte. Nonetheless, after matching peaks based on retention times and peak areas, we can still establish a data matrix containing all peaks and their relative levels. The CV of peak areas can be used to compare the relative concentration of each peak between mixtures without knowing the identity or absolute concentration of the analyte. (3). Comparison to the Other Approaches. Partial leastsquares (PLS) analysis has been used to cluster mixtures on the toxicological basis.3,24 Real-world complex mixtures usually contain thousands of components. Eide et al.3 also discussed this limitation in applying PLS in variable matrices with high dimensionality. We have used PLS to correlate chemical composition matrix to the mutagenicity of mixtures. However, the correlation coefficient was very low (r = 0.309). In bioinformatics, the “mapping” methodologies provide a possible tool for constructing the relationship between pollution profiles with much higher dimension and toxicological effects. The better correlation of chemical clusters represented by the IUR tree to the toxicological effects (mutagenicity) (Table 3) indicates that the “pollution tree” method framework is the much more effective and easy-toexplain way for clustering mixtures on the basis of this particular toxicological end point. IUR-Distance-Based Tree is the Most Effective and Robust of All Trees. (1). IUR is the Most Reasonable Distance Matrix. The distance matrix was central to creating the “pollution tree” structures by distance-based tree-building methods. We defined IUR distance matrix that represented compound composition differences and the differences in concentration of each common compound between every two mixtures. Other measures of distance, such as Euclidean distance, Pearson correlation and Cosine correlation, mainly reflect the mathematical and statistical features of vectors more than the environmental characteristics of real mixtures. (2). Comparison to MP and ML Trees. The characterfeature-based methods (MP and ML) using converted binary data compare the presence or absence information at each compound site of the pollution profiles between samples. The MP and ML trees then can effectively represent the sourcesrelated, geographical, or other properties-related clusters. Our constructed MP and ML trees showed similar topological structure (see Figure 1 and SI Figure S4) and similar temporal−spatial patterns and sources-related clusters of the real-world mixtures to the IUR tree. However, the absolute or relative concentration of components is not represented by the MP and ML trees. This deficiency limits the application of MP and ML trees in the context of toxicology, as demonstrated by the lower correlation of chemical composition represented by the MP/ML trees than that of IUR tree (Table 3). Components that were common between every two mixtures accounted for only 1/4 to 1/3 of all “union” compounds.

Figure 2. Scatter plot of mutagenicity of mixtures mapped to groups and pollution lineages in the “pollution tree” constructed by IURdistance-based NJ methods. Pollution tree lineages denoted by different colored symbols. The 7 circles respectively enclose the samples belonging to the 7 different lineages in the IUR tree. The numeric labels beside the symbols indicate the sample groups (SI Table S1). The vertical axis represents the mutagenicity on TA98 + S9 and the horizontal axis represents the mutagenicity on TA98 − S9.

The pollution profile lineage constructed by the “intersection and union ratio” distance matrix correlated significantly with the mutagenic potency (r = 0.909, P < 0.01) (Table 3). Table 3. Correlation Relationship between Pollution Profile Lineages of Mixtures and Their Mutagenicity coefficient (r)

Pollution profile lineage (“intersection and union ratio” distance matrix) Pollution profile lineage (Euclidean distance matrix) Pollution profile lineage (Pearson correlation distance matrix) Pollution profile lineage (Cosine correlation distance matrix) Pollution profile lineage (MP and ML trees) Group (sampling region and period) Region Water type

TA98 + S9

TA98 − S9

TA98 ± S9a

0.858

0.918

0.909

0.421

0.487

0.473

0.501

0.495

0.501

0.592

0.463

0.480

0.618 0.621 0.542 0.201

0.686 0.650 0.589 0.298

0.654 0.633 0.572 0.231

a

The canonical correlation between the pollution lineage and mutagenicity of stain TA98 with and without S9.

Correlation was higher than that based on trees constructed by any of the other three distance matrix (0.473−0.501), by the MP and ML methods (0.654), or by sampling group (0.633), region (0.572), or water type (0.231) (Table 3). The temporal−spatial and sources-related attributes also were mapped to the chemical composition cluster, and showed the temporal−spatial cluster trends of the samples (Figure 1). The mutagenicity clusters have much higher correlation to the chemical composition clustering than to the other propertiesbased clusters, such as sampling regions, time, and water types. In this case, at least, the IUR-distance-based “pollution tree” correlates closely with a specific toxicological property of the mixtures. 7279

dx.doi.org/10.1021/es300728q | Environ. Sci. Technol. 2012, 46, 7274−7282

Environmental Science & Technology

Article

Just as is true in the biological sciences, environmental scientists deal increasingly with enormous amounts of information (including increasing numbers of contaminants in environmental samples and more complex chemical nature of the components in these environmental mixtures). Our results indicate that the “pollution tree” approach is a useful computational and visualization tool for determining similarities and differences among real complex mixtures. “Pollution tree” approaches should have broad application with various realworld, complex mixtures.

Therefore, many positions of the binary data matrix (Y) were zero. Similarly, Eide et al.3 constructed a data matrix based on compound matching for pattern recognition on toxicological evaluation of mixtures and 70% of the values in the total matrix were zero. While the large number of null values may confound some approaches for comparing mixtures,3 the IUR-distancebased tree is not affected by the large number of null values in the comparison between every two samples. Combined with that, the IUR-distance-based tree can effectively measure the similarity of chemical composition with compound level deviation between mixtures and is more relevant to their biological activity; we then conclude that the IUR should be used primarily among all the trees for assessing toxicological risk of mixtures. Sources of Biological Activity in Complex Mixtures. For complex mixtures in the environment, the correlation between pollution profiles and biological effects resembles the relationship between “pollution type” of mixtures and their “toxicological type” or “function”. In this study, the “pollution tree” method framework successfully clustered whole chemical composition of mixtures and showed their relevance to a toxicological end point. In contrast, other approaches focused on specific components or compound classes in the complex mixtures rather than the whole “pollution type”.17,18 Of course, for other samples and biological effects, the whole chemical composition clusters represented by the “pollution tree” may not correspond to the toxicological clusters of mixtures. In these cases, a “pollution tree” would still provide useful information, showing that specific contaminants rather than the whole pollution profiles are likely responsible for the toxicological effects. We expect that with increasing usage, it will be possible to refine the ability of the “pollution-tree” approach to uncover relationships of the sample type or of individual components themselves as the main source of biological activity in the water samples. Perspectives. Our results represent the first attempt to use bioinformatics tree-building methods to highlight similarities and differences in pollutant composition between real complex mixtures. The constructed IUR, MP, and ML “pollution trees” clearly highlighted similarities and differences between mixtures containing thousands of compounds. The pollution profile lineages were mapped to both environmental and toxicological characteristics, i.e., sample mutagenicity, suggesting that general similarities in composition of these particular mixtures correlate with this specific measure of biological activity. The theory of tree-building models and the algorithms from bioinformatics have advantages in analyzing complex biological issues because they are more concordant with chemical and biological properties of the mixtures. However, these tree methods have little application in environment sciences for assessing similarities of exposures and of complex mixtures. With our success looking at the family tree of the water samples, we believe that these methods are widely applicable for investigating the temporal and spatial evolution trends of compounds in the environmental mixtures, estimating the evolutionary rate of contaminants, and determining the “ancestor” (source of pollution).25 One caveat of this study is that we mapped only a single end point, mutagenicity, to the “pollution trees”. In the future, other toxicological end points can also be mapped to see the extent that other responses of the mixtures map to chemical composition clusters of mixtures provided these “pollution trees”.



ASSOCIATED CONTENT

S Supporting Information *

Additional description of results including figures and tables. This information is available free of charge via the Internet at http://pubs.acs.org/



AUTHOR INFORMATION

Corresponding Author

*Tel.: 86-21-54237203; fax: 86-21-64045165; e-mail: wdqu@ fudan.edu.cn. Author Contributions ⊥

These authors contributed equally to this work.

Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This project was supported by National Key Technology R&D Program in the 11th Five Year Plan (2006BAI19B02 and 2008ZX07421-004), National Natural Science Foundation of China (30972438 and 30771770), Key Project of National High-tech R&D Program of China (863 Program) (2008AA062501), and “Dawn” Program of Shanghai Education Commission (07SG01). We gratefully appreciate Professor Yang Zhong (School of Life Sciences, Fudan University) for his crucial review and discussion. The suggestions and the critical reviews from three anonymous reviewers are greatly appreciated.



NOMENCLATURE CV coefficient of variation IUR intersection and union ratio ML maximum likelihood MP maximum parsimony NJ neighbor-joining PCA principal component analysis



REFERENCES

(1) Lang, L. Strange brew: Assessing risk of chemical mixtures. Environ. Health Perspect. 1995, 103, 142−145. (2) Teuschler, L. K. Deciding which chemical mixtures risk assessment methods work best for what mixtures. Toxicol. Appl. Pharmacol. 2007, 223, 139−147. (3) Eide, I.; Neverdal, G.; Thorvaldsen, B.; Grung, B.; Kvalheim, O. M. Toxicological evaluation of complex mixtures by pattern recognition: Correlating chemical fingerprints to mutagenicity. Environ. Health Perspect. 2002, 110 (suppl 6), 985−988. (4) Monosson, E. Chemical mixtures: considering the evolution of toxicology and chemical assessment. Environ. Health Perspect. 2005, 113, 383−390. (5) Finney, D. J. Probit Analysis, 2nd ed.; Cambridge University Press: U.K., 1952.

7280

dx.doi.org/10.1021/es300728q | Environ. Sci. Technol. 2012, 46, 7274−7282

Environmental Science & Technology

Article

(6) Smyth, H. F.; Weil, C. S.; West, J. S.; Carpenter, C. P. An exploration of joint toxic action: Twenty-seven industrial chemicals intubated in rats in all possible pairs. Toxicol. Appl. Pharmacol. 1969, 14, 340−347. (7) Smyth, H. F.; Weil, C. S.; West, J. S.; Carpenter, C. P. An exploration of joint toxic action. II. Equitoxic versus equivolume mixtures. Toxicol. Appl. Pharmacol. 1970, 17, 498−503. (8) U.S. EPA. Guidelines for the Health Risk Assessment of Chemical Mixtures; EPA/630/R-98/002; U.S. Environmental Protection Agency: Washington, DC, 1986. (9) U.S. EPA. Supplementary Guidance for Conducting Health Risk Assessment of Chemical Mixtures; EPA/630/R-00/002; U.S. Environmental Protection Agency: Washington, DC, 2000. (10) Teuschler, L. K.; Mumtaz, M.; Rice, G. E.; Hertzberg, R. C. Methods and Guidance on Health Risk Assessment of Chemical Mixtures; Society for Risk Analysis 2005 Annual Meeting, Orlando, FL, 2005. (11) WHO. Assessment of Combined Exposures to Multiple Chemicals: Report of a WHO/IPCS International Workshop on Aggregate/ Cumulative Risk Assessment; World Health Organization: Geneva, Switzerland, 2007. (12) Bostrøm, E.; Engen, S.; Eide, I. Mutagenicity testing of organic extracts of diesel exhaust particles after spiking with PAHs. Arch. Toxicol. 1998, 72, 645−649. (13) Eide, I.; Johnsen, H. G. Mixture design and multivariate analysis in mixture research. Environ. Health Perspect. 1998, 106 (suppl 6), 1373−1376. (14) Feron, V. J.; Groten, J. P.; Jonker, D.; Cassee, F. R.; van Bladeren, P. J. Toxicology of chemical mixtures: Challenges for today and the future. Toxicology 1995, 105, 415−427. (15) Feron, V. J.; Cassee, F. R.; Groten, J. P. Toxicology of chemical mixtures: International perspective. Environ. Health Perspect. 1998, 106 (suppl 6), 1281−1289. (16) Gardner, H. S.; Brennan, L. M.; Toussaint, M. W.; Rosencrance, A. B.; Boncavage-Hennessey, E. M.; Wolfe, M. J. Environmental complex mixture toxicity assessment. Environ. Health Perpect. 1998, 106 (suppl6), 1299−1305. (17) Liao, K. H.; Dobrev, I. D.; Dennison, J. E.; Andersen, M. E.; Reisfeld, B.; Reardon, K. F.; Campain, J. A.; Wei, W.; Klein, M. T.; Quann, R. J.; Yang, R. S. H. Application of biologically based computer modeling to simple or complex mixtures. Environ. Health Perspect. 2002, 110 (suppl 6), 957−963. (18) Meadows, S. L.; Gennings, C.; Carter, W. H.; Bae, D. S. Experimental designs for mixtures of chemicals along fixed ratio rays. Environ. Health Perspect. 2002, 110 (suppl 6), 979−983. (19) Rice, G.; Teuschler, L. K.; Speth, T. F.; Richardson, S. D.; Miltner, R. J.; Schenck, K. M.; Gennings, C.; Hunter, E. S.; Narotsky, M. G.; Simmons, J. E. Integrated disinfection by-products research: Assessing reproductive and developmental risks posed by complex disinfection by-product mixtures. J. Toxicol. Environ. Health, Part A 2008, 71, 1222−1234. (20) Richardson, S. D.; Thruston, A. D.; Krasner, S. W.; Weinberg, H. S.; Miltner, R. J.; Schenck, K. M.; Narotsky, M. G.; McKague, A. B.; Simmons, J. E. Integrated disinfection by-products mixtures research: Comprehensive characterization of water concentrates prepared from chlorinated and ozonated/postchlorinated drinking water. J. Toxicol. Environ. Health, Part A 2008, 71, 1165−1186. (21) Simmons, J. E.; Richardson, S. D.; Speth, T. F.; Miltner, R. J.; Rice, G.; Schenck, K. M.; Hunter, E. S.; Teuschler, L. K. Development of a research strategy for integrated technology-based toxicological and chemical evaluation of complex mixtures of drinking water disinfection byproducts. Environ. Health Perspect. 2002, 110 (suppl 6), 1013−1024. (22) Verhaar, H. J. M.; Morroni, J. R.; Reardon, K. F.; Hays, S. M.; Gaver, D. P.; Carpenter, R. L.; Yang, S. H. A proposed approach to study the toxicology of complex mixtures of petroleum products: The integrated use of QSAR, lumping analysis and PBPK/PD modeling. Environ. Health Perspect. 1997, 105 (suppl 1), 179−195. (23) Teuschler, L.; Klaunig, J.; Carney, E.; Chambers, J.; Conolly, R.; Gennings, C.; Giesy, J.; Hertzberg, R.; Klaassen, C.; Kodell, R.; Paustenbach, D.; Yang, R. Support of the science-based decisions

concerning evaluation of the toxicology of mixtures: A new beginning. Regul. Toxicol. Pharmacol. 2002, 36, 34−39. (24) Eide, I.; Neverdal, G.; Thorvaldsen, B.; Shen, H.; Grung, B.; Kvalheim, O. Resolution of GC-MS data of complex PAC mixtures and regression modeling of mutagenicity by PLS. Environ. Sci. Technol. 2001, 35, 2314−2318. (25) McGregor, L. A.; Gauchotte-Lindsay, C.; Daéid, N. N.; Thomas, R.; Kalin, R. M. Multivariate statistical methods for the environmental forensic classification of coal tars from former manufactured gas plants. Environ. Sci. Technol. 2012, 46, 3744−3752. (26) Simmons, J. E. Chemical mixtures: Challenge for toxicology and risk assessment. Toxicology 1995, 105, 111−119. (27) Teuschler, L. K.; Hertzberg, R. C. Current and future risk assessment guidelines, policy, and methods development for chemical mixtures. Toxicology 1995, 105, 137−144. (28) Bull, R. J.; Rice, G.; Teuschler, L. K. Determinants of whether or not mixtures of disinfection by-products are similar. J. Toxicol. Environ. Health, Part A 2009, 72, 437−460. (29) Feder, P. I.; Ma, Z. J.; Bull, R. J.; Teuschler, L. K.; Schenck, K. M.; Simmons, J. E.; Rice, G. Evaluating sufficient similarity for disinfection by-product (DBP) mixtures: Multivariate statistical procedures. J. Toxicol. Environ. Health, Part A 2009, 72, 468−481. (30) Feder, P. I.; Ma, A. J.; Bull, R. J.; Teuschler, L. K.; Rice, G. Evaluating sufficient similarity for drinking-water disinfection byproduct (DBP) mixtures with bootstrap hypothesis test procedures. J. Toxicol. Environ. Health, Part A 2009, 72, 494−504. (31) Stumpe, B.; Engel, T.; Steinweg, B.; Marschner, B. Application of PCA and SIMCA statistical analysis of FT-IR spectra for the classification and identification of different slag types with environmental origin. Environ. Sci. Technol. 2012, 46, 3964−3972. (32) Gray, R. D.; Atkinson, Q. D. Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 2003, 426, 435−439. (33) Pagel, M.; Atkinson, Q. D.; Meade, A. Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature 2007, 449, 717−721. (34) Nei, M; Kumar, S. Molecular Evolution and Phylogenetics; Oxford Press: New York, 2000. (35) Alexandrou, M. A.; Oliveira, C.; Maillard, M.; McGill, R. A. R.; Newton, J.; Creer, S.; Taylor, M. I. Competition and phylogeny determine community structure in Müllerian co-mimics. Nature 2011, 469, 84−88. (36) Ke, Y.; Su, B.; Song, X.; Lu, D.; Chen, L.; Li, H.; Qi, C.; Marzuki, S.; Deka, R.; Underhill, P.; Xiao, C.; Shriver, M.; Lell, J.; Wallace, D.; Wells, R. S.; Seielstad, M.; Oefner, P.; Zhu, D.; Jin, J.; Huang, W.; Chakraborty, R.; Chen, Z.; Jin, L. African origin of modern humans in East Asia: A tale of 12000 Y chromosomes. Science 2001, 292, 1151−1153. (37) Wallace, R. G.; HoDac, H.; Lathrop, R. H.; Fitch, W. M. A statistical phylogeography of influenza A H5N1. Proc. Natl. Acad. Sci., U.S.A. 2007, 104, 4473−4478. (38) Wen, B.; Li, H.; Lu, D.; Song, X.; Zhang, F.; He, Y.; Li, F.; Gao, Y.; Mao, X.; Zhang, L.; Qian, J.; Tan, J.; Jin, J.; Huang, W.; Deka, R.; Su, B.; Chakraborty, R.; Jin, L. Genetic evidence supports demic diffusion of Han culture. Nature 2004, 431, 302−305. (39) Chen, L.; Zhou, Y.; Wu, Y. L.; Zhang, H.; Wang, X; Zheng, W. W.; Liu, L.; Jiang, S. H.; Qu, W. D.; Zhao, J. W. Status of trace organic pollution in the network water came from Huangpu River [in Chinese]. J. Hygiene Res. 2008, 3, 137−143. (40) Maron, D. M.; Ames, B. N. Revised methods for the Salmonella mutagenicity test. Mutat. Res. 1983, 113, 173−215. (41) SAS Institute Inc. SAS, version 9.2; SAS Institute Inc: Cary, NC, 2008. (42) IBM Corporation. SPSS, version 18.0; IBM Corporation: Armonk, NY, 2009. (43) Tamura, K.; Dudley, J.; Nei, M.; Kumar, S. MEGA, version 4.1, 2008; http://www.megasoftware.net/mega4/index.html. 7281

dx.doi.org/10.1021/es300728q | Environ. Sci. Technol. 2012, 46, 7274−7282

Environmental Science & Technology

Article

(44) Swofford, D. L. PAUP*. Phylogenetic Analysis Using Parsimony (* and Other Methods), version 4.0; Sinauer Associates: Sunderland, MA, 2002. (45) Stata Corporation. Stata, version 10.0; Stata Corporation: College Station, TX, 2007. (46) Richardson, S. D.; Plewa, M. J.; Wagner, E. D.; Schoeny, R.; DeMarini, D. M. Occurrence, genotoxicity, and carcinogenicity of regulated and emerging disinfection by-products in drinking water: A review and roadmap for research. Mutat. Res. 2007, 636, 178−242.

7282

dx.doi.org/10.1021/es300728q | Environ. Sci. Technol. 2012, 46, 7274−7282