Prediction of the Sorption of Organic Compounds into Soil Organic

Oct 11, 2006 - A new model to estimate the soil-water partition coefficient of non-ionic organic compounds normalized to soil organic carbon, Koc, fro...
0 downloads 10 Views 107KB Size
Environ. Sci. Technol. 2006, 40, 7005-7011

Prediction of the Sorption of Organic Compounds into Soil Organic Matter from Molecular Structure GERRIT SCHU ¨ U ¨ R M A N N , * ,†,‡ RALF-UWE EBERT,† AND RALPH KU ¨ HNE† Department of Ecological Chemistry, UFZ Centre for Environmental Research, Permoserstrasse 15, 04318 Leipzig, Germany, and Institute for Organic Chemistry, Technical University Bergakademie Freiberg, Leipziger Strasse 29, 09596 Freiberg, Germany

A new model to estimate the soil-water partition coefficient of non-ionic organic compounds normalized to soil organic carbon, Koc, from the two-dimensional molecular structure is presented. Literature data of log Koc for 571 organic chemicals were fitted to 29 parameters with a squared correlation coefficient r2 of 0.852 and a standard error of 0.469 log units. The application domain includes the atom types C, H, N, O, P, S, F, Cl, and Br in various important compound classes. The multilinear model contains the variables molecular weight, bond connectivity, molecular E-state, an indicator for nonpolar and weakly polar compounds, and 24 fragment corrections representing polar groups. The prediction capability is evaluated through an initial twostep development using an 80%:20% split of the data into training and prediction, cross-validation, permutation, and application to three external data sets. The discussion includes separate analyses for subsets of H-bond donors and acceptors as well as for nonpolar and weakly polar compounds. Comparison with existing models including linear solvation energy relationships illustrates the superiority of the new model.

Introduction Sorption of waterborne compounds to solid matrices reduces their volatilization as well as their mobility and thus decreases both vertical leaching through soils as well as horizontal transport in natural streams. For organic compounds, sorption also affects their susceptibility for aquatic photolysis, their bioavailability for aquatic organisms, and their biodegradation. At the molecular level, sorption results from dispersion forces (London forces), dipole-induced dipole forces (Debye forces), dipole-dipole forces (Keesom forces), and hydrogen bonding and in the case of ionic sorbates also from Coulomb interactions. For non-ionic organic compounds, the soilwater distribution coefficient Kd is usually normalized to the organic carbon content foc yielding Koc ) Kd/foc. It reflects the fact that soil organic carbon (SOC) is the major sorption domain for neutral organic compounds (1) and that in this case linear absorption into SOC is the major component, * Corresponding author phone: +49-341-235-2309; fax: +49-341235-2401; e-mail: [email protected]. † Department of Ecological Chemistry. ‡ Institute for Organic Chemistry. 10.1021/es060152f CCC: $33.50 Published on Web 10/11/2006

 2006 American Chemical Society

despite growing evidence that the overall sorption process is more complex (2). Existing models to predict log Koc from molecular structure have been reviewed recently (3). They include methods employing log Kow (octanol/water partition coefficient) or water solubility (4), molecular connectivity indices without or with fragment correction factors and increment methods (5-7), and linear solvation energy relationship (LSER) models (8, 9). There is also a promising quantum chemical approach (10), but so far its statistical performance is inferior to conventional methods. Collection of a set of 571 log Koc data from literature sources (4, 9, 11-13) and application of various general-purpose methods that use only information from two-dimensional (2D) molecular structure revealed that their actual prediction performances leave room for improvement. This prompted us to develop a new model, restricting the parameters to 2D structural information to enable simple and transparent applications as well as interpretation of Koc in terms of structural driving forces. As will be shown below, the new model has a wide application range covering atom types C, H, N, O, P, S, F, Cl, and Br in various chemical functionalities, was validated with stringent statistical tests (cross-validation and permutation alone and in combination (14) as well as external prediction), and outperforms existing methods with respect to both calibration and prediction statistics.

Materials and Methods Data Compilation. Experimental log Koc data were collected for a set of 571 compounds taken from Huuskonen (4), using the following five data sources: 349 data stem from Sabljic´ et al. (11) including consideration of its erratum (eight of which are identical to the Nguyen et al. (9) data set), 85 data from Lohninger et al. 1994 (12), 111 data from the CHEMFATE database (13), and 26 data from ref 9, resulting in the total number of 571 compounds used for the model development. Most of these data drew from studies with some quality control effort as described in the respective literature sources, with the Nguyen data (9) being the one most critically evaluated. The complete data set is given in the Supporting Information. For compounds with multiple data from ref 9, we used medians instead of arithmetic means because the data generally did not obey normal distribution. Experimental sorption data may vary considerably also after normalization to organic carbon (15). This holds true even when the data selection is subject to rigorous evaluation criteria, as is illustrated by a correspondingly generated data collection (9). Here, among 31 compounds with several log Koc entries, 12 compounds have Koc ranges over 0.5 log units when considering different laboratories and soils. It follows that at present, restriction to rigorously validated data would still include log Koc variations above 0.5 log units but reduce severely the chemical domain as basis for the model development. Our approach is a compromise between maximizing the range of data available for the model development and confining the data to those in accord with strict quality criteria. Thus, the resultant data probably contain some uncertainty beyond the experimental error that is estimated to be ca. 0.4 log units. In addition, three mostly external data sets were used for testing the predictive performance of the model. The remaining 41 compounds from ref 9 that had not been included in the present model development, 48 nonpolar and monofunctional compounds (see below) with data collected by Sabljic´ (6) that had also been left out for the present model development, and 93 compounds (of which VOL. 40, NO. 22, 2006 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

7005

TABLE 1. Parameters and Application Domain To Calculate log Koc from Molecular Structurea General Parameters value

description regression constant molecular weight bond connectivity index  molecular E-state

0.936 0.00321 0.255 -0.0139

max

0.075 0.00062 0.023 0.0034

na 32 0 4.7

na 665 18.2 103.8

value

[$(c)]:c([$(c)])[$(c)] [F,Cl,Br,I]-c1ccccc1 [OH]-[$(C)] [$(C)]-O-[$(C)] [$(#6)]-O-[$(c)] [$(#6)]-C()O)-O-[$(#6)] [$(#6)]-O-C)O NX3-CR0 NX3-c1R1:aR1:aR1:aR1:aR1:aR1 n1R1:cR1:cR1:cR1:cR1:cR1 n1:n:c:c:c:c1, n1:c:n:c:c:c1, n1:c:c:n:c:c1 n1:n:n:c:c1, n1:n:c:n:c1 NX3-C()O)-[(#6)] NX3-CR0()O)-[#7]X3 N1-C()O)-N-C()O)-[C,N])C1 N1)C(-O)-N)C(-O)-[C,N])C1 O-N)[$(*) O)SX3 c-S()O)()O)-N-C)O O)S()O)-N-C()O)-c [$(A)]-[O,S]-[$(P)O)] [$(*)]-[O,S]-[$(P)O)]-[$(*)]∼[$(a)] [$(*)]-[O,S]-[$(P)S)

0.130 0.191 -0.620 -0.310 -0.200 -0.26 -0.327 -0.35 -0.227 -0.50 -0.61 -0.90 -0.857 -0.939 -1.19

Fragmental Correction Factors 0.022 184 56 0...8 0.022 347 166 0...8 0.092 29 27 0...2 0.081 32 27 0...2 0.054 84 71 0...2 0.10 17 13 0...2 0.057 74 57 0...3 0.11 15 12 0...2 0.048 89 65 0...3 0.19 5 4 0...2 0.15 11 11 0...1 0.15 12 12 0...1 0.071 65 64 0...2 0.080 55 55 0...1 0.22 5 5 0...1

-0.87 -0.76 -1.85

0.14 0.19 0.20

-0.339 0.043 56 -0.63 0.10 11 -0.064 0.032 104

na [OH]-C)O O)SX4)O OX2-C()O)-NX3

0.395 -1.026 -1.22 -0.737

0.077 0.099 0.17 0.092

10 7 7

ncps

min

SMARTS code (29)

SD

ntot

SD

9 7 7

nmin...max

explanation fused aromatic carbon halogen at aromatic carbon ring OH at aliphatic C -O- between aliphatic C -O- attached to at least one aromatic C CdO (keto group) COO (ester group) >N-, -NH-, NH2- at acyclic C >N-, -NH-, NH2- at nonfused aromatic ring pyridine ring diazine ring triazole ring amid group (N not at S or CdO) urea group (acyclic) uracil type structure

0...2 O-N) 0...1 sulfoxid group 0...1 sulfuron structure at arom. ring

23 0...3 O or S at PdO (aliphatic phosphate) 6 0...3 O or S at PdO (aromatic phosphate) 34 0 or 2...6 O or S at PdS

Indicators 81 81 30 30 9 9 35 35

na 0...2 0...2 0...3

indicator Inwp, see text COOH (carboxylic acid) sulfon group (not in sulfonamides) carbamate group

a The numbers n , n tot cps, and nmin...max denote the total fragment occurrence in the training set, the number of compounds containing the fragment, and the minimum...maximum count in a single compound, respectively. Fragment correction factors have to be applied for each occurrence of the respective group. In contrast, indicators are to be counted only once per molecule regardless of the actual number of occurrences.

63 compounds are also in our model data set but with different data) with data collected by Baker et al. (16). Model Development. In the first phase of the model development, the compounds were split into a training set of 457 compounds (80%) and a prediction set of 114 compounds (20%). The latter was built as a stratified random sample to ensure that all major compound classes are represented. Preliminary multilinear regression of log Koc of the training set compounds on molecular weight, first-order molecular connectivity index 1χ, the respective valence-corrected index 1 v χ , molecular E-state as sum of all atomic E-states (17), and the bond connectivity index  (18) revealed the linear combination of , E-state, and molecular weight as the most promising starting combination with regard to statistics, accounting already for 53% of the log Koc variation. Subsequently, 24 polar fragment corrections were derived in a stepwise manner through analyses of calculation errors of the consecutively improved model, as long as these could be associated with the occurrence of specific structural features of the compounds. Detailed analysis of the performance then revealed a systematic log Koc underestimation for nonpolar and simple monofunctional compounds (hydrocarbons, halogenated hydrocarbons, monofunctional aromatic alcohols, and amines without other heteroatoms) by ca. 0.4 log units. Accordingly, a respective indicator variable (Inwp, see below) was added as the 28th model parameter. Note that in an early study of 7006

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 40, NO. 22, 2006

modeling log Koc through 1χ, 72 nonpolar and respective monofunctional compounds could be combined into one linear regression equation with r2 ) 0.95 (6) despite the fact that intermolecular interaction types differ for different compound classes (e.g., phenols may participate in hydrogen bonding, which contrasts with simple hydrocarbons). Training set (80%) calibration and external prediction using the initial validation subset (20%) yielded similar statistics (see below), demonstrating the significant predictive power of the new model. Subsequently, the two subsets were merged to derive the final model equation. All further statistics refer to this combined model data set of 571 compounds. Application Domain. The data set compounds are neutral organics (except for partial ionization of acids and bases at soil pH) that include the following atom types: C, H, N, O, P, S, F, Cl, and Br. Because the training set contained no organoiodine compounds, I is not included, but it is likely that a simple extension (see below) will provide reasonable estimates for compounds with iodine attached to aliphatic or aromatic carbon. Although the 571 compounds cover a wide range of chemical classes, the chemical domain is restricted by the actually observed frequencies of occurrences of the substructures associated with the 25 fragment corrections. Accordingly, the chemical domain is specified in terms of allowed atom types as mentioned above, through value ranges of the three continuous model parameters (see columns min and max of Table 1), through frequency ranges of occurrence

of substructural features associated with fragment corrections (see column nmin...max and legend of Table 1), and by the target value (log Koc) range observed for the 571 compounds (see log Koc data in the Supporting Information). In future updates, a more comprehensive characterization of the application domain may be achieved through application of the concept of atom-centered fragments (ACFs) (19-21). Model Validation. The prediction capability of the final model was evaluated using cross-validation (10 runs, each leaving out 10% of the compounds), permutation (12 runs with varying degrees of target value scrambling (14)), and external prediction with three additional data sets as mentioned above. Validation statistics are expressed in terms of the predictive squared correlation coefficient q2 and its associated standard error (SE) and bias (14). Comparison to Existing Models. The following six methods have been included in comparative analyses: Sabljic´ model 1 (Sabljic´ 1) with 1χ and 17 polar compound classspecific correction factors (6), Sabljic´ model 2 (Sabljic´ 2) as decision tree model (11) employing 19 different regression equations based on log Kow and one 1χ-based regression equation (using KOWWIN (22) to calculate log Kow from molecular structure), PCKOCWIN (23) as an updated version of ref 5 that was based on 1χ and 27 polar fragment corrections, the Tao model consisting of 74 fragment values and 24 structural correction factors (7), the Poole model as linear solvation energy relationship (LSER) employing octanolcalibrated solute basicity values (8), and the Nguyen model as another LSER employing water-calibrated solute basicity values (9). All calculations were performed with respective model implementations in our ChemProp software (24) except for the PCKOCWIN predictions of log Koc and the KOWWIN predictions of log Kow.

Results and Discussion Model. The final multilinear regression model for predicting log Koc has the general form

log Koc )

∑ a P + ∑b F + ∑ c I i i

i

j j

j

k k

+d

k

with 3 continuous variables Pi (molecular weight MW, bond connectivity  and molecular E-state), 21 fragment correction factors Fj, 4 structural indicator variables Ik, and 1 regression constant as listed in Table 1. Using 571 compounds, the calibration statistics are r2 ) 0.852, SE ) 0.469, average absolute error ) 0.361, and bias ) 0.000. Initial model derivation based on 457 training compounds (80%) yielded the same parameter selection with similar regression coefficients and calibration statistics (r2 ) 0.853, SE ) 0.464). External prediction on the initial validation set of 114 compounds (20%) lead to q2 ) 0.834, SE ) 0.512, and bias ) -0.066, indicating the statistical robustness of the final multilinear model. Formal consideration of the degrees of freedom (29 model parameters) yields r2 ) 0.845 and SE ) 0.481. However, structural fragments yield nonzero contributions only to (often small) subsets of the total compound set (e.g., only 27 of the 571 compounds have an aliphatically bound OH group, see Table 1), which is a major difference to continuous variables (e.g., MW) and makes a proper treatment of the degrees of freedom in increment models difficult. The number of actually used model variables ranges from 3 (only the 3 continuous variables, no fragment or indicator variable) to 8, with an average of 4.55 actually used model variables per compound. Accordingly, the subsequent statistical analysis focuses on simulation techniques to assess the robustness and prediction capability, without explicit consideration of the degrees of freedom.

The positive MW increment indicates that as a general trend, increasing molecular size increases the affinity of organic waterborne chemicals for soil organic matter, which results from both the energy penalty of cavity formation associated with aqueous solvation and the increased van der Waals attraction with humic substances and biopolymers as major components of the OC content of natural soils (2, 25). MW corresponds to the molecular volume term of the LSER approach. Log Koc also increases with increasing bond connectivity index  (18) of the compounds, and interestingly  alone accounts already for 43% of the total log Koc variation of the 571 compounds. By definition,  increases with molecular size but decreases with increasing branching in a systematic manner, both of which holds also true for 1χ. It follows that like 1χ (26),  may be interpreted as reflecting the geometric accessibility (as far as geometry can be defined through topological information) of compounds for intermolecular interactions. As such,  can be understood as molecular size corrected for geometric accessibility with respect to van der Waals contacts. A respective parametrization of the van der Waals accessible part of the molecular surface does not appear to be addressed through LSER parameters. While  and MW are substantially intercorrelated (r2 ) 0.72), both are statistically highly significant alone and in linear combination for predicting log Koc, and their regression coefficients remain essentially constant during cross-validation. The molecular E-state (17) is calculated from so-called intrinsic states of atoms in a molecule that reflect their valence-state electronegativities. The intrinsic state of an atom relates its number of π and lone pair electrons to its adjacency (degree of branching). It thus increases with increasing polarizability of its valence electrons and with increasing accessibility (reciprocal of adjacency). For a given atom, the electronic influence of all other atoms is then accounted for as intramolecular perturbation that depends on both their topological distances and their intrinsic states. Because increasing difference in electronegativity of atoms in a molecule increases its polarity, the latter is also reflected in the E-state. In our model, the molecular E-state term serves to reduce log Koc with increasing polarity and polarizability of the compounds. As such, it corrects  for differences in both the polarity and polarizability of the compounds, and the coefficient sign is again in accord with expectation. E-state is significantly intercorrelated with both MW (r2 ) 0.67) and  (r2 ) 0.73). However, as with MW and  also the E-state regression coefficient remains essentially constant during cross-validation, and its contribution to the model alone and in combination with the other variables is highly significant. While the covariance could be eliminated through multivariate projection techniques, we prefer to keep the model technically simple, considering the statistical robustness (cross-validation) and prediction power (external validation) as demonstrated below. For the remaining 25 structural parameters, the largest intercorrelation r2 values are 0.14 (aromatically bound halogen vs Inwp) and 0.11 (ester group vs E-state and vs ), with all other intercorrelations being below r2 ) 0.10. The indicator variable Inwp (subscript “nwp” stands for nonpolar and weakly polar) discriminates hydrocarbons, Cland Br-substituted hydrocarbons, and monofunctional aromatic alcohols and amines from other heterosubstituted organics. It reflects the fact that without this parameter, the model would yield a systematic underestimation of the soil affinity for nonpolar and weakly polar compounds. While this parameter has a high statistical significance, it merges simple alkylated aromatics with weak hydrogen bond acceptors (chlorinated aliphatics and aromatics) and monofunctional compounds acting both as H bond donor and acceptor (aromatic alcohols and amines). Future studies may VOL. 40, NO. 22, 2006 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

7007

TABLE 2. Method Statistics for the Total Set of 571 Compounds and its Subsets SE

bias

max error

max + error

0.469 1.167 0.541 0.719 0.552 0.790 0.934

0.000 +0.506 -0.086 +0.104 +0.014 +0.012 +0.095

-1.88 -2.20 -2.11 -2.09 -2.88 -4.00 -3.53

1.76 4.61 1.77 2.94 1.96 3.34 3.48

Subset with Nonpolar and Weakly Polar Compounds 0.852 0.847 0.428 0.534 0.783 0.753 0.502 0.678 0.823 0.812 0.426 0.592 0.797 0.785 0.452 0.635 0.860 0.847 0.428 0.534 0.827 0.763 0.485 0.665 0.820 0.750 0.490 0.683

0.000 -0.054 -0.095 -0.036 -0.045 -0.010 +0.051

-1.11 -2.03 -1.65 -1.87 -1.20 -1.09 -1.52

1.76 1.95 1.77 1.82 1.56 2.82 3.07

287 216 286 287 249 287 287

Subset with Stronger Hydrogen Bond Donor Sites 0.806 0.803 0.299 0.394 0.411 -0.943 0.719 1.123 0.668 0.662 0.356 0.504 0.606 0.416 0.466 0.679 0.682 0.610 0.354 0.506 0.515 0.204 0.620 0.793 0.498 -0.020 0.699 0.897

+0.029 +0.499 -0.054 +0.122 +0.031 -0.076 +0.105

-1.16 -1.08 -2.11 -2.09 -2.88 -3.34 -3.01

1.47 4.61 1.10 2.94 1.96 1.66 2.40

203 110 196 203 178 203 203

0.744 0.354 0.709 0.610 0.681 0.651 0.588

-0.041 +0.860 -0.129 +0.135 +0.016 +0.145 +0.098

-1.88 -2.20 -1.89 -1.83 -2.42 -2.07 -3.53

1.25 3.73 1.09 2.81 1.94 2.74 3.48

method

no. of valid results

r2

new model Sabljic´ 1 Sabljic´ 2 PCKOCWIN Tao Poole Nguyen

571 407 563 571 508 571 571

0.852 0.543 0.804 0.721 0.806 0.717 0.657

new model Sabljic´ 1 Sabljic´ 2 PCKOCWIN Tao Poole Nguyen

81 81 81 81 81 81 81

new model Sabljic´ 1 Sabljic´ 2 PCKOCWIN Tao Poole Nguyen new model Sabljic´ 1 Sabljic´ 2 PCKOCWIN Tao Poole Nguyen

abs av error

q2 0.852 0.092 0.799 0.653 0.788 0.581 0.415

Total Set 0.361 0.800 0.399 0.508 0.403 0.593 0.698

Subset Without Hydrogen Bond Donor Sites 0.741 0.422 0.537 -1.066 1.177 1.499 0.693 0.447 0.575 0.420 0.592 0.804 0.628 0.462 0.621 0.373 0.598 0.836 -0.020 0.778 1.066

provide an opportunity to investigate the mechanistic background of this finding further. Among the 24 fragment parameters except for Inwp, 22 decrease log Koc through polar fragments as indicated by the negative sign of the corresponding increment values, and two quantify a group-specific preference for SOC. The largest hydrophilic correction of -1.85 log units is provided by the sulfuron group (see Table 1), and there are five more fragments that yield an average decrease in soil sorption of organic compounds by more than 0.8 log units. Inspection of Table 1 reveals that all hydrophilic corrections are associated with electron-rich heteroatoms in various functionalities and thus reflect the increased propensity of respective compounds for favorable dipole-dipole and hydrogen bond acceptor interactions with water. The latter is in accord with a negative regression coefficient of hydrogen bond basicity in current LSER models (8, 9), while only the Nguyen model includes a respective polarity term (9). Interestingly, both LSER models contain a positive excess molar refraction parameter that is often interpreted to represent phase interactions with solute n and π electrons. However, the presently identified hydrophilic fragments all contain heteroatoms with lone pairs (O, N, S), indicating a negative (soil-avoiding) contribution through n electrons. It suggests that solute lone pairs have both a favorable (dispersion interaction) and unfavorable (H-bond acceptance) component to the compound partitioning from water into SOC. Fused aromatic carbon slightly increases Koc by 0.130 log units, probably because of the decreased availability of its π electrons for hydrogen bond acceptor interactions with water. Aromatically bound halogen (F, Cl, Br) yields the largest positive correction for log Koc, which reflects its greater hydrophobicity as compared to aliphatically bound halogen. Accordingly, applying the halogen correction factor of 0.191 7008

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 40, NO. 22, 2006

also for I attached to aromatic carbon is likely to yield (together with the positive MW contribution) reasonable log Koc estimates for organoiodine compounds, keeping in mind that these do formally not belong to the chemical domain of the model. Statistical Performance Compared to Literature Methods. In Table 2, calibration and prediction statistics of the new model and six literature methods are summarized for the total set of 571 compounds as well as for three subsets: 81 nonpolar and weakly polar compounds (Inwp ) 1), 287 polar compounds with stronger hydrogen bond donor sites (OH, NH2, NH, SH; except compounds with Inwp ) 1), and 203 polar compounds without such H donor sites, of which 201 contain stronger hydrogen bond acceptor sites (O, N except as ≡N, -S-, )S; again excluding compounds with Inwp ) 1). For the two LSER models, only calculated parameters (27, 28) have been used in this comparative analysis to increase their application range and make them fully comparable to the other methods that all yield predictions directly from molecular structure. The LSER performance for the (small) subsets of compounds with experimental Abraham parameters will be discussed below. For all seven methods, r2 quantifies the calibration performance, and q2 quantifies the prediction performance of the methods in their original versions. Note that when testing a given model Y ) F(X), r2 is not appropriate to evaluate its absolute performance, because r2 automatically corrects for any bias and any systematic error in the slope, which corresponds to refitting the model according to Y’ ) a * Y + b ) a * F(X) + b. All further statistical parameters (average absolute error, bias, standard error, maximum negative and maximum positive error) refer to the original (not refitted) models. Because our new model was derived through calibration with all 571 compounds, its total set q2 is equal to r2 by definition.

For the total set, the new model outperforms all six literature methods with respect to all statistical parameters. The Sabljic´ 1 model can be applied to only 407 compounds due to restricted definitions of polar corrections introduced for specific compound classes (6), the Sabljic´ 2 model suite (11) cannot be applied to 8 compounds because their (calculated) log Kow is outside the required range, and the fragment list of the Tao model (7) excludes 63 compounds of the present data. While Sabljic´ 1 cannot compete with any of the other models, Sabljic´ 2 as the only model involving log Kow (through a decision tree approach with 19 class-specific submodels, augmented by one 1χ-based regression equation) shows the overall second best performance, followed by PCKOCWIN (23). The two LSER models yield only moderate statistics, and both standard error and average absolute error are largest for the Nguyen model (9). As noted above, however, the LSER results in Table 2 refer to calculated Abraham parameters. It suggests that the overall moderate LSER performance is mainly due to the limited quality of the calculated Abraham parameters, keeping in mind that for more complex compounds, experimental Abraham parameters are often not available. Interestingly, the second greatest difference between calibration and prediction is observed for the Nguyen model, suggesting that its original training set was too narrow with regard to the chemical domain. For the subset of 81 nonpolar and weakly polar compounds (Inwp ) 1), all seven methods show similar statistics. Our new model yields slightly larger errors on the average than for the total set, but it is still superior to all others except for the Tao model. Note, however, that these 81 compounds are structurally quite simple and do not offer much room for improving the model except through introduction of other parameters not yet considered so far. The differences in performance are particularly striking for 287 compounds with strong hydrogen bond donor sites. Here, the new model is significantly better than all other methods. Sabljic´ 1 yields the overall poorest performance with an even negative q2 (indicating that the model prediction is inferior to taking simply the subset average of log Koc as predicted value for all compounds), while Sabljic´ 2 is again the second best model. Both LSER models also show quite significant differences between calibration (r2) and prediction (q2), and the second largest standard errors (0.793, 0.897) are provided by the LSER approach. Here, the Poole model systematically underestimates Koc by ca. 0.08 log units and in this respect differs from the Nguyen model. A possible explanation is the different treatment of the LSER hydrogen bond donor parameter. According to the Poole model, increasing hydrogen bond acidity decreases log Koc, while the opposite holds for the Nguyen model. However, in the latter model the respective regression coefficient does not appear to be statistically significant when comparing its reported value (+0.15) with its reported standard error ((0.15). The last part of Table 2 refers to the subset of 203 polar compounds without strong hydrogen bond donor sites, of which 201 compounds are strong hydrogen bond acceptors. Again the new model yields the best overall statistics, while the standard error is now significantly larger than for the hydrogen bond donors (0.537 vs 0.394). As for the total set, Sabljic´ 2 is the second best model followed by PCKOCWIN, Sabljic´ 1 is far off with respect to all statistical parameters, and both LSER models (when employing calculated Abraham parameters) appear to be of little use for predicting log Koc, probably because of the limited quality of the calculated Abraham parameters. Overall, the statistics summarized in Table 2 show that the new model is statistically quite robust and that its performance across chemical functionalities is quite balanced

as compared to the results achieved with the literature methods. The distribution of calculated vs experimental log Koc data is plotted in the Supporting Information (Figure S1, left) together with the respective distribution of calculation error vs experimental log Koc (Figure S1, right). For the new model, 50% of the calculation errors are below 0.28 log units and 74% are below 0.50. The respective error 50% percentiles of PCKOCWIN, Sabljic´ 2, Tao, Poole, and Nguyen model are 0.35, 0.30, 0.30, 0.46, and 0.53 log units, and with these models 65%, 53%, 70%, 71%, and 48% of the compounds yield prediction errors below 0.5 log units. Interestingly, the calculation errors of the methods are not highly intercorrelated (except for the two LSER models). Taking our new model as well as PCKOCWIN, the Tao model, and the Poole model as examples, the intercorrelation r2 values are all below 0.33 (new model vs Tao model), and the r2 values of the new model vs PCKOCWIN, Sabljic´ 2, and the Poole model are 0.26, 0.002, and 0.16, respectively. It suggests that as a general trend, the method errors are mainly caused by model-specific features and less by possibly systematic experimental log Koc errors. Coming back to the LSER approach, one might argue that the only moderate performance of the Nguyen model could be mainly due to data quality problems, keeping in mind that for its development, only literature Koc data in accord with stringent quality protocols had been used. Note, however, that both LSER models show similar (moderate) statistics, while the Poole model was derived from less rigorously evaluated data. Moreover, restriction of the present data to the subset of compounds with experimental Abraham parameters yields significantly improved performances for both LSER models, as will be discussed in more detail below. Finally, three of the five Nguyen model parameters are significantly intercorrelated for their 75 training set compounds, yielding r2 values of 0.68, 0.76, and 0.60 for E vs S, E vs V, and S vs V (E ) excess molar refraction, S ) polarity/ polarizability, V ) molecular volume). It reflects the limited structural variability of the training set and indicates its correspondingly limited suitability as a basis for statistically robust QSAR models. Validation. In addition to the initial external validation (see above), three further validation strategies were employed. First, cross-validation was performed, leaving out randomly selected 10% of the compounds (57-58 compounds) for a total of 10 runs. The resultant statistics are q2 ) 0.830, SE ) 0.503, and bias ) -0.002, which is pretty close to the initial prediction set statistics (q2 ) 0.834, SE ) 0.512) and superior to the total set prediction statistics of all literature methods. Second, a permutation test (14) was performed 12 times with varying degrees of scrambling (changing association between target values and compounds), ranging from about 12.5% reallocated target values up to 100% permutation. Because different degrees of permutation may result in similar correlations between permuted and original data, we select q2 (scrambled vs original log Koc data) as an indirect but relevant measure of the degree of target value scrambling. For models based on parameters truly related to the target property, increasing scrambling would decrease calibration and cross-validation statistics, while good statistics with permuted data would indicate overfitting or chance correlation or both. The results show that with increasing the degree of permutation as expressed through the scrambling q2 (actual degree of correlation between originally ordered and permuted data), the calibration and prediction performances decrease systematically up to near 100% permutation (scrambling q2 < 0; see Figure S2 of the Supporting Information). Moreover, the difference between calibration and prediction increases with increasing scrambling. These findings demonstrate that the new model is based on a proper VOL. 40, NO. 22, 2006 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

7009

TABLE 3. Method Statistics for the Subset with Available Experimental LSER Parameters method

no. of valid results

r2

abs av error

q2

SE

bias

max error

max + error

-0.047 -0.093 -0.069

-1.21 -1.26 -1.12

1.92 1.60 1.76

-0.016 -0.081 -0.093 -0.046

-1.47 -1.23 -1.12 -1.21

2.55 1.74 1.76 1.92

Subset with Experimental Abraham Parameters for Poole Model Poole model exptl LSER param calcd LSER param new model for this subset

107 107 107

0.932 0.911 0.920

0.923 0.900 0.916

0.310 0.384 0.349

0.432 0.494 0.451

Subset with Experimental Abraham Parameters for Nguyen Model Nguyen model exptl LSER param calcd LSER param new model Poole model (exptl param)

108 108 108 99

0.907 0.902 0.918 0.930

0.889 0.890 0.912 0.922

relationship between the model parameters and log Koc and that its architecture has a good prediction capability. Third, two external data sets and a third data set with partly external compounds as described above were used for external prediction. The detailed statistics are listed in Table S6 of the Supporting Information. The first external set with 41 Nguyen compounds not included in the present training set yields q2 values from 0.898 (Sabljic´ 2) to 0.970 (Tao, Poole) and standard errors between 0.254 (Tao) and 0.473 (Sabljic´ 2). The respective results for the new model are q2 ) 0.934 and SE ) 0.379, which is somewhat better than the total set calibration (Table 2) as well as the associated leave-10%-out cross-validation statistics (see above). Note further that for this LSER subset of validated experimental log Koc data (9), the Tao model yields statistics very close to the LSER models. The second external set contains 48 nonpolar and weakly polar compounds (6) with Inwp ) 1 as defined above. For these relatively simple and typically hydrophobic compounds, the new model (q2 ) 0.853, SE ) 0.307) is inferior to the literature methods (q2 ) 0.812-0.937, SE ) 0.2020.348) except for Sabljic´ 2 but still yields a performance slightly superior to its total set statistics that outperformed all other methods. It shows also that the present model is more balanced in its performance across different compound classes than all other methods. For the Baker data set of 93 compounds (16), all methods yield negative biases of around -0.3 log units and standard errors from 0.524 (Tao) to 0.812 (Sabljic´ 1) log units, with intermediate results for the new model (0.613) and the two LSER models (0.606 and 0.676, respectively). Here, the calibration r2 (that corrects for bias) ranges from 0.702 (Sabljic´ 2) to 0.861 (Tao). Among the 93 compounds, 63 are also in the present training set, 14 in the Sabljic´ set, and 29 in the Nguyen set. Linear regression of the Baker log Koc data on the ones of these three subsets yields q2 values of 0.919, 0.624, and 0.950, with biases of -0.185, -0.371, and -0.141 log units, respectively. These results suggest that the Baker log Koc values are systematically lower than the ones from the other data sets. Interestingly, the Baker data were reported as critically evaluated and selected to conform to rigorous criteria (16). In any case, the present model yields r2 and q2 values (0.867, 0.816) comparable to its performance with the other data sets and again outperforms all other methods except the Tao model (0.912, 0.861) and the Poole model (0.864, 0.819); the Tao model, however, excludes two compounds because of missing fragments. Experimental LSER Parameters. According to major literature sources with experimental Abraham parameters, such data could be collected for 107 of the 571 compounds when employing octanol-calibrated solute basicity as needed for the Poole model (8). For 108 compounds, the parameter combination relevant for the Nguyen model (9) was available, 7010

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 40, NO. 22, 2006

0.368 0.397 0.361 0.318

0.518 0.515 0.461 0.440

and for 99 of these the octanol-calibrated basicity was also available. The statistical performances for these subsets are summarized in Table 3 for the two LSER models employing experimental or calculated LSER parameters and for our new model. As expected, the LSER statistics are significantly better than for all 571 compounds (see Table 2), which is however still true when using calculated LSER parameters. The latter indicates that both subsets typically contain more simple compounds where the increment estimation (27, 28) of LSER parameters yields reasonable values. Moreover, for both subsets our new model is competitive with both LSER models (also when the latter employ experimental input parameters) and in particular yields low predictive standard errors. These findings indicate that with respect to environmental chemicals, an important current shortcoming of the LSER approach is the lack of experimental input parameters that would enable a solid prediction for the property of interest. Such parameters are more often available for more simple compounds, but in these cases other prediction methods work fine as well, including those increment methods that allow one to estimate the Abraham parameters from molecular structure. By contrast, for more complex compounds LSER parameters are likely to be missing, and in those cases the quality of predicted Abraham parameters using currently available increment methods appears to be limited. It follows that for the modeling situation in practice, the LSER approach issfor the time beingslimited in its scope for predicting fate properties of environmental chemicals. Outliers. The overall largest outlier is 2,3,7,8-tetrachlorodibenzodioxin (TCDD) (#149 in the Supporting Information) with a reported experimental log Koc of 6.50 and a model prediction of 4.62 (deviation ) -1.88 log units). Here, the LSER models yield deviations of -1.67 (Poole) and -2.09 (Nguyen), and PCKOCWIN and Sabljic´ 1 come the closest with deviations still above 1 order of magnitude (-1.33 and -1.20 log units, respectively). Outliers may also reflect data errors, as could be expected for benzo(ghi)perylene (#20) as an extended planar aromatic compound whose reported experimental log Koc of 4.61 is overestimated by 1.76 log units. The other two outliers are hexabromobiphenyl (#47) and 2,6-dichlorobenzamide (#345). However, the calibration statistics are not significantly affected when omitting individual outliers, and so all 4 compounds remained included. Overall, the prediction performance of our new model is much more balanced with respect to the total set and the various internal subsets as well as all external prediction sets than all six literature methods including the LSER models.

Acknowledgments This study was partly funded by the European Commission (Integrated Project NOMIRACLE, Contract No. 003956GOCE).

Supporting Information Available Complete list of experimental and calculated values for the training set and the three validation sets, including chemical names, CAS numbers, data sources, descriptors, subset categories, calculated log Koc values for the models under analysis, a separate table with detailed statistics for the external data sets, and figures showing the data distribution and permutation performance of the new model. This material is available free of charge via the Internet at http:// pubs.acs.org.

Literature Cited (1) Chiou, C. T. Partition and Adsorption of Organic Contaminants in Environmental Systems; Wiley: Hoboken, NJ, U.S.A., 2002; 257 pp. (2) Cornelissen, G.; Gustafsson, O ¨ .; Bucheli, T. D.; Jonker, M. T. O.; Koelmans, A. A.; van Noort, P. C. M. Extensive sorption of organic compounds to black carbon, coal, and kerogen in sediments and soils: Mechanisms and consequences for distribution, bioaccumulation, and biodegradation. Environ. Sci. Technol. 2005, 18, 6881-6895. (3) Doucette, W. J. Quantitative structure-activity relationships for predicting soil-sediment sorption coefficients for organic chemicals. Environ. Toxicol. Chem. 2003, 22, 1771-1788. (4) Huuskonen, J. Prediction of soil sorption coefficient of a diverse set of organic chemicals from molecular structure. J. Chem. Inf. Comput. Sci. 2003, 43, 1457-1462. (5) Meylan, W. M.; Howard, P. H.; Boethling, R. S. Molecular topology/fragment contribution method for predicting soil sorption coefficients. Environ. Sci. Technol. 1992, 26, 15601567. (6) Sabljic´, A. On the prediction of soil sorption coefficients of organic pollutants from molecular structure: Application of molecular topology model. Environ. Sci. Technol. 1987, 21, 358366. (7) Tao, S.; Piao, H.; Dawson, R.; Lu, X.; Hu, H. Estimation of organic carbon normalized sorption coefficient (KOC) for soils using the fragment constant method. Environ. Sci. Technol. 1999, 33, 2719-2725. (8) Poole, S. K.; Poole, C. F. Chromatographic models for the sorption of neutral organic compounds by soil from water and air. J. Chromatogr., A 1999, 845, 381-400. (9) Nguyen, T. H.; Goss, K. U.; Ball, P. W. Polyparameter linear free energy relationships for estimating the equilibrium partition of organic compounds between water and the natural organic matter in soils and sediments. Environ. Sci. Technol. 2005, 39, 913-924. (10) Klamt, A.; Eckert, F.; Diedenhofen, M. Prediction of soil sorption coefficients with a conductor-like screening model for real solvents. Environ. Toxicol. Chem. 2002, 21, 2562-2566. (11) Sabljic´, A.; Gu ¨ sten, H.; Verhaar, H.; Hermens, J. QSAR modelling of soil sorption. Improvements and systematics of log KOC vs. log KOW correlations, Chemosphere 1995, 31, 4489-4514. Corrigendum: Chemosphere 1996, 33, 2577. (12) Lohninger, H. Estimation of soil partition coefficients of pesticides from their chemical structure. Chemosphere 1994, 29, 1611-1626.

(13) Syracuse Research Cooperation. CHEMFATE database. http:// www.syrres.com/esc/chemfate.htm. (14) Ku ¨ hne, R.; Ebert, R.-U.; Schu ¨u ¨ rmann, G. Prediction of the temperature dependency of Henry’s Law constant from chemical structure. Environ. Sci. Technol. 2005, 39, 6705-6711. (15) Burkhard, L. P. Estimating dissolved organic carbon partition coefficients for nonionic organic chemicals. Environ. Sci. Technol. 2000, 34, 4663-4668. (16) Baker, J. R.; Mihelcic, J. R.; Luehrs, D. C.; Hickey, J. P. Evaluation of estimation methods for organic carbon normalized sorption coefficients. Water Environ. Res. 1997, 69, 136-145. (17) Kier, L. B.; Hall, L. H. Molecular Structure Description; Academic Press: San Diego, CA, U.S.A., 1999; 286 pp. (18) Estrada, E. Edge adjacency relationships and a novel topological index related to molecular volume. J. Chem. Inf. Comput. Sci. 1995, 35, 31-33. (19) Ku ¨ hne, R.; Kleint, F.; Ebert, R.-U.; Schu ¨u ¨ rmann, G. Calculation of compound properties using experimental data from sufficiently similar chemicals In Software Development in Chemistry; Gasteiger, J., Ed.; Gesellschaft Deutscher Chemiker: Frankfurt, Germany, 1996; pp 125-134. (20) Dimitrov, S.; Dimitrova, G.; Pavlov, T.; Dimitrova, N.; Patlewicz, G.; Niemela, J.; Mekenyan, O. A Stepwise approach for defining the applicability domain of SAR and QSAR models. J. Chem. Inf. Model 2005, 45, 839-849. (21) Ku ¨ hne, R.; Ebert, R.-U.; Schu ¨u ¨ rmann, G. Model selection based on structural similarity - method description and application to water solubility prediction. J. Chem. Inf. Model 2006, 46, 636641. (22) Meylan, W. M. KOWWIN 1.67 (EPI-Suite v.3.12); Syracuse Research Corporation: Syracuse, NY, U.S.A., 2000. (23) Meylan, W. M. PCKOCWIN 1.66 (EPI-Suite v.3.12); Syracuse Research Corporation: Syracuse, NY, U.S.A., 2000. (24) Schu ¨u ¨ rmann, G.; Ku ¨ hne, R.; Kleint, F.; Ebert, R.-U.; Rothenbacher, C.; Herth, P. A software system for automatic chemical property estimation from molecular structure. In Quantitative Structure-Activity Relationships in Environmental Sciences VII; Chen, F., Schu ¨u ¨ rmann, G., Eds.; SETAC Press: Pensacola, FL, U.S.A., 1997; pp 93-114. (25) Sutton, R.; Sposito, G. Molecular structure in soil humic substances: The new view. Environ. Sci. Technol. 2005, 39, 9009-9015. (26) Kier, L. B.; Hall, L. H. Intermolecular accessibility: The meaning of molecular connectivity. J. Chem. Inf. Model. 2000, 40, 792795. (27) Abraham, M. H.; McGowan, J. C. The use of characteristic volumes to measure cavity terms in reversed phase liquid chromatography. Chromatographia 1987, 23, 243-246. (28) Platts, J. A.; Butina, D.; Abraham, M. H.; Hersey, A. Estimation of molecular linear free energy relation descriptors using a group contribution approach. J. Chem. Inf. Comput. Sci. 1999, 39, 835845. (29) Daylight Chemical Information Systems, Inc. http:// www.daylight.com/dayhtml_tutorials/languages/smarts.

Received for review January 23, 2006. Revised manuscript received September 5, 2006. Accepted September 7, 2006. ES060152F

VOL. 40, NO. 22, 2006 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

7011