Environ. Sci. Technol. 1989,23 708-713
Prediction of Aqueous Solubility of Organic Chemicals Based on Molecular Structure. 2. Application to PNAs, PCBs, PCDDs, etc. Nagamany N. Nirmalakhandan and Rlchard E. Speece"
Vanderbilt University, Nashville, Tennessee 37235 Quantitative structure-activity relationship (QSAR) techniques are used to develop a simple and robust predictive model for aqueous solubility of a wide range of organic chemicals. A basic model, developed originally from a training set of 145 compounds, is developed further to cover over 300 additional compounds. The predictive capability of this model is clearly demonstrated on various classes of chemicals, notably on 45 PCBs that are of considerable environmental concern. Apart from being the largest solubility data set ever modeled by a single QSAR equation, some of the other distinct features of this model are as follows: use of simple molecular descriptors calculable by consistent algorithms for all classes of chemicals; no reliance on experimental data input; well-demonstrated statistical robustness and predictive ability; and applicability over 12 log units range in solubility.
Introduction Aqueous solubility is an important property of organic chemicals that is required in many scientific studies and engineering applications. From an environmental perspective, it is directly related to the steady-state concentration of a chemical pollutant in the aqueous phase, which in turn would control the pollutant's toxicity, biosorption, bioaccumulation, etc., and transfer to other phases. From an engineering point of view, it is a vital input in process and reactor design, plant operation, pollution control, and management decision making. In spite of such wide and critical applications, solubility data are scarce and are not readily available, particularly for many compounds of environmental concern. In many instances, reported data are contradictory, even for commonly encountered chemicals. Because of this, there is a growing interest in the development of predictive methods to estimate solubility data. Availability of good predictive methods can also aid in rationalizing and corroborating interlaboratory data, in devising optimal experimental conditions for laboratory measurement, in designing chemicals for a desired application, in screening new chemicals for further scrutiny, and in estimating other related properties. The QSAR approach has been found to be a useful one in developing such predictive models and has become increasingly accurate and popular. In our previous papers (1,2), we reported on the application of QSAR techniques in the development of a new, simplified approach to model solute-solvent interactions. Using molecular connectivity indexes, x, and a polarizability parameter, &, both calculable purely from molecular structure without any experimental inputs, this QSARbased approach was shown to be an efficient one in the prediction of two important physicochemical properties: the aqueous solubility of 200 miscellaneous organic chemicals, spanning 6 log units, in the first study ( I ) , and Henry's constant for a separate set of 200 others, spanning 7 log units, in the second one (2). In this paper, we report further on the utility of our solubility model in rationalizing interlaboratory data and showing how to apply the approach to cover a wider range of organic chemicals. The usefulness of the solubility model is further demonstrated by applying it in a predictive mode, over a span of 12 log 708
Environ. Sci. Technol., Vol.
23,No. 6,1989
units, on additional fresh testing sets of 115 compounds of mixed classes, congeneric sets of 45 PCBs and 25 PAHs, and 65 other miscellaneous compounds not included in the derivation of the preliminary model reported earlier (1).
Preliminary Model In our original study ( I ) ,a generalized preliminary model for solubility was developed from a training set of 145 compounds containing alcohols, halogenated aliphatics, and aromatics. The experimental values used in the above study were in gr/gr% units. We have now converted all the data into mol/L units for easy comparison with other QSAR models reported in the literature. On this basis, the preliminary model now takes the form log S 1.506 + 1.7i5O0x- 1.469OxV+ 1.015 (1) n = 145; r = 0.970; r2 = 0.941; SE = 0.311 In the above model, the first two parameters Ox and Oxv, are the zero-order, simple, and valence molecular connectivity indexes, calculated by using an algorithm slightly modified (1)from that originally proposed by Hall and Kier ( 3 ) . The third parameter,.$, introduced as the modified polarizability parameter, was derived from the polarizability parameter 9 proposed by Ketelaar (4). The connectivity indexes have been shown to encode information relating to the surface area of the solute as well as its polarizability, both of which are known to correlate with solubility. However, as detailed in our first study, connectivity indexes alone were not sufficient to account for all the variance in the solubility data of a diverse set, and a polarizability parameter had to be included to explain a substantial portion of the variance. Since some polarizability information appeared to be duplicated by the 9 parameter proposed by Ketelaar and the x indexes, we derived a modified &, which along with the x indexes could yield the best model for solubility. This G was derived from a group contribution method by optimizing the coefficients of certain atomic and group contents of the solute. Once these coefficients are determined from a sufficiently large data set, they are held constant thereafter for all testing sets of compounds. Based on the original training set, this 6 was established to be 3~= -0.963(no. of C1) - 0.361(no.of H) - 0.767(no. of double bonds). When compounds with structures and substituents vastly differing from those employed in the training set are to be modeled, appropriate terms have to be added to this 5 expression as shown later. Improvement of the Preliminary Model Since reporting the original study, we have now screened the training data set to improve the overall quality of the data set as well as our model, using the jackknifed r values, rj. As suggested by Dietrich et al. (5) and Cornish-Bowden and Wang (6), the jackknifed r is a simple and useful indicator of the outlying tendency of any particular case in a data set of n cases. For any given compound Cj, its corresponding rj is determined by deleting that compound from the regression analysis and calculating the resulting regression coefficient, r, of the original model using the remaining n - 1 cases. Thus, cases with unduly high r,
0013-936X/89/0923-0708$01.50/0
@ 1989 American Chemical Society
values could be suspected to be “outliers”, and those with low values could be considered “influential points”. From the jackknifed r values of the 145 compounds in the training set, two compounds appeared to be obvious outliers: 1-bromo-3-chloropropane with rj = 0.974, and 1,2,4,5-tetrabromobenzenewith rj = 0.971. Compared to the average r of 0.970 of the 145 cases, even a slightly higher jackknife value of 0.971 is a positive indicator of outlying tendency. Even though this outlying tendency could be due to different mechanistic behavior of the corresponding compounds, because of similarity of these two compounds to other members of the data set, the experimental observations were suspected to be erroneous. A literature search revealed alternate experimental values: log S = -1.844 and -5.554 vs the original values of log S = -0.939 and -5.894, respectively, for the two compounds. These alternate values are closer to the predicted values of log S = -2.085 and -5.195. Based on this consideration, the original data base was amended, resulting in the overall r of the model increasing from 0.970 to 0.973 and the standard error decreasing from 0.311 to 0.300. The training data set was further scrutinized by analyzing the residuals. First, 1,2,3-tribromobenzene was found to have a high residue of 0.91. When the original source of this data point was examined, we noticed that this value was an estimated one and not an experimentally determined value. Therefore, this point was not included in any of the following statistical analyses. Compounds having residues over f0.4 log unit were then identified and the literature was searched for alternate experimental values for them. While reported data were scarce, alternate data were found for some of these compounds that matched the predicted values more closely. For instance, in the case of tetrachloroethylene, the value used was log S = -2.076, whereas another value of log S = -2.530 from the literature matched the predicted value of log S = -2.536, reducing the error for this compound from -0.51 to -0.006. These changes were considered reasonable and justifiable, and the data base was amended accordingly to yield a more precise model: log S = 1.465
+ 1.758’~- 1.465’~”+ 1.016
(2)
n = 144; r = 0.975; r2 = 0.949; SE = 0.281
We concede that preferential acceptance or rejection of any given data point based purely on statistical grounds is not entirely justifiable. However, in the absence of any other valid method to corroborate contradictory experimental data, we list below the following features of our analysis to absolve our approach: 1. On the basis of the premise that physicochemical properties of a chemical are dependent on its molecular structure, and considering that our model was derived solely by using structural features, it is reasonable to use the model to screen the data. 2. When the calculated structural parameters were replaced randomly by arbitrary nonstructural numbers of similar magnitude, the model failed completely. Likewise, when arbitrary numbers of similar magnitude in random were used instead of the solubility data with the calculated structural parameters, again the model failed completely. 3. The model predicted solubility of new testing compounds of structural similarity with nearly the same precision as that for the training set and did not disregard any single point as outlier. However, when compounds with new structural features or mechanistic behaviors not adequately represented in the training set were tested, the predictive errors were rather high, showing again the in-
Table I. Stepwise Derivation of Equation 6 step 1
variable entered intercept OXV
2
intercept OXV
5 3
intercept OXV
i OX
F to remove
value of coeff“
0.795 355.288 -0.547 (0.029) 1.698 303.222 -0.438 (0.025) 182.296 0.287 (0.021) 1.464 3520.591 -1.367 (0.018) 5322.552 1.001 (0.014) 6098.652 1.662 (0.028)
correln coeff
RMS
0.699
1.207
0.811
0.989
0.984
0.306
resid
Numbers within Darentheses eaual standard error.
tegrity and validity of the model. The above exercises clearly indicate that our parameters do in fact, in tandem, describe the systematic variation of solubility with molecular structure. Thus, the model could be used with confidence to screen the data set for consistency and in corroborating, rationalizing, and reconciling inconsistent data. The statistical robustness of the model and the results were demonstrated with a variety of tests such as deletion tests, subset deletion tests, principal-component analysis, etc. (1). The predictive capability of this model was earlier demonstrated by applying it on a testing set of 55 compounds containing ethers, esters, etc. The correlation between the experimental and predicted values was very satisfactory with an r of 0.984 and standard error of 0.29. In the following sections, the basic model is applied to various classes of compounds containing specific functional groups and structures. Application of the General Model to Larger Data Sets To illustrate the generality of this approach, we first use a data set of 31 aliphatic and aromatic compounds, 17 of them fluorinated and 14 iodinated. Since these heteroatoms were not adequately represented in the original data base, additional terms had to be included in the 6 term to account for the fluorine and iodine atoms. When this was done, the overall model now fitted the original 145 compounds, the 55 members of the earlier testing set, and these 31 new compounds quite satisfactorily with n = 230, r = 0.975, and SE = 0.286. The polarizability term is now given by 6 = -0.963(no. of C1) - 0.361(no. of H) - 0.767(no. of double bonds) - 2.620(no. of F) 1.474(no. of I) (3)
+
It is of interest to note that the 6 term carries all the halogen atoms except bromine. A closer inspection of the numerical weighting of the halogens (-2.620, -0.963, 0.0, and +1.474 for F, C1, Br, and I, respectively) reveals that not only do they follow the same sequence as in the periodic table, but they also increase inversely in almost the same manner as their electronegativity factors on the Pauling’s scale: 4.0, 3.0, 2.8, and 2.5, respectively, for F, C1, Br, and I (4). This pattern of weighting shows that the coefficients of the 6 term, in fact, encode solubility related information in a systematic manner and are not arbitrary numbers of statistical significance only. The validity of this approach was further examined by testing the model on a set of 27 alkanes and alkenes. The anomalous behavior of these compounds has been observed previously by other workers in this area (3, 7, 8). Even though these compounds differ from all other compounds used in this study because of their inability to take part in hydrogen bonding, the above model predicted the solubility of these compounds well, but with a constant Environ. Sci. Technol., Vol. 23, No. 6, 1989
709
Table 11. Observed vs Predicted log S for 107 PNAs, PCBs, PCDDs, etc. (Sin mol/L) no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 710
compd indan 1-methylnaphthalene 2-methylnaphthalene 1,3-dimethylnaphthalene 1,4-dimethylnaphthalene 1,5-dimethylnaphthalene 2,3-dimethylnaphthalene 2,6-dimethylnaphthalene
1,4,5-trimethylnaphthalene 1-ethylnaphthalene 2-ethylnaphthalene 2-methylanthracene 9-methylanthracene 9,lO-methylanthracene 1-methylfluorene 3-methylcholanthrene 1-chloronaphthalene 2-chloronaphthalene 1-bromonaphthalene 2-bromonaphthalene 6-chloro-10-methyl-1,2-benzanthrene naphthalene fluorene phenanthrene pyrene fluoranthene triphenylene benz [alanthracene benzo[a]pyrene benzo[shi]perylene dibenz [a,h]anthracene coronene acenaphthene benzo[blfluorene benzo[blfluorene benzo[e]pyrene benzo[b]fluoranthene benzo[j]fluoranthene benzo[k]fluoranthene biphenyl 2-CB 3-CB 4-CB 2.2’-PCB 2[4-PCB 2,4’-PCB 2,5-PCB 2,6-PCB 4,4’-PCB 2,2’,5-PCB 2,3’,4’-PCB 2,4,4’-PCB 2,4,5-PCB 2,4’,5-PCB 2,4,6-PCB 3,3’,4-PCB 2,2‘,3,3’-PCB 2,2’,3,5’-PCB 2,2’,4,4’-PCB 2,2’,4,5’-PCB 2,2’,5,5’-PCB 2,2’,6,6’-PCB 2,3’,4,4’-PCB 2,3,4,5-PCB 2,3’,4’,5-PCB 3,3‘,4,4‘-PCB 2,2’,3,4,5-PCB 2,2’,3,4,5’-PCB 2,2’,3,4,6-PDB 2,2’,4,5,5’-PCB 2,3,4,5,6-PCB 2,2’,3,3’,4,4’-PCB 2,2’,3,3’,4,5-PCB 2,2’,3,3‘,5.6-PCB 2,2’,3,3’,6,6’-PCB 2,2’,4,4’,5,5’-PCB
Environ. Sci. Technol., Vol. 23, No. 6, 1989
ref
OX
OXV
i
obsd. log S
pred. log S
error
12 13 13 13 13 13 13 13 13 13 10 13 13 13 14 14 10 10 10 10 10 11 11 11 11 11
5.43 6.54 6.54 6.54 6.54 6.54 6.54 6.54 8.38 7.25 7.25 8.85 8.85 9.62 8.25 12.11 6.54 6.54 6.54 6.54 11.77 5.61 7.32 7.77 8.77 8.77 9.92 9.92 10.92 11.92 12.08 12.92 6.88 9.48 9.48 10.92 10.92 10.92 10.92 6.77 7.69 7.69 7.69 8.62 8.62 8.62 8.62 8.62 8.62 9.54 9.54 9.54 9.54 9.54 9.54 9.54 10.46 10.46 10.46 10.46 10.46 10.46 10.46 10.46 10.46 10.46 11.38 11.38 11.38 11.38 11.38 12.31 12.31 12.31 12.31 12.31
5.43 6.54 6.54 6.54 6.54 6.54 6.54 6.54 8.38 7.25 7.25 8.85 8.85 9.62 8.25 12.11 6.67 6.67 7.50 7.50 11.90 5.61 7.32 7.77 8.77 8.77 9.92 9.92 10.92 11.92 12.08 12.92 6.88 9.48 9.48 10.92 10.92 10.92 10.92 6.77 7.83 7.83 7.83 8.88 8.88 8.88 8.88 8.88 8.88 9.94 9.94 9.94 9.54 9.94 9.94 9.94 11.00 11.00 11.00 11.00 11.00 11.00 11.00 11.00 11.00 11.00 12.05 12.05 12.05 12.05 12.05 13.11 13.11 13.11 13.11 13.11
-5.91 -7.46 -7.46 -8.17 -8.17 -8.17 -8.17 -8.17 -8.89 -8.17 -8.17 -9.70 -9.70 -10.42 -8.93 -12.68 -7.33 -7.33 -6.35 -6.36 -12.56 -6.72 -8.21 -8.98 -9.75 -9.75 -11.24 -11.24 -12.00 -12.77 -13.49 -13.54 -7.45 -10.47 -10.47 -12.00 -12.00 -12.00 -12.00 -8.21 -8.81 -8.81 -8.81 -9.42 -9.42 -9.42 -9.42 -9.42 -9.42 -10.02 -10.02 -10.02 -10.02 -10.02 -10.02 -10.02 -10.62 -10.62 -10.62 -10.62 -10.62 -10.62 -10.62 -10.62 -10.62 -10.62 -11.22
-3.03 -3.70 -3.75 -4.29 -4.14 -4.66 -4.72 -4.89 -4.91 -4.16 -4.29 -6.69 -5.87 -6.57 -5.22 -7.96 -3.93 -4.12 -4.32 -4.40 -7.44 -3.60 -4.93 -5.16 -6.13 -5.90 -6.73 -7.28 -7.80 -9.02 -8.74 -9.32 -4.50 -6.83 -7.09 -7.72 -8.23 -8.00 -8.50 -4.34 -4.63 -5.16 -5.33 -5.45 -5.20 -5.56 -5.60 -5.63 -6.62 -6.62 -6.52 -6.59 -6.45 -6.44 -6.06 -7.22 -6.93 -6.23 -6.73 -7.25 -7.28 -8.03 -6.91 -7.18 -7.25 -7.41 -7.52 -7.86 -7.44 -7.89 -7.38 -8.91 -8.63 -8.60 -7.78 -8.48
-2.97 -4.21 -4.21 -4.93 -4.93 -4.93 -4.93 -4.93 -5.14 -4.73 -4.73 -5.83 -5.83 -6.34 -5.22 -7.91 -4.26 -4.26 -4.43 -4.43 -8.07 -3.73 -4.75 -5.40 -5.89 -5.89 -7.07 -7.07 -7.56 -8.06 -8.74 -8.55 -4.11 -6.42 -6.42 -7.56 -7.56 -7.56 -7.56 -4.91 -5.44 -5.44 -5.45 -5.97 -5.97 -5.97 -5.97 -5.97 -5.97 -6.50 -6.50 -6.50 -6.50 -6.50 -6.50 -6.50 -7.03 -7.03 -7.03 -7.03 -7.03 -7.03 -7.03 -7.03 -7.03 -7.03 -7.56 -7.56 -7.56 -7.56 -7.56 -8.09 -8.09 -8.09 -8.09 -8.09
-0.07 0.51 0.46 0.64 0.79 0.27 0.21 0.04 0.23 0.57 0.44 -0.87 -0.04 -0.23 0.00 -0.05 0.33 0.14 0.11 0.03 0.63 0.14 -0.17 0.24 -0.24 0.00 0.35 -0.21 -0.24 -0.96 0.00 -0.77 -0.39 -0.40 -0.67 -0.16 -0.66 -0.44 -0.94 0.57 0.82 0.28 0.11 0.52 0.77 0.42 0.37 0.34 -0.65 -0.12 -0.02 -0.09 0.05 0.07 0.45 -0.72 0.10 0.80 0.30 -0.22
11 11 11 11 11 11 11 11 11 11 11 11 11
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9
9 9 9
-11.22
-11.22 -11.22 -11.22 -11.82 -11.82 -11.82
-11.82 -11.82
-0.25
-1.00 0.12 -0.15 -0.22 -0.38 0.04 -0.30 0.13 -0.33 0.19 -0.82 -0.53 -0.50 0.32 -0.38
Table I1 (Continued) no. 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107
compd
OX
oX v
i
obsd. log S
pred. log S
error
12.31 13.23 13.23 14.15 14.15 15.08 15.08 16.00 4.38 5.30 6.23 7.15 9.00 6.23 6.23 8.95 9.88 11.72 11.72 12.64 14.49 3.91 6.74 7.44 5.32 6.35 4.38 5.30 5.30 5.30 5.30
13.11 14.17 14.17 15.22 15.22 16.28 16.28 17.34 3.83 4.89 5.94 7.00 9.46 5.67 5.67 8.49 9.55 11.76 11.76 12.82 14.93 3.61 6.44 7.15 5.02 6.06 3.96 4.88 4.88 5.02 5.02
-11.82 -12.43 -12.43 -13.03 -13.03 -13.63 -13.63 -14.23 -4.11 -4.71 -5.31 -5.91 -7.12 -5.55 -5.55 -9.92 -10.52 -11.72 -11.72 -12.32 -13.53 -2.64 -5.53 -6.25 -4.09 -4.81 -3.47 -4.19 -4.19 -4.07 -4.07
-8.52 -8.26 -8.93 -9.54 -9.38 -9.77 -9.41 -10.38 0.00 -0.66 -1.56 -2.39 -3.52 -1.46 -1.25 -5.24 -6.00 -7.14 -7.09 -7.79 -8.88 0.28 -2.05 -2.58 -0.83 -1.30 -0.41 -0.82 -0.85 -1.53 -1.37
-8.09 -8.62 -8.62 -9.16 -9.16 -9.69 -9.69 -10.22 -0.69 -1.22 -1.73 -2.27 -3.81 -1.60 -1.60 -5.37 -5.91 -7.10 -7.10 -7.64 -8.70 0.32 -1.79 -2.34 -0.74 -1.19 -0.22 -0.69 -0.69 -0.76 -0.76
-0.42 0.36 -0.30 -0.38 -0.22 -0.09 0.27 -0.16 0.68 0.57 0.17 -0.12 0.29 0.14 0.35 0.14 -0.10 -0.03 0.01 -0.16 -0.18 -0.05 -0.26 -0.24 -0.09 -0.11 -0.19 -0.13 -0.17 -0.77 -0.61
ref
9 9 9 9 9 9 9 9 15 15 15 15 15 15 15 16 2-chlorodibenzo-p-dioxin 16 2,7-dichlorodibenzo-p-dioxin 16 1,2,3,4-tetrachlorodibenzo-p-dioxin 16 1,3,6,8-tetrachlorodibenzo-p-dioxin 1,2,3,4,7-pentachlorodibenzo-p-dioxin 16 16 heptachlorodibenzo-p-dioxin 2-butanone 17 17 2-octanone 17 2-nonanone 7 3-hexenone 2,4-dimethyl-3-pentanone 7 aniline 8 o-toluidine 8 8 m-toluidine 8 o-chloroaniline 8 m-chloroaniline
2,2’,4,4’,6,6’-PCB 2,2’,3,3’,4,4’,6-PCB 2,2’,3,4,5,5’,6-PCB 2,2’,3,3’,4,4‘,5,5’-PCB 2,2’,3,3’,5,5’,6,6’-PCB 2,2’,3,3’,4,4’,5,5’,6-PCB 2,2‘,3,3‘,4,5,5‘,6,6‘-PCB 2,2’,3,3’,4,4’,5,5’,6,6’-PCB phenol 2-chlorophenol 2,4-dichlorophenol 2,4,6-trichlorophenol pentachlorophenol 3,5-dimethylphenol 2,4-dimethylphenol
systematic error. However, when an indicator, A, was used to identify such compounds and included in the polarizability parameter 4 with a coefficient of -1.24, the generalized model predicted with the same precision for the entire data set of 257 compounds: log S = 1.354 + 1.653’~- 1.336’~’ + 1.004
(4)
n = 257; r = 0.973; r2 = 0.947; SE = 0.288 where 4 = -0.963(no. of C1) - 0.361(no. of H) - 0.767(no. of double bonds) 2.620(no. of F) 1.474(no. of I) - 1.24A
+
The slight variations in the intercept and the three coefficients in eq 4 are due to the larger and more diverse number of compounds covered. In spite of these slight variations, considering the fact that the general model was derived by using only 145 compounds, it is satisfying to note that the general form of the model remained unchanged with the quality of the model still intact. It is also to be noted that, once the coefficients in the b expression are establsihed, they remain fixed thereafter and do not change with the addition of new compounds to the data base. At this point, another testing set of 57 compounds was assembled to test the predictive capability of eq 4. This set of compounds contained multiple structural and heteroatom features that were only included individually in the training set. For instance, cyclopropyl ethyl ether, cyclohexanol, and 2-bromoethyl acetate are typical compounds that carry multiple features that were represented in the original data base individually by ethers, alcohols, halogens, etc. The predicted log S values for these compounds agreed very well with the experimental values, with errors within h0.30 log unit for 73% of these 57 compounds, and within *0.47 log unit for 96% of them. When this testing set was merged with the training set of 257
compounds, the same basic model fitted all 314 compounds to the same level of precision: log S = 1.318 + 1.643’~- 1.351’~’ + 0.991b (5) n = 314; r = 0.974; r2 = 0.949; SE = 0.302 So far, the approach and the model have been shown to be applicable to compounds with two basic forms of structural skeletons: the aromatic ring and the aliphatic chain, with various degrees of branching and substitution with simple heteroatoms. It is of importance to check the model’s applicabilityto compounds with specific functional groups and complex structural forms. To demonstrate the flexibility and utility of our approach in handling new compounds, new training sets of data for 10 amines, 6 aldehydes, 12 ketones, and 14 nitro compounds were gathered to represent various common functional groups. To represent complex structures, data on 9 polychlorinated dibenzo p dioxins were selected from the 15 for which data has been reported in the literature (16). The remaining six were reserved to check the predictive ability later. Since these compounds are structurally and mechanistically very different from those considered so far, additional terms had to be added to the 4 expression. As mentioned earlier, this is done by keeping the coefficients already derived constant, and adding additional terms to the 5 expression to represent the new compounds, so as to get the best statistical fit to the entire data set. The following equation results, with K indicating ketones or aldehydes, and D dioxins: log S = 1.464 + 1.662’~- 1.367’~’ + 1.0014 (6) n = 365; r = 0.984; r2 = 0.967; SE = 0.306 where $ = -0.963(no. of C1) - 0.361(no. of H) 0.767(no. of double bonds) - 2.620(no. of F) + 1.474(no. of I) - 1.24A + 1.014K+ 0.636(no. of NH2) + 0.833(no. of NH) - 1.695(no. of NO2)- 1.823D Environ. Sci. Technol., Vol. 23, No. 6, 1989
711
/I
53~romatics 87Ali~hatics
.^
13 Ethers 47 Esters
35 Alkanes etc 10 Cyclics
-6
I!5q (6);n -8 -8
= 365; I = 0.99; I
-6
-4
-2
.
std. error = 0.3061 I
.
0
1
2
.
4
Calculated log S Flgure 1. Comparison between observed and calculated log S (S in mol/L).
It is to be noted that this equation covering all the 365 compounds is essentially the same in form and quality as that obtained originally with the basic training set of 145 compounds. Even though $ is a lumped parameter containing 10 terms, not all the terms are always used for any given chemical. Since more than 350 compounds belonging to over 10 different classes are modeled by using these 10 terms indirectly, this lumping approach does not in any way violate any statistical standards. (In fact, such lumping of parameters has been widely used and is accepted by all QSAR practitioners, as in the case of the log p approach, where over 30 parameters may be lumped in its estimation. Many QSAR models using log p have been reported for much smaller numbers of congeneric compounds belonging to just one class.) We propose this as our general model for aqueous solubility, which can be expected to predict satisfactorily for compounds containing heteroatoms, functional groups, and structures similar to those in the training sets used, either individually or in combination. To illustrate the statistical significance of the individual terms and the overall model, its stepwise derivation is shown in Table I. The quality of fit between the observed and calculated log S values for the different classes of compounds is shown in Figure 1. Predictive Ability of the General Model The validity of our approach and the predictive ability of our general model are demonstrated below on various sets of a congeneric series and miscellaneous compounds such as PCBs, PNAs, PCDDs, phenols, etc., which are of considerable importance in environmental contamination. Even though many of these compounds are not directly represented per se in the training set, their basic structural features are adequately represented, and therefore, the model could be expected to predict satisfactorily without any parameter adjustment. All the reported solubility data for 45 PCBs (9) containing up to 10 chlorine substitutions were tested, and the agreement between the observations and predictions was found to be quite satisfactory, with an r of 0.955 and standard error of 0.287. Further, 38 PNAs containing up to six fused rings were tested and again the agreement was found to be reasonable. Table I1 lists the observed and predicted log S data for all the compounds tested, and Figure 2 illustrates the quality of the prediction, from which the agreement between the observed and predicted log S can be seen to be very good, with r = 0.987 and SE = 0.382. The validity of our approach is amply demonstrated by 712
Environ. Sci. Technol., Vol. 23, No. 6, 1989
-12
-10
-8
-6
-4
-2 0 Predicted log S
2
Figure 2. Comparison between observed and predicted log S (eq 6) (S in mol/L).
the results of this predictive test. The fact that the original training set did not contain any PCBs or PNAs adds further credence to our model. This also shows that our model parameters are richer in information content relating to aqueous solubility when compared to others such as log p (7, IO), total surface area (9, I I ) , etc., which have been used by many researchers in deriving QSAR models for selected congeneric sets of compounds. The robustness of the general model is further validated when this testing data set is merged with the previous set of 365 miscellaneous compounds, the same basic model fitted all the 470 chemicals with an r of 0.99 and a standard error of 0.332: log S = 1.543 + 1.638'~- 1.374'~" + 1.0035 (7) n = 470; r = 0.990; r2 = 0.980; SE = 0.332 Conclusions The studies reported in the previous paper and this one show that aqueous solubility can be predicted with a high degree of confidence by using the simple approach developed herein. The approach employs error-free descriptors which are calculable purely from molecular structure, using consistent algorithms, without requiring any experimental data input whatsoever. The approach itself is flexible in that it can be used to accommodate different classes of compounds. The general aqueous solubility model developed from a training set of 145 compounds was eventually tested successfully on an additional 325 compounds. The model explains over 98% of the variation in solubility data of the 470 compounds studied. Spanning over 12 log units and covering liquid and solid alkanes, alkenes, alcohols, esters, ethers, cyclos, amines, aldehydes, ketones, nitros, aliphatics and aromatics with halo substitutions, PNAs, PCBs, PCDDs, etc., the unexplained variance due to imperfections in the model and the experimental errors in the data is only 2%. This represents the largest solubility data base ever to have been successfully modeled by a single QSAR equation with an overall r of 0,990 and standard error of 0.332. Literature Cited (1) Nirmalakhandan, N.; Speece, R. E. Environ. Sci. Technol. 1988, 22, 328. ( 2 ) Nirmalakhandan, N.; Speece, R. E. Environ. Sci. Technol. 1988,22, 1349.
Environ. Sci. Technol. 1989, 23, 713-722
Hall, L. H.; Kier, L. B.; Murray, W. J. J. Pharm. Sei. 1975, 64, 1974. Horvath, A. L. Halogenated Hydrocarbons; Marcel Dekker, Inc., New York, 1982. Dietrich, W. S.;Dreyer, N. D.; Hansch, C. J.Med. Chem. 1980,23, 120. Cornish-Bowden, A.; Wong, J. T. Biochem. J. 1978,175, 969. Hansch, C.; Quinlan,J. E.; Lawrence, G. L. J. Org. Chem. 1968,33,347. Chiou, C. T.; Schmedding, D. W. Environ. Sei. Technol. 1982,16,4. Opperhulzen, A.; Gobas, F. A. P. C.; Van der Steen, J. M. D.; Hutzinger, 0. Environ. Sei. Technol. 1988,22, 638. Yalkowski, S.H.; Valvani, S. C.; Mackay, D. Residue Rev. 1983,85,43.
(11) Baker, R. J.;Donelan, B. J.; Peterson, L. J.; Acree, W. E., Jr.; Tsai, C.-c. Phys. Chem. Liq. 1987,16,279. (12) Yalkowski, S. H.; Valvani, S. C. J. Pharm. Sei. 1980,69, 912. (13) Mackay, D.; Shiu, W. Y. J. Chem. Eng. Data 1977,22,399. (14) Miller, M. M.; Wasik, S. P.; Huang, G. L.; Shiu, W. Y.; Mackay, D. Environ. Sei. Technol. 1985,19,522. (15) Water Related Environmental Fate of 129 Priority Pollutants. EPA Report 4401479-0296,1979;Vol. 11. (16) Shiu, W. Y.; Doucette, W.; Gobas, F. A. P. C.; Andren, A.; Mackay, D. Environ. Sci. Technol. 1985,22, 651. (17) Tewari, Y. B.; Miller, M. M.; Wasik, S. P.; Martire, D. E. J. Chem. Eng. Data 1982,27,451.
Received July 25, 1988. Accepted February 16,1989.
Evaluation of Mass Transfer Parameters for Adsorption of Organic Compounds from Complex Organic Matrices Edward H. Smith and Waiter J. Weber, Jr."
Environmental and Water Resources Engineering, The University of Michigan, Ann Arbor, Michigan 48109 ~
The short-bed adsorber (SBA) technique has been demonstrated to be an effective method for estimation of maw transport parameters for adsorption of target organic compounds from otherwise organic-free background waters. This work evaluates the procedure for the more pertinent circumstance in which a water or wastewater is not only comprised of target organic species but also contains complex and uncharacterized dissolved organic matter. The SBA is compared with other parameter estimation methods for adsorption of two target compounds from different background waters. A system-specific modeling approach is found to accommodate the variable impacts of different background waters on the equilibrium and kinetic relationships of the target species. Verification studies reveal that mass transfer parameters determined by the SBA technique generally yield more accurate predictions of fixed-bed adsorber breakthrough profiles for target compounds than do those determined by the other methods evaluated.
Introduction Dual resistance rate models have undergone extensive testing and refinement for describing and predicting the adsorption of organic substances by microporous adsorbents in fixed-bed reactor systems (I). The advantages that have been thus gained by refinements in model formulation and numerical solution techniques can potentially be negated, however, by the inaccuracies and uncertainties yet associated with the evaluation of equilibrium and mass transport coefficients required for model simulation and subsequent scale-up. Estimation of reliable mass transfer parameters has been a particular challenge for research aimed at characterizing adsorption processes for such heterogeneous systems as those commonly encountered in environmental field applications. Techniques for evaluating external and intraparticle mass transfer, the two mass transport steps considered important in the development of adsorber models, have, logically, been developed in simple systems of one or more target species in otherwise organic-free background water. It remains to examine the applicability of these parameter estimation techniques for 0013-936X/89/0923-0713$01.50/0
the more pertinent situation of waters and wastewaters containing complex humic and fulvic materials and other uncharacterized dissolved organic matter as well as the particular target organic compounds of interest. External, or film, mass transfer coefficients have often been estimated by using semiempirical correlations developed from experimental data for particle-fluid mass transfer processes measured for specific solutes and solid particulates. A wide range of correlations have been published, each distinguished by an observed functional relationship between the dimensionless Reynolds, Schmidt, and Sherwood numbers. The primary system parameters incorporated into these dimensionless terms are the mass flow rate and void space in the bed. Important solute-solid information includes the free liquid diffusivity of the sorbate and a characteristic length parameter related to the solid particles, usually the equivalent particle diameter, which may be modified by an empirically determined shape factor. Several attempts have been made to deduce a generalized working correlation by nonlinear analysis of all reported film mass transfer data (2,3). The two major advantages of a literature correlation model are that the film diffusion coefficient can be determined without experimental effort and that it provides a means for evaluating external mass transfer independently of other physical and chemical processes in the system. Inherent difficulties in applying these models to granular activated carbon adsorption systems are that (I) nearly all are developed with solids that are different in chemical and physical character than activated carbon and thus ignore potential impacts of surface topography and roughness on film-controlled mass transfer ( 4 , 5 ) ,(2) calculations of free liquid diffusivity and of film diffusion do not incorporate interactions between target contaminants and other background species in solution, and (3) values computed by different correlations vary significantly, and there are no established criteria for determining which correlation may be best suited to a particular system. Film diffusion may also be evaluated by various model calibration techniques by using system-specific experimental data. For example, the film diffusion coefficient, k f ,may be determined from column breakthrough data by
0 1989 American Chemical Society
Environ. Sci. Technol., Vol. 23, No. 6, 1989
713