Ind. Eng. Chem. Res. 2002, 41, 6623-6633
6623
CORRELATIONS Group-Contribution-Based Estimation of Octanol/Water Partition Coefficient and Aqueous Solubility Jorge Marrero and Rafiqul Gani* CAPEC, Department of Chemical Engineering, Technical University of Denmark, Building 227, DK-2800 Lyngby, Denmark
New methods for the estimation of the octanol/water partition coefficient (Kow) and aqueous solubility (Ws) at ambient temperature are presented. The property values are estimated by using a three-level group-contribution estimation approach requiring only molecular structural information. The primary level uses contributions from simple first-order groups that allow for the description of a wide variety of organic compounds, whereas the higher levels (second- and third-order groups) involve polyfunctional and structural groups that provide more information about molecular fragments whose description through first-order groups is not possible. The group-contribution values were calculated by linear regression analysis using a data set of 9560 values for Kow and 2087 values for Ws. The data set included compounds ranging from C3 to C70, including large and heretocyclic compounds. Compared to other currently used groupcontribution methods, the new methods make significant improvements in accuracy with logarithm-unit average absolute errors of 0.24 for Kow and 0.46 for Ws. Introduction The n-octanol/water partition coefficient is the ratio of the concentration of a chemical in n-octanol to that in water in a two-phase system at equilibrium. This ratio is used as a measure of lipophilicity, and its logarithm, log Kow, is one of the key parameters in quantitative structure-activity relationship (QSAR) studies.1-3 Kow is also used to provide valuable information for the overall understanding of the uptake, distribution, biotransformation, and elimination of a wide variety of chemicals.4-6 Water solubility (Ws), on the other hand, is defined as the maximum amount of a chemical that is dissolved in pure water at a specific temperature, and it is also an important parameter in QSAR as well as environmental studies.7-9 Thus, there is an increasing need for reliable techniques that can be used to estimate the values of these properties as it is not always practical or possible to obtain them from experimental measurements. For the estimation of log Kow and log Ws, different approaches have been developed, and new methods continue to be proposed. Several works have reported the use of either correlation equations or neural networks with parameters derived from quantum mechanics to estimate log Kow and/or Ws.10-15 The main limitation of these approaches is that it can be difficult and sometimes even impossible to obtain the electronic properties used as input parameters in the estimations of log Kow or Ws. This limitation, and the fact that the predictive capability of these methods is somewhat restricted, considerably constrains the use of these * Corresponding author. E-mail:
[email protected]. Telephone number: (+45)45252882. Fax number: (+45)45932906.
methods in tasks such as computer-aided molecular design and process/product synthesis, where the estimation of property values for thousands of compounds (many of them new) might be necessary. An alternative approach is to apply group-contribution (GC) methods. These methods are computationally very efficient and can also be very accurate for a wide range of compounds.16 For the estimation of log Kow, the GC methods reported by Hansch and Leo17 and, more recently, by Meylan and Howard18 appear to be the most popular. For the estimation of aqueous solubility at ambient temperature, the more common approach has been to apply correlation equations where Ws is calculated from known values of log Kow and the normal melting temperature Tm.19,20 These estimation equations are quite accurate but require accurate values of log Kow and Tm, which might not always be available. In this work, we present new group-contribution methods for the estimation of log Kow and log Ws for a very large range of compounds that only require molecular structural information. These new methods are based on a multilevel group-contribution approach that has been used previously, with excellent results, in the development of estimation methods for an important number of properties of pure organic compounds, including normal melting temperature, normal boiling temperature, critical constants, enthalpy of fusion, and others.16 Motivated by the good results obtained in the previous work, our efforts have focused on developing new GC methods with the ultimate objective of providing more accurate and reliable estimations of log Kow and log Ws for a wide range of chemical substances, including large and complex compounds of interest in pharmaceutical and environmental studies.
10.1021/ie0205290 CCC: $22.00 © 2002 American Chemical Society Published on Web 11/14/2002
6624
Ind. Eng. Chem. Res., Vol. 41, No. 25, 2002
Table 1. First-Order Groups and Their Contributions along with Sample Assignments 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
group
example
log Kow
SEa,b
log Ws
SEb,c
CH3 CH2 CH C CH2dCH CHdCH CH2dC CHdC CdC CH≡C C≡C aCH aC fused with aromatic ring aC fused with non-aromatic subring aC except as above aN in aromatic ring aC-CH3 aC-CH2 aC-CH aC-C aC-CHdCH2 aC-CHdCH aC-CdCH2 aC-C≡CH aC-C≡C OH aC-OH COOH aC-COOH CH3CO CH2CO CHCO CCO aC-CO CHO aC-CHO CH3COO CH2COO CHCOO CCOO HCOO aC-COO aC-OOCH aC-OOC COO except as above CH3O CH2O CH-O C-O aC-O CH2NH2 CHNH2 CNH2 CH3NH CH2NH CHNH CH3N CH2N aC-NH2 aC-NH aC-N NH2 except as above CHdN CdN CH2CN CHCN CCN aC-CN CN except as above CH2NO2 CHNO2 CNO2 aC-NO2 NO2 except as above ONO2 HCONHCH2 CONH2 CONHCH3 CONHCH2 CON(CH3)2 CONCH3CH2 CON(CH2)2 CONHCO CONCO aC-CONH2 aC-NH(CO)H
n-tetracontane (2) n-tetracontane (38) 2-methylpentane (1) 2,2-dimethylbutane (1) 1-hexene (1) 2-hexene (1) 2-methyl-1-butene (1) 2-methyl-2-butene (1) 2,3-dimethyl-2-butene (1) 1-pentyne (1) 3-decyne (1) benzene (6) naphthalene (2) indane (2) benzophenone (1) pyridine (1) toluene (1) ethylbenzene (1) cumene (1) t-butbenzene (1) styrene (1) 1-propenylbenzene (1) R-methylstyrene (1) phenylacetylene (1) 1-phenyl-1-propyne (1) 1,4-butanediol (2) phenol (1) pentanoic acid (2) benzoic acid (1) 2-butanone (1) 3-pentanone (1) 2,4-dimethyl-3-pentone (1) 2,2,4,4-tetramethyl-3-pentanone (1) acetophenone (1) 1-hexanal (1) benzaldehyde (1) butyl acetate (1) methyl butyrate (1) ethyl isobutyrate (1) ethyl 2,2-dimethyl propionate (1) propyl formate (1) methyl benzoate (1) phenyl formate (1) phenyl acetate (1) ethyl acrylate (1) methyl butyl ether (1) di-n-butyl ether (1) sec-butyl ether (1) tert-butyl ether (1) methyl phenyl ether (1) ethylamine (1) sec-butylamine (1) tert-butylamine (1) dimethylamine (1) dipropylamine (1) diisopropylamine (1) methyldiethylamine (1) triethylamine (1) aniline (1) N-methyl aniline (1) N,N-dimethyl aniline (1) cyclobutylamine acetaldazine (2) ketazine (2) propionitrile (1) isobutyronitrile (1) 2,2-dimethylpropionitrile (1) benzonitrile (1) acrylonitrile (1) 1-nitropropane (1) 2-nitropropane (1) 2-methyl-2-nitropropane (1) nitrobenzene (1) nitrocyclohexane (1) n-butyl nitrate (1) ethylformamide (1) butyramide (1) methylacetamide (1) ethylacetamide (1) dimethylacetamide (1) methylethylacetamide (1) diethylacetamide (1) diacetamide (1) methyldiacetamide benzamide N-phenylformamide (1)
0.257 78 0.450 05 0.465 31 0.748 06 0.511 34 0.758 03 0.733 37 0.701 01 0.796 44 -0.320 01 0.430 81 0.216 94 0.364 01 0.339 82 0.331 52 -0.498 33 0.642 29 0.565 09 0.754 38 0.963 77 1.275 75 0.936 95 0.939 88 0.902 34 1.254 31 -1.096 58 -0.025 44 -0.883 14 0.140 76 -0.480 75 -0.174 07 0.204 53 0.256 51 -0.175 31 -0.633 06 -0.117 82 -0.490 06 -0.317 51 -0.522 03 0.154 21 -0.887 47 -0.036 86 -0.367 66 -0.266 05 -0.474 98 -0.350 70 -0.123 97 -0.033 33 0.923 07 0.018 73 -1.413 22 -1.922 14 -1.139 49 -0.741 95 -0.983 92 -0.377 16 -0.460 04 -0.655 52 -0.292 10 0.360 82 0.188 04 -0.815 10 0.280 70 0.804 52 -0.573 04 -0.571 77 0.242 21 -0.080 22 -0.101 22 -0.233 57 -0.172 17 -0.548 63 0.144 07 -0.425 43 -0.096 41 -1.733 07 -1.444 27 -0.456 69 -0.835 43 -0.365 98 -0.824 94 -0.633 88 -2.110 29 -0.976 42 -1.045 12 -0.405 50
0.117 47 0.056 58 0.131 16 0.194 44 0.172 70 0.178 04 0.233 79 0.188 91 0.339 70 0.195 46 0.239 51 0.070 64 0.089 33 0.084 12 0.117 74 0.098 41 0.107 05 0.123 07 0.171 86 0.205 85 0.440 53 0.187 66 0.389 09 0.579 30 0.394 01 0.120 49 0.111 60 0.148 47 0.162 31 0.172 92 0.181 41 0.248 97 0.284 97 0.160 73 0.241 64 0.221 42 0.179 31 0.155 32 0.213 19 0.257 40 0.440 97 0.148 80 0.579 30 0.170 48 0.134 65 0.158 79 0.144 88 0.253 13 0.392 67 0.111 12 0.202 42 0.235 61 0.443 77 0.198 26 0.182 90 0.239 33 0.157 60 0.196 59 0.117 44 0.147 91 0.201 39 0.148 95 0.200 76 0.210 80 0.221 52 0.410 04 0.411 77 0.169 60 0.175 30 0.327 86 0.371 04 0.445 80 0.117 49 0.164 17 0.270 78 0.492 48 0.166 11 0.184 98 0.164 26 0.209 32 0.260 12 0.235 47 0.410 43 0.579 86 0.186 62 0.305 96
-5.944 17 -5.779 18 -5.549 66 -5.162 26 -10.859 55 -10.788 85 -10.499 65 -10.624 20 -9.600 40 -10.281 21 -12.354 65 -5.190 40 -5.230 00 -5.229 94 -5.198 05 -5.070 59 -11.011 06 -10.631 42 -10.360 32 -10.111 12 -15.795 01 -16.026 30 -15.966 77 -15.467 68 -15.535 57 -5.801 15 -11.010 26 -16.828 40 -22.220 05 -16.588 23 -17.117 42 -16.072 62 -16.257 10 -16.029 85 -11.007 55 -16.121 27 -22.694 05 -22.394 91 -22.052 00 -22.388 09 -17.013 00 -21.808 00 -21.647 34 -16.929 62 -11.374 75 -11.159 87 -11.380 73 -10.541 61 -10.941 66 -10.732 52 -11.885 81 -9.421 10 -10.616 30 -10.016 56 -11.041 53 -10.632 51 -10.075 54 -10.902 99 -11.056 61 -10.547 09 -5.799 13 -11.164 41 -10.172 86 -15.266 75 -14.998 22 -15.295 57 -10.759 93 -23.287 17 -23.290 88 -22.782 89 -23.073 88 -19.017 16 -24.054 06 -16.555 13 -22.061 05 -22.698 44 -26.947 67 -26.936 05 -26.457 30 -27.568 03 -27.390 46 -21.938 52 -21.179 41
1.635 13 1.579 37 1.521 57 1.461 49 2.193 09 2.151 80 2.151 80 2.109 79 2.066 75 2.109 81 2.066 90 1.521 57 1.461 47 1.461 47 1.461 44 1.578 23 2.193 07 2.151 80 2.109 78 2.066 83 2.635 36 2.601 18 2.601 16 2.566 61 2.531 35 1.739 09 2.271 35 2.829 39 3.184 52 2.766 70 2.734 11 2.701 17 2.667 84 2.667 72 2.271 39 2.701 16 3.240 35 3.212 55 3.184 54 3.156 26 2.829 40 3.156 28 3.156 29 2.797 53 2.349 22 2.310 76 2.271 69 2.231 87 2.231 83 2.311 65 2.272 54 2.232 83 2.311 59 2.272 58 2.232 69 2.272 54 2.232 72 2.232 74 2.192 24 2.150 95 1.688 00 2.192 26 2.150 96 2.668 54 2.600 60 2.600 44 2.150 94 3.267 34 3.239 76 3.212 14 3.212 00 2.860 26 3.320 57 2.798 30 3.213 19 3.185 21 3.580 37 3.555 28 3.529 93 3.554 05 3.528 82 3.156 94 3.156 94
Ind. Eng. Chem. Res., Vol. 41, No. 25, 2002 6625 Table 1. (Continued) 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
group
example
log Kow
SEa,b
log Ws
SEb,c
aC-N(CO)H aC-CONH aC-NHCO aC-NCO NHCONH NH2CONH NH2CON NHCON NCON aC-NHCONH2 aC-NHCONH NHCO except as above CH2Cl CHCl CCl CHCl2 CCl2 CCl3 CH2F CHF CHF2 CF2 CF3 CCl2F HCClF CClF2 aC-Cl aC-F aC-I aC-Br I- except as above Br- except as above F- except as above Cl- except as above CHNOH CNOH aC-CHNOH OCH2CH2OH OCHCH2OH OCH2CHOH CH2SH CHSH CSH aC-SH -SH (except as above) CH3S CH2S CHS CS aC-SSO SO2 SO3 (sulfite) SO3 (sulfonate) SO4 (sulfate) aC-SO aC-SO2 PO3 (phosphite) PHO3 (phosphonate) PO3 (phosphonate) PHO4 (phosphate) PO4 (phosphate) aC-PO4 aC-P CO3 (carbonate) C2H3O C2O CH2 (cyc) CH (cyc) C (cyc) CHdCH (cyc) CHdC (cyc) CdC (cyc) CH2dC (cyc) NH (cyc) N (cyc) CHdN (cyc) CdN (cyc) O (cyc) CO (cyc) S (cyc) SO2 (cyc)
N-methyl-N-phenylmethanamide (1) N-methylbenzamide (1) N-(2-methylphenyl)acetamide (1) phenylmethylacetamide (1) N,N′-dimethylurea (1) methylurea (1) N,N-dimethylurea (1) trimethylurea (1) tetramethylurea (1) phenylurea (1) N,N′-diphenylurea N-chloroacetamide (1) 1-chlorobutane (1) 2-chloropropane (1) 2-chloro-2-methylpropane (1) 1,1-dichloroethane (1) 2,2-dichloropropane (1) 1,1,1-trichloroethane 1-fluorobutane (1) 2-fluorobutane (1) 1,1-difluoroethane (1) perfluorohexane (4) hexafluoroethane (2) tetrachloro-1,2-di.F.ethane (2) 1-Cl-1,2,2,2-tetrafluoroethane (1) 1,2-dichlorotetrafluoroethane (2) chlorobenzene (1) hexafluorobenzene (6) iodobenzene (1) bromobenzene (1) iodoethane (1) bromoethane (1) benzyl fluoride (1) ethyl chloroacetate (1) propionaldehyde oxime (1) diethyl ketoxime (1) phenyl oxime (1) 2-ethoxyethanol (1) 2-ethoxy-1-propanol (1) 1-methoxy-2-propanol (1) ethanethiol (1) 2-propanethiol (1) 2-methyl-2-propanethiol (1) benzenethiol (1) cyclohexanethiol (1) dimethyl sulfide (1) diethyl sulfide (1) diisopropyl sulfide (1) di-tert-butyl sulfide (1) phenyl methyl sulfide (1) dimethyl sulfoxide (1) dimethyl sulfone (1) dimethyl sulfite (1) dimethyl sulfonate (1) dimethyl sulfate (1) phenyl methyl sulfoxide (1) diphenyl sulfone (1) triethyl phosphite (1) dimethylphosphonate (1) trimethylphosphonate (1) diethyl phosphate (1) trimethyl phosphate (1) triphenyl phosphate (1) triphenylphosphine (1) diethyl carbonate (1) ethyl oxirane (1) trimethyl oxirane (1) cyclopentane (5) methylcyclopentane (1) 1,1-dimethylcyclohexane (1) cyclobutene (1) 1-methylcyclopentene (1) 1,2-dimethylcyclopentene (1) methylene cyclohexane (1) cyclopentimine (1) N-methylpyrrolidine (1) imidazole (1) 2-methyl-1H-imidazole (1) tetrahydropyran (1) cyclobutanone (1) 2-methylthiophene (1) cyclobutadiene sulfone (1)
-0.630 84 -0.609 90 -0.340 63 -0.980 99 -1.122 44 -1.275 46 -0.209 20 -0.638 20 -1.626 44 -0.580 55 -0.197 45 -1.115 98 0.516 13 0.845 87 0.753 63 0.914 76 0.993 98 1.829 67 -0.022 95 -0.085 13 1.449 61 0.671 10 0.874 69 1.599 79 0.683 61 1.105 07 0.896 34 0.360 42 0.820 10 1.019 63 0.692 62 0.399 37 -0.247 92 0.268 73 -0.025 78 -0.495 55 -0.018 55 -1.135 97 -0.665 51 -0.706 10 0.379 63 -0.087 48 -0.033 25 0.546 92 0.640 78 0.511 21 0.484 69 0.584 17 1.710 29 0.608 25 -0.940 94 -0.660 57 0.353 73 -0.960 34 -0.818 64 -0.794 49 -0.474 35 -1.756 64 -2.054 70 -0.781 62 -2.094 20 -1.785 77 -0.894 43 0.828 34 -0.969 86 -0.547 63 -0.201 87 0.173 89 0.327 05 0.305 81 0.371 03 0.558 72 0.733 41 -0.216 19 -0.450 91 -0.623 15 -0.191 66 0.032 64 -0.385 24 -0.353 43 0.377 46 -1.227 31
0.410 63 0.163 52 0.147 21 0.211 34 0.244 59 0.262 43 0.487 40 0.328 80 0.402 40 0.255 50 0.204 09 0.127 63 0.170 28 0.312 32 0.579 60 0.198 25 0.492 47 0.260 16 0.274 33 0.389 96 0.281 25 0.253 55 0.183 87 0.447 19 0.443 77 0.376 37 0.080 76 0.125 79 0.196 45 0.136 28 0.217 88 0.157 68 0.127 43 0.120 42 0.294 42 0.313 64 0.306 73 0.256 62 0.413 53 0.200 23 0.358 56 0.481 73 0.583 42 0.357 04 0.359 32 0.219 44 0.190 93 0.488 36 0.393 43 0.178 74 0.271 06 0.191 99 0.581 86 0.357 00 0.579 32 0.265 21 0.147 65 0.579 91 0.579 91 0.259 37 0.579 41 0.300 43 0.282 76 0.488 57 0.359 48 0.265 05 0.580 95 0.079 11 0.090 87 0.136 60 0.130 12 0.111 18 0.146 16 0.293 85 0.119 62 0.120 76 0.140 63 0.115 91 0.105 23 0.102 97 0.141 60 0.174 88
-21.057 13 -21.596 12 -20.466 46 -22.432 38 -22.333 73 -21.260 28 -20.104 75 -26.396 89 -27.798 91 -16.321 82 -19.582 79 -19.195 05 -19.666 88 -32.937 20 -34.850 37 -46.917 41 -12.336 36 -11.900 28 -20.438 12 -19.649 58 -27.581 64 -40.787 52 -26.712 01 -34.467 72 -19.261 90 -12.408 22 -54.387 25 -36.692 13 -49.757 26 -31.182 31 -7.325 22 -14.059 08 -16.641 97 -15.874 41 -22.335 94 -22.578 29 -22.135 91 -18.650 83 -17.687 75 -17.397 26 -12.757 72 -18.670 32 -18.745 61 -19.277 35 -17.689 79 -17.208 89 -24.891 05 -31.686 90 -31.277 66 -36.953 97 -29.520 62 -29.150 96 -34.766 78 -40.755 23 -22.754 73 -16.190 26 -15.669 66 -5.536 22 -5.309 56 -4.780 54 -10.347 65 -10.160 67 -10.336 87 -9.977 48 -5.648 25 -4.495 82 -10.755 71 -10.566 35 -6.005 72 -11.068 62 -12.497 82 -24.280 83
3.128 39 3.128 39 3.099 64 3.212 64 3.240 44 3.184 66 3.156 27 3.554 74 3.529 44 2.766 03 2.966 29 2.935 89 2.905 36 3.863 20 3.839 99 4.587 95 2.423 43 2.386 28 3.011 93 2.982 04 3.503 00 4.257 17 3.463 78 3.898 37 2.905 24 2.348 29 4.970 21 4.042 90 4.750 48 3.769 50 1.838 06 2.510 88 2.798 32 2.766 00 3.295 24 3.267 98 3.267 90 2.894 09 2.831 37 2.831 48 2.425 00 2.894 06 2.862 93 2.799 53 2.799 59 2.923 52 3.375 26 3.773 18 3.773 30 4.133 18 3.678 04 3.747 48 4.109 57 4.361 70 3.266 61 2.766 72 2.701 02 1.579 37 1.521 55 1.461 47 2.151 82 2.109 75 2.066 86 2.151 73 1.634 03 1.578 22 2.192 24 2.151 00 1.686 77 2.231 82 2.387 84 3.375 23
a SE ) standard errors of log K b c ow first-order group-contribution values. Standard errors calculated through eqs 9-11. SE ) standard errors of log Ws first-order group-contribution values.
6626
Ind. Eng. Chem. Res., Vol. 41, No. 25, 2002
In our GC methods, the estimation is performed at three levels. The basic level has a large set of simple groups that allow for the representation of a wide variety of organic compounds. However, these groups only partially capture proximity effects and are unable to distinguish among isomers. For this reason, the first level of estimation is intended to deal with simple and monofunctional compounds. The second level involves groups that permit a better description of proximity effects and differentiation among isomers. The second level of estimation is consequently intended to deal with polyfunctional, polar or nonpolar, compounds of medium size, C3-C10, and aromatic or cycloaliphatic compounds with only one ring and several substituents. The third level includes groups that provide more structural information about molecular fragments of compounds whose description is insufficient through the first- and second-order groups. The third level of estimation allows the properties of of complex heterocyclic and large (C10-C70) polyfunctional acyclic compounds to be estimated. Development of the New Methods In the new methods, the molecular structure of a compound is considered to be a collection of three types of groups: first-order groups, second-order groups, and third-order groups. The description of a given compound through these groups is based on the following rules (a more detailed description is given by Marrero and Gani16): 1. In the first level, groups describing the entire molecule must be selected. For example, CH3COCH(CH3)CH2COCH(CH3)OCH3 is described in the following way: (1) CH3CO, (1) CH2CO, (1) CH3-O-, (2) CH3, and (2) CH. In the case of aromatic substituents, groups of type aC-R must be chosen. For example, methoxybenzene is described by (1) aC-O-, (5) aCH, and (1) CH3. The same molecular fragment cannot be represented by more than one group. For example, dimethylethylurea is represented by (1) NHCON, (3) CH3, and (1) CH2. The use of groups such as CH2NH or CH3N would be wrong because the nitrogen atoms would be included more than once. 2. In the second- and third-order groups, the entire molecule does not need to be described by groups, and more than one group can represent the same molecular portion. For example, nitrocyclohexane has only CHcycNO2 as a second-order group, and cyclohexylethylacrylate is represented by the second-order groups CHcycOOC, CHndCHmsCOO, and CH2sCHndCHm. Contrary to the case of first-order groups, there can be molecules that do not need any second-order or third-order groups (e.g., methoxybenzene). There can be compounds that do not need any second-order groups but need thirdorder groups such as diphenylamine for which the thirdorder group aC-NH-aC is needed. The model equations for the properties considered in this work (log Kow and log Ws) are the following
log Kow ) 0.543 + ΣiNi log Kow(I)i + wΣjMj log Kow(II)j + zΣkOk log Kow(III)k (1)
log Ws ) 4.856 + 0.385Mw + ΣiNi log Ws(I)i + wΣjMj log Ws(II)j + zΣkOk log Ws(III)k (2) where log Kow is the decimal logarithm of the n-octanol/ water partition coefficient; log Ws is the decimal logarithm of the water solubility given in milligrams per liter; Mw is the molecular weight, given in grams per mole; log Kow(I)i is the contribution to log Kow of the firstorder group of type i occurring Ni times in the molecule; log Kow(II)j is the contribution to log Kow of the secondorder group of type j occurring Mj times in the molecule; log Kow(III)k is the contribution to log Kow of the thirdorder group of type k occurring Ok times in the molecule; log Ws(I)i is the contribution to log Ws of the first-order group of type i occurring Ni times in the molecule; log Ws(II)j is the contribution to log Ws of the second-order group of type j occurring Mj times in the molecule; and log Ws(III)k is the contribution to log Ws of the thirdorder group of type k occurring Ok times in the molecule. In the first level of estimation, the constants w and z in eqs 1 and 2 are assigned values of 0 because only first-order groups are employed. In the second level, the constants w and z are assigned values of 1 and 0, respectively, because only first- and second-order groups are involved, and in the third level of estimation, both w and z are set to values of 1. The determination of the adjustable parameters in the property-estimation equations, that is, the contribution values and universal constants in eqs 1 and 2, has been divided into a three-step regression procedure: 1. A first regression is carried out to determine the contributions of first-order groups and the universal constants in eqs 1 and 2 with w and z set to 0. The regression models for log Kow and log Ws in this first step are consequently the following
log Kow ) ΣiNi log Kow(I)i + K
(3)
log Ws ) ΣiNi log Ws(I)i + aMw + W
(4)
The left-hand sides of the above equations are experimental values of log Kow and log Ws. As a result of this first regression step, the universal constants K, a, and W are obtained (K ) 0.543 ( 0.012, a ) 0.385 ( 0.001, W ) 4.856 ( 1.476) as well as the contribution values log Kow(I)i and log Ws(I)i of the first-order groups, which are listed in Table 1, together with their corresponding standard errors (SEs). 2. Then, w is set to 1, z is set to 0, and another regression is performed to calculate the contributions log Kow(II)j and log Ws(II)j of second-order groups, using the first-order contribution values [log Kow(I)i and log Ws(I)i] and the universal constants calculated in the previous step. The regression models for log Kow and log Ws in this second step are consequently the following
log Kow ) 0.543 + ΣiNi log Kow(I)i + ΣjMj log Kow(II)j (5) log Ws ) 4.856 + 0.385Mw + ΣiNi log Ws(I)i + ΣjMj log Ws(II)j (6) The left-hand sides are again experimental values of log
Ind. Eng. Chem. Res., Vol. 41, No. 25, 2002 6627
Kow and log Ws. The contribution values log Kow(II)j and log Ws(II)j of the second-order groups are obtained through this second regression step and are listed in Table 2, together with their corresponding standard errors (SEs). 3. Finally, both w and z are assigned to unity, and the contributions log Kow(III)k and log Ws(III)k of the third-order groups are determined by using the following regression models with the universal constants and contribution values calculated in the preceding regression steps
log Kow ) 0.543 + ΣiNi log Kow(I)i + ΣjMj log Kow(II)j +ΣkOk log Kow(III)k (7) log Ws ) 4.856 + 0.385Mw + ΣiNi log Ws(I)i + ΣjMj log Ws(II)j + ΣkOk log Ws(III)k (8) The left-hand sides are again experimental values of log Kow and log Ws. The contribution values log Kow(III)k and log Ws(III)k of the third-order groups are determined through this last regression step and are shown in Table 3, together with their corresponding standard errors (SEs). This stepped regression scheme ensures the independence of contributions of first, second, and third order. In addition, the contributions of the higher levels act as corrections to the approximations of the lower levels. The regression steps were performed by using linear optimization, and the objective function was to minimize the sum of squares of the differences between the experimental and estimated values of the properties. The experimental data used in the regression analysis were obtained from a comprehensive data bank of property values developed at CAPEC-DTU21 through a systematic search of different data sources. The log Kow data cover a total of 9560 compounds, and the Ws data (calculated as log Ws) cover a total of 2087 compounds. Property data values have been included in this collection only after a rigorous analysis of their reliability. The statistical significance of group-contribution values, obtained from the multilevel regression scheme, has been evaluated by using the corresponding standard errors (SEs) listed in Tables 1-3. The standard errors (SEs) were calculated by using the following expressions
SE ) S(log Kow)/S(βj)
S(log Kow) )
S(βj) )
x
N
(Yexp,i - Yest,i)2 ∑ i)1
x
(9)
N-P
(10)
N
(γi,j - γavg,j)2 ∑ i)1
(11)
where N is the number of compounds involved in a regression step, Yexp,i represents the experimental value of the property of the ith compound, Yest,i represents the estimated value of the property of the ith compound, P is the number of groups involved in a regression step,
βj represents the contribution values of jth group, γi,j gives the number of times the jth group occurs in the ith compound, and γavg,j is the average frequency of occurrence of the jth group in the N compounds. From the SE values given in Tables 1-3, t-Student22 tests of statistical significance were performed for each of the regressed contribution values. By using the t-Student test, a group-contribution value (βj) is accepted with a 95% confidence if the quotient t ) βj/SE is greater than 0.002 (for log Kow group-contribution values) or 1.964 (for log Ws group-contribution values). Therefore, the statistical significance of each of the group-contribution values reported in Tables 1-3 can be easily verified through this test. For most groups, the corresponding standard errors were very small, and therefore, the regressed contribution values passed the t-Student tests. In the few cases where a group-contribution value failed the statistical significance test, the standard errors were fairly small, and the contribution values were near 0. Thus, the uncertainty in the prediction for compounds containing these groups will also be reasonably small. The groups for which the statistical significance test failed correspond only to second- and third-order groups and are marked in bold in Tables 2 and 3. Results and Discussion Applications of the new method for the estimation of log Kow and log Ws are highlighted through two illustrative examples (given in the Appendix). The two examples highlight the representation of the molecular structures at different levels, as well as the calculation of the property values at different levels. For each property, the standard deviation, the average absolute error, and the correlation coefficient r2 for the first-, second-, and third-level approximations are given in Table 4. The number of experimental values used in the first regression step is also given. It is interesting to note that, as the standard deviation (STD) and average absolute error (AAE) decrease with the application of successive (higher) estimation level, the correlation coefficient (r2) increases as a result of the improvement in the goodness of fit obtained from the successive (higher-level) regression steps. The statistics given in Table 4 for the second- and third-level approximations encompass all of the data points, even those corresponding to compounds in which no secondor third-order groups occur (that consequently were not used in the second- and third-level regression steps). Therefore, the average deviations given for the thirdlevel approximation characterize the global results of the three sequential approximations. A comparison of the average deviations obtained as results of the secondand third-level regression steps only is highlighted through data given in Table 5, which also includes the actual number of data points used in each step, that is, the number of compounds in which second- and thirdorder groups occur. For each set of compounds, the average deviations corresponding to both the current and previous step are presented to illustrate the improvement in accuracy achieved in each step. The reliability of the estimation equations obtained from the regression steps has been tested for each
6628
Ind. Eng. Chem. Res., Vol. 41, No. 25, 2002
Table 2. Second-Order Groups and Their Contributionsa along with Sample Assignments 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
group
example
log Kow
SEb,c
log Ws
SEc,d
(CH3)2CH (CH3)3C CH(CH3)CH(CH3) CH(CH3)C(CH3)2 CHndCHm-CHpdCHk (k, m, n, p ) 0, 1, 2) CH3-CHmdCHn (m, n ) 0, 1, 2) CH2-CHmdCHn (m, n ) 0, 1, 2) CHp-CHmdCHn (m, n ) 0, 1, 2; p ) 0, 1) CHCHO or CCHO CH3COCH2 CH3COCH or CH3COC CHCOOH or CCOOH CH3COOCH or CH3COOC CO-O-CO CHOH COH NCCHOH or NCCOH OH-CHn-COO (n ) 0, 1, 2) CHm(OH)CHn(OH) (m, n ) 0, 1, 2) CHm(OH)CHn(NHp) (m, n, p ) 0, 1, 2) CHm(NH2)CHn(NH2) (m, n ) 0, 1, 2) CHm(NH)CHn(NH2) (m, n ) 1, 2) H2NCOCHnCHmCONH2 (m, n ) 1, 2) CHm(NHn)-COOH (m, n ) 0, 1, 2) HOOC-CHn-COOH (n ) 1, 2) HOOC-CHn-CHm-COOH (n, m ) 1, 2) HO-CHn-COOH (n ) 1, 2) CH3-O-CHn-COOH (n ) 1, 2) HS-CH-COOH HS-CHn-CHm-COOH (n, m ) 1, 2) NC-CHn-CHm-CN (n, m ) 1, 2) OH-CHn-CHm-CN (n, m ) 1, 2) COO-CHn-CHm-OOC (n, m ) 1, 2) OOC-CHm-CHm-COO (n, m ) 1, 2) NC-CHn-COO (n ) 1, 2) COCHnCOO (n ) 1, 2) CHm-O-CHndCHp (m, n, p ) 0, 1, 2, 3) CHmdCHn-F (m, n ) 0, 1, 2) CHmdCHn-Br (m, n ) 0, 1, 2) CHmdCHn-Cl (m, n ) 0, 1, 2) CHmdCHn-CN (m, n ) 0, 1, 2) CHndCHm-COO-CHp (m, n, p ) 0, 1, 2, 3) CHmdCHn-CHO (m, n ) 0, 1, 2) CHmdCHn-COOH (m, n ) 0, 1, 2) aC-CHn-X (n ) 1, 2; X ) halogen) aC-CHn-NHm (n ) 1, 2; m ) 0, 1, 2) aC-CHn-O(n ) 1, 2) aC-CHn-OH (n ) 1, 2) aC-CHn-CN (n ) 1, 2) aC-CHn-CHO (n ) 1, 2)
2-methylpentane (1) 2,2,4,4-tetramethylpentane (2) 2,3,4-trimethylpentane (2) 2,2,3,4,4-pentamethylpentane (2) 1,3-butadiene (1)
0.026 07 0.018 06 0.140 13 -0.175 44 0.089 58
0.002 88 0.005 08 0.038 25 0.026 88 0.016 78
-0.040 21 -0.349 44 -0.115 51 0.221 91 -0.201 33
0.016 01 0.156 32 0.035 06 0.077 03 0.035 09
0.073 81
0.006 26
-0.233 72
0.033 44
2-methyl-2-butene (3)
-0.032 17
0.003 70
0.000 75
0.002 36
0.076 22
0.017 97
0.299 39
0.041 89
2-methylbutyl aldehyde (1) 2-pentanone (1) 3-methyl-2-pentanone (1) 2-methyl butanoic acid (1) isopropyl acetate (1) propanoic anhydride (1) 2-butanol (1) 2-methyl-2-butanol (1) 2-hydroxypropionitrile (1) ethyl lactate (1)
-0.125 00 -0.155 97 -0.025 09 0.099 50 -0.028 52 -0.240 00 0.022 47 -0.154 78 -0.070 00 0.299 07
0.032 29 0.011 32 0.023 36 0.011 81 0.016 93 0.064 57 0.006 25 0.016 30 0.064 57 0.031 52
0.347 48 0.425 50 0.720 01 0.715 08 0.149 45 -1.124 02 0.555 13 0.738 42 -2.220 11
0.140 72 0.150 05 0.317 48 0.202 54 0.048 22 0.208 46 0.109 34 0.257 09 0.228 35
ethylene glycol (1)
-0.030 97
0.005 80
-0.340 51
0.023 77
2-amino-1-butanol (1)
0.012 39
0.003 83
0.129 35
0.032 00
ethylenediamine (1)
0.240 00
0.064 57
-1.345 99
0.572 25
-0.120 78
0.029 10
-
-
0.054 64
0.037 29
-
-
1,4-pentadiene (2) 3-methyl-1-butene (1)
diethylenetriamine (1) butanediamide (1)
-0.240 93
0.011 39
-0.061 24
0.013 83
malonic acid (1)
0.155 00
0.045 66
-0.159 23
0.154 85
succinic acid (1)
-0.098 69
0.028 23
-0.147 42
0.022 88
2-hydroxyisobutyric acid (1)
0.094 04
0.020 20
0.044 45
0.021 76
methoxyacetic acid (1)
0.097 45
0.065 15
-
-
2-mercaptopropionic acid (1) β-thiolactic acid (1)
0.050 00 -0.062 72
0.064 57 0.038 05
0.876 25 0.582 26
0.171 49 0.572 35
1,2-dicyanoethane (1)
-0.390 00
0.064 57
-0.485 56
0.193 45
3-hydroxypropanenitrile (1)
-0.260 00
0.064 57
0.201 48
0.047 33
ethylene glycol diacetate (1)
-0.126 14
0.027 69
0.041 98
0.572 30
dimethylsuccinate (1)
-0.147 50
0.032 29
-0.202 25
0.031 42
methylcyanoacetate (1)
-0.220 00
0.064 57
-
-
methylacetoacetate (1)
0.206 83
0.040 75
0.151 05
0.031 78
ethyl vinyl ether (1)
-0.207 01
0.026 50
-0.409 74
0.221 25
1-fluoro-1-propene (1)
-0.036 00
0.028 88
-0.118 92
0.024 82
1-bromo-1-propene (1)
0.042 27
0.016 02
-0.310 43
0.043 77
1-chloro-2-methyl propene (1)
0.125 98
0.008 43
-0.082 32
0.033 91
L-alanine
(1)
acrylonitrile (1)
-0.027 41
0.008 84
1.173 25
0.018 94
ethyl acrylate (1)
-0.051 54
0.004 70
0.415 69
1.067 79
propenaldehyde (1)
-0.159 64
0.026 37
0.457 84
0.108 10
acrylic acid (1)
0.085 90
0.013 84
-0.355 69
0.069 67
benzyl bromide (1)
0.008 30
0.003 60
-0.207 26
0.016 74
benzylamine (1)
-0.104 55
0.006 66
0.373 39
0.007 35
benzyl ethyl ether (1)
-0.130 08
0.006 70
-0.253 22
0.009 76
benzyl alcohol (1)
-0.074 59
0.004 04
0.274 24
0.004 99
benzyl cyanide (1)
-0.256 28
0.015 10
-0.239 00
0.037 70
0.220 00
0.064 57
phenyl acetaldehyde (1)
-
-
Ind. Eng. Chem. Res., Vol. 41, No. 25, 2002 6629 Table 2. (Continued) group 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
aC-CHn-COOH (n ) 1, 2) aC-CHn-CO(n ) 1, 2) aC-CHn-S(n ) 1, 2) aC-CHm-NO2 (n ) 1, 2) aC-CHn-CONH2 (n ) 1, 2) aC-CHn-OOC (n ) 1, 2) aC-CHn-COO (n ) 1, 2) aC-SO2-OH aC-CH(CH3)2 aC-C(CH3)3 aC-CF3 (CHndC)(cyc)-CHO (n ) 0, 1, 2) (CHndC)cyc-COO-CHm (n, m ) 0, 1, 2, 3) (CHndC)cyc-CO(n ) 0, 1, 2) (CHndC)cyc-CH3 (n ) 0, 1, 2) (CHndC)cyc-CH2 (n ) 0, 1, 2) (CHndC)cyc-CN (n ) 0, 1, 2) (CHndC)cyc-Cl (n ) 0, 1, 2) CHcyc-CH3 CHcyc-CH2 CHcyc-CH CHcyc-C CHcyc-CHdCHn (n ) 1, 2) CHcyc-CdCHn (n ) 1, 2) CHcyc-Cl CHcyc-F CHcyc-OH CHcyc-NH2 CHcyc-NH-CHn (n ) 0, 1, 2, 3) CHcyc-N-CHn (n ) 0, 1, 2, 3) CHcyc-CN CHcyc-COOH CHcyc-CO CHcyc-NO2 CHcyc-SCHcyc-OCHcyc-COO CHcyc-OOC Ccyc-CH3 Ccyc-CH2 Ccyc-OH >Ncyc-CH3 >Ncyc-CH2 AROMRINGs1s2 AROMRINGs1s3 AROMRINGs1s4 AROMRINGs1s2s3 AROMRINGs1s2s4 AROMRINGs1s3s5 AROMRINGs1s2s3s4 AROMRINGs1s2s3s5 AROMRINGs1s2s4s5 PYRIDINEs2 PYRIDINEs3 PYRIDINEs4 PYRIDINEs2s3 PYRIDINEs2s4 PYRIDINEs2s5 PYRIDINEs2s6 PYRIDINEs3s4 PYRIDINEs3s5 PYRIDINEs2s3s6
example phenyl acetic acid (1)
log Kow
SEb,c
SEc,d
log Ws
0.052 63
0.005 77
0.145 58
0.006 05
phenyl acetone (1)
-0.014 85
0.008 55
-0.436 70
0.056 16
benzyl methyl sulfide (1)
-0.089 07
0.020 90
0.318 42
0.063 07
phenyl nitromethane (1)
0.066 11
0.045 67
-
-
-0.114 56
0.016 19
-
-
0.086 40
0.007 86
-0.357 28
0.014 09
methyl phenyl acetate (1)
-0.174 34
0.009 63
1.070 18
0.068 52
benzenesulfonic acid (1) cumene (1) tert-butylbenzene (1) perfluorotoluene (1) furfural (1)
-0.590 00 0.130 90 0.069 30 -0.012 89 -0.002 04
0.064 57 0.008 56 0.008 41 0.011 96 0.022 84
0.630 79 -0.507 38 -0.210 83 0.752 33 0.339 09
0.118 42 0.369 08 0.033 48 0.016 03 0.016 29
methyl furanyrate (1)
0.025 27
0.008 78
1.660 01
0.190 32
2-acetylfuran (1)
0.085 21
0.006 92
-0.012 01
0.057 18
1,2-dimethylcyclopentene (2)
-0.049 24
0.003 59
0.403 69
0.029 47
2-ethylfuran (1)
-0.073 77
0.005 06
0.475 69
0.027 17
3-cyanofuran (1)
0.025 32
0.010 02
0.260 95
0.037 78
2-chlorofuran (1)
0.113 50
0.005 43
-0.126 13
0.003 64
methylcyclopentane (1) ethylcyclohexane (1) isopropylcyclopentane (1) tert-butylcyclohexane (1) vinylcyclopentane (1)
-0.035 23 0.018 32 -0.001 59 0.098 50 0.245 13
0.004 73 0.004 22 0.010 21 0.019 72 0.016 78
-0.132 21 -0.240 73 0.090 01 -0.357 59 -0.237 34
0.001 76 0.003 75 0.017 05 0.020 79 0.019 79
limonene (1)
-0.110 26
0.046 28
-0.022 43
0.042 40
chlorocyclopentane (1) fluorocyclohexane (1) cyclohexanol (1) cyclohexylamine (1) N-methylcyclohexylamine (1)
-0.045 05 -0.092 41 0.059 22 -0.294 03 -0.002 94
0.002 04 0.009 73 0.002 85 0.013 94 0.009 33
0.028 89 -1.472 65 -0.178 38 1.554 87 1.053 17
0.011 74 0.001 66 0.024 64 0.015 32 5.723 63
dimethylcyclohexanamine (1)
0.052 21
0.018 81
0.422 50
0.099 88
-0.440 00 0.201 62 0.032 76 0.059 70 -0.148 07 0.002 36 -0.251 68 -0.116 73 0.007 35 0.049 00 0.000 79 0.015 57 -0.017 61 0.005 61 0.054 00 0.017 40 0.006 29 0.007 78 0.151 43 -0.058 50 -0.040 15 -0.036 76 -0.156 05 -0.139 29 -0.161 59 -0.068 78 0.091 00 -0.068 03 -0.129 50 -0.300 00 -0.065 00 -0.168 41
0.064 57 0.013 28 0.007 83 0.016 71 0.013 99 0.006 99 0.017 18 0.009 02 0.003 66 0.005 09 0.007 37 0.004 78 0.002 79 0.002 83 0.004 01 0.001 78 0.004 84 0.002 38 0.008 10 0.006 38 0.004 19 0.006 01 0.008 16 0.009 80 0.007 93 0.019 61 0.011 09 0.014 05 0.017 41 0.064 57 0.045 66 0.065 06
0.032 09 0.292 32 -0.076 61 0.393 81 0.207 81 0.074 61 0.392 20 -0.280 54 -0.212 09 0.311 75 0.091 52 -0.386 00 -0.164 59 0.081 39 0.124 87 -0.178 49 0.023 06 -0.086 88 -0.001 06 0.806 72 0.318 51 0.154 34 0.392 69 0.645 04 0.092 56 1.532 35 1.399 59
0.083 92 0.030 00 0.019 39 0.008 63 0.007 71 0.026 72 0.014 94 0.022 51 0.012 79 0.098 81 0.010 28 0.012 86 0.014 32 0.005 33 0.001 95 0.006 46 0.002 08 0.002 05 0.004 82 0.001 52 0.023 25 0.026 52 0.246 76 0.031 75 0.007 40 0.020 37 0.211 55
phenyl ethanamide (1) benzyl acetate (1)
cyanocyclopentane (1) cyclopropanecarboxylic acid (1) methyl cyclohexyl ketone (1) nitrocyclohexane (1) methyl cyclopentyl sulfide (1) methoxycyclohexane (1) ethyl cyclobutyrate (1) cyclohexyl acetate (1) 1,1-dimethylcyclohexane (2) 1-ethyl-1-methyl-cyclopentane (1) 1-methylcyclopentanol (1) N-methyl-2-pyrrolidone (1) N-ethylpyrrole (1) 2-methylphenol (1), 2-ethyltoluene (1) 3-methylphenol (1), 3-ethyltoluene (1) 4-methylphenol (1), 4-ethyltoluene (1) 1,2,3-trimethylbenzene (1) 1,2,4-trihydroxybenzene (1) 3,5-diethyltoluene (1) 3-ethyl-1,2,4-trimethylbenzene (1) 1,2,3,5-tetramethylbenzene (1) 1,2,4,5-tetramethylbenzene (1) 2-methylpyridine (1) 3-methylpyridine (1) 4-methylpyridine (1) 2,3-dimethylpyridine (1) 2,4-dimethylpyridine (1) 2,5-dimethylpyridine (1) 2,6-dimethylpyridine (1) 3,4-dimethylpyridine (1) 3,5-dimethylpyridine (1) 2,3,6-trimethylpyridine (1)
a Values in bold correspond to group-contribution values that failed the t-Student test. b SE ) standard errors of log K ow second-order group-contribution values. c Standard errors are calculated through eqs 9-11. d SE ) standard errors of log Ws second-order groupcontribution values.
6630
Ind. Eng. Chem. Res., Vol. 41, No. 25, 2002
Table 3. Third-Order Groups and Their Contributionsa along with Sample Assignments log Kow
SEb,c
1,5-pentanedioic acid (1)
-0.425 00
0.016 47
0.464 81
0.042 16
4-aminobutanol (1)
-0.050 00
0.032 94
-
-
1,5-diaminopentane (1)
0.400 00
0.032 94
-
-
indene (1), acenaphthylene (2)
0.004 40
0.002 81
0.613 27
0.048 68
biphenylene (2), biphenyl (1) cyclohexylbenzene (1)
0.036 67 -0.048 84
0.001 73 0.002 31
-0.271 03 0.356 01
0.084 32 0.059 63
tetralin (2), indane (2) bibenzyl (1)
-0.030 08 0.130 00
0.001 55 0.023 29
0.470 61 -1.172 27
0.017 57 0.007 23
0.256 63
0.009 16
-0.092 49
0.013 12
cyclohexyl cyclohexane (1) hexahydroindan (2), decalin (2) spiropentane (1) diphenylmethane (1)
0.027 08 -0.005 70 0.013 67 -0.028 00
0.011 80 0.001 10 0.001 45 0.002 22
0.031 00 0.059 88 -0.050 54 0.333 90
0.007 86 0.084 32 0.049 10 0.027 08
1,2-diphenylethylene (1)
-0.156 67
0.019 02
-0.617 98
0.004 61
difuranyl methane (1)
0.203 29
0.019 13
-
-
benzophenone (1) benzyl phenone (1)
0.025 00 0.055 00
0.007 02 0.023 29
0.700 24 0.247 00
0.006 46 0.011 00
phenyl-2-furanyl-methanone (1)
-0.098 75
0.011 65
-0.312 70
0.059 63
phenolphthalein (1) 1,4-diphenyl-1,4-butanedione (1)
-0.007 50 -0.061 59
0.001 61 0.008 27
0.027 07 0.287 36
0.037 93 0.084 32
cyclohexyl phenyl methanone (1)
0.608 29
0.033 02
-1.485 82
0.034 43
N-phenyl benzamide (1)
0.090 95
0.006 12
-1.011 01
0.006 41
N,N′-diphenylurea (1)
0.235 00
0.016 47
0.158 97
0.028 54
N-phenonyl piperidine (1) dibenzothiophene (2) diphenyl sulfide (1) diphenyl sulfone (1)
0.474 08 0.026 38 -0.060 00 -0.253 12
0.010 99 0.002 58 0.014 73 0.006 12
-0.782 55 -0.019 49 0.175 95
0.059 91 0.029 87 0.059 63
carbazole (2)
-0.013 74
0.001 73
0.092 39
0.086 91
diphenylamine (1) phenyl-3-pyrazole (1) benzoxazole (1)
0.123 31 -0.021 89 0.023 15
0.006 12 0.002 76 0.002 20
0.150 97 1.495 06 0.101 46
0.013 44 0.042 16 0.010 29
benzoisoxazole (1)
-0.177 45
0.005 28
-0.648 33
0.025 47
benzyl phenyl ether (1)
-0.003 04
0.005 30
-1.405 10
0.026 29
0.041 50
0.002 57
-0.151 24
0.016 69
benzyl ether (1)
-0.260 00
0.032 94
0.483 54
0.024 03
benzoxazole (1) naphthalene (2) 1-methylnaphthalene (1) 2,7-dimethylnaphthalene (2) 2,3-dimethylnaphthalene (1) 1,4-dimethylnaphthalene (1) 1,2-dimethylnaphthalene (1) 1,3-dimethylnaphthalene (1) phenalene (3), pyrene (2) anthracene (1) 9-methylanthracene (1) 9,10-dimethylanthracene (1) phenanthrene (1), pyrene (2) quinoline (1) isoquinoline (1) acridine (1)
0.037 73 0.022 94 -0.022 92 -0.012 88 0.058 84 0.033 77 0.044 81 -0.035 72 0.025 72 0.078 82 -0.052 67 -0.053 95 0.058 47 -0.126 59 -0.088 60 -0.271 49
0.001 66 0.001 51 0.002 64 0.001 59 0.003 98 0.006 45 0.007 49 0.004 24 0.004 06 0.006 81 0.011 26 0.023 88 0.003 16 0.003 87 0.008 90 0.010 26
-0.243 30 -0.080 61 -0.194 82 -0.167 46 -0.043 70 -0.260 29 0.547 12 -0.111 65 -0.138 80 -0.653 27 0.032 50 0.028 51 0.074 96 0.331 56 1.089 46 1.421 32
0.084 32 0.009 66 0.084 32 0.005 19 0.006 56 0.009 90 0.010 52 0.017 11 0.028 85 0.038 40 0.022 92 0.012 69 0.022 69 0.043 99 0.060 94 0.008 66
group 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
HOOC-(CHn)m-COOH (m > 2, n ) 0, 1, 2) NH2-(CHn)m-OH (m > 2, n ) 0, 1, 2) NH2-(CHn)m-NH2 (m > 2; n ) 0, 1, 2) aC-(CHndCHm)cyc (fused rings) (n, m ) 0, 1) aC-aC (different rings) aC-CHncyc (different rings) (n ) 0, 1) aC-CHncyc (fused rings) (n ) 0, 1) aC-(CHn)m-aC (different rings) (m > 1; n ) 0, 1, 2) aC-(CHn)m-CHcyc (different rings) (m > 0; n ) 0, 1, 2) CHcyc-CHcyc (different rings) CH multiring C multiring aC-CHm-aC (different rings) (m ) 0, 1, 2) aC-(CHmdCHn)-aC (different rings) (m, n ) 0, 1, 2) (CHmdC)cyc-CHp-(CdCHn)cyc (different rings) aC-CO-aC (different rings) aC-CHm-CO-aC (different rings) (m ) 0, 1, 2) aC-CO-(CdCHn)cyc (different rings) (n ) 0, 1) aC-COcyc (fused rings) aC-CO-(CHn)m-CO-aC (different rings) (m > 0; n ) 0, 1, 2) aC-CO-CHncyc (different rings) (n ) 0, 1) aC-CO-NHn-aC (different rings) (n ) 0, 1) aC-NHnCONHm-aC (different rings) (n, m ) 0, 1) aC-CO-Ncyc (different rings) aC-Scyc (fused rings) aC-S-aC (different rings) aC-SOn-aC (different rings) (n ) 1, 2, 3, 4) aC-NHncyc (fused rings) (n ) 0, 1) aC-NH-aC (different rings) aC-(CdN)cyc (different rings) aC-(NdCHn)cyc (fused rings) (n ) 0, 1) aC-(CHndN)cyc (fused rings) (n ) 0, 1) aC-O-CHn-aC (different rings) (n ) 0, 1, 2) aC-O-aC (different rings) aC-CHn-O-CHm-aC (different rings) (n, m ) 0, 1, 2) aC-Ocyc (fused rings) AROM.FUSED[2 AROM.FUSED[2]s1 AROM.FUSED[2]s2 AROM.FUSED[2]s2s3 AROM.FUSED[2]s1s4 AROM.FUSED[2]s1s2 AROM.FUSED[2]s1s3 AROM.FUSED[3] AROM.FUSED[4a] AROM.FUSED[4a]s1 AROM.FUSED[4a]s1s4 AROM.FUSED[4p] PYRIDINE.FUSED[2] PYRIDINE.FUSED[2-iso] PYRIDINE.FUSED[4]
example
1-cyclopentyl-3-phenylpropane (1)
diphenyl ether (1)
SEc,d
log Ws
a Values in bold correspond to group-contribution values that failed the t-Student test. b SE ) standard errors of log K ow second-order group-contribution values. c Standard errors are calculated through eqs 9-11. d SE ) standard errors of log Ws second-order groupcontribution values.
Ind. Eng. Chem. Res., Vol. 41, No. 25, 2002 6631 Table 4. Global Comparison of Consecutive First-, Second-, and Third-Order Approximationsa STD
r2
AAE
property
data points
first
second
third
first
second
third
first
second
third
log Kow log Wsb
9560 2087
0.42 0.65
0.38 0.60
0.34 0.55
0.35 0.53
0.27 0.48
0.24 0.46
0.95 0.90
0.96 0.92
0.97 0.93
a
STD )
x
∑(Xest - Xexp)2 N
AAE )
1 ∑(Xest - Xexp) N
r2 )
∑(Xest - Xexp)2 ∑(Xexp - Xexp)2
where N is the number of data points, Xest is the estimated value of the property, and Xexp is the corresponding experimental value. b log Ws is given in logarithmic units of milligrams per liter (STD and AAE corresponding to log Ws are given in the same units). Table 5. Comparison of Average Deviationsa for First-, Second-, and Third-Order Approximations first level
second level
deviations
third level
deviations
deviations
property
data points
first
data points
first
second
data points
second
third
log Kow log Ws
9560 2087
0.35 0.53
6158 1412
0.36 0.55
0.24 0.45
3082 616
0.30 0.50
0.23 0.45
a
Deviations are expressed as logarithmic units of average absolute errors (logarithmic units of milligrams per liter for log Ws).
Table 6. Comparison of Accuracya between the Meylan-Howard18,19 Method and the New GC Method STD
r2
AAE
property
data points
MHb
newc
MHb
newc
MHb
newc
log Kow log Ws
11 085 2325
0.44 0.61
0.43 0.65
0.30 0.46
0.28 0.50
0.95 0.92
0.95 0.88
a See expressions for STD, AAE, and r2 in footnote a of Table 4. b MH ) Meylan and Howard18 and Meylan et al.19 method.
property by performing a least-squares analysis in which a randomly selected subset of the N experimental data points was excluded from the full data set and for which the group parameters were regressed separately. Then, the mean-square residual J, defined by
J)
x
∑i(Xi - Yi)2 N
(12)
was calculated. In eq 12, N is the number of data points excluded from the full data set, Xi is the property value of the compound i estimated by the full regression, and Yi is the property value of the same compound estimated by the partial regression. In the case of log Kow, 1915 data points were used in the partial regression, that is, N ) 1915, whereas for log Ws, 420 data points (N ) 420) were used in the partial regression. For both properties, the residuals are smaller than the estimation errors reported in Table 4, confirming the reliability of the method. To illustrate the ability of the new group-contribution (GC) method to estimate accurate values of log Kow, a comparison with the results obtained from the GC method reported by Melyan and Howard18 was carried out. This method was chosen because it exhibits a range of applicability similar to that of the new method presented in this work and also because it can be easily evaluated by using the version 1.66 of the program KowWin, which can be freely downloaded from the official web site of the U.S. Environmental Protection Agency.23 The comparison was performed using a data set including log Kow values of 11 085 compounds, and the average results are shown in Table 6. The estimation method for log Kow presented in this work exhibits
c
new ) new GC
clearly better results. The data set used in this comparison included 2245 compounds that were not used in the full regression step for the estimation of the group-contribution parameters of the new GC method. For these compounds, the new GC method presents STD, AAE, and r2 values of 0.44, 0.31, and 0.93, respectively, which are comparable to the statistics given in Table 4 for the data set of the original 9560 compounds employed to calculate the group-contribution values for log Kow. This illustrates the predictive capability of the new GC method. With respect to log Ws, a comparison was also performed using a data set with experimental values of 2325 compounds, and the average results are shown in Table 6. The data set used in this comparison included 238 compounds that were not used in the full regression step for the estimation of the group-contribution values. For these compounds, the new GC method presents STD, AAE, and r2 values of 0.67, 0.52, and 0.85, respectively, which are satisfactory in comparison to the statistics given in Table 4 for the data set of the original 2087 compounds employed to calculate the groupcontribution values for log Ws. This is also an example of the predictive capability of the new GC method for log Ws. The new GC method for log Ws was not compared with other group-contribution-based methods because no method with a similar range of applicability was found in the literature. Instead, the new GC method was compared with the method of Meylan et al.19 This method employs a correlation function that relates log Kow and Tm with log Ws. The results of the comparison are also listed in Table 6, and it can be seen that the two methods present similar levels of performance even though the method of Meylan et al.19 uses experimental
6632
Ind. Eng. Chem. Res., Vol. 41, No. 25, 2002
Table 7. Estimation of log Ws by the Meylan-Howard19 Method Using Estimated Values of Tm and log Kow Tma-c
log Kowa,d,e
STDf
AAEf
exp exp exp JR JR JR MGTm MGTm MGTm
exp MH MG exp MH MG exp MH MG
0.40 0.47 0.44 0.63 0.66 0.66 0.60 0.63 0.61
0.30 0.35 0.33 0.50 0.54 0.54 0.47 0.51 0.50
a exp ) experimental values. b JR ) values of T m estimated using the Joback and Reid24 method. c MGTm ) values of Tm estimated using the Marrero and Gani16 method. d MH ) values of log Kow estimated using the Meylan and Howard18 method. e MG ) values of log Kow estimated using the method presented in this work. f See expressions for STD and AAE in footnote a of Table 4.
values of Tm and log Kow as input data whereas the new GC method requires only molecular structural information as input data. As mentioned above, one influential factor to take into account when the correlation function of Meylan et al.19 is employed for estimating log Ws is that it requires values of Tm and log Kow as input data. Because experimental data for these properties might not be always available, it is important to analyze the performance of this correlation function when estimated values of Tm and log Kow are used as input data instead of experimental values. A comparison of estimated log Ws values involving 980 compounds (mostly heterocyclic compounds that are in the solid state at ambient temperature) with various combinations of experimental and estimated values for Tm and log Kow is illustrated through the data listed in Table 7. It is evident that the correlation equation is very sensitive to the influence of Tm values and, to a lesser extent, to the influence of log Kow values. However, if the use of estimated values were the only choice in a practical application, it would be advisable to employ at least the Tm values estimated from the Marrero and Gani16 method because it appears to give acceptable results. It should be noted, however, that other more accurate predictive methods for Tm, if available, could also be used to obtain similar or even better results. To facilitate the application of the presented groupcontribution methods, a computer program was developed. By using this program, it is possible to draw a molecular structure and automatically obtain the estimated values of log Kow and log Ws as well as estimated values of other important properties whose estimation methods are described elsewhere.16 Also, the group assignments corresponding to the first-, second-, and third-level estimations are automatically generated by this program, together with atomic descriptions of the molecule for direct transfer of data to higher-level calculations, such as molecular modeling. Conclusion New group-contribution methods for the estimation of the octanol/water partition coefficient and aqueous solubility have been developed using three different sets of functional groups, one for a first-order approximation and two successive higher-order approximations for
refining the estimations for complex, large, and heterocyclic compounds. Compared to other currently used estimation methods, the proposed method exhibits a better accuracy for a wider range of compounds of interest to chemical, biochemical, pharmaceutical, agrochemical, and environmental applications. A computer program has also been developed for automatic group representation and estimation of the considered properties as well as other properties of interest in chemical, biochemical, agro-chemical, pharmaceutical, and environmental studies. Appendix To illustrate the application of the proposed method, the estimation of the octanol/water partition coefficient Table 8. Example 1. Estimationsa of log Kow and log Ws of Triamcinolone Acetonide
First-Order Estimations first-order groups CH3 OH CH2CO -F CH2 (cyc) CH (cyc) C (cyc) CHdCH (cyc) CHdC (cyc) O (cyc) CO (cyc)
occurrences 4 2 1 1 4 4 5 1 1 2 1 ΣiNiY(I)i
log Kow contribution
log Ws contribution
0.257 78 -1.096 58 -0.174 07 -0.247 92 0.173 89 0.327 05 0.305 81 0.371 03 0.558 72 -0.385 24 -0.353 43
-5.944 17 -5.801 15 -17.117 42 -7.325 22 -5.536 22 -5.309 56 -4.780 54 -10.347 65 -10.160 67 -6.005 72 -11.068 62
1.754 62
-170.696
log Kow ) 1.754 62 + 0.543 ) 2.30 (first-order approximation, error ) 0.23) log Ws ) -170.696 + 4.856 + 0.385Mw ) 1.45 (first-order approximation, error ) 0.13) Second-Order Estimations second-order groups CHcyc-OH Ccyc-CH3
log Kow contribution
log Ws contribution
1 2
0.059 22 0.014 69
-0.178 38 0.149 21
ΣiNiY(II)j
0.073 92
-0.029 16
occurrences
log Kow ) 1.754 62 +0.073 92 + 0.543 ) 2.37 (second-order approximation, error ) 0.16) log Ws ) -170.696 - 0.029 16 + 4.856 + 0.385Mw ) 1.42 (second-order approximation, error ) 0.10) Third-Order Estimations third-order groups CH multiring C multiring
occurrences
log Kow contribution
log Ws contribution
3 4 ΣiNiY(III)k
-0.017 11 0.054 68 0.037 57
0.179 64 -0.202 15 -0.022 51
log Kow ) 1.754 62 + 0.073 92 + 0.037 57 + 0.543 ) 2.41 (third-order approximation, error ) 0.12) log Ws ) -170.696 - 0.029 16 - 0.022 51 + 4.856 + 0.385Mw ) 1.39 (third-order approximation, error ) 0.07) Estimations Using Meylan and Howard Method18,19 log Kow ) 2.69 (error ) 0.16) a Experimental values: log K ow ) 2.53, log Ws (mg/L) ) 1.32 (Mw ) 434.51 g/mol).
Ind. Eng. Chem. Res., Vol. 41, No. 25, 2002 6633
and aqueous solubility is provided using two example compounds, and the results are presented in Tables 8 and 9. The experimental data and estimations through Meylan and Howard methods18,19 are also provided.
Table 9. Example 2. Estimationsa of log Kow and log Ws of 2,2′,3,5,5′-Pentachlorobiphenyl
First-Order Estimations first-order groups
log Kow contribution
log Ws contribution
5 2 5
1.084 70 0.663 04 4.481 70
-25.952 00 -10.396 10 -96.309 50
ΣiNiY(I)i
6.229 44
-132.658 00
occurrences
aCH aC aC-Cl
log Kow ) 6.229 44 + 0.543 ) 6.77 (first-order approximation, error ) 0.02) log Ws ) -132.658 + 4.856 + 0.385Mw ) -2.12 (first-order approximation, error ) 0.19) Second-Order Estimations second-order groups
log Kow contribution
log Ws contribution
1 1
0.007 78 -0.040 15
0.124 87 -0.086 88
ΣiNiY(II)j
-0.032 38
0.037 99
occurrences
AROMRINGs1s2s4 AROMRINGs1s2s3s5
log Kow ) 6.229 44 - 0.032 38 + 0.543 ) 6.74 (second-order approximation, error ) 0.05) log Ws ) -132.658 + 0.037 99 + 4.856 + 0.385Mw ) -2.08 (second-order approximation, error ) 0.21) Third-Order Estimations third-order groups aC-aC (different rings)
log Kow log Ws occurrences contribution contribution 1
0.036 67
-0.271 03
ΣiNiY(III)k
0.036 67
-0.271 03
log Kow ) 6.229 44 - 0.032 38 + 0.036 67 + 0.543 ) 6.78 (third-order approximation, error ) 0.01) log Ws ) -132.658 + 0.037 99 - 0.271 03 + 4.856 + 0.385Mw ) -2.35 (third-order approximation, error ) 0.04) Estimations Using Meylan and Howard Method18,19 log Kow ) 6.98 (error ) 0.19) log Ws(mg/L) ) -1.86 (error ) 0.44) a Experimental values: log K ow ) 6.79, log Ws (mg/L) ) -2.31 (Mw ) 326.44 g/mol).
Literature Cited (1) Hansch, C.; Leo, A.; Hoekman, D. Fundamentals and Applications in Chemistry and Biology. In Exploring QSAR; American Chemical Society: Washington, DC, 1995. (2) Linpinski, C.; Lombardo, F.; Dominy, B. W.; Freeny, P. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Delivery Rev. 1997, 23, 3.
(3) McFarland, J. W. Quantitative structure-activity relationships among macrolide antibacterial agents: in vitro and in vivo potency against Pasteurella multicida. J. Med. Chem. 1997, 40, 1340. (4) Briggs, G. Theoretical and experimental relationships between soil adsorption, octanol-water partition coefficients, water solubilities, bioconcentration factors, and the parachor. J. Agric. Food Chem. 1981, 29, 1050. (5) Borman, S. New QSAR Techniques Eyed for Environmental Assessments. Chem. Eng. News 1990, 68, 20. (6) De Bruijn, J.; Hermans, J. Uptake and elimination kinetics of organophosphorus pesticides in the guppy (Poecilia reticulata): Correlations with the octanol/water partition coefficient. Environ. Toxicol. Chem. 1991, 10, 791. (7) LaHann, T. R.; DeKrey, L. J.; Tarr, B. D. Capsaicin analgesia: Predictions based on physicochemical parameters. Proc. West. Pharmacol. Soc. 1989, 32, 201. (8) Baker, E. A.; Hayes, A. L.; Butler, R. C. Physicochemical properties of agrochemicals: Their effect on foliar penetration. Pestic. Sci. 1992, 34, 167. (9) Ertepinar, H.; Gok, Y.; Geban, O.; Ozden, S. A QSAR study of the biological activities of some benzimidazoles and imidazopyridines against Bacillus subtilis. Eur. J. Med. Chem. 1995, 30, 171. (10) Murray, J. S.; Lane, P.; Brinck, T.; Paulsen, K.; Grice, M. E. Politzer, P. Relationships of Critical Constants and Boiling Points to Computed Molecular Surface Properties. J. Phys. Chem. 1993, 97, 9369. (11) Clark, T.; Breindl, A.; Rauhut, G. A Combined Semiempirical MO/Neural Net Technique for Estimating 13C Chemical Shifts. J. Mol. Model. 1995, 1, 22 (12) Breindl, A.; Beck, B.; Clark, T.; Glen, R. C. Prediction of the n-Octanol/Water Partition Coefficient, logP, Using a Combination of Semiempirical MO-Calculations and a Neural Network. J. Mol. Model. 1997, 3, 142. (13) Parham, M.; Hall, L. Accurate Prediction of n-Octanol/ Water Partition Coefficient Using Neural Network Algorithms and E-State Atom Indices. Presented at the IBC Drug Discovery Conference, Boston, MA, Aug 16-19, 1999; Paper B23. (14) Huuskonen, J. Estimation of Aqueous Solubility for a Diverse Set of Organic Compounds Based on Molecular Topology. J. Chem. Inf. Comput. Sci. 2000, 40, 773. (15) Beck, B.; Breindl, Clark, T. QM/NN QSPR Models with Error Estimation. J. Chem. Inf. Comput. Sci. 2000, 40, 1046. (16) Marrero, J.; Gani, R. Group-Contribution Based Estimation of Pure Component Properties. Fluid Phase Equilib. 2001, 183, 183. (17) Hansch, C.; Leo, A.; Elkins, D. Partition Coefficients and Their Uses. Chem. Rev. 1971, 71, 524. (18) Meylan, W. M.; Howard, P. H. Atom/fragment contribution method for estimating octanol-water partition coefficients. J. Pharm. Sci. 1995, 84, 83. (19) Meylan, W. M.; Howard, P. H.; Boethling, R. S. Improved method for estimating water solubility from octanol/water partition coefficient. Environ. Toxicol. Chem. 1996, 15, 100. (20) Ran, Y.; Yalkowsky, S. H. Prediction of Drug Solubility by General Solubility Equation (GSE). J. Chem. Inf. Comput. Sci. 2001, 41, 354. (21) Nielsen, T. L.; Abildskov, J.; Harper, P. M.; Papaeconomou I.; Gani, R. The CAPEC Database. J. Chem. Eng. Data 2001, 46, 1041. (22) Chatterjee, S.; Hadi, A.; Price, B. Regression Analysis by Example; Wiley: New York, 2000. (23) U.S. Environmental Protection Agency: www.epa.gov/oppt/ exposure/docs/episuite.htm. (24) Joback, K. G.; Reid, R. C. Estimation of Pure-Component Properties from Group Contributions. Chem. Eng. Commun. 1987, 57, 233.
Received for review July 17, 2002 Revised manuscript received October 16, 2002 Accepted October 18, 2002 IE0205290