Anal. Chem. 1992, 64, 1350-1355
1350
Computer-Assisted Study of the Relationship between Molecular Structure and Henry's Law Constant Charles J. Russell,+ Steven L. Dixon, and Peter C. Jurs' Department of Chemistry, 152 Davey Laboratory, The Pennsylvania State University, University Park, Pennsylvania 16802
Computer-aulsted methods have been used to develop a predictive model correlating 63 molecular structures with the log of Henry's law constant. A five-variable model uslng structural descriptors Is developed with an R value of 0.978 and standard deviationof 0.375 log units. The model Indicates that Henry's law constant may be approximated as a linear function of factors related to bulk, Ilpophlllcity, and polarity.
INTRODUCTION The partitioning of a compound between the phases of a gaseous-aqueous binary system is of fundamental importance in chemical thermodynamics. Henry's law constant, H, is a quantitative expressionof the compound's partitioning nature between the two phases. For low aqueous concentrations of a compound, H is defined as
H = fJX (1) where f is the fugacity of the compound in the gas phase and Xis its mole fraction in the aqueous phase. Thus, a compound with a high H partitions more into the gas phase, and a compound with a low H partitions more into the aqueous phase. Even for common compounds, log H values have been experimentally determined with only limited success. Mackay and Shiul have critically reviewed the literature for log H data of 167compounds and report only 40 direct experimental measurements of Henry's law constant. The log H data for the remainder of the compounds were calculated by dividing vapor pressure data by solubility data-a method which sometimes leads to significantly reduced accuracy. Nirmalakhandan and Speece2noted, for example, that the reported values of log H for chloroform and trichloroethylene vary from -0.543 to -0.954 and from -0.113 to 0.555 respectively. Since accurate log H data are difficult to obtain experimentally, alternative methods have been employed to predict log H from molecular structure. Two quantitative structure-property relationships (QSPR) studies have provided accurate models for the prediction of log H from molecular structure. First, Hine and Mookerjee6 reported an empirically-based group and bond contribution scheme relating log H to molecular structure with a standard error of 0.42 log units for 245 compounds. Since 70 group and 34 bond contribution factors were used in the scheme, + Present address: Department of Chemistry,Millikin University, 1184 West Main Street, Decatur, IL 62522. (1)Mackay, D.; Shiu,W. Y. J.Phys. Chem. Ref.Data. 1981,10,11751199. ~~.~ (2) Nirmalakhandan, N. N.; Speece,R. E.Enuiron. Sci. Technol. 1988, 22,1349-1357. (3) Technical Data Services, Inc., New York, NY. (4) McConnel, G; Ferguson, D. M.; Pearson, C. R. Endeavour 1975,34, 13-18. (5) Lincoff,A. H.; Gossett, J. M. Presented at InternationalSymposium on Gas Transfer at Water Surfaces, Cornel1University, Ithaca, NY, 1984. (6) Hine, J.; Mookerjee, P. K. J. Org. Chem. 1975, 40, 292-298. 0003-2700/92/0384-1350$03.00/0
the nature of the relationship between molecular structure and log H is difficult to interpret. Second, Nirmalakhandan and Speece2 developed a predictive equation which employed descriptors calculated from molecular structure alone. They achieved a standard error of 0.262 log units for 180compounds. Their fiial equation contained a valence connectivity index, lxpv(described by Kier and Hall'), an indicator variable for the presence of an electronegative atom, and an 11-factor statistically-optimized molecular polarizability descriptor, 4 (described by Horvaths). Since the formula used by Nirmalakhandan and Speece for computing 4 was obtained by regressing the 11 factors against log H, its identification as a polarizability descriptor is questionable. Thus, the nature of the relationship between molecular structure and log H is again difficult to interpret. In summary, two investigations have demonstrated the existence of a strong relationship between molecular structure and log H, but the models do not allow for a clear interpretation of this relationship. There were two primary goals in the present investigation. The fiit goal was to use computer-assisted methods to develop a model strongly correlating molecular structure with log H using structural descriptors containing interpretable information. The second goal was to determine if the model could lend insight to the relationship between molecular structure and log H.
EXPERIMENTAL SECTION The procedure used in this investigation is outlined in the flow diagram presented in Figure 1. All computationsinvolving descriptor generation and statistical modeling were carried out using the ADAPT software packageSJ0ona Sun 4/110 workstation. The unique feature of the ADAPT methodology is that a large number (>loo) and a variety of structural descriptors are generated initially. The choice of descriptorscomputed is based on a knowledge that they have been found to be important in numerous other QSPR investigations covering a variety of physicochemical properties. The use of such a large descriptor pool supposes no a priori knowledge of the relationship between structure and Henry's law constant. Subsequent descriptor analysis and regression analysis ultimately lead to a single model judged to be the best based on statistical and nonstatistical criteria. This methodology is but one of many that can be used to uncover structure-property relationships. Data Set. The data set for this investigation consisted of a diverse set of 72 organic compounds. Table I contains the experimental log H values (in dimensionlessform) at 25 "C,all of which were taken from Hine and Mookerjeee and references therein. The majorityof the Henry's law constantswere obtained through division of vapor pressure data by solubilitydata. Seven (7) Kier, L. B.; Hall, L. H. Molecular Connectivity in StructureActivity Analysis; Research Studies: Hertfordshire, England, 1986. (8) Horvath, A. L. Halogenated Hydrocarbons;Marcel Dekker: New
York, 1982. (9) Stuper,A. J.;Brugger, W.E.; Jurs,P.C. Computer AssistedStudies of Chemical Structure andBiological Function; Wiley-Interscience: New York, 1979. (10) Jurs, P. C.; Chou, J. T.; Yuan, M. In Computer-Assisted Drug Design; Olson, E. C., Christoffersen, R. E., Eds.; The American Chemical Society: Washington, D.C., 1979; pp 103-129. 0 1992 American Chemical Society
1351
ANALYTICAL CHEMISTRY, VOL. 64, NO. 13, JULY 1, 1992
I "Ey I + Molecular l Generation
Descriptor Analysis
4
where k is a suitable conversion factor between mole fraction and molar concentration. This formula leads to
Regression Analysis
log H A = log P A - log SA + log k (3) Changes dpA and d S A govern the change d(1og HA)according to
4 Model Validation
d(log H A ) = d p A / p A - d s A / s A
Flguro 1. Procedural flow diagram for the QSPR study.
Table I. Names and log H Values for t h e Compounds Used in This Study (Hin Dimensionless Form)* no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 0
compound n-hexane n-heptane n-octane methylcyclohexane cyclohexene 1-octene 1-pentyne 1,4-pentadiene ethylbenzene propylbenzene butylbenzene m-xylene naphthalene anthracene ethanol 1-propanol 1-butanol I-pentanol 1-hexanol I-heptanol 2-propanol cyclohexanol phenol diethyl ether dibutyl ether dimethyl ether propionaldehyde butyraldehyde 2-pentanone 2-heptanone 2-octanone acetic acid propionic acid butyric acid ethyl acetate propyl acetate
loaH 1.87 1.92 2.12 1.25 0.27 1.59 0.01 0.69 -0.45 -0.39 -0.29 -0-59 -1.77 -3.14 -3.59 -3.56 -3.46 -3.29 -3.20 -3.12 -3.48 -3.63 -4.79 -1.28 -0.61 -1.39 -2.52 -2.33 -2.58 -2.23 -2.11 -4.91 -4.74 -4.66 -2.26 -2.09
no. 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
comDound butyl acetate pentyl acetate hexylacetate isobutyl acetate methyl butyrate ethylamine propylamine butylamine pentylamine hexylamine propionitrile butyronitrile nitrobenzene 2-chloropropane 1-chlorobutane chlorobenzene 1-bromopropane 1-bromobutane 1,2-dichloroethane 1,4-dioxane fluoromethane dimethyl sulfide diethyl sulfide methanethiol ethanethiol thiophenol dimethylamine diethylamine piperidine triethylamine trimethylamine N-methylpiperidin pyridine 4:methylpyridine 2,4-dimethylpyridine 2-methylpyrazine
overfit the data. For log H , such an estimate is difficult to obtain due to the largenumber of experimental sources and the omission of experimental uncertainties in many of the original papers. However, since the log H values were obtained primarily through the combination of vapor pressure data and solubility data, a rough estimate of the uncertainty may be arrived at by examining limiting experimental uncertainties in vapor pressures and solubilities. Compounds with low aqueous solubilities and/or low vapor pressures should have the greatest associated uncertainty in log H. If, for some compound A, the vapor pressure P A is not much greater than 1atm and the solubility SA is not much greater than 1 mol/L, then eq 1 may be replaced by
log H -1.87 -1.80 -1.66 -1.73 -2.08 -3.38 -3.30 -3.21 -3.00 -2.96 -2.82 -2.67 -3.02 -0.18 -0.10 -0.74 -0.41 -0.30 -1.27 -3.70 -0.16 -1.13 -1.05 -0.91 -0.95 -1.87 -3.14 -2.98 -3.74 -2.22 -2.37 -2.85 -3.44 -3.61 -3.56 -4.04
log H data taken from Hine and Mookerjee.6
compounds that are fairly representative of the entire set were removed so that they could be used as an external prediction set during model validation. The compounds selected as the prediction set were butylbenzene, 1-pentanol, dimethyl ether, prop'onic acid, butyl acetate, pentylamine, and diethylamine. Thejemaining 65 compounds were used as the training set for developing statistical models. Experimental Uncertainties. Once the data set for a QSPR investigation is established, it is important to have a n estimate of the average uncertainty in the experimental data. The primary reason for this is to ensure that the modeling process does not
(4)
Thus, if U ~ isA the uncertainty in the vapor pressure and USA is the uncertainty in the solubility, then a standard technique of error propagation yields
(5) Hence, the uncertainty in log H depends only on the relatively uncertainties in the vapor pressure and solubility. This formulation of course ignores any systematic errors inherent in the approach of combining vapor pressures and solubilities to arrive a t Henry's law constant. Vapor pressures p used in the calculation of log H are most often computed from an empirically derived Antoine equation11J2
+
log p = A - B/(T C) (6) Here, T i s the temperature in "C and the constants A , B, and C are determined from vapor pressure measurements a t three or more temperatures. The formula in eq 6 in general provides a very accurate estimate ofp over a specified range of temperature or pressures. For example, an Antoine equation for isooctane12 yielded an average deviation of only 0.05 mmHg over the observed vapor pressure range 100-1500 mmHg. This degree of accuracy approaches that of the actual experimental vapor pressure measurements. Realizing that errors from the Antoine equation are usually small in absolute value, examination of compounds with low vapor pressures should uncover the largest relative uncertainties in p. Of the compounds in the data set, the pyridines are among those exhibiting the lowest room temperature vapor pressures. Andon and co-workers13 compared Antoine-calculated and observed vapor pressures for a number of pyridines near room temperature. The observed vapor pressures for these compounds covered a range of 4-16 mmHg. Andon found an average error of 1.6 % in the Antoine values and a maximum error of 3 % This analysis is by no means comprehensive, but it is probably not unreasonable to assume an uncertainty of approximately 2% in the vapor pressure data. Solubility is subject to considerably more uncertainty than vapor pressure. For the present dataset, the hydrocarbons exhibit by far the lowest aqueous solubilities and therefore should carry the largest relative uncertainties for solubility. McAuliffe14used a gas-liquid partition chromatographic technique to measure aqueous solubilities for a variety of saturated and unsaturated hydrocarbons. Most of the hydrocarbon solubilities for the present study were taken from this source. For each measured solubility, McAuliffe reported a standard deviation for a series of trials. Ignoring systematic errors and utilizing the reported standard deviations as uncertainties yields a n average relative
.
~
~
(11)Antoine, C. Compt. Rend. 1888,107,681. (12)Thomson, G.W . Chem. Rev. 1946,38, 1-39. (13) Andon, R. J. L.; Cox, J. D.; Herington, E. F. G. J.Chem. Soc. 1954,
3188-3 196. (14)McAuliffe, C. J. Phys. Chem. 1966, 70,1267-1275.
1352
ANALYTICAL CHEMISTRY, VOL. 64, NO. 13, JULY 1, 1992
Table 11. Summary of the S t r u c t u r a l Descriptors Calculated a t the Beginning of t h e Study category topological
geometric
electronic
charge-partial surface area
description simple and valence-corrected connectivity indices atom, bond, fragment, and substructure counts weighted sums of self-avoidingconnected paths from molecular graphs weighted path descriptors from atomic IDS molecular shape indices solvent-accessiblemolecular surface areas and volumes based on van der Waals atomic radii maximum and minimum length to breadth ratios for the molecule moments of inertia, ratios of moments, and radius of gyration, with and without hydrogens symmetry descriptors based on the number of atoms with unique environments three-dimensional Wiener index empirically-calculated molar refractivity empirically-calculated molecular polarizability atomic charges on most positive and most negative atoms, electric dipole moment, and variations of charges on heteroatoms combinations of atomic charges and solvent-accessiblevan der Waals atomic surface areas
number computed 28 36 5 5 6
ref 7 25,26 27 28 20
2 2 14
3 29 30 31 17,18
1 1 1
36
19
25 total:
uncertainty of 7 % for the solubilities of the first nine compounds (Table I) in the present work. With approximate uncertainties of 2% and 7% for the vapor pressure and solubility data, respectively, eq 5 yields an absolute uncertainty of 0.07 log H units. Even though limiting vapor pressures and solubilities were used to arrive at this value, it should still be considered a lower bound on the uncertainty of log H, since systematic errors were ignored throughout. Therefore, for this data set, an uncertainty of 0.07 log units should be used as a guideline to prevent overfitting the data. S t r u c t u r e Entry. Molecular structures for each compound were generated by sketching them on a graphics terminal. Minimum-energy conformations for 68 of the 72 compounds were arrived a t using Allinger's molecular mechanics force field.'5 The MM2 program lacked parameters for phenol, nitrobenzene, chlorobenzene, and thiophenol. For these structures, geometries were optimized using the MOPAC software package with the MNDO hamiltonian.lB Descriptor Generation. A total of 165structural descriptors were calculated. The descriptors may be grouped into the following four categories: (1) topological, (2) geometric, (3) electronic, and (4) charged-partialsurface area. Table I1contains a summary of the types of descriptors included in each category. Topological descriptors were computed by considering only the types of atoms and bonds present and the connection table for each compound. Geometric descriptors utilized the actual threedimensional structures of the molecules. T h e majority of the electronic descriptors were obtained using a n empirical atomic charge scheme known as CHARGE.17J8 This method computes partial atomic charges using a n electronegativity-based algorithm for u charges and a simple Huckel molecular orbital calculation for ?r charges. The CHARGE scheme has been parameterized to yield accurate electric dipole moments. Atomic charges from this method were combined with solvent-accessible atomic van der Waals surface areas t o yield a variety of charged-partial surface area descriptor^.^^ The method of Pearlma@ was utilized for the surface area calculations. Descriptor Analysis. After descriptors were generated, they were analyzed for content of discriminating information. All descriptors with more than 70% zero or identical values were eliminated from further consideration since they would encode little valid or discriminating information. One descriptor from each pair correlated a t r 1 0.95 was eliminated to avoid the (15)Burkert, U.; Alliiger, N. L. Molecular Mechnics; ACS Monograph Series: Washington, D.C., 1982. (16) Quantum Chemistry Program Exchange No. 455 (Version 6.0). Dewar, M. J. S.; Thiel, W. J. J.Am. Chem. SOC.1977,99, 4899-4907. (17) Abraham, R. J.; Smith, P. E. J. Comput. Chem. 1988,9,288-297. (18) Dixon, S. L.; Jurs, P. C. J. Comput. Chem., in press. (19) Stanton, D. T.; Jurs, P. C. Anal. Chem. 1990,62, 2322-2329. (20) Quantum Chemistry Program Exchange No. 413. Pearlman, R. S. In Physical Chemical Properties of Drugs; Sinkula, A. A., Valvani, S. C., Eds.; Marcel Dekker, Inc.: New York, 1980, Chapter 10.
165
Table 111. Descriptors Retained for Submission to Regression Analysis classification connectivity
label S6P
v1 V5P fragment
NHEAVY" NCl NBND
weighted path surface area
WTPD SA
moment of inertiab atomic charge
MOM1 QPOS QNEG HPOS QPREL QHET" QREL
QRELSP charged-partial FPSA surface areac FNSA WPSA" RNCS"
definition/formula
Bxp:simple 6th-order path connectivity index lxPv: valence 1st-order path connectivity index 5xpv:valence 5th-order path connectivity index number of heavy atoms number of chlorine5 number of bonds between heavy atoms (molecular ID)/NHEAW total solvent-accessible surface area (2nd moment)/(3rd moment) charge on the most positive atom charge on the most negative atom charge on the most positive hydrogen QPOS/(total positive charge) (totalcharge on heteroatoms)/ (number of heteroatoms) (total charge on hebratome)/ (number of atoms) QREL2 (1/SA)E(Qi+)(S&+) (USA)ECQi-) @Ai-) (SA)ESAi+ (QNEG)(SAmneg)/(EQi-)
a Descriptor was used in final regression equation. b Computed from heavy atoms only. Qi'and SAi+refer to the charge and surface area, respectively of the ith positively charged atom; Qi- and SAi-are defined analogously.
duplication of information. In this case, the descriptor which was easier to calculate or the one possessing more direct physical meaning was retained. During this process, a total of 63 descriptors were eliminated. T o further reduce the overlap of information, the remaining pool of 102 descriptors were ranked on the basis of mutual orthogonality using a vector space descriptor analysis technique.21 Starting with a single descriptor that was highly correlated with log H and adding an additional descriptor from the pool a t each step, a Gram-Schmidt orthogonalization procedureZ2was used to build a series of orthonormal bases of increasing dimension. (21) Topliss, J. G.; Edwards, R. P. J. Med. Chem. 1979,22,1238-1244. (22) Bradley, G. L. A Primer of Linear Algebra; Prentice-Hall, Inc.: New Jersey, 1975.
ANALYTICAL CHEMISTRY, VOL. 64, NO. 13, JULY 1, 1992
At each step, the descriptor possessing the largest component orthogonalto the current descriptor space was removed from the pool and incorporated into the bias. The process was terminated when the projection angle between the chosen descriptor and the current descriptor space was less than lo. This corresponds to a space that accounts for more than 99.9% of the variance in eachof the descriptors remainingin the pool. The entire analysis was performed repeatedly, each time with a different initial descriptor. A basis of about 20 descriptors was found to be all that was necessary to achieve the loprojection angle criterion. With few exceptions, the same set of 20 descriptors was incorporated into the basis, independent of the choice of initial basis vector. These descriptors were retained for regression analysis and are described in detail in Table 111. Regression Analysis. The leaps-and-bounds regression algorithm23was used to find the subsets of descriptorsmost highly correlated with log H. Variations of the subsets were then manually regressed on the dependent variable log H with an interactive regression analysis program. The regression models developed were analyzed for quality by a combination of the following criteria: highest R value, lowest standard deviation of regression,smallest number of descriptors,and most interpretable descriptors.
RESULTS AND DISCUSSION By examining R values and standard deviations of regression, a number of statistically superior models containing from five to seven descriptors were identified. A single fivevariable regression yielded an R value essentially as high as the six- and seven-variable fits and had the lowest standard deviation of regression. For ease of interpretation, this fivevariable fit was chosen as the best model. It can be summarized as
log H = (-0.513 f 0.038)NHEAVY + (0.0370 f 0.0025)WPSA + (0.0361 f 0.0052)RNCS (10.1 f 0.4)QHET - (251 f 58)QRELSQ + 0.82 f 0.30 (7)
+
R = 0.970
n = 65
s = 0.433
Definitions for these descriptors appear in Table 111. The uncertainties on the regression coefficientscorrespond to 95 % confidence intervals. It should be noted that the first four descriptors in eq 7 were present in all of the potential models. Thus, the decision to retain only one additional descriptor did not have a profound effect on the structure of the final model. Analysis of the model showed that chlorobenzene and thiophenol were statistical outliers. These compounds were identified as outliers because they exceeded three or more of the diagnostic cut-off values for six standard statistical tests known as: (1)residual, (2) standardized residual, (3) studentized residual, (4) leverage, (5) DFITS statistic, and (6) Cook’s distance.24 Briefly,the residual is simply the difference between the observed experimental value and the calculated value. A standardized residual is the residual divided by the standard deviation of the regression equation. The studentized residual is the residual for one observation divided by (23) Furnival, G.M.; Wilson, R. W. Technometrics 1974,16,499-511. (24) Belsley, D. A.; Kuh, E.; Welsch, R. E. Regression Diagnostics; Wiley: New York, 1980. (25) RandiC, M. Comput. Chem. 1979,3,5-13. (26) Wiener, H. J. Am. Chem. SOC.1947,69, 17-20. (27) RandiC, M. J. Chem. Znf. Comput. Sci. 1984,24, 164-175. (28) Kier, L. B. Quant. Struct.-Act. Relat. Pharmacol., Chem. Biol. 1986, 5, 1-7. (29) Trinajstic, N.; Nikolic, S. J. Math. Chem. 1989, 3, 299-309. (30) Vogel, A. I. Elementary Practical Organic Chemistry, Part 2 Qualitative Organic Analysis; John Wiley and Sons: New York, 1966; p 24. (31) Miller, K. J.; Savchic, J. A. J . Am. Chem. Soe. 1979, 101, 72067213.
1353
its own standard deviation. Leverage is a measure of the relative influence a single observation has on the regression. The DFITS statistic is used to describe the difference in the fit of the equation caused by removal of an observation. Finally, Cook’s distance measures the change in regression coefficients due to the removal of an observation. The outliers, chlorobenzene and thiophenol, were two of the four compounds that were modeled by the MOPAC program. The other two compounds modeled by MOPAC, phenol and nitrobenzene, contained values well accounted for by the model, however. Compared to the other members of the training set, the outliers were the only benzene derivatives with an attached substituent beyond the second period. Their higher polarizability and tendency to interact with the adjacent 7~ system through d orbitals may have caused a resonance effect not accounted for in the model and not represented well in the training set. Since the outliers significantly influenced the model in a way not accounted for by the model, they were eliminated from consideration in subsequent work, leaving a revised training set of 63 compounds. The best model from the regression analysis was rebuilt using the revised training set of 63 compounds. This yielded the following model:
log H = (-0.547 f 0.033)NHEAVY + (0.0402 f 0.0023)WPSA + (0.0360 f 0.0045)RNCS (10.1 f 0.4)QHET - (215 f 51)QRELSQ + 0.73 f 0.26 (8)
+
R = 0.978
n = 63
s = 0.375
In this model, no compounds were identified as statistical outliers using the criteria previously discussed. The calculated values and residuals for the training set compounds are shown in Table IV. Figure 2 is a plot of the calculated vs observed values for log H. Several validation techniques other than statistical outlier detection were used to test the quality of the model. First, the residuals were plotted against the calculated values to check for a relationship between the two. As shown in Figure 3, the residuals were not a function of the calculated value. Second, the pairwise correlations of descriptors were checked. As shown in Table V, these correlations were well below the cut-off value of r 2 0.95 used in this study. Third, the correlation of combinations of descriptors was checked. This test for multicollinearities was done by regressing each descriptor in the model against the remaining descriptors used in the model. The highest multicollinearity, at R = 0.766, was also well below the cut-off value used in this study. Finally, the model was used to estimate log H values for compounds in the external prediction set. Calculated log H values and residuals for these compounds are given in Table VI. The standard deviation of the prediction error was 0.402 log units, which is within acceptable limits when compared to the training set error of 0.375 log units. A plot of the calculated vs observed log H values for the prediction set is shown in Figure 4. It should be noted that external prediction is one of the most valuable of all model validation techniques. Accurate external predictions are very strong evidence for the stability of regression coefficients and therefore the absence of chance correlations in model development. The model indicates that Henry’s law may be approximated as a multiple linear function of (1)the number of heavy atoms, (2) the weighted positively charged surface area, (3) the relative surface area on the most negative atom, (4) the average charge on heteroatoms, and (5) the squared relative charge on heteroatoms. A direct cause-and-effect relationship may not be concluded based on statistical correlations alone. However, guided by interpretation of the model variables, it
2.5
1.0
r
-
1.7 -
0.8 -
0.9 -
0.6 -
0.1 -
0.4 -
0.2
a
5
00
. -
. .
m
m
. . . -. m
m m
m
m
.
m
m
m m m
P
m m
m
-
m
m I
4.6 -0.8 -1.0
m .
S
-0.2 -
-0.4
m m
-
m
t i ' .
-3.1 -
m
"
"
D
.
m
m m
m
ANALYTICAL CHEMISTRY, VOL. 64, NO. 13, JULY 1, 1992
Table VI. Comparison of Observed and Calculated log H for Prediction Compounds (Hin Dimensionless Form) no. compound obsvd calcd residual -0.06 -0.29 -0.23 1 butylbenzene -0.10 -3.29 -3.19 2 1-pentanol -1.39 -2.12 0.73 3 dimethyl ether 0.60 -4.74 -4.14 4 propionic acid 0.25 -1.87 -2.12 5 butyl acetate 0.25 -3.00 -3.25 6 pentylamine 7 diethylamine -2.98 -2.57 -0.41
-0.95
1
p a25
-
4.90 -3.55
-4.20 -4.85
1
1 1 ,./
-5.50 -5.50
4.85
-4.20
-3.55
4.90
-2.25
-1.60
O b m c d Log H
-0.95
-0.30
0.35
The average heteratom charge (QHET) and the square of the relative charge due to heteroatoms (QRELSQ)both encode polarity directly. These descriptors are nonzero only when heteroatoms are present, and they become larger in absolute value for compounds with highly negative heteroatoms. The nonzero values for QHET are always negative, so the fact that it has a positive coefficient in the regression is consistent with the aqueous affinity and lower logH of polar compounds. The term QRELSQ encodes information similar to that of QHET, but the regression coefficient is negative because the squaring operation makes all of the values non-negative.
CONCLUSIONS
J./
-0.30
1355
1.00
Figure 4. Plot of calculated log H vs observed log H for the external predlctlon set.
and into the vapor phase. The relative negative surface area descriptor (RNCS) can best be described as a factor which is inversely related to bulk for polar compounds. Values for RNCS are generally small for hydrocarbons and only become significant for compounds with a heteroatom. Within a given non-hydrocarbon series (e.g., carboxylic acids), RNCS is observed to decrease with increasing chain length or molecular bulk. Considering the fact that RNCS receives a positive coefficient in the model, it could be concluded that this factor encodes for the forcing of larger polar compounds out of the gas phase and into the aqueous phase. Thus, it appears that RNCS may represent the influence of bulk on the partitioning of polar compounds.
A model that highly correlates molecular structure with the log of Henry’s law constant has been developed for a diverse set of organic compounds. The model consists of five structural descriptors which encode information apparently related to each compound‘s bulk, lipophilicity, and polarity. These factors were arrived a t through interpretation of the descriptors, which was possible primarily because there were a small number of descriptors with direct physical meaning. Log H values for an external set of compounds have been predicted with reasonable accuracy using the model equation. Computer-assisted methods have thus provided insight into the relationship between structure and Henry’s law and have led to the development of a predictive equation for log H. Examination of a larger data set using similar computerassisted methods may provide a deeper understanding of the factors contributing to gaseous-aqueous partitioning and may provide a more universal equation for predicting log H.
ACKNOWLEDGMENT C.J.R. acknowledgesthe support provided during the course of this work by the REU program of the National Science Foundation. RECEIVEDfor review October 14, 1991. Revised manuscript received March 9, 1992. Accepted March 16, 1992. Registry No. Butylbenzene, 104-51-8; 1-pentanol, 71-41-0; dimethyl ether, 115-10-6;propionic acid, 79-09-4; butyl acetate, 123-86-4;pentylamine, 110-58-7;diethylamine, 109-89-7.