Chapter 36
Quantitative Structure—Activity Relationship Models for Predicting Aqueous Solubility Comparison of Three Major Approaches
Downloaded by IOWA STATE UNIV on February 15, 2017 | http://pubs.acs.org Publication Date: December 7, 1990 | doi: 10.1021/bk-1990-0416.ch036
Nagamany N. Nirmalakhandan and Richard E. Speece Environmental and Water Resources Engineering, Vanderbilt University, Nashville, TN 37235
Three major approaches to the prediction of aqueous solubility of organic chemicals using Quantitative Structure Activity Relationship (QSAR) techniques are reviewed. The rationale behind six QSAR models derived from these three approaches, and the quality of their fit to the experimental data are summarized. Their utility and predictive ability are examined and compared on a common basis. Three of the models employed octanol-water partition coefficient as the primary descriptor, while two others used the solvatochromic parameters. The sixth model utilized a combination of connectivity indexes and a modified polarizability parameter. Considering the ease of usage, predictive ability, and the range of applicability, the model derived from the connectivity- polarizability approach appears to have greater utility value.
Several excellent QSAR models for predicting aqueous solubility have been proposed during the past few years. Many of them covered small sets of selected classes of congeneric compounds, while few covered a wide variety of compounds. The first major attempt in developing QSAR model for aqueous solubility was by Hansch et al (I), who used the octanol-water partition coefficient, p, to derive a simple linear equation for the solubility of 156 organic liquid solutes. Since then, several semi-theoretical and empirical models using log ρ have been reported, though for smaller numbers and particular classes of compounds (2-5). The second approach, developed by Kamlet and co-workers (6,7), was of a more fundamental nature. Known as the Linear Solvation Energy Relationships (LSER), this approach uses solvatochromic parameters to model the solute-solvent interactions in the solution process. The third approach, developed by the current authors (8-10). uses molecular connectivity indexes, χ, and a modified polarizability parameter, Φ , to model the solute-solvent interactions. In terms of statistical considerations, these models have been reported to be very strong as shown by their high regression coefficient, r. However, except in a very few cases, the utility value and the predictive ability of many of these models have not been demonstrated by their respective authors. Such features of QSAR models should be made available so that an appropriate model could be selected by end-users depending on case of usage, reliability, degree of accuracy required, and their own expertise. This paper is an attempt in providing a comparative analysis of six important models for solubility developed from the three major approaches as reported by four groups of workers. These six models were selected for evaluation because all of them employed 100 or more compounds in their training sets and reported r > 0.95 with standard error < 0.3, which make all of them strong candidates for many practical applications. 0097-6156/90/0416-0478$06.00/0 © 1990 American Chemical Society
Melchior and Bassett; Chemical Modeling of Aqueous Systems II ACS Symposium Series; American Chemical Society: Washington, DC, 1990.
36.
NIRMALAKHANDAN & SPEECE
479
Predicting Aqueous Solubility
THE THREE MAJOR APPROACHES In this section, the reasoning behind each of the three approaches is outlined, and the models derived from the respective approaches are presented. The log ρ Approach.
Downloaded by IOWA STATE UNIV on February 15, 2017 | http://pubs.acs.org Publication Date: December 7, 1990 | doi: 10.1021/bk-1990-0416.ch036
The rationale for this approach was based on the similarity between the dissolution of an organic solute in water, and its partitioning between two solvents. Thus, the equilibrium between an organic solute and its saturated aqueous solution was thought to be similar to that of the partitioning of the solute between itself and water, and a linear relationship between log S (S = solubility in moles/lit) and log ρ was sought. This study by Hansch et al (1) reported the following QSAR model: [Model 1]: for Aliphatic and Aromatic liquid solutes log(l/S) = 1.339 log ρ - 0.978 η = 156; r = 0.935.
(1)
where η is the number of solutes evaluated. A thermodynamic justification for this model has been presented by Hansch et al Q). By considering the chemical potential of the solute and ignoring non-ideality, they derived a theoretical equation of the form: log (1/S) = log ρ + [(μθ(οοΐ) - P(l))/2.303-RT] ls m
(2)
e
where p ° ( o c t ) chemical potential of the solute in one mole of ideal octanol solution, and μ(ΐ) is that of the pure liquid solute. Reasonable agreement between equations (1) and (2) can be seen in their form and the coefficient and sign of the log ρ term. Following the approach introduced by Hansch et al, Yalkowski and Valvani (11) and Yalkowski et al (12) developed a semi-theoretical equation to cover both liquid and solid solutes, using an entropy of fusion term, ASf , and a melting point term, MP: [Model 2]: for Aliphatic and Aromatic liquid and solid solutes log S = - log ρ - 1.11 ASp(MP-25)/1364 + 0.54 n= 167; r = 0.994. [Model 3]: for Aromatic liquid and solid solutes log S = -0.944 log ρ - 0.01 MP + 0.323 η = 164; r = 0.977.
(3)
(4)
The coefficients of the log ρ term in the above two models, are in very good agreement with that of the theoretically derived equation (equation 2). The LSER Approach. Kamlet et al (6,7), who introduced and developed this approach, use a linear combination of three energy terms to model solubility related properties, SP, in solute-solvent systems. The parameters used to quantify these energy terms are called the solvatochromic parameters. The first energy term, called the 'cavity term', is a measure of the free energy necessary to separate the solvent molecules (by overcoming the solvent-solvent interactions) to provide a suitably sized cavity for the solute. The second term, called the 'dipolar term', is a measure of the exoergic energy associated with the solute-sol vent interactions (e.g. dipole-dipole). The third term, 'hydrogen bonding term' accounts for the exoergic effects of complexation between systems capable of taking part in hydrogen bonding.
Melchior and Bassett; Chemical Modeling of Aqueous Systems II ACS Symposium Series; American Chemical Society: Washington, DC, 1990.
480
CHEMICAL MODELING OF AQUEOUS SYSTEMS II
Using appropriate solvatochromic parameters, these three energy terms are modeled to relate to solubility related property, SP of various solutes in a given solvent (e.g. aqueous solubility): SP = S P + m Vi/100 + s π* + a ot 0
m
+bp
(5)
m
where, S P , m, s, a and b are the constant coefficients, and V j , π*, a , and p solvatochromic parameters relating to the solute. 0
m
m
are the
Downloaded by IOWA STATE UNIV on February 15, 2017 | http://pubs.acs.org Publication Date: December 7, 1990 | doi: 10.1021/bk-1990-0416.ch036
The LSER approach has been very successfully applied by Kamlet et al to model many physicochemical properties and biological activities. Their model for aqueous solubility was (6,7): [Model 4]: for Aliphatic liquid solutes log S = 0.05 - 5.85 Vi/100 + 1.09 π* + 5.23 β™ n= 115; r = 0.994. [Model 5]: for Aromatic liquid and solid solutes log S = 0.57 - 5.58 VT/100 + 3.85 p
m
(6)
- 0.01 l(MP-25)
(7)
η = 70; r = 0.991. The quality of fit of these LSER models is excellent and the error is comparable to the uncertainties in experimental methods of determining solubility, and has been claimed to have "reached the level of exhaustive fit" (6). The Connectivitv-Polarizabilitv Approach. In this approach, the solute-solvent interactions are modeled using polarizability and the molar volume of the solute. Polarizability, Φ, is in turn modeled by Ketelaar's method (13), where an atomic contribution scheme is employed. Molar volume is in turn modeled by molecular connectivity indices, χ , which are calculated using slightly modified algorithms (9), originally proposed by Kier and Hall (14,15). These indices encode information on the molecular topology and its heteroatom content. They have been shown to correlate well with the solutes' molar volume, and polarizability Q4,I5). Since polarizability information is duplicated by χ and Φ, in this approach, a combination of these two parameters is used to model aqueous solubility, by optimizing the atomic contributions to Φ, and deriving a modified polarizability parameter, Φ. The basic QSAR model derived on this basis was (8): [Model 6]: for Aliphatic and Aromatic liquid and solid solutes ν
log S = 1.465 + 1.758 °χ - 1.465 ° χ + η = 145; r = 0.975. Q
9
1. 01
Φ
Q
(8)
where Φ = - 0.963 (N of Cl) - 0.361 (Ν of Η) - 0.767 (N of Double Bonds). The above model has been verified on many testing sets of miscellaneous compounds, and the model fitted a total of 470 compounds including ethers, esters, PCBs, PNAs, PCDDs, etc. with an r of 0.99 and standard error of 0.33 (8-10).
Melchior and Bassett; Chemical Modeling of Aqueous Systems II ACS Symposium Series; American Chemical Society: Washington, DC, 1990.
36.
NIRMALAKHANDAN & SPEECE
Predicting Aqueous Solubility
481
COMPARISON BETWEEN THE SIX MODELS An overall summary of the above six models is shown in Table I. In this Table, the descriptors are classified into four "types"- experimental, assigned, estimated and calculated. In some models, descriptors are assigned numerical values depending on the solute, which if used incorrectly, could lead to erroneous results. The difference between "estimated" and "calculated" is that an error may be associated with the estimated value while rigid algorithms are used in determining the "calculated values" yielding firm values. In multiple regression analyses, if the basic assumption of error-free independent variables is violated, invalid models may result. Therefore, models using "calculated" descriptors would be more preferable to those using experimental, assigned or estimated ones.
Downloaded by IOWA STATE UNIV on February 15, 2017 | http://pubs.acs.org Publication Date: December 7, 1990 | doi: 10.1021/bk-1990-0416.ch036
Statistical Qualities Of The Models. Considering the unexplained variance, UV, in the experimental data, the goodness of fit is seen to be superior (UV < 5%) in Models 2, 3, 4, 5 and 6. The variance in Model 1 (UV > 10%), though weakest, is quite remarkable in that only one descriptor was used to cover 156 compounds, while the other models employ 2 or more descriptors. Adjusted fi, is a statistical indicator which can be used to compare data sets of different numbers of chemicals and descriptors on an equitable basis. Models 2 and 5, which rank high on this basis, need an experimental input, the errors of which might affect not only the accuracy of the result, but also, the statistical validity of the model itself. The last two columns of Table I give a direct indication of the reliability of the different models. They show the number and percent of compounds for which the error in fitting is greater than 0.3 log units (i.e. factor of 2). On this basis of comparison, the LSER approach (Models 4 and 5) ranks very high, while the log ρ approach for the aromatic compounds (Model 3) appears to be the least reliable. Utility Value Of The Models. The utility value of a QSAR model depends primarily on the nature of the descriptors used in developing the model. The availability of the descriptors, ease of calculation, accuracy or consistency of their values, applicability to new compounds, and ability to represent the structural and atomic variations in the chemicals are some of the relevant factors to be considered in comparing different descriptors. Based on the above factors, and considering the fact that the connectivity-polarizability approach does not require any experimental data, Model 6 appears to be the most desirable, along with the log ρ approach to a lesser degree. In the connectivity-polarizability approach, simple and rigid algorithms are used which can be applied to all classes of chemicals. The major limitation of the parameters in this approach is that they can not differentiate between isomeric members of certain congeneric serieses. Methods for estimating log ρ have been firmly established and fragment constants and substituent factors are available for most atomic combinations. However, these estimation methods ignore effects of substituent interactions, and corrections for such effects are not currently available. Thus, estimated log ρ can be used confidently for compounds containing mono-functional substitutions, while for others with mixed substitution, the results may be questionable. Because of this, the developers of these estimation methods have recommended that experimental log ρ values, rather than estimated ones, be used wherever possible (16). However, due to the nonavailability of experimental values, many QSAR workers continue to use the estimated ones. In fact, in deriving Model 1, Hansch et al Q) used estimated log ρ values for 133 compounds and experimental log ρ values for only 23 compounds. Yalkowski and Valvani (11,12) used estimated log ρ values for all compounds in deriving their two Models. In the LSER approach, values have been assigned for some compounds; for others, rules are becoming available for estimating the parameters. In many cases, established values are not yet available, severely limiting the utility of this otherwise attractive model. The solvatochromic parameters are in an evolving stage, and currently, established values for about 600 chemicals are believed to be available. The rules for their estimation are far from being rigid or complete, and many exceptions and special cases have to be taken into account, which demands considerable expertise and insights. The practical implementation of this approach would need a relatively large database of
Melchior and Bassett; Chemical Modeling of Aqueous Systems II ACS Symposium Series; American Chemical Society: Washington, DC, 1990.
Melchior and Bassett; Chemical Modeling of Aqueous Systems II ACS Symposium Series; American Chemical Society: Washington, DC, 1990.
164
115
70
MODEL 3 Yalkowski et al, (1983)
MODEL 4 Kamlet et al, (1986)
MODEL 5 Kamlet et al,(1986)
Aromatic Solids & Liquids
Aliphatic Liquids
Aromatic Solids & Liquids
Mixed Solids & Liquids
Mixed Liquids
Type
1 (MP)
-
1 (MP)
1 (MP)
-
Ex
2 (SCP)
3 (SCP)
-
1 (ASf)
-
As
-
-
1 (log p)
1 (log p)
1 (log p)
Es
-
-
-
-
-
Ca
Type and Number of Descriptors Used in the Model *
3
3
2
3
1
TOTAL
?
0.992
0.994
0.977
0.994
0.935
0.979 §
0.987 §
0.953 §
0.987 §
0.873 §
0.216
0.153
n/r
0.242
n/r
8
8
11
8
6
122
12
8
94
43
47
25
17
7
57
26
30
Range ** Predictions Statistics of the Model of where Corr. Adjusted Standard log S error >• 0.3 Coeff. Error r r squared N %
0.327 12 0.990 0.976 Mixed 3 3 oX,oXv,0 Solids & Liquids * Types of Descriptors: Ex- Experimental; As- Assigned; Es- Estimated; Ca- Calculated MP- Melting Point; p- Octanol/water partition coefficient; SCP- Solvato chromic parameters; ASf- Entropy of fusion; οΧ,οΧν- molecular connectivity indexes; 0 - polarizability parameter. * * Range shown in orders of magnitude; n/r - Not reported in original study; § Not reported in original study, but calculated in this study.
470
167
MODEL 2 Yalkowski and Valvani, (1980)
MODEL 6 Nirmalakhandan and Speece, (1988)
156
ISP
Chemicals in Data Set
MODEL 1 Hansch et al, (1968)
MODEL Study By (Year)
TABLE I. Summary of Solubility Models Derived From Three Major Approaches
Downloaded by IOWA STATE UNIV on February 15, 2017 | http://pubs.acs.org Publication Date: December 7, 1990 | doi: 10.1021/bk-1990-0416.ch036
36.
NIRMALAKHANDAN & SPEECE
Predicting Aqueous Solubility
483
parameters as well as rigid parameter estimation rules. However, if established values or consistent rules for their estimation become available, log S could be calculated using hand-calculators, whereas, all the other approaches would need a computer program, with some form of graphic input capability to describe the structure.
Downloaded by IOWA STATE UNIV on February 15, 2017 | http://pubs.acs.org Publication Date: December 7, 1990 | doi: 10.1021/bk-1990-0416.ch036
Range Of Applicability Considering the heterogeneity of the training set, Model 6, (connectivity-polarizability approach) covering the largest number of aliphatic and aromatic solid and liquid solutes with just one equation, appears to be very broad-based. In terms of the range of numerical values of S covered, Model 6 again ranks high, covering over 12 orders of magnitude. The LSER approach requires two markedly different equations for the aliphatic and aromatic compounds. However, within each class, the LSER models are the only ones reported containing amines, nitro compounds, etc. in the training set. Even though log ρ can be used to predict solubility for aliphatic and smaller aromatic molecules, as shown by Models 1 and 2, a separate equation, as shown by Model 3, fits the larger aromatic molecules with substitutions better. Physical Significance Of The Descriptors. When comparing different QSAR models, one of the important points to be considered is the physical significance of the descriptors. Among those used in QSAR models for solubility, the solvatochromic parameters seem to be the most fundamental, and physically significant. They are useful in understanding the solution process (at a molecular level) by identifying and resolving it into three steps. Such knowledge could be beneficial in understanding other solute-solvent related phenomena (e.g. solubility in blood). A significant feature of this approach is that, for the first time, solubility has been resolved quantitatively into more fundamental properties of the solute. Further, since the solvatochromic parameters ( VT/100, π*, p ) are scaled to be roughly the same order of magnitude, the relative importance of the appropriate parameter in governing solubility is clearly shown by their respective coefficients in these equations. The fact that the same solvatochromic parameters are used to model a variety of solubility related properties shows that this approach is sound and fundamentally very strong. m
The connectivity indices and the polarizability parameters, however, relate a solutes solubility directly to its molecular structure, and thus could be more useful in the design and evaluation of new chemicals. A particular drawback of the polarizability parameter used here is that, unlike the LSER descriptors, it is not universally applicable to all solute-solvent interactions. It has to be defined and optimized for each property being studied. The log ρ descriptor is purely empirical, and does not portray any direct mechanistic significance in relation to the solutes molecular structure. Further, since Model 1 is significantly improved by including melting point data, it can be noted that log ρ alone does not encode sufficient information relating aqueous solubility. Predictive Ability Of The Models. The applicability of the models to estimate aqueous solubility of new compounds is one of the important points to check when comparing different models. In reality, the models can be verified by testing on existing compounds which were not included in deriving the original model. In this regard, the predictive ability of the connectivity-polarizability model and the flexibility of the approach in accommodating new compounds have been amply demonstrated (8-10). One of the main features of this approach is that compounds with mixed substitutions and structures could also be satisfactorily modeled, provided the substituents and structures are adequately represented individually in the training set. To evaluate the predictive ability of the models discussed above, we have assembled a testing set of ten compounds not included in the original training sets, but similar to them in structure and heteroatom content. The performance of each of the three approaches could thus be compared on a common basis. The rationale for picking these particular ten compounds is as follows: the first three are alcohols, which have been adequately represented in the training sets of all three approaches. The
Melchior and Bassett; Chemical Modeling of Aqueous Systems II ACS Symposium Series; American Chemical Society: Washington, DC, 1990.
Melchior and Bassett; Chemical Modeling of Aqueous Systems II ACS Symposium Series; American Chemical Society: Washington, DC, 1990.
-2.42 -2.50 -2.55 -0.63 -0.61 -5.63 -7.25 -7.78 -8.26 -9.42
1,1 Diethyl pentanol 3,5,5 Trimethyl hexanol 2,6 Dimethyl 3 heptanol 2 Bromo ethyl acetate 2 Chloro ethyl acetate 2, 6 CI. biphenyl 2, 2', 4, 5' CI. biphenyl 2, 2\ 3, 3\ 6, 6' CI biphenyl 2, 2', 3, 3', 4, 4', 6 CI. biphenyl 2, 2', 3, 3', 4, 5, 5', 6, 6' CI. biphenyl
S in Moles/L; n/a- Not applicable
Average error = Absolute average error =
Exp. logS*
Compound
-2.82 -2.43 -2.61 -0.41 -0.37
-0.01 0.20
0.40 -0.07 0.06 -0.22 -0.24 n/a n/a n/a n/a n/a -2.30 -2.01 -2.14 -0.42 -0.51 -4.73 -7.17 -7.44 -7.61 -9.89 -0.28 0.38
-0.12 -0.49 -0.41 -0.21 -0.10 -0.90 -0.08 -0.34 -0.65 0.47 -4.67 -6.84 -7.05 -7.20 -9.20 -0.68 0.68
n/a n/a n/a n/a n/a -0.96 -0.41 -0.73 -1.06 -0.22
Log ρ Approach Model 1 Mode 2 Model 3 Pred. Error Pred. Error Pred. Error logS* logS* logS* -1.94 -2.14 -2.23 -1.18 -1.28 n/a n/a n/a n/a n/a 0.01 0.48
-0.48 -0.36 -0.32 0.55 0.67 -5.47 -6.77 -8.56 -9.15 -10.82
n/a n/a n/a n/a n/a
0.49 0.74
-0.16 -0.48 0.78 0.89 1.40
LSER A pproach Model 4 Model 5 Pred. Error Pred. Error logS* logS*
TABLE II. Results of Predictive Tests on Six Models
Downloaded by IOWA STATE UNIV on February 15, 2017 | http://pubs.acs.org Publication Date: December 7, 1990 | doi: 10.1021/bk-1990-0416.ch036
-2.47 -2.37 -2.38 -0.85 -0.68 -5.98 -7.05 -8.12 -8.65 -9.71
0.12 0.22
0.05 -0.14 -0.17 0.22 0.07 0.34 -0.21 0.34 0.39 0.29
Con-Pol Approach Model 6 Pred. Error logS*
H
Si
C/3
ζ/3
cl
o
cl M
> O
O
O
w
S o
r
Ρ
S
2
36.
NIRMALAKHANDAN & SPEECE
Predicting Aqueous Solubility
485
Downloaded by IOWA STATE UNIV on February 15, 2017 | http://pubs.acs.org Publication Date: December 7, 1990 | doi: 10.1021/bk-1990-0416.ch036
next two compounds are halogenated esters which carry combination of structures, atoms and functional groups represented in the training sets of all three approaches. Finally, the five PCBs picked represent typical compounds of environmental concern, a class for which there is a severe lack of data. The results of this predictive test are shown in Table II. While this test is not expected and cannot be considered to be the ultimate test for a universal solubility model, it is hoped that this would help in evaluating the different approaches and in identifying future research areas in this and related topics. Models 2 and 6 are applicable to all the ten compounds selected while the other four models are applicable only to sub-groups.Of the former two, Model 6 can be seen to predict reasonably well with the error averaging 0.12 log units. Models 3 and 5 performed poorly for all the compounds tested. Model 1 predicted very well for the five compounds tested. Although the average error of Model 4 was 0.01, its predictions were inconsistent. CONCLUSIONS The six models discussed here have their own merits and demerits. From the analysis reported above, the following general conclusions can be drawn. Model 1, based on the log ρ approach appears to be the simplest to use if either experimental or estimated log ρ values are readily available. In the absence of experimental log ρ data, one has to accept the result with a high degree of uncertainty, because, estimated log ρ values can introduce additional errors. If log ρ and melting point data are available, then Models 2 and 3 could yield more accurate results, particularly if the solute is a solid at room temperature. The LSER approach (Models 4 and 5) has great potential in predicting solubility, but at this point of time, its practical applicability is limited by the non-availability of appropriate parameters or a firm set of rules for their estimation. If further research is focused in rectifying these shortcomings, and if larger data sets could be tested, the LSER approach would probably emerge as the most suitable tool for predicting aqueous solubility. Model 6, based on the connectivitypolarizability approach, has the advantage of requiring neither experimental data or in-depth knowledge of solute-solvent interaction parameters. Further, the wide coverage and the good quality of fit of this model implies greater confidence in its predictive ability. Literature Cited 1. Hansch et al, J. Org. Chem., 1968, 33, 347-350. 2. Miller et al, J. Chem. Eng. Data. 1984, 29, 184-190. 3. Mackay et al, Chemosphere, 1980, 9, 701-711. 4. Baker et al, Phys. Chem. Liq., 1987, 16, 279-292. 5. Baker et al, Quant. Struct. Act. Relat., 1984,3,10-16. 6. Kamlet et al, Jour. Pharm Sci., 1986, 75, 338-349. 7. Kamlet et al, Jour. Phy Chem., 1987, 91, 1996-2004. 8. Nirmalakhandan, N.; Speece, R. E., Env. Sci.&Technol.,1988,22,328-338. 9. Nirmalakhandan, N. Ph.D.Thesis, Drexel University, Philadelphia, 1988. 10. Nirmalakhandan, N.; Speece, R. E., Env. Sci.& Technol., 1989, 23, 708-713. 11. Yalkowski, S. H.; Valvani, S. C., Jour. Pharm. Sci., 1980,69, 912-922. 12. Yalkowski et al, Residue Review, 1983, 85, 43-55. 13. In Horvath A. L. Halogenated Hydrocarbons, Marcel Dekker, Inc., NY 1982. 14. Kier, M. J.; Hall, L. H., Molecular Connectivity in Chemistry and Drug Design, Academic Press, NY 1976. 15. Kier, M. J.; Hall, L. H., Molecular Connectivity in Structure Activity Analysis. Research Studies Press Ltd., England, 1986. 16. Leo et al, J. Med. Chem. 1975, 18, 865-868. RECEIVED August 24, 1989
Melchior and Bassett; Chemical Modeling of Aqueous Systems II ACS Symposium Series; American Chemical Society: Washington, DC, 1990.