Article Cite This: J. Phys. Chem. B XXXX, XXX, XXX−XXX
pubs.acs.org/JPCB
Predicting Octanol−Water Partition Coefficients: Are Quantum Mechanical Implicit Solvent Models Better than Empirical FragmentBased Methods? Varun Kundi and Junming Ho* School of Chemistry, University of New South Wales, Sydney, NSW 2052, Australia
Downloaded via UNIV AUTONOMA DE COAHUILA on July 28, 2019 at 17:21:45 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
S Supporting Information *
ABSTRACT: In this work, we examined the performance of contemporary quantum mechanical implicit solvent models (SMD, SM8, SM12, and ADF-COSMO-RS) and empirical fragment-based methods for predicting octanol−water partition coefficients (log Pow). Two test sets were chosen: the first is composed of 34 organic molecules from a recent study by Mobley et al. J. Chem. Theory Comput, 2016, 12, 4015−4024, and the second set is based on a collection of 55 fluorinated alkanols and carbohydrates from Linclau et al. Angew. Chem., Int. Ed., 2016, 55, 674−678. Our analysis indicates that the errors in the solvation free energies of implicit models are reasonably systematic in both solvents such that there is substantial cancellation of errors in the calculation of transfer free energies. Overall, implicit solvent models performed very well across the two test sets with mean absolute errors (MAEs) of about 0.6 log unit and are superior to explicit solvent simulations (GAFF and GAFF-DC). Interestingly, the best performers were empirical fragment-based methods, including ALOGP and miLOGP with significantly lower MAEs (0.2 to 0.4 log unit). The ALOGP method was further tested against the recent SAMPL6 log Pow challenge consisting of 11 drug-like molecules where it obtained an MAE of 0.32 log unit compared to the best-performing COSMOtherm model (0.31 log unit).
■
require significantly less time and resources3 and have become the method of choice for log Pow estimation. There exist a number of computational approaches for calculating log Pow, which can be broadly divided into (1) empirical fragment-based methods, (2) implicit solvation models, and (3) explicit solvent statistical free energy methods. In fragment-based methods, the log Pow is assembled from contributions arising from the fragments (atoms or functional groups) that make up the molecule and is therefore computationally very efficient. This approach is entirely empirical as the fragment coefficients and correction factors are normally derived by regression or neural network models against a large dataset of experimental values (typically >10,000 compounds) such as the PHYSPROP database.4 Examples of such methods include ALOGP,5−7 ALOGPS,8−11 miLOGP,12,13 MLOGP,14−16 WLOGP,17 and XLOGP3.16,18 For this reason, the success of these methods depends critically on the similarity between the test molecules and the training set. For example, these methods may not yield reliable predictions for ion-bound (e.g., transition metal complexes) systems or organometallic compounds, are unable to discriminate between isomers (e.g., diastereomers) or different
INTRODUCTION The equilibrium distribution of a solute in a system composed of two completely immiscible or partially miscible solvents is described by the partition coefficient and is written as the ratio of the concentrations of solute in the two solvents. P=
[solute]organic phase [solute]aqueous phase
(1)
This differs from the distribution coefficient (D), which takes into account all ionized and neutral forms of the solute. One of the most common organic solvents for the study of biological systems is 1-octanol as it provides a simple model of more complex systems such as lipid molecules in biological membranes. The resulting partition coefficient is used as a measure of the lipophilicity of the solute and is normally reported as the logarithm of P, that is, log Pow. In Lipinski’s famous “Rule of 5”, the log Pow is identified as one of the key physicochemical parameters for determining the membrane permeability of drug-like compounds.1 However, the number of compounds for which log Pow measurements are needed is significantly higher compared to the available experimental log Pow data.2 This is due, in part, to the fact that experimental determination of log Pow normally entails synthesis and purification and is less straightforward for nonUV-active compounds. On the other hand, in silico methods © XXXX American Chemical Society
Received: April 30, 2019 Revised: June 25, 2019 Published: June 25, 2019 A
DOI: 10.1021/acs.jpcb.9b04061 J. Phys. Chem. B XXXX, XXX, XXX−XXX
Article
The Journal of Physical Chemistry B conformations of the same molecule, and cannot account for nonstandard medium effects (e.g., ionic strength). On the other hand, implicit and explicit solvent models are physicsbased methods that are broader in scope, and they predict log Pow through the calculation of absolute or relative solvation free energies of the solute in the respective solvents. In explicit solvent models, molecular dynamics (MD) or Monte Carlo (MC) simulations are carried out in conjunction with statistical free energy methods (e.g., umbrella sampling or thermodynamic perturbation) to determine the absolute or relative solvation free energies.19 These are the most expensive of the three methods because of the need for extensive configurational sampling. However, they are also considered the most robust because solute−solvent interactions are modeled explicitly, and the quality of these simulations is amenable to systematic improvement through the use of more robust force fields or potentials (e.g., hybrid quantum mechanics/molecular mechanics (QM/MM) potentials). By comparison, in implicit solvent models, the solvent degrees of freedom are averaged out so that the solvation free energy of a solute in a given conformation is determined from the energies of the corresponding stationary points on the gasphase and solution-phase potential energy surface, that is, only two points in the configuration space, and are therefore computationally very efficient. Quantum mechanical (QM) implicit solvent models such as SMD,20 SM8/12,21,22 and COSMO-RS23 have been optimized at relatively modest levels of the electronic structure theory such as Hartree−Fock (HF) and/or density functional theory (DFT) methods to reproduce experimental solvation free energies of solutes in aqueous and organic solvents. This approach can be further sped up through the use of molecular mechanics potential energy functions in lieu of QM calculations. For example, the generalized Born/ surface area (GB/SA) methods have been implemented with the OPLS-AA and the Merck molecular force fields (MMFFs) for predicting log Pow.24,25 Similarly, the COSMOfrag26 method was also developed to circumvent expensive DFT calculations (in COSMO-RS) through the use of sigma profiles of precalculated molecules within a database of more than 40,000 compounds. There have also been a number of studies that adopt a hybrid strategy for predicting log Pow by using MD simulations to generate an ensemble of conformations that are used to calculate implicit solvation free energies.27,28 All three classes of methods, particularly, fragment-based methods, have been widely used for the prediction of log Pow. For example, Tetko and co-workers have carried out an exhaustive study of fragment- or topological-based prediction methods and found that the performance of these models deteriorates when they are applied to in-house datasets compared to public datasets.2 The better performance for the public dataset has been attributed to the fact that a significant fraction of these compounds are also present in the training sets for some of the log Pow prediction methods. Quite recently, explicit solvent simulations have also been applied to predict the partition coefficients of n-alkanes,29,30 polychlorinated biphenyls and nitroaromatics,31,32 imidazolium-based ionic liquids,33 amino acids,34 and organic solutes.35−41 Potoff and co-workers reported calculated log Pow values of a dataset of 12 nitroaromatic compounds with a mean absolute deviation (MAD) of 0.2 log unit using the transferable potentials for phase equilibria (TraPPE).31 Using the general AMBER force field (GAFF) and dielectric-corrected GAFF (GAFF-DC), Mobley’s group also recently calculated log Pow
values for a more diverse set of functional groups (41 solutes) with MAEs42 of 0.73 and 0.71 log unit, respectively.37 Interestingly, there are fewer studies that directly assess the performance of implicit solvation models for the calculation of log Pow, although this can be determined from any models that have been parameterized to calculate solvation free energies in water and 1-octanol. Truhlar and co-workers developed SMx models (e.g., x = 5CR,43 5.42,44 5.42R,45 SMD,20 and SM8/ 1221,22 models) for the calculation of solvation and transfer free energies in aqueous and organic solvents;46,47 however, their performance for predicting log Pow remains largely unexplored. Michaliḱ and Lukeš examined the use of SMD, IEF-PCM, and CPCM implicit solvent models in conjunction with various DFT methods to calculate log Pow values of 27 alkyl alcohols and found that predictions made by the latter two models were in poor agreement with the experiment presumably because they have been designed mainly for the prediction of aqueous solvation free energies.48 Other implicit solvent models that can predict solvation free energies in 1octanol (and/or general organic solvents) include the Miertus−Scrocco−Tomasi (MST),49 generalized Born/surface area (GB/SA),50 as well as the COSMO-RS23 and COSMOfrag26 methods. While in silico methods are a promising route toward log Pow prediction, there are currently no clear guiding principles to choose among the bewildering array of different methods. In particular, explicit solvent protocols require considerably more computational resources than implicit solvent and fragmentbased methods, but it is unclear if the extra cost necessarily translates to higher accuracy. While a previous study suggests that this is not the case,41 it was based on a relatively small sample of seven solutes, and a broader study is needed to determine the generality of this finding. Recently, a wide variety of computational methods were also used to predict cyclohexane−water distribution coefficients (log D) in the SAMPL5 challenge.51 As noted above, predicting log D is even more challenging because of the need for accurate pKa values and ionization states of the solute. In this paper, we present an assessment study that directly compares the performance of the above-mentioned methods on a common dataset of experimental log Pow values. The first part of this paper will first focus on 34 organic solutes from the Mobley dataset where log Pow values calculated using explicit solvent (GAFF) simulations are available for comparison. A selection of commonly used QM implicit solvent models (SMD, SM8, SM12, and COSMO-RS) and fragment-based methods (ALOGP,5,7,52 ALOGPS,8−11 iLOGP,24,53 miLOGP,12,13 MLOGP,14−16 WLOGP,16,17 XLOGP3,16,18 and Silicos-IT LOGP16,54) is examined in this work. These fragment-based methods are chosen on the basis of recent benchmarking studies where these methods generally performed the best compared to other similar well-established empirical methods.2,24 The second part of the paper focuses on a dataset of fluorinated alcohols and carbohydrates where their log Pow values were only recently reported and are therefore unlikely to be present in the training set of the empirical models. As such, this dataset should provide an objective assessment of these methods. Finally, the best model(s) identified through this study was further applied to the very recent SAMPL6 log Pow challenge55 consisting of 11 kinase inhibitor fragment-like small molecules. It is envisaged that this work will allow chemists to make informed decisions B
DOI: 10.1021/acs.jpcb.9b04061 J. Phys. Chem. B XXXX, XXX, XXX−XXX
Article
The Journal of Physical Chemistry B
Figure 1. Molecules from dataset M.
concerning the choice of log Pow prediction method for their molecules of interest.
other hand, a significant portion of dataset L is composed of new measurements determined recently where only 14 (out of the 55) solutes have previously measured log Pow values.56 All experimental log Pow values were measured at 298 K. Theory and Methods. The partition coefficient log Pow is related to the transfer free energy between 1-octanol and water
■
EXPERIMENTAL DESIGN AND COMPUTATIONAL DETAILS Datasets. We considered 34 organic molecules from the dataset recently studied by Mobley and co-workers37 and 55 fluorinated molecules from Linclau et al.56 These are labeled as datasets M and L, and the chemical structures of these compounds are provided in Figures 1 and 2, respectively. As shown, the dataset M consists of a broad range of functionalities, including amines, ketones, aliphatic and aromatic alcohols, and amides; however, the experimental log Pow values for these compounds were reported as early as 1971,57−59 and it is likely that these molecules are part of the training set for the empirical fragment-based methods. On the
log Pow =
−ΔΔGtransfer RT ln(10)
(2)
where ΔΔGtransfer is simply the difference in free energy of solvation of the solute in the respective solvents (ΔGw and ΔGo). ΔΔGtransfer = ΔGo − ΔGw
(3)
The individual solvation free energies can be calculated using QM implicit solvent models60,61 C
DOI: 10.1021/acs.jpcb.9b04061 J. Phys. Chem. B XXXX, XXX, XXX−XXX
Article
The Journal of Physical Chemistry B
Figure 2. Molecules from dataset L.
ΔGS* = Esoln(R soln) − Egas(R gas) + Gnes(R soln)
2X/6-31G(d) level, and the COSMO-RS values were based on the BP/TZP method. Geometry optimizations and systematic conformer searches were carried out to locate the lowest energy conformer in the gas phase and in 1-octanol and water; the latter solution-phase calculations were performed in SMD. The SM8, SM12, and COSMO-RS values were obtained from single-point calculations on the SMD geometries. Vibrational analysis confirmed that all optimized geometries have zero imaginary frequencies and are true minimum-energy structures. All SMD and gas-phase geometry optimizations were carried out using the Gaussian16 program,65 while the SM8/SM12 and COSMO-RS calculations were performed using QChem66 and ADF,63,67−69 respectively. It should be noted that the ADF implementation of COSMO-RS is based on the 1998
(4)
where Esoln(Rsoln) and Egas(Rgas) are the solution- and gasphase electronic energies calculated on the lowest-energy geometries in the respective phases so that effects due to solvation change in geometry are included. Gnes(Rsoln) represents the nonelectrostatic contributions (cavitation and dispersion−repulsion). The superscript “*” indicates that these solvation free energies are calculated at 298 K and with respect to a standard state of 1 M in the gas and solution phases. All QM implicit solvent models were applied at the levels of theory that are consistent with their parameterization. The SMD, SM8, and SM12 (Merz−Kollman62,63 and CM5 charges64) solvation free energies were obtained at the M06D
DOI: 10.1021/acs.jpcb.9b04061 J. Phys. Chem. B XXXX, XXX, XXX−XXX
Article
The Journal of Physical Chemistry B
Table 1. Signed Errors (kJ mol−1) in Solvation Free Energies for 1-Octanol and Water Using SMD, SM8, SM12, and COSMORS Solvation Modelsa signed error (calculated − experiment)/kJ mol−1 SMD
b
SM8b
SM12-MKb
SM12-CM5b
COSMO-RSb
solute
1-octanol
water
1-octanol
water
1-octanol
water
1-octanol
water
1-octanol
water
acetonitrile acetophenone aniline anisole anthracene benzene benzonitrile bromobenzene bromomethane chloroform cyclohexane dibromomethane dichloromethane dimethyl disulfide dimethyl ether fluorobenzene hydrazine m-cresol m-xylene nitrobenzene o-cresol o-xylene p-bromophenol p-cresol p-xylene piperidine pyridine tetrafluoromethane tetrahydrofuran toluene trimethylamine urea 1,2-dichlorobenzene 1,2-dichloroethene 1,4-dichlorobenzene 1,4-dioxane 2-bromopropane 2-chloropropane 2-methyl-1-nitrobenzene 2-methylaniline 2-methylpropane 2-methylpropene 2-methylpyrazine 2-methylpyridine 2-nitropropane tetrahydropyran tetramethylsilane thioanisole thiophenol propene propionitrile propyne ethyne ethane ethene MSE (ΔGS)c MAE (ΔGS)c
[−4.36] 0.02 −2.49 3.01 8.76 −1.83 1.88 −0.46 −2.97 −4.04 2.03 −4.75 −7.10 −0.65 2.05 0.43 [−5.46] [1.29] 3.01 8.92 3.07 1.80 5.83 4.24 2.77 4.74 1.96 [−1.08] [−1.05] 0.86 2.11 0.93 1.76 1.19 −0.03 −1.57 −4.47 −5.53 8.16 −0.27 0.40 2.01 [4.24] 4.62 −0.83 −0.68 7.06 1.71 −0.84 −0.21 [−3.59] −2.57 −2.41 0.66 −0.02 0.59 2.70
[2.55] 0.87 −0.81 0.69 3.75 −1.48 4.19 −1.75 −1.07 −1.30 1.07 −1.64 −3.62 1.03 4.00 0.45 [2.30] [−1.47] 1.84 6.19 1.11 0.97 0.82 1.61 1.84 6.04 1.64 [1.27] [3.48] 0.31 1.22 3.32 −0.18 1.74 −1.53 −0.05 −2.42 −2.59 4.72 1.42 −0.50 0.91 [−1.13] 3.82 −0.70 4.19 3.83 1.19 −1.26 −0.47 [3.92] −1.24 −1.78 0.07 0.17 0.94 1.90
[−2.52] −2.92 0.21 2.36 5.87 −2.24 2.87 −0.80 1.33 −0.99 −1.22 −0.70 [−1.88] 1.41 1.82 0.18 −0.34 2.84 1.77 1.89 4.68 0.87 8.20 5.65 1.35 2.19 −0.57 −2.09 [−3.14] 0.01 0.65 −4.73 1.45 5.65 0.78 [−2.54] −0.94 −1.59 3.11 2.32 −0.19 3.20 1.13 1.47 −1.14 [−2.21] 6.71 3.62 2.29 [1.34] [−0.58] 1.88 6.97 −0.24 2.82 1.04 2.24
[1.65] 0.01 3.12 0.55 4.35 −1.06 4.41 −1.99 1.15 1.43 −3.18 1.31 [1.07] 0.74 1.98 0.19 7.00 1.47 0.31 1.42 4.17 0.07 4.41 4.34 0.16 3.53 1.17 −0.85 [2.27] −0.27 −2.56 −4.73 −0.60 4.47 −1.82 [1.22] −0.46 −0.28 1.87 5.21 −3.76 −0.59 −0.57 2.61 0.79 [3.52] −0.79 3.15 2.82 [−1.26] [4.58] 1.08 4.80 −3.13 0.21 1.12 2.08
−6.02 −7.88 −11.52 −6.88 −10.30 −10.83 −5.27 −5.85 −3.24 −6.01 2.92 −3.67 −8.32 −5.18 0.18 −8.09 −10.57 −8.36 −6.21 −1.94 −6.82 −6.43 −0.49 −5.68 −6.22 3.09 −7.90 [−2.75] −1.10 −9.25 −4.44 −11.31 −5.37 −2.13 −6.11 −1.43 −1.15 −3.84 −1.17 −7.10 −0.73 −2.62 −1.57 −4.49 −2.12 0.63 [2.05] −8.07 −8.20 −5.52 −3.86 −13.77 −16.31 0.74 −6.85 −5.11 5.46
−2.59 −6.13 −7.93 −10.88 −11.63 −9.44 −4.66 −6.28 −2.88 −5.25 −0.36 −1.19 −6.51 −6.86 −2.93 −7.35 −3.30 −9.82 −7.47 −4.61 −6.90 −6.99 −3.55 −7.17 −7.15 2.48 −7.30 0.43 −0.60 −9.41 −9.24 −10.12 −7.64 −3.75 −8.21 −6.58 −0.71 −3.55 −4.78 −3.61 −4.57 −5.79 −5.79 −4.42 −3.92 1.42 [−4.00] −8.84 −10.67 −7.41 −0.26 −13.21 −16.62 −1.77 −8.31 −5.76 5.91
−5.20 −3.47 [−2.82] −0.88 2.89 −4.26 0.56 −1.07 0.29 −3.08 0.11 −0.70 −3.64 −0.36 −0.69 −1.38 [−2.57] −0.30 0.29 0.40 1.44 −0.49 6.14 2.60 0.09 2.05 −2.43 −4.35 −2.29 −1.80 −2.10 −10.34 −0.34 2.55 −0.90 −1.90 0.25 −1.46 [1.73] 0.09 −0.48 [1.67] 0.02 0.40 −3.35 [−1.51] [5.36] 1.07 0.80 −0.49 [−2.86] −1.41 1.63 −1.04 0.10 −0.68 1.84
−2.04 −1.71 [2.67] −3.53 2.88 −2.40 1.48 −0.83 0.87 −1.29 −3.74 2.49 −0.99 −1.47 −2.30 −0.42 [7.40] −0.87 −0.91 −3.08 2.07 −0.95 4.10 2.11 −0.81 3.05 −0.46 −3.13 0.41 −1.61 −5.15 −7.91 −1.80 1.57 −1.99 −2.60 0.75 −0.82 [−2.52] 5.11 −5.02 [−1.95] −2.59 1.43 −6.07 [1.06] [−1.44] 1.12 −0.41 −2.70 0.64 −0.66 1.67 −4.31 −1.64 −0.71 2.25
−1.86 −2.05 −2.48 −0.88 3.62 −0.93 −1.89 −1.69 −0.54 −1.86 −1.11 −0.96 −1.69 −0.20 −0.21 −1.72 −10.44 −3.97 −0.07 −0.39 −2.13 −0.74 −1.67 −1.28 −0.25 1.35 −0.91 −6.97 −2.61 −0.32 2.41 −8.73 −1.89 −0.17 −3.36 −2.13 −2.51 −2.68 −1.68 −0.80 −1.39 0.36 1.25 −0.84 −1.95 −3.10 −6.82 −0.12 0.34 −0.77 −1.96 −1.51 −3.24 0.67 −1.37 −1.66 1.97
−3.00 −2.86 0.79 −0.21 2.34 −0.39 −1.85 −2.03 −0.99 −0.47 −1.11 0.11 −0.87 2.30 0.74 −0.97 −11.92 −1.93 0.24 −0.36 0.85 0.00 −1.02 0.71 0.30 6.41 −0.15 −2.35 −0.36 0.06 8.90 −12.06 −2.68 −0.96 −3.82 −2.87 −2.25 −1.86 −2.68 2.79 −1.45 −0.41 −0.63 0.20 −0.27 0.04 −5.71 0.68 −0.13 −1.02 −1.22 −1.28 −2.92 −0.27 −1.52 −0.93 1.92
E
DOI: 10.1021/acs.jpcb.9b04061 J. Phys. Chem. B XXXX, XXX, XXX−XXX
Article
The Journal of Physical Chemistry B Table 1. continued signed error (calculated − experiment)/kJ mol−1 SMD solute MAE (ΔΔGtransfer)c AEmax (ΔΔGtransfer)c
1-octanol 2.15 7.72
b
SM8b water
1-octanol
SM12-MKb water
1-octanol
2.11 7.51
2.06 7.27
water
SM12-CM5b 1-octanol 2.10 9.97
water
COSMO-RSb 1-octanol
water
1.28 6.49
a Square brackets denote solutes that display errors in solvation free energies that are larger than 1 kJ mol−1 and with opposite sign in 1-octanol and water. bWith the exception of COSMO-RS (BP/TZP), all models were applied at the M06-2X/6-31G(d) level. cMSE, mean signed error; MAE, mean absolute error; AEmax, maximum absolute error.
version,70 and there have been several published and unpublished revisions to the original COSMO-RS method, which are implemented in the COSMOtherm program.71 The performance of the two implementations will be compared for the SAMPL6 dataset of 11 drug-like molecules (vide infra). We also calculated the log Pow using a number of fragmentbased methods, viz., ALOGPS, 8 − 1 1 miLOGP, 1 2 , 1 3 ALOGP, 5,7,52 iLOGP, 16 XLOGP3, 16,18 MLOGP, 14−16 WLOGP,16,17 and Silicos-IT LOGP.16,54 A brief description of each method is provided below: 1. iLOGP: This method was developed in 2014 by Daina et al.24 in which the GB/SA methods have been implemented with the OPLS-AA and the Merck molecular force fields (MMFFs) for predicting log Pow. As such, this is an implicit solvent model that is trained on 11,993 molecules, and the method is available through the SwissADME web tool.16,53 2. ALOGP: This is an atom-based approach developed by Ghose and Crippen in 19865−7,52 and was trained on 893 molecular structures. 3. miLOGP: This is a fragment-based method developed by Molinspiration.12,13 In this method, the log Pow is calculated as a sum of group/fragment contributions and correction factors, and it was trained on more than 12,000 molecular structures. 4. ALOGPS: This is a topological method, which is based on the Associative Neural Network (ASNN) model using 75 E-state indices.8−11 The training of this method was performed using 12,908 molecules from the PHYSPROP database. 5. WLOGP: This is an atom-based approach developed by Wildman and Crippen16,17 in which a new atom-type classification system is used. Using a training set of 9920 molecules, they determined 68 atomic contributions to log Pow. 6. MLOGP: This topological method is developed by Moriguchi and co-workers,14−16 which is trained on 1230 molecular structures. This method is based on a regression constant and a linear relationship having 13 parameters, which describe the topology of the molecule. 7. Silicos-IT LOGP: It is a hybrid method based on 27 fragments and 7 topological descriptors.16,54 8. XLOGP3: This is the third version of the XLOGP.16,18,72 This model is based on two correction factors and a system of 87 fragments. In this method, a library of 8199 reference compounds with measured experimental log Pow values is used. The input structure is compared with the molecular structures in the library to identify a similar reference structure. Corrections are applied to fragments of the input structure that are
■
different from those of the reference structure to obtain an estimate of log Pow.
RESULTS AND DISCUSSION How Systematic Are the Errors in Implicit Solvent Models? Before applying these models to calculate log Pow, it
Figure 3. Lowest-energy conformers of M4 and M6 in the respective phases.
is of interest to examine whether the errors in implicit solvent models are systematic with respect to the calculation of solvation free energies in 1-octanol and water. This is because these models can generally estimate solvation free energies of neutral solutes with an average accuracy of ∼5 kJ mol−1,20,21,73 which already corresponds to ∼0.9 log unit error in log Pow at room temperature. As indicated in eq 3, cancellation of systematic errors should give rise to more accurate log Pow predictions. For a randomly selected set of 55 molecules with experimental solvation free energies in 1-octanol and water, the signed errors in the calculated solvation free energies were determined for each model and are shown in Table 1. Errors larger than 1 kJ mol−1 and with an opposite sign in 1-octanol and water are shown in square brackets. These data indicate that the errors are remarkably systematic for all implicit solvent models, particularly, the SM12-MK and COSMO-RS models. Of the data points displaying errors with opposite sign (in square brackets), only 3, 1, 1, 2, and 0 molecules for the SMD, SM8, SM12-MK, SM12-CM5 and COSMO-RS, respectively, F
DOI: 10.1021/acs.jpcb.9b04061 J. Phys. Chem. B XXXX, XXX, XXX−XXX
Article
The Journal of Physical Chemistry B
Table 2. Experimental and Calculated Log Pow Values Using the SMD Solvation Model and M06-2X/6-31G(d) Level of Theory. Unsigned Errors and Difference in Errors (Log Unit) for Selected Molecules Before and After Performing Conformer Searches in the Respective Phases (1-Octanol and Water). log Pow molecule
experiment
calculateda
calculatedb
unsigned errora
unsigned errorb
difference in error
M4 M6 M31 M32 M33 L8 L9 L11 L12 L13 L21 L25 L29 L30 L32 L33 L34 L35 L36 L39 L40 L41 L46 L47 L48 L49 L50 L51 L52 L53 L54
0.51 0.88 2.75 2.43 3.29 −0.29 −0.26 0.11 0.41 1.20 0.09 1.98 1.82 0.52 0.71 0.87 0.97 1.22 1.72 −1.01 −0.97 −0.11 −2.47 −2.37 −2.36 −2.21 −2.11 −1.45 −1.11 −0.58 −0.32
1.95 1.44 2.61 2.28 3.29 0.83 0.58 0.97 0.86 1.23 1.06 1.86 2.11 1.87 1.51 1.77 1.51 1.85 2.10 0.00 0.08 0.42 −3.23 −1.62 −1.42 −1.51 −1.68 −0.65 −0.70 −0.76 −0.57
1.54 1.04 2.54 2.15 3.34 0.77 0.58 0.97 0.86 1.23 1.00 1.77 1.97 1.87 1.43 1.73 1.51 1.75 1.74 0.14 0.08 0.42 −1.88 −1.68 −1.49 −1.79 −1.73 −1.01 −0.77 −0.62 −0.70
1.44 0.56 0.14 0.15 0.00 1.12 0.84 0.86 0.45 0.03 0.97 0.12 0.29 1.35 0.80 0.90 0.54 0.63 0.38 1.01 1.05 0.53 0.76 0.75 0.94 0.70 0.43 0.80 0.41 0.18 0.25
1.03 0.16 0.21 0.28 0.05 1.06 0.84 0.86 0.45 0.03 0.91 0.21 0.15 1.35 0.72 0.86 0.54 0.53 0.02 1.15 1.05 0.53 0.59 0.69 0.87 0.42 0.38 0.44 0.34 0.04 0.38
0.40 0.41 −0.08 −0.13 −0.05 0.06 0.00 0.00 0.00 0.00 0.06 −0.10 0.15 0.00 0.08 0.05 0.00 0.11 0.36 −0.14 0.00 0.00 0.17 0.06 0.07 0.28 0.05 0.36 0.07 0.14 −0.13
a
Solvation free energies in 1-octanol and water calculated on gas phase global minimum conformer. bSolvation free energies in 1-octanol and water calculated on lowest energy conformers in the respective phases.
showed an error of more than 5.7 kJ mol−1, i.e. 1 log unit, in the transfer free energy (ΔΔGtransfer). Furthermore, the largest mean absolute error (MAE) in transfer free energy displayed by the SMD model is only 2.2 kJ mol−1 (or 0.4 log unit), which is significantly smaller than the mean errors in the calculated solvation free energies reported for these models. The maximum absolute error (AEmax) in transfer free energy is ∼8 kJ mol−1 (1.4 log unit), and for the SMD, SM8, and SM12 models, this corresponds to the hydrazine molecule. From this analysis, it also appears that aliphatic molecules containing the cyano (−CN) group, such as acetonitrile and propionitrile, are problematic for the SMD model. While the SM12-MK model performed very well in terms of error cancellation, it is worth noting that the errors in the absolute solvation free energies for aromatic species (e.g., aniline and pyridine) and alkynes were significantly larger compared to other implicit models. This is reflected in the 2to 3-fold higher MAE values in absolute solvation free energies (ΔGS) for this model (Table 1). Interestingly, these errors are significantly reduced when CM5 charges are employed with the SM12 model, although the MAE for the transfer energies remains similar to SM12-MK. On the basis of this result, we
recommend the use of CM5 charges with the SM12 model for the calculation of solvation free energies. The Use of Lowest-Energy Conformations Improves the Accuracy of Log Pow Calculation. As noted in Figure 1, a number of molecules (e.g., M4 and M6) in the dataset M possess multiple rotatable bonds and may exist in more than one conformation. For solute M4, the conformation in the gas phase differs appreciably from its conformations in water and 1-octanol (Figure 3). If the gas-phase conformer was used for the calculation of solvation free energies in 1-octanol and water, this gives rise to a log Pow value of 1.95 log unit, which differs by 1.44 log unit from the experimental value of 0.51. This error is reduced by 0.40 log unit when the lowest-energy conformers in the respective solvents are used. Table 2 lists the molecules from both datasets where the global minimum energy geometries in 1-octanol and/or water are different to that in the gas phase. For these molecules, the log Pow was determined using two approaches: (a) on gasphase global minimum conformers and (b) on lowest-energy geometries in the respective phases. The difference in the prediction error using the two approaches is provided in the last column, and a positive value implies that approach (b) improved the accuracy of calculated log Pow and vice versa. G
DOI: 10.1021/acs.jpcb.9b04061 J. Phys. Chem. B XXXX, XXX, XXX−XXX
Article
The Journal of Physical Chemistry B
Figure 4. (a, b) Mean absolute errors (MAEs) of all the methods used in this work for both datasets. MAE values are indicated on the top of each bar.
Figure 5. (a, b) Maximum absolute error (AEmax) values of all the methods used in this work for both datasets. The solute IDs corresponding to the AEmax values are indicated on the top of each bar.
that, in cases where the use of approach (b) did not lead to an improvement, the results are rarely worse off by more than 0.1 log unit. These results emphasize the importance of carrying out conformer searches in the respective solvents, and this
Accordingly, it is encouraging to note that the use of approach (b) resulted in a modest improvement in the accuracy of log Pow prediction. Notable examples are solutes M4, M6, L36, and L51. Another encouraging observation is H
DOI: 10.1021/acs.jpcb.9b04061 J. Phys. Chem. B XXXX, XXX, XXX−XXX
Article
The Journal of Physical Chemistry B
state MM-to-QM/MM (B3LYP/GAFF) corrections for predicting the partition coefficients of amino acids in water− chloroform.74 For the test set of amino acids, the authors found that QM/MM corrections did not necessarily provide an improvement over the MM results. This was attributed to errors in the QM/MM approximation (e.g., treatment of the nonbonded interactions between the QM and MM regions). Another possible source of error in both implicit and explicit solvent models is the assumption of complete immiscibility; that is, the calculation of solvation free energies in wet 1octanol may improve the calculated log Pow values.41 In any case, the current results indicate that QM implicit solvent models are cost-efficient alternatives to explicit solvent models for the calculation of log Pow. It is of interest to examine the solutes that have large prediction errors by the implicit solvent models. Table 3 provides an overview of the solutes with absolute errors that are greater than 1.0 log unit for each model. The molecules appearing frequently amongst the models are denoted by asterisks, and it is interesting to note that solutes M4, M10, and M13 appear in nearly all the solvent models (except COSMO-RS). This is presumably because the Minnesota SMD, SM8, and SM12 models were trained using similar datasets. As shown in Figure 1, M4, M10, and M13 are molecules that display intramolecular hydrogen bonding in the gas phase, 1-octanol and/or water. Notably, the signed errors (calculated − experiment) are mostly negative, implying that the models underestimate the lipophilicity (or overestimate the hydrophilicity) of these molecules. As mentioned above, a possible explanation of this systematic error is that experimental data involve wet 1-octanol, and it is likely that the water molecules in the organic phase could associate to the highly polar parts of these solutes, increasing their solubility in the organic medium. These data suggest that implicit solvent prediction of molecules that display such features should be checked using complementary methods. Linclau Dataset Is More Challenging for FragmentBased Methods. Figure 4a,b compares the MAEs for the Mobley and Linclau datasets. As shown, the fragment-based methods outperformed both implicit and explicit solvent models for the dataset M where XLOGP3 performed best with a mean absolute error (MAE) of 0.05 log unit. Additionally, all fragment-based methods have MAE values that are typically 50% lower compared to implicit and explicit solvent models. Interestingly, the performance of these methods was noticeably poorer against the Linclau dataset, although their overall
Table 3. Molecules that Displayed Errors of More than 1.0 Log Unit Using Implicit Solvation Models for Dataset Ma SMD (M062X)
SM8
SM12 (CM5)
ADF-COSMO-RS
M13 (−1.81)* M4 (1.03)* M10 (−1.01)*
M4 (1.96)* M13 (−1.48) M26 (1.48)
M4 (1.47)* M10 (−1.13)* M19 (−1.06) M20 (−1.05)*
M6 (1.65) M29 (1.34) M33 (1.33) M7 (1.30) M4 (1.24)* M32 (1.24) M3 (1.13)
a
Values in parentheses refer to signed errors. The molecules appearing frequently (in at least two columns) are denoted by asterisks.
procedure is used for all the implicit solvent calculations presented in this work. Figures 4 and 5 compare the mean and maximum absolute errors for the various models for the Mobley and Linclau datasets, which reveal a number of interesting trends. Implicit Solvent Models Perform Better than Explicit Solvent Models. For dataset M, Mobley’s group performed explicit solvent free energy perturbation (FEP) calculations with the GAFF and GAFF-DC force fields,37 which we could compare with our implicit solvent calculations. It is interesting to note that the MAE for the implicit solvent models are generally smaller compared to the explicit solvent results (∼0.65 log unit). Of all the implicit solvation models, SMD and SM12 models showed the best performance with MAEs of