Reparametrization of the COSMO Solvent Model for Semiempirical

5 days ago - Research of the year C&EN's most popular stories of the year Molecules of the year Science that delighted... SCIENCE CONCENTRATES ...
5 downloads 0 Views 552KB Size
Article Cite This: J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

pubs.acs.org/jcim

Reparametrization of the COSMO Solvent Model for Semiempirical Methods PM6 and PM7 Kristian Krí̌ ž†,‡ and Jan Ř ezać *̌ ,† †

Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, 166 10 Prague, Czech Republic Department of Physical and Macromolecular Chemistry, Faculty of Science, Charles University, Hlavova 8, 128 40 Prague 2, Czech Republic



Downloaded via IOWA STATE UNIV on January 5, 2019 at 18:53:35 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

S Supporting Information *

ABSTRACT: An accurate description of solvation effects is of high importance in modeling biomolecular systems. Our main interest is to find an accurate yet efficient solvation model for semiempirical quantum-mechanical methods applicable to large protein−ligand complexes in the context of computer-aided drug design. We present a survey of readily available methods and a new reparametrization of the COSMO solvent model for PM6 and PM7 calculations in MOPAC. We have tested the reparametrized method on validation data sets of small drug-like molecules for which experimental solvation free energies are available as well as on a set of large model systems of the active site of carbonic anhydrase II interacting with a series of ligands for which experimental affinity values are known. In both cases, there is a significant improvement in accuracy after the reparametrization and the addition of a nonpolar term to the COSMO solvent model.



INTRODUCTION An accurate description of solvation effects on the properties of molecules is fundamental in computational modeling of the systems of interest in pharmaceutics, biochemistry, chemistry, and other fields. Two principal ways of tackling this task involve the use of either explicit or implicit solvent models. Where the former approach attempts to describe solvent plausibly in terms of individual solvent molecules, the latter approach approximates the solvent as a continuum that affects the solute molecules. In quantum-mechanical (QM) calculations, the use of explicit solvation (which also requires proper sampling of the configuration space) is often too demanding, which makes the application of implicit solvent models more practical. At the basic level, the solvent model must describe the electrostatic interaction between the solute and solvent, and this term enters the quantum-mechanical calculation of the solute. The nonpolar contributions are usually approximated by parametric expressions independent of the electronic structure. The most advanced theories such as 3D-RISM1 or COSMO-RS2 use statistical mechanics to model the response of the solvent in more detail. While there are many elaborate implicit solvent models for standard computational methods such as Hartree−Fock (HF) or density functional theory (DFT), less attention in this respect has been paid to semiempirical quantum-mechanical (SQM) methods. However, because of their efficiency, these methods are again gaining attention as they are being © XXXX American Chemical Society

successfully applied to calculations of large, complex systems such as biomolecules. In a proper description of biomolecules, an accurate solvent model is a prerequisite for results that are relevant either biologically or in drug design, for instance. We have recently applied SQM methods in computer-aided drug design for the calculations of protein−ligand interactions.3,4 When the latest corrections for noncovalent interactions are used,5 SQM methods can yield sufficiently accurate interaction energies in the gas phase, and our calculations suggest that it is the solvent model that limits the overall accuracy. To address this issue, we have extended the COSMO solvent model and parametrized it specifically for the use with PM6 and PM7 semiempirical methods.6,7 In this paper, we describe this parametrization, compare the result with other methods available, and demonstrate its application to the calculation of protein−ligand interactions.



CONTINUUM SOLVENT MODELS FOR SQM METHODS Only some of the numerous solvent models published are available for use with semiempirical QM methods (in this work, we consider both the classical MNDO-based SQM methods and the self-consistent charge density functional tight binding method, SCC-DFTB) either because of theoretical Received: October 2, 2018

A

DOI: 10.1021/acs.jcim.8b00681 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling limitations or, most frequently, because of the lack of a suitable software implementation. In general, the available models belong to the family of PCM (polarizable continuum model) and C-PCM/COSMO (conductor-PCM/conductor-like screening model) methods with their further extensions such as SMD (solvent model based on density).8−11 These models belong to the class of ASC (apparent surface charge models), in which the electrostatic effects of a solvent are approximated by a charge distribution on the surface of a cavity surrounding the molecule. In the original implementation of PCM,8 now sometimes referred to as D-PCM, the solvent continuum is treated as a dielectric. Both COSMO10 and C-PCM (conductor-like polarizable continuum model)9 treat the solvent as a conductor, which simplifies the calculation at the cost of an error that is negligible for solvents with a high dielectric constant. In the SCC-DFTB methods, the more approximate generalized Born model12 has been used as well,13,14 as it can be easily implemented and is computationally efficient. For a thorough review of solvent models, see ref 15. When a solvent model in a formulation developed for ab initio or DFT methods is applied to a SQM method (which is possible in some general purpose software packages), it often becomes more computationally demanding than the SQM calculation itself, and such a combination is not applicable to large molecules, where the use of SQM methods can be expected. It is obvious that, for practical applications, only an efficient implementation of a solvent model tailored specifically for SQM methods is useful. We have found that for systems with hundreds or thousands of atoms the only viable alternative in terms of performance is the implementation of the COSMO solvent model in MOPAC16 because it can be used along with the MOZYME17 linear scaling algorithm (for actual timing, see the Timing section). For these reasons, COSMO in MOPAC is our method of choice when the aim is a fast, yet accurate, assessment of the binding of a ligand to a whole protein (such as in scoring). The primary parametrization has been performed for the PM6 method,6 which we prefer for most applications (when combined with D3H4 corrections,5 it yields very accurate interaction energies), but we also report the parametrization for PM77 (for a comparison of PM6 and PM7 in terms of the description of noncovalent interactions, see ref 18).

tions to the solvation free energy.15 This nonpolar term can be at its simplest expressed empirically as a product of the total surface area of the cavity A (solvent−accessible surface) and an effective surface tension coefficient ξ:

METHODOLOGICAL ADVANCES The accuracy of COSMO in MOPAC is reasonably good when tested against experimental solvation free energies, but not as good as some more elaborate models published since its development. Besides the formulation of the method and its parametrization, it does not include a nonpolar term describing solute−solvent interactions other than that of electrostatic nature. In our application to biomolecules, it is now the accuracy of the solvent model that limits the overall accuracy of the calculations, rather than the error of the SQM method itself (when corrections for noncovalent interactions are used). This has motivated us to develop the COSMO solvent model further by adding a nonpolar term and optimizing also the remaining parameters in the model (the atomic radii defining the cavity). There are various formulations of the nonpolar term Enp, a term collecting dispersion, repulsion, and cavitation contribu-

a

Enp = ξA

(1)

The value of the solvent-accessible surface area of the cavity is taken from the underlying COSMO calculation used in the calculation of the electrostatic contribution. The nonpolar term is then added to the total energy of the solvated system. We also tried adding another term proportional to the volume of the solute, but its contribution was found to be negligible. The COSMO cavity in MOPAC is constructed from the van der Waals radii of atoms constituting the molecule. It is represented by a grid of discrete points on the surface of the union of the atomic spheres. To smooth the cusps between atoms, the following algorithm is used: The radius of the solvent molecule (an empirical parameter, in the present implementation 1.3 Å) is added to each atomic radius during the initial construction of the grid on each atom. The points inside the overlapping spheres are discarded, and the remaining points are then moved back to the original van der Waals radii.10 These radii are very important parameters, determining the resulting solvation energy. In the implementation of COSMO in MOPAC, a set of van der Waals radii from an earlier work was used (in Table 1, they can be compared with the widely Table 1. Atomic radii (in Ångstroms) used in the Original COSMO Implementation in MOPAC, Newly Optimized COSMO2 Method for PM6 and PM7, and Bondi van der Waals Radii as a Referencea) H C N O P S F Cl Br I ξ



COSMO

PM6/COSMO2

PM7/COSMO2

Bondi

1.08 1.53 1.48 1.36 1.75 1.70 1.30 1.65 1.80 2.05 

0.828 1.821 1.904 1.682 2.118 2.369 1.602 1.911 2.178 2.276 0.046

0.929 1.699 1.913 1.686 2.242 2.346 1.528 1.901 2.211 2.062 0.042

1.20 1.70 1.55 1.52 1.80 1.80 1.47 1.75 1.85 1.98 

Effective surface tension parameter ξ (kcal/mol/Å2).

used radii of Bondi19). This set of radii is probably not optimal because it was later shown that better results can be achieved by using slightly larger radii (obtained by scaling van der Waals radii by a factor of 1.1−1.3) or optimizing these radii as free parameters (such as in the SMD and related models).11 Here, we have adopted the latter approach as it offers more flexibility, which is needed for compensating the errors inherent to the underlying SQM method (such as limited molecular polarizability caused by the use of the minimal basis set). The complete method is reparametrized for water solvent on experimental solvation free energy values from the Minnesota Solvation Database.20 B

DOI: 10.1021/acs.jcim.8b00681 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling



REFERENCE DATA AND COMPUTATIONAL DETAILS The data set used for the parametrization of atomic radii was retrieved from Truhlar et al.20 Their Minnesota Solvation Database version 2012 contains experimental solvation energies in water for 533 different compounds and provides the corresponding molecular geometries. Seven compounds have been removed because the SQM calculations in the gas phase were difficult to converge, and 57 more have been removed because they overlapped with the SAMPL1 data set21 used for validation (see below). The resulting set of 469 compounds (331 neutral, 138 ionic) are labeled as MNSol* (the list of the compounds used is provided in the Supporting Information). The first data set used for validation was derived from Mobley’s data set of the experimental solvation energies of organic molecules,22 from which we removed all compounds included in the training set. The resulting data set contains 266 compounds (geometries are provided in a Supporting Information zip file (ci8b00681_si_002)) and is labeled Mobley266. The molecular geometries were taken from the original paper.22 Since we are particularly interested in the application of the method to computer-aided drug design, the remaining two validation sets cover specifically drug-like molecules. We use the data published for the SAMPL challenge, SAMPL121 and SAMPL423 sets. These sets consist of experimental values for the hydration of 63 and 50 small neutral drug-like molecules, respectively. As no geometries were provided along with the experimental data, we built them ourselves and optimized at the DFT/COSMO level. On these geometries, we recalculated the solvation free energies using the advanced COSMO-RS solvent model, and discarded those compounds where the disagreement between experimental and computed results was larger than the expected error of this method because, in these cases, the proposed geometry is likely to be a conformer different from the one studied experimentally. Additionally, four more compounds were removed from the SAMPL4 data set because of an overlap with the training set. Our SAMPL1 and SAMPL4 data sets thus comprise 53 and 42 compounds, respectively; their lists, as well as the optimized geometries used in this work are provided in Supporting Information files ci8b00681_si_001.pdf, ci8b00681_si_004, and ci8b00681_si_005, respectively. Another small set of 10 charged molecules (denoted here as C10) was built for the validation of the method since neither Mobley266 nor SAMPL data sets contain ionic compounds. Experimental solvation energies were taken from ref 24, where we found 6 cations and 4 anions not included in our training set. The geometries of these compounds were prepared same way as the geometries of SAMPL1 and 4 data sets and are p r o v i d e d a s a S u p p o r t i n g I n f o r m a t i o n z i p fil e (ci8b00681_si_006). The geometries of the compounds in the SAMPL1, SAMPL4, and C10 data sets were optimized using the B3LYP/def-QZVP setup with the COSMO solvent model and the default settings as implemented in Turbomole.25 As an example of a highly accurate implicit solvent model, we have used the COSMO-RS method2 based on DFT calculations in Turbomole (B-P/def-2TZVPD) and the corresponding set of COSMO-RS parameters as implemented in the COSMOtherm software.26 COSMO-RS (RS stands for

real solvents) is an extension of COSMO that supplements the method by van der Waals contacts of molecular segments with water and includes the interaction between segments of solute molecules by means of statistical thermodynamics. Besides the COSMO method10 implemented in MOPAC and applicable to any SQM method available therein, we have tested the following SQM methods and solvent model combinations: DFTB3/CPCM and DFTB3/SMD27 as implemented in GAMESS28 (with the default 3OB parameter set29 setup used for the DFTB327 calculations) as well as PM6/PCM and PM6/CPCM as implemented in Gaussian 09.30 Calculations of the data sets were automated using the Cuby framework.31,32



REPARAMETRIZATION OF COSMO FOR PM6 AND PM7 The parameters of the respective element radii together with an effective surface tension parameter ξ were optimized to give the minimal root-mean-square error (RMSE) in the MNSol* training set. A gradient optimization algorithm was used; the gradient of the minimized quantity with respect to the parameters was obtained using finite differences. The optimization was repeated multiple times with different initial values of the parameters to ensure that the solution represents the global minimum (within physical bounds of the parameters). The same protocol was repeated with PM6 and PM7, giving rise to two method-specific sets of parameters. When the PM6/COSMO2 parameters were applied to PM7, the results remained very similar, but reparametrization for PM7 allowed further reduction of the error. In the first step, the radii of H, C, O, N, and the ξ parameter were parametrized separately on a subset of compounds containing only these four elements. Their original COSMO radii (Table 1, on the left) were used as the starting point, while the initial value for ξ was chosen to be 0.05 kcal/mol/Å2 in the final parametrization (we have tested different starting values including zero, but the parametrization always led close to this value). In the second step, we optimized the radii for the remaining elements (S, P, and halogens) and reoptimized ξ with the H, C, O, and N radii held fixed. The value of ξ did not change much in this second optimization. The new parametrization (including ξ) is labeled COSMO2. The final parameters are listed in Table 1. If parameters for elements other than those optimized are needed, we recommend scaling the Bondi vdW19 radius by 1.16 for PM6 and 1.15 for PM7, which are the values that we have obtained by averaging the scaling factors for optimized nonhydrogen elements. Since the resulting parameter sets for PM6 and PM7 are rather similar, only the first one is discussed in more detail. Except for hydrogen, all the optimized radii were larger than the originally proposed ones. The situation was similar regardless of the starting value of the parameters or the optimization protocol used. This is in line with previous findings9 indicating that better results can be obtained when the van der Waals radii are scaled up slightly. When the radius of H was kept at the original value of 1.08, the results were slightly worse. We assume that the reduction of the radius (which makes the interaction of the atom with the solvent stronger) compensates the limited polarizability of hydrogen in SQM methods (which do not include polarization functions on hydrogen in the basis set). The radius of sulfur (2.369 Å) C

DOI: 10.1021/acs.jcim.8b00681 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

the AM1/COSMO setup yields rather good results in some data sets, but it still fails in the SAMPL1 set with an error of 8.9 kcal/mol. The error in the MNSol* set is also rather large. Surprisingly enough, the PM6/PCM method (as implemented in Gaussian) yields very good results for neutral molecules, although it calculates only the electrostatic part of the solvation energy. This is likely a result of an error compensation between underestimated electrostatics (due to the limited polarizability shared by all SQM methods) and the missing positive cavitation energy. When the more elaborate SMD model (which is an extension of the PCM calculation) is used, the results in the training sets become worse because the model was not developed for SQM methods (although it performs slightly better in the MNSol* set, which is practically identical to the training set used for the development of SMD). DFTB3 results are very similar. The best results are achieved with the C-PCM′ model, which neglects nonpolar terms; when these are added in the complete DFTB3/C-PCM calculation, the results become substantially worse. Again, SMD yields better results than complete C-PCM, but slightly worse than when only the electrostatic contribution is considered. These results clearly indicate that the application of implicit solvent models to SQM methods requires special attention. Models that perform well when combined with ab initio or DFT methods can not be directly applied to SQM methods which behave differently. If one does not want to rely on the error compensation, SQM-specific reparametrization of the solvent model, such as the newly developed COSMO2, is necessary. Because the validation set contains only a limited number of ionic species, we performed also a separate error analysis for neutral and ionic species in the training set. In general, the methods compared here suffer from large errors in ionic systems (Figure 1, Table S1 in Supporting Information).

exhibits the largest deviation from the standard van der Waals value. When the sulfur radius was kept at its original value, the RMSE rose significantly to 5.5 Å. This is, however, not unique to our modelfor instance, the corresponding radius in the SMD model (as implemented in Gaussian)30 is similarly large (2.49 Å). Parametrization without the nonpolar term (setting ξ to zero) yielded a RMSE of 3.46 kcal/mol, while its inclusion brought about a substantial improvement with a RMSE of 2.65 kcal/mol. The optimal value of ξ is larger than what is used in molecular mechanics. This can be explained by the fact that the electrostatic term is overestimated in less polar systems because it has to compensate the limited polarizability in the polar ones, and it is balanced by a stronger nonpolar contribution.



RESULTS, SOLVATION FREE ENERGIES OF SMALL MOLECULES The accuracy of the new PM6/COSMO2 and PM7/COSMO2 solvation models, as well as of other solvent models applicable to SQM methods was tested on the Mobley266, SAMPL1, and SAMPL4 data sets introduced in the previous section. The results are summarized in Table 2. Table 2. Errors in Solvation Free Energies (RMSE against experimental values, in kcal/mol) in the Training Set (MNSol*) and Four Validation Setsa Methods PM6/COSMO2 PM6/COSMO PM6/PCM PM6/SMD PM7/COSMO2 PM7/COSMO AM1/COSMO DFTB3/C-PCM DFTB3/C-PCM′ DFTB3/SMD DFT/COSMO-RS

MNSol* Mobley266 2.65 4.31 6.35 4.80 2.62 3.96 4.80 9.92 6.16 4.86 2.54

2.81 3.72 2.24 3.03 2.54 3.44 2.26 6.40 2.31 2.65 1.08

SAMPL1

SAMPL4

C10

5.08 9.07 3.93 4.91 3.73 6.07 8.89 12.87 3.66 4.85 1.88

2.44 4.01 1.92 2.59 1.92 3.21 2.26 8.96 2.84 3.15 1.59

2.18 2.88 8.38 8.39 2.28 2.87 4.68 12.55 8.01 5.57 3.57

a

COSMO2 is the new parametrization including the nonpolar term. C-PCM′ denotes the C-PCM model excluding the default cavitation and dispersion terms.

COSMO2 exhibits a systematic improvement over the original parametrization in all the validation sets. Importantly, the unacceptable RMSE of 9 kcal/mol for PM6/COSMO in the SAMPL1 set was reduced to about 5 kcal/mol for PM6/ COSMO2, and it is as low as 3.7 kcal/mol for PM7/ COSMO2. In the ionic systems (C10 set), it provides an additional small improvement to the already very good results of the original parametrization of COSMO. The description of ionic species is discussed in detail below. Overall, COSMO2 now performs as well as the best of its peers. PM7/COSMO2 is the most accurate of the semiempirical methods tested. In the training set, it yield results comparable to PM6/COSMO2, but it provides slightly better estimates in the SAMPL test sets. This is likely caused by the better electronic structure parameters in PM7 because the method has been parametrized to a larger and more robust training set. It should be noted that the original implementation of COSMO in MOPAC had been developed for the AM1 method, and although it contains no optimized parameters, they were probably selected to match the method. Therefore,

Figure 1. Errors of PM6/COSMO2 solvation free energies in the training set plotted separately for neutral molecules (gray), cations (red), and anions (blue).

However, one has to consider that the solvation energies for charged species are much larger than for neutral ones, which makes the errors of about 5 kcal/mol still acceptable if expressed in relative terms. Besides the overall error, it is also useful to look at its systematic part. For COSMO-RS, as well as PM6 and PM7 with COSMO2, it is small because these methods were parametrized also on ions. PM6 and PM7 with the original COSMO also perform rather well. The other SQM D

DOI: 10.1021/acs.jcim.8b00681 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

of the timing and resources needed for such calculations). Because we work with reliable geometries taken from a previous work (X-ray structure with added and optimized hydrogen atoms), the score is calculated in the fixed geometry as a sum of the interaction energy and a change of solvation free energy upon binding

methods exhibit different trends for cations and anions. Solvation energies of cations are always underestimated (not enough negative). Those of anions are either overestimated slightly (SMD methods), close to no systematic deviation (PM7 or PM6 with COSMO) or underestimated, too (but less than those of cations, PCM methods). Our new parametrization seems to absorb a substantial amount of these errors because it provides by far the best estimates of ionic solvation energies at the SQM level. It is obvious that the empirical parametrization of the solvent model specific to SQM methods can compensate for some errors inherent to the SQM calculation itself. Results for larger realistic model systems that include charged sites support the transferability of the current parametrization (see the following section). The best results for charged systems that constitute a part of the training data set (Figure 1) were obtained by PM6/ COSMO2, closely followed by PM7/COSMO2. Consistently, for the C10 data set, the COSMO2 parametrizations for PM6 and PM7 were also the two methods yielding the most accurate estimates. The next most accurate estimates were provided by PM6/COSMO and PM7/COSMO with the original parameters. This might be because they were developed specifically for SQM calculations. Since the remaining methods use parameters derived for ab initio or DFT calculations, they do not perform so well when applied to SQM calculations, in which the description of the strong solute−solvent interaction is even more difficult than in neutral systems. For neutral systems, PM6/COSMO2 and PM7/COSMO2 were most accurate when tested on the training data set (Figure 1, neutral), although most of the other methods yielded similarly small errors.

score = ΔEint + ΔΔGsolv

(2)

We have previously shown that in a series of similar inhibitors the other contributions (deformation energy, entropy, etc.) can be neglected.4 We consider two versions of the model systems. First, we use the complete experimental structure including crystal water molecules (for the details of the preparation of the systems including the determination of the position of hydrogens, see ref 33). Second, we removed the water molecules from the geometry because in many cases (a low-resolution X-ray structure or an in silico modeled structure) the information on theirs positions is not available. The results, summarized as R2 of the correlation between the computed score and experimental binding free energies and as the corresponding predictive index PI, are listed in Table 3. Table 3. Correlation Coefficient (R2) and Predictive index (PI) in the Series of Carbonic Anhydrase II Inhibitors for Scoring Functions Combining Different Methods for the Calculation of Gas-Phase Interaction Energy (ΔEint) and Change of Solvation Free Energy upon Binding (ΔΔGsolv). a With water molecules

Method



APPLICATION TO DRUG DESIGN: CARBONIC ANHYDRASE II In order to test the reparametrized COSMO model in a realistic scenario, we applied it to the calculations of protein− ligand interactions for computer-aided drug design. We have recently published a very good model system for thata series of inhibitors of carbonic anhydrase for which we have highresolution X-ray structures as well as a consistent set of experimentally measured binding free energies.33 We have already tested a SQM-based scoring function (a computed estimate of the binding free energy) based on DFTB3-D3H4X interaction energy and the solvation energy calculated using PM6/COSMO,33,34 which provided satisfactory correlation with the experiment (while many other scoring functions failed). Unlike in the original paper, we have now selected only one model geometry for each inhibitor (the one that yields the strongest binding in the best calculations performed in the original paper). The resulting data set is provided in a Supporting Information zip file (ci8b00681_si_003) and will be made available as a predefined data set in the Cuby framework.31,32 Here, we test scoring functions based on PM6-D3H4, PM7, and DFTB3-D3H4 interaction energies (PM6 has problems with the description of the zinc ion in the active site, whereas PM7 and especially DFTB3 yield better results), combined with solvation calculated using PM6 and PM7 with COSMO and COSMO2 solvent models. Because of the size of the systems (1400−1500 atoms), a COSMO calculation with MOZYME is the only practical SQM method with a solvent model applicable (see the following section for the discussion

No water molecules

ΔEint

ΔΔGsolv

R2

PI

R2

PI

PM6-D3H4 PM6-D3H4 PM7 PM7 DFTB3-D3H4 DFTB3-D3H4 DFTB3-D3H4

PM6/COSMO PM6/COSMO2 PM7/COSMO PM7/COSMO2 PM6/COSMO PM6/COSMO2 PM7/COSMO2

0.25 0.37 0.48 0.49 0.62 0.66 0.67

0.57 0.61 0.63 0.58 0.82 0.87 0.83

0.21 0.40 0.28 0.33 0.45 0.75 0.60

0.58 0.74 0.68 0.68 0.73 0.88 0.81

a

Results obtained with and without explicit structural waters are listed separately.

The predictive index35 is a measure of correlation that also takes into account the ordering of the data; its values can range from 1 (perfect correlation, correct ordering) through 0 (random distribution, no correlation) to −1 (perfect correlation with a negative slope, i.e., reverse ordering). In all cases, the COSMO2 solvent model improves the results, although the difference is rather small in the model including crystal water molecules. Nevertheless, the new setup surpasses all the methods tested on this model previously (including the standard scoring methods used in the field). What is more important is the very large improvement observed in the model without the crystal water molecules. Moreover, the DFTB3-D3H4+PM6/COSMO2 setup in the model without water molecules yields the best overall result, with significantly better correlation with the experiment (R2 = 0.75) than when the water molecules are included (R2 = 0.66). This may be explained by the limitation of the use of explicit water moleculesin that case, only their contribution to the interaction energy is included, while other finite-temperature thermodynamic contributions are not covered. On the other hand, an implicit model fitted to the free energies of solvation E

DOI: 10.1021/acs.jcim.8b00681 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

be met with the MOZYME linear scaling algorithm implemented in MOPAC, which is compatible with the COSMO solvent model. To improve the accuracy of these calculations, we present a novel reparametrization of COSMO for PM6 and PM7, named COSMO2. Its accuracy in neutral systems is comparable to the best methods in this class, and it has a significant advantage in the description of ionic systems. PM7/COSMO2 is slightly more accurate and can be thus recommended for calculations of solvation energies. In more complex systems, were both solvation and noncovalent interactions are important; PM6D3H4/COSMO2 may be preferred. Our main motivation for the development of COSMO2 was the application of SQM methods in computer-aided drug design. We have tested it on a model of the active site of carbonic anhydrase II with a series of inhibitors for which a detailed experimental reference is available. The use of COSMO2 improves the correlation of the computed estimates of the binding free energies with the experiment, which is very significant in a model that neglects crystal waters. The results obtained with the combination of DFTB3-D3H4 interaction energies and PM6/COSMO2 or PM7/COSMO2 solvation are better than any scoring function reported previously.33

covers also these effects (although approximately). This is very encouraging for potential applications to systems where the position of the water molecules is not known. These results also indicate that the new PM6/COSMO2 model yields rather accurate results.



TIMING For practical applications to SQM calculations of large systems, the additional time needed for the computation of the solvent model becomes very important, as it may be the limiting step. In order to demonstrate that, we have compared the performance of PM6 in the gas phase and with a full-featured PCM solvent (as implemented in Gaussian) and with the COSMO solvation tailored specifically for SQM calculations (available in MOPAC). The tests were performed on artificial polyalanine chains of a defined length (30, 60, and 100 alanine residues, each capped by an aspartate at the N end and a lysine at the C end; 337, 637, and 1037 atoms, respectively). The results are summarized in Table 4. Table 4. Computation Time (in minutes) for PM6 Calculations without and with COSMO and PCM Solvent Models in a Series of Polyalanine Helices



Chain length Method MOPAC MOPAC (MOZYME) Gaussian MOPAC/COSMO MOPAC/COSMO (MOZYME) Gaussian/PCM

30 0.1 0.1 0.6 0.2 0.2 407

60 5.4 0.2 6.1  0.4 2767

ASSOCIATED CONTENT

S Supporting Information *

100 10.4 0.4 31.0  1.2 11311

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.8b00681.

In the larger systems, the MOZYME linear scaling algorithm provides a significant advantage, but all the gas-phase calculations are sufficiently fast. When the PCM solvent model, developed for ab initio calculations of smaller molecules, is applied to a respective gas-phase calculation, the calculation times grow by several orders of magnitude. Moreover, the larger calculations require gigabytes of memory, which becomes the limiting factor in even larger systems. The COSMO implementation in MOPAC is more efficient, but it has a hard limit on the number of atoms in the systems, as a result of which it can only be applied to the smallest model here. The program, however, suggests using the MOZYME algorithm in the case of systems exceeding this particular size. Not only does MOZYME improve the scaling of the SQM method itself, but the partitioning of the molecule exploited by the MOZYME algorithm is also used in the COSMO calculation, which allows the calculation to scale well even to systems with thousands of atoms. The loss of accuracy is negligible, in the 30-alanine chain, and the difference between the solvation energies calculated with and without MOZYME is about 0.5 out of the total ΔGsolv of 295 kcal/mol.



Table of the errors used to construct Figure 1, list of compounds in the training set and reference experimental binding free energies of the carbonic anhydrase II complexes (PDF) Geometries for dataset SAMPL1 (ci8b00681_si_004.zip) Geometries for dataset SAMPL4 (ci8b00681_si_005.zip) Geometries for dataset C10 (ci8b00681_si_006.zip) Geometries for dataset Mobley266 (ci8b00681_si_002.zip) Geometries for dataset CA (ci8b00681_si_003.zip)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Kristian Kříž: 0000-0001-9072-0949 Jan Ř ezác:̌ 0000-0001-6849-7314 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We acknowledge the support from the Czech Science Foundation, Grant No. P208/16-11321Y, and from the European Regional Development Fund; OP RDE; Project: Chemical biology for drugging undruggable targets (ChemBioDrug) (No. CZ.02.1.01/0.0/0.0/16_019/0000729). This work is part of the Research Project RVO 61388963 of the Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences.



CONCLUSIONS The combination of efficient semiempirical quantum-mechanical methods with implicit solvation models enables a realistic description of large molecular systems (with up to thousands of atoms) in their native environment. However, the solvation model must be chosen carefully with respect to both computational efficiency and accuracy. The first criterion can F

DOI: 10.1021/acs.jcim.8b00681 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling



(22) Mobley, D. L.; Dill, K. A.; Chodera, J. D. Treating Entropy and Conformational Changes in Implicit Solvent Simulations of Small Molecules. J. Phys. Chem. B 2008, 112, 938−946. (23) Guthrie, J. P. SAMPL4, a Blind Challenge for Computational Solvation Free Energies: the Compounds Considered. J. Comput.Aided Mol. Des. 2014, 28, 151−168. (24) Lee, S.; Cho, K.-H.; Lee, C. J.; Kim, G. E.; Na, C. H.; In, Y.; No, K. T. Calculation of the Solvation Free Energy of Neutral and Ionic Molecules in Diverse Solvents. J. Chem. Inf. Model. 2011, 51, 105−114. (25) TURBOMOLE v6.6, 2015. http://www.turbomole.com (accessed 2018-11-28). (26) Klamt, A.; Eckert, F. COSMOtherm, Version C1.1, Revision 01.01, 2001. http://www.cosmologic.de/products/cosmotherm.html (accessed 2018-11-28). (27) Elstner, M.; Porezag, D.; Jungnickel, G.; Elsner, J.; Haugk, M.; Frauenheim, T.; Suhai, S.; Seifert, G. Self-Consistent-Charge DensityFunctional Tight-Binding Method for Simulations of Complex Materials Properties. Phys. Rev. B: Condens. Matter Mater. Phys. 1998, 58, 7260. (28) Schmidt, M. W.; Baldridge, K. K.; Boatz, J. A.; Elbert, S. T.; Gordon, M. S.; Jensen, J. H.; Koseki, S.; Matsunaga, N.; Nguyen, K. A.; Su, S.; Windus, T. L.; Dupuis, M.; Montgomery, J. A. General Atomic and Molecular Electronic Structure System. J. Comput. Chem. 1993, 14, 1347−1363. (29) Gaus, M.; Goez, A.; Elstner, M. Parametrization and Benchmark of DFTB3 for Organic Molecules. J. Chem. Theory Comput. 2013, 9, 338−354. (30) Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V.; Mennucci, B.; Petersson, G. A.; Nakatsuji, H.; Caricato, M.; Li, X.; Hratchian, H. P.; Izmaylov, A. F.; Bloino, J.; Zheng, G.; Sonnenberg, J. L.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Vreven, T.; Montgomery, J. A., Jr.; Peralta, P. E.; Ogliaro, F.; Bearpark, M.; Heyd, J. J.; Brothers, E.; Kudin, K. N.; Staroverov, V. N.; Kobayashi, R.; Normand, J.; Raghavachari, K.; Rendell, A.; Burant, J. C.; Iyengar, S. S.; Tomasi, J.; Cossi, M.; Rega, N.; Millam, N. J.; Klene, M.; Knox, J. E.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.; Gomperts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.; Pomelli, C.; Ochterski, J. W.; Martin, R. L.; Morokuma, K.; Zakrzewski, V. G.; Voth, G. A.; Salvador, P.; Dannenberg, J. J.; Dapprich, S.; Daniels, A. D.; Farkas, Ö .; Ortiz, J. V.; Cioslowski, J.; Fox, D. J. Gaussian 09, revision D.01; Gaussian, Inc.: Wallingford, CT, 2009. . (31) Ř ezác,̌ J. Cuby: An Integrative Framework for Computational Chemistry. J. Comput. Chem. 2016, 37, 1230−1237. (32) Ř ezác,̌ J. Cuby 4, software framework for computational chemistry, 2015. http://cuby4.molecular.cz/ (accessed 2018-11-28). (33) Pecina, A.; Brynda, J.; Vrzal, L.; Gnanasekaran, R.; Hořejší, M.; Eyrilmez, S. M.; Ř ezác,̌ J.; Lepšík, M.; Ř ezácǒ vá, P.; Hobza, P.; Majer, P.; Veverka, V.; Fanfrlík, J. Ranking Power of the SQM/COSMO Scoring Function on Carbonic Anhydrase II−Inhibitor Complexes. ChemPhysChem 2018, 19, 873−879. (34) Pecina, A.; Haldar, S.; Fanfrlik, J.; Meier, R.; Ř ezác,̌ J.; Lepsik, M.; Hobza, P. The SQM/COSMO Scoring Function at the DFTB3D3H4 Level: Unique Identification of Native Protein-Ligand Poses. J. Chem. Inf. Model. 2017, 57, 127−132. (35) Pearlman, D. A.; Charifson, P. S. Are Free Energy Calculations Useful in Practice? A Comparison with Rapid Scoring Functions for the p38 MAP Kinase Protein System. J. Med. Chem. 2001, 44, 3417− 3423.

REFERENCES

(1) Ten-no, S.; Hirata, F.; Kato, S. A Hybrid Approach for the Solvent Effect on the Electronic Structure of a Solute Based on the RISM and Hartree-Fock Equations. Chem. Phys. Lett. 1993, 214, 391− 396. (2) Klamt, A. Conductor-Like Screening Model for Real Solvents - a New Approach to the Quantitative Calculation of Solvation Phenomena. J. Phys. Chem. 1995, 99, 2224−2235. (3) Lepšík, M.; Ř ezác,̌ J.; Kolár,̌ M.; Pecina, A.; Hobza, P.; Fanfrlík, J. The Semiempirical Quantum Mechanical Scoring Function for In Silico Drug Design. ChemPlusChem 2013, 78, 921−931. (4) Pecina, A.; Meier, R.; Fanfrlík, J.; Lepšík, M.; Ř ezác,̌ J.; Hobza, P.; Baldauf, C. The SQM/COSMO Filter: Reliable Native Pose Identification Based on the Quantum-Mechanical Description of Protein−Ligand Interactions and Implicit COSMO Solvation. Chem. Commun. 2016, 52, 3312−3315. (5) Ř ezác,̌ J.; Hobza, P. Advanced Corrections of Hydrogen Bonding and Dispersion for Semiempirical Quantum Mechanical Methods. J. Chem. Theory Comput. 2012, 8, 141−151. (6) Stewart, J. J. P. Optimization of Parameters for Semiempirical Methods V: Modification of NDDO Approximations and Application to 70 Elements. J. Mol. Model. 2007, 13, 1173−1213. (7) Stewart, J. J. P. Optimization of Parameters for Semiempirical Methods VI: More Modifications to the NDDO Approximations and Re-Optimization of Parameters. J. Mol. Model. 2013, 19, 1−32. (8) Miertuš, S.; Scrocco, E.; Tomasi, J. Electrostatic Interaction of a Solute With a Continuum. A Direct Utilizaion of AB Initio Molecular Potentials for the Prevision of Solvent Effects. Chem. Phys. 1981, 55, 117−129. (9) Barone, V.; Cossi, M. Quantum Calculation of Molecular Energies and Energy Gradients in Solution by a Conductor Solvent Model. J. Phys. Chem. A 1998, 102, 1995−2001. (10) Klamt, A.; Schüürmann, G. COSMO: a New Approach to Dielectric Screening in Solvents With Explicit Expressions for the Screening Energy and its Gradient. J. Chem. Soc., Perkin Trans. 2 1993, 0, 799−805. (11) Marenich, A. V.; Cramer, C. J.; Truhlar, D. G. Universal Solvation Model Based on Solute Electron Density and on a Continuum Model of the Solvent Defined by the Bulk Dielectric Constant and Atomic Surface Tensions. J. Phys. Chem. B 2009, 113, 6378−6396. (12) Still, W. C.; Tempczyk, A.; Hawley, R. C.; Hendrickson, T. Semianalytical Treatment of Solvation for Molecular Mechanics and Dynamics. J. Am. Chem. Soc. 1990, 112, 6127−6129. (13) Xie, L.; Liu, H. The Treatment of Solvation by a Generalized Born Model and a Self-Consistent Charge-Density Functional Theory-Based Tight-Binding Method. J. Comput. Chem. 2002, 23, 1404−1415. (14) Hou, G.; Zhu, X.; Cui, Q. An Implicit Solvent Model for SCCDFTB with Charge-Dependent Radii. J. Chem. Theory Comput. 2010, 6, 2303−2314. (15) Tomasi, J.; Mennucci, B.; Cammi, R. Quantum Mechanical Continuum Solvation Models. Chem. Rev. 2005, 105, 2999−3094. (16) Stewart, J. J. P. MOPAC, 2016. http://openmopac.net/ (accessed 2018-11-28). (17) Stewart, J. J. P. Application of the PM6Method to Modeling Proteins. J. Mol. Model. 2009, 15, 765−805. (18) Hostaš, J.; Ř ezác,̌ J.; Hobza, P. On the Performance of the Semiempirical Quantum Mechanical PM6 and PM7Methods for Noncovalent Interactions. Chem. Phys. Lett. 2013, 568, 161−166. (19) Bondi, A. van der Waals Volumes and Radii. J. Phys. Chem. 1964, 68, 441−451. (20) Marenich, A. V.; Kelly, C. P.; Thompson, J. D.; Hawkins, G. D.; Chambers, C. C.; Giesen, D. J.; Winget, P.; Cramer, C. J.; Truhlar, D. G. Minnesota Solvation Database Version 2012, 2012. https://comp. chem.umn.edu/mnsol/ (accessed 2018-11-28). (21) Guthrie, J. P. A Blind Challenge for Computational Solvation Free Energies: Introduction and Overview. J. Phys. Chem. B 2009, 113, 4501−4507. G

DOI: 10.1021/acs.jcim.8b00681 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX