Letter pubs.acs.org/JPCL
Robust Predictive Power of the Electrostatic Term at Shortened Intermolecular Distances Karol M. Langner,† Wiktor Beker,‡ and W. Andrzej Sokalski*,‡ †
Leiden Institute of Chemistry, Leiden University, The Netherlands Institute of Theoretical and Physical Chemistry, Wrocław University of Technology, Poland
‡
S Supporting Information *
ABSTRACT: At distances shorter than equilibrium, electrostatic interactions seem to be a more robust indicator of relative molecular dimer stability than more accurate electronic structure approaches. We arrive at this conclusion by investigating the nonparametric correlation between reference interaction energies at equilibrium geometries (coupled cluster with singles, doubles, and perturbative triples at the complete basis set limit, ΔECBS,ref CCSD(T)) and its various approximate values obtained at a range of distances for a training set of 22 biologically relevant dimers. The reference and other costly methods start to fail to reproduce the equilibrium ranking of dimer stabilities when the intermolecular distance is shortened by more than 0.2 Å, but the full electrostatic component (includes penetration) maintains a high success rate. Such trends provide a new perspective for any applications where inaccurate structures are used out of necessity, such as the scoring of ligands docked to enzyme active sites. SECTION: Molecular Structure, Quantum Chemistry, and General Theory
F
rom the form of intermolecular potentials,1 one can learn that forces between neighboring molecules depend strongly on their separation. To estimate stabilization energies, it therefore seems important to operate on adequate structures, but obtaining accurate coordinates is a notorious problem in quantum chemistry, requiring sufficiently saturated basis sets and a proper account of correlation effects.2 Truly reliable ab initio results are still available for only the smallest of molecular systems; a notable example is force fields for water built from first principles, which are now capable of reproducing properties for the dimer as well as the liquid state.3 Most force fields, however, are soberly contingent on experimental data and not surprisingly misguide the geometry of molecules; average errors here can range from 0.2 Å to as much as 3 Å.4 Quantum chemical methods also entail many sources of inaccuracy,2,5,6 an interesting one being basis set superposition error (BSSE); equilibrium distances obtained with ab initio methods for dimers7−10 and biomolecules11−13 can be reduced by up to 0.4 Å. Finally, crystallographic structures, which are often the starting point for computational studies, carry their own well-known inaccuracies. We wish to consider two unpopular but necessary questions in this context. What happens to the interaction energy when complexes are significantly displaced from their equilibrium geometry? Can the relative stability at equilibrium, with such distorted input, be recovered? The first question is of a fundamental type. A short answer, after Feynman, is that the interaction energy tends to zero at large separations and sharply rises when any two atoms approach each other. The second question is related, but more pragmatic. It epitomizes the problems in a typical structure© 2012 American Chemical Society
based drug design scenario: a set of ligand−receptor geometries is known to have defects, and we can locate neither the errors nor their magnitude, but we would nonetheless like to infer about their stability, at least in a relative sense. We probe this matter statistically, asking how well different approximations to the interaction energy, obtained at various intermolecular separations (we employ the distance between centers of mass dCOM) reproduce the ranking of accurate values at equilibrium (denoted by req). As reference values we take the S22 data set from Jurečka et al.14 comprising state-of-the-art coupled cluster results extrapolated to the complete basis set (CBS) limit, ΔECBS,ref CCSD(T), for a set of 22 molecular dimers of biological importance (Figure 1). The correlation of these is evaluated with the most accurate available literature values obtained for geometries significantly out of equilibrium,15 which extend the reference to 14 intermolecular separations (−0.8 < dCOM−req < 10.0).16 We likewise evaluate statistical correlation for theory levels defined by a hybrid variation− perturbation (HVP) partitioning of the interaction energy.17,18 Since the publication of S22, larger reference data sets have been made available, most notably the S66 data set.19,20 These feature dissociation curves and angular displacements, but the shortest intermolecular distances are not as extreme as those in the report by Fusti Molnar et al.15 and therefore do not lend themselves to as clear a demonstration of the effect sought by us here. Received: August 8, 2012 Accepted: September 10, 2012 Published: September 10, 2012 2785
dx.doi.org/10.1021/jz301146v | J. Phys. Chem. Lett. 2012, 3, 2785−2789
The Journal of Physical Chemistry Letters
Letter
Figure 1. Overview of the small molecule dimers in the S22 reference set and the corresponding intermolecular interaction energies at four levels of theory. The data set includes seven hydrogen-bonded dimers, eight complexes with dominant dispersion interactions, and seven mixed dimers, whose chemical identities and coordinates can be found in the Supporting Information here and in their original publications.14,15
results, to some extent, from the nonlinear behavior of the intermolecular potential, especially at short distances. A simple, informative statistical measure is the success rate of prediction Npred (Figure 2), calculated for all pairs of dimers (there are 231) and any two computational methods (one is the reference at equilibrium). Each dimer pair is considered to be a success only if the difference in dimer interaction energies does not
A great number of applications, including high-throughput screening and comparative docking of inhibitors into enzyme active sites, focus on relative stabilities, not energy values themselves. This moves us to use nonparametric statistics,21 which operate on ranks instead of exact values and therefore address the main issue at hand: which complex is more stable? Such an abstraction of the statistical correlation also frees the 2786
dx.doi.org/10.1021/jz301146v | J. Phys. Chem. Lett. 2012, 3, 2785−2789
The Journal of Physical Chemistry Letters
Letter
separation. Overall, ΔECBS CCSD(T) and ΔEMP2 as well as the intermediate Hartree−Fock (ΔERHF) and Heitler−London (ΔE(1) HL) levels all show qualitatively similar trends. Only the first-order electrostatic interaction energy ΔEel(1) exhibits sustained strong correlation at shortened intermolecular separations, with Npred remaining above 90% down to −0.8 Å from equilibrium. It is important to reflect on this result because electrostatic effects have long held a leading role in explaining the energetics of biomolecule processes such as enzyme catalysis.22 One curious case we encountered recently23 involved ligands docked by force fields into the urokinase active site, where several nearest contacts were up to 0.5 Å shorter than those in more accurate MP2-based geometries. MP2 interaction energies at the force-field geometries failed to correlate with experimental activities, but the electrostatic component of the interaction energy ΔE(1) el surprisingly exhibited much better correlation. Whereas that result could have been accidental given its limited scope, the statistical analysis presented here is more suggestive. Our working explanation for this observation is the weaker distance dependence of the electrostatic multipole term compared with the exponential vanishing of the exchange component ΔE(1) ex . We also point out that the first-order uncorrelated electrostatic term ΔE(1) el does not converge to CBS ΔECCSD(T) at long-range (1.5 Å above req), where the interaction energy is dominated by nonpenetrative, multipole electrostatic effects. This difference can only be due to intramolecular correlation and could be fixed by employing atomic multipole expansions based on correlated electronic densities.24 In addition to Npred (Figure 2), we have considered other statistical measures (several are presented in Table 1) that provide some additional insight. For example, it is interesting that the Pearson coefficient for the electrostatic component
Figure 2. Success rate of predicting the relative stability of molecular dimers at equilibrium, using several levels of theory at various intermolecular distances from equilibrium.
change sign between methods, that is, if the ranking of relative stability is preserved. As expected, around the equilibrium distance and at larger distances where dCOM > req, the most accurate interaction energies (ΔECBS CCSD(T) and the second-order Møller−Plesset energy ΔEMP2) adequately reproduce the reference equilibrium ordering. At long range, the success rate for the electrostatic component becomes gradually worse, whereas it remains quite high (>90%) for the MP2 and coupled cluster energies. The success rates of the latter two deteriorate rapidly, however, when dCOM drops below −0.2 Å from the equilibrium
Table 1. Kendall tau (τK), Error Rate (Nerr), the Error Magnitudes Δ̅ ref mis and Δ̅ mis, as well as Spearman (ρs) and Pearson (ρp) Correlation Coefficients Calculated Between ΔECBS,ref CCSD(T) (at equilibrium distance req) and Another Theory Level (the aug-ccpVDZ basis set was used in this case) at Three Representative Intermolecular Distances dCOMa Δ̅ ref mis
Δ̅ mis
Nerr
[kcal/mol]
[kcal/mol]
0.861/2e−08 0.870/1e−08 0.766/6e−07
7% 6% 12%
0.99 0.96 1.60
4.39 1.00 0.76
0.954/7e−12 0.963/8e−13 0.879/7e−08
0.976/1e−14 0.972/4e−14 0.960/1e−12
−0.784/3e−07 −0.022/9e−01 0.654/2e−05
11% 49% 83%
1.28 7.19 7.99
6.29 3.08 3.85
−0.914/3e−09 −0.074/7e−01 0.804/7e−06
−0.932/3e−10 0.102/7e−01 0.916/2e−09
−0.143/4e−01 0.455/3e−03 0.671/1e−05
57% 27% 16%
6.72 3.65 2.40
7.36 3.48 1.16
−0.233/3e−01 0.599/3e−03 0.806/6e−06
−0.072/8e−01 0.819/3e−06 0.931/3e−10
0.411/7e−03 0.896/5e−09 0.905/4e−09
29% 5% 5%
7.43 0.46 0.33
3.03 0.79 0.43
0.490/2e−02 0.979/3e−15 0.982/6e−16
0.111/6e−01 0.988/1e−17 0.990/2e−18
0.484/3e−03 0.989/1e−09 0.937/8e−09
26% 1% 3%
5.71 0.20 0.25
1.24 0.05 0.07
0.544/1e−02 0.998/4e−24 0.988/5e−16
0.709/5e−04 1.000/1e−30 0.998/2e−22
τK/p
ρS/p
ρP/p
(1) ΔECBS,ref CCSD(T)·ΔEel
dCOM − req = −0.4 dCOM − req = 0.0 dCOM − req = 0.7 (1) ΔECBS,ref CCSD(T)·ΔEHL dCOM − req = −0.4 dCOM − req = 0.0 dCOM − req = 0.7 ΔECBS,ref CCSD(T)·ΔERHF dCOM − req = −0.4 dCOM − req = 0.0 dCOM−req = 0.7 ΔECBS,ref CCSD(T)·ΔEMP2 dCOM − req = −0.4 dCOM − req = 0.0 dCOM − req = 0.7 CBS ΔECBS,ref CCSD(T)·ΔECCSD(T) dCOM − req = −0.4 dCOM − req = 0.0 dCOM − req = 0.7 a
p in τK/p and other coefficients denotes the statistical significance of the correlation coefficient. 2787
dx.doi.org/10.1021/jz301146v | J. Phys. Chem. Lett. 2012, 3, 2785−2789
The Journal of Physical Chemistry Letters
Letter
ΔECCSD(T) − ΔEMP2. Corrections due to extrapolating to complete wave function basis sets are also often considered, and in fact both of these corrections form the previously published reference data used in our analysis, ΔECBS,ref CCSD(T). All HVPT calculations were performed using an appropriately modified version of GAMESS-US27 and automated with scripts based on cclib.28 The aug-cc-pVDZ basis set was used throughout, and the corresponding triple valence basis set was also tested on a subset of S22 with no observable difference in the reported statistics. Statistical Correlation Based on Ranks. The interpretation of rank-based statistics is straightforward when minimum quantitative information is needed, for example, whether the activity of one compound is larger than another. If this can be decided using energetic criteria, then the main concern becomes how to reproduce efficiently the ascending or descending order of energies, and one may turn toward the analysis of their ranks, which we attempt to outline shortly. Let two sets of energies, {Ai} and {Bi}, both with N elements, be well-ordered. The elements of these sets (energies) correspond to molecular dimers in our case, and they are treated as raw scores and converted into ranks {ai} and {bi}. The rank ai of an element Ai corresponds to its position when the set is sorted in ascending or descending order. The most popular nonparametric measure, the Spearman rank correlation coefficient, is the Pearson correlation coefficient between two sets of ranks. We choose a simpler way to express the strength of a monotonic relationship, by the number of concordant and discordant pairs among sets, NC and ND. These can in turn be used to evaluate the Kendall tau coefficient, which measures the correspondence of rankings
(1) (ΔECBS,ref CCSD(T)·ΔEel ) remains almost unchanged at large distances, whereas the fraction of mistakes Nerr and their magnitude Δ̅ ref mis almost double. This is an example of how an assumed linear relationship could lead to an erroneous conclusion about the quality of a statistical relationship; in this case, the quality would be overestimated. Both Figure 2 and Table 1 show the “autocorrelation” of the CBS-extrapolated CCSD(T) interaction energy: CBS,ref ΔECCSD(T) ·ΔECBS CCSD(T). The percentage of mistakes Nerr at equilibrium is not zero in this case due to differences between methods. (The latter also omits two of the largest dimers in S22.) The Kendall tau in this case drops below 0.9 only for dCOM−req > 3 Å (not shown), where the fraction of misaligned ref pairs exceeds 10% and Δ̅ mis reaches 1.5 kcal/mol. The corresponding statistics when the MP2 interaction energy acts as the prognostic are not much worse, but in both cases correlation degrades rapidly at shortened separations (dCOM− req < −0.2 Å). Because of the statistical nature of our observations, there are a number of subtle but relevant technical points. Details of the data set and statistical analysis are important, as is the manner in which intermolecular distance from equilibrium is defined. Furthermore, a natural extension would be to use a distributed representation or another more efficient way to approximate the full electrostatic term used here.25 Some reservations are in order, above all, the fact that our results are statistical and therefore their meaningfulness is also purely statistical. One would also expect the sustained correlation of the electrostatics interaction term to break down in many cases, for example, for complexes with significant charge transfer (which are absent in S22 and rare in biological settings). We contend, however, that the characteristic trend in Npred and in similar measures will remain unchanged in many biological applications, leading to the somewhat counterintuitive conclusion that the most accurate quantum chemical methods may not always be the best choice. This seems to be the case for the S22 training set when dimer structure is strongly perturbed but surely also for larger biomolecular systems when ranking a number of complexes by their relative stabilities. The new type of statistical analysis presented here offers guidance in such situations on how to improve efficiently empirical scoring procedures used in drug design and related applications, especially when dealing with questionable geometries.
■
τK =
COMPUTATIONAL METHODS
Nerr = 100%·
(1) (R) > ΔE RHF(ΔEel(1) + ΔEex + ΔEdel )
> ΔEel(1)
+
ND 1 N (N − 2
1)
(3)
and the success rate is Npred = 100% − Nmis. Furthermore, the magnitude of errors can be estimated by the average difference between the values of mistaken pairs. There will be two such averages, one for {Ai} and another for {Bi}. The first of these, Δ̅ Amis, will be
(1) (R) ΔEMP2(ΔEel(1) + ΔEex + ΔEdel + ΔEcorr )
>
(2)
Similar to the Pearson and Spearman coefficients, τK assumes values between −1 and 1, corresponding to opposite ideal monotonic relationships. A statistical significance can also be assigned and interpreted as the probability of observing rank sets {ai} and {bi} assuming the null hypothesis. Because one set of ranks is used to reproduce the other, the null hypothesis here is that the sequences of {Ai} and {Bi} are unrelated. The numbers NC and ND already give a good idea of the monotonic relationship. Normalized, they measure the fraction of mistakes or successful predictions when reproducing the order of elements in {Bi} from the order in {Ai}. The error rate therefore reads
Hybrid Variation−Perturbation Theory (HVPT). This approach17,8 combines variational and perturbational aspects of Hartree−Fock theory to provide a decomposition of the interaction energy that is, in essence, a simplified version of symmetry-adapted perturbation theory.26 The resulting components define a hierarchy of gradually simplified models
(1) ΔE HL (ΔEel(1)
NC − ND 1 N (N − 1) 2
A
(1) ΔEex )
Δ̅ mis = (1)
Nerr
1 1 N (N 2 err err
− 1)
∑ (i , j) ∈ M
|A i − A j | (4)
where M is the set of discordant or concordant pairs, depending on the direction of the monotonic relationship. In the context of energies as discussed here, Δ̅ Amis carries a clear practical meaning. It is the average difference between two energies in
This can be extended with electron correlation effects beyond the MP2 level, for example, within coupled-cluster theory, yielding an additional correction δΔECCSD(T) = 2788
dx.doi.org/10.1021/jz301146v | J. Phys. Chem. Lett. 2012, 3, 2785−2789
The Journal of Physical Chemistry Letters
Letter
(13) Balabin, R. M. Communications: Is Quantum Chemical Treatment of Biopolymers Accurate? Intramolecular Basis Set Superposition Error (BSSE). J. Chem. Phys. 2010, 132, 231101. (14) Jurečka, P.; Šponer, J.; Č erný, J.; Hobza, P. Benchmark Database of Accurate (MP2 and CCSD(T) Complete Basis Set Limit) Interaction Energies of Small Model Complexes, DNA Base Pairs, and Amino Acid Pairs. Phys. Chem. Chem. Phys. 2006, 8, 1985. (15) Fusti Molnar, L.; He, X.; Wang, B.; Merz, K. M. J. Further Analysis and Comparative Study of Intermolecular Interactions Using Dimers from the S22 Database. J. Chem. Phys. 2009, 131, 065102. (16) The results reported by Fusti Molnar et al. comprise 20 of the 22 dimers originally published, which means that the data for ΔECBS CCSD(T) in Figure 2 are derived for 190 instead of 231 dimer pairs. There were also convergence issues for some of the larger dimers, and so in this case the plot only goes down to −0.4 Å (although we expect it to follow ΔEMP2). (17) Sokalski, W. A.; Roszak, S.; Pecul, K. An efficient procedure for decomposition of the SCF interaction energy into components with reduced basis set dependence. Chem. Phys. Lett. 1988, 153, 153−159. (18) Góra, R. W.; Sokalski, W. A.; Leszczyński, J.; Pett, V. B. The Nature of Interactions in the Ionic Crystal of 3-Pentenenitrile, 2-Nitro5-Oxo, Ion(−1), Sodium. J. Phys. Chem. B 2005, 109, 2027−2033. (19) Ř ezác,̌ J.; Riley, K. E.; Hobza, P. S66: A Well-Balanced Database of Benchmark Interaction Energies Relevant to Biomolecular Structures. J. Chem. Theory Comput. 2011, 7, 2427−2438. (20) Ř ezác,̌ J.; Riley, K. E.; Hobza, P. Extensions of the S66 Data Set: More Accurate Interaction Energies and Angular-Displaced Nonequilibrium Geometries. J. Chem. Theory Comput. 2011, 7, 3466−3470. (21) Lehmann, E. L.; D’Abrera, H. J. M. Nonparametrics: Statistical Methods Based on Ranks, rev. 1st ed.; Prentice-Hall: Upper Saddle River, NJ, 1998. (22) Warshel, A.; Sharma, P. K.; Kato, M.; Xiang, Y.; Liu, H.; Olsson, M. H. M. Electrostatic Basis for Enzyme Catalysis. Chem. Rev. 2006, 106, 3210−3235. (23) Grzywa, R.; Dyguda-Kazimierowicz, E.; Sieńczyk, M.; Feliks, M.; Sokalski, W. A.; Oleksyszyn, J. The Molecular Basis of Urokinase Inhibition: from the Nonempirical Analysis of Intermolecular Interactions to the Prediction of Binding Affinity. J. Mol. Model. 2007, 13, 677−683. (24) Sokalski, W. A.; Sawaryn, A. Correlated Molecular and Cumulative Atomic Multipole Moments. J. Chem. Phys. 1987, 87, 526−534. (25) For clarity, we note that in HVPT the electrostatic term, ΔE(1) el , is calculated using the standard expression ⟨ϕ0|V|ϕ0⟩, where ϕ0 = ϕA0 ϕB0 is the polarization approximation for the dimer and V is the Coulomb operator. (26) Jeziorski, B.; Moszyński, R.; Szalewicz, K. Perturbation Theory Approach to Intermolecular Potential Energy Surfaces of van der Waals Complexes. Chem. Rev. 1994, 94, 1887−1930. (27) Schmidt, M. W.; et al. General Atomic and Molecular Electronic-Structure System. J. Comput. Chem. 1993, 14, 1347−1363. (28) O’Boyle, N. M.; Tenderholt, A. L.; Langner, K. M. cclib: A Library for Package-Independent Computational Chemistry Algorithms. J. Comput. Chem. 2008, 29, 839.
{Ai} when their order is opposite the majority of concordant or discordant pairs with respect to {Bi}. A similar delta can be defined for {Bi} relative to {Ai} and called Δ̅ Bmis, and the particular choice of symbols depends on which set is used to predict which.
■
ASSOCIATED CONTENT
S Supporting Information *
Coordinates of the dimers studied and all calculated interaction energy components in the form of GAMESS-US input and output files. This material is available free of charge via the Internet at http://pubs.acs.org.
■
AUTHOR INFORMATION
Corresponding Author
*E-mail: sokalski@pwr.wroc.pl. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS The authors acknowledge the support of Wrocław Networking and Supercomputing Center. This work was financed by a statutory activity subsidy from the Polish Ministry of Science and Higher Education for the Faculty of Chemistry of Wrocław University of Technology.
■
REFERENCES
(1) Stone, A. J. Intermolecular Potentials. Science 2008, 321, 787− 789. (2) Riley, K. E.; Pitoňaḱ , M.; Jurečka, P.; Hobza, P. Stabilization and Structure Calculations for Noncovalent Interactions in Extended Molecular Systems Based on Wave Function and Density Functional Theories. Chem. Rev. 2010, 110, 5023−5063. (3) Bukowski, R.; Szalewicz, K.; Groenenboom, G. C.; van der Avoird, A. Prediction of the Properties of Water from First Principles. Science 2007, 315, 1249−1252. (4) Paton, R. S.; Goodman, J. M. Hydrogen Bonding and π-Stacking: How Reliable Are Force Fields? A Critical Evaluation of Force Field Descriptions of Nonbonded Interactions. J. Chem. Inf. Model. 2009, 49, 944−955. (5) Gordon, M. S.; Mullin, J. M.; Pruit, S. R.; Roskop, L. B.; Slipchenko, L. V.; Boatz, J. A. Accurate Methods for Large Molecular Systems. J. Phys. Chem. B 2009, 113, 8646−9663. (6) Gordon, M. S.; Fedorov, D. G.; Pruit, S. R.; Slipchenko, L. V. Fragmentation Methods: a Route to Accurate Calculations on Large Systems. Chem. Rev. 2012, 112, 632−672. (7) Simon, S.; Duran, M.; Dannenberg, J. J. How Does Basis Set Superposition Error Change the Potential Surfaces for HydrogenBonded Dimers? J. Chem. Phys. 1996, 105, 11024−11031. (8) Simon, S.; Duran, M.; Dannenberg, J. J. Effect of Basis Set Superposition Error on the Water Dimer Surface Calculated at Hartree-Fock, Møller−Plesset, and Density Functional Theory Levels. J. Phys. Chem. A 1999, 103, 1640−1643. (9) McAllister, L. J.; Bruce, D. W.; Karadakov, P. B. Halogen Bonding Interaction Between Fluorohalides and Isocyanides. J. Phys. Chem. A 2011, 115, 11079−11086. (10) Vijay, D.; Sakurai, H.; Sastry, N. The Impact of Basis Set Superposition Error on the Structure of pi-pi Dimers. Int. J. Quantum Chem. 2011, 111, 1893−1901. (11) Valdés, H.; Klusák, V.; Pitǒnák, M.; Exner, O.; Starý, I.; Hobza, P.; Rulišeḱ , L. Evaluation of the Intramolecular Basis Set Superposition Error in the Calculations of Larger Molecules: [n]Helicenes and PheGly-Phe Tripeptide. J. Comput. Chem. 2008, 29, 861−870. (12) Asturiol, D.; Duran, M.; Salvador, P. Intramolecular Basis Set Superposition Error Effects on the Planarity of DNA and RNA Nucleobases. J. Chem. Theor. Comp. 2009, 5, 2574−2581. 2789
dx.doi.org/10.1021/jz301146v | J. Phys. Chem. Lett. 2012, 3, 2785−2789