MM Simulations

Apr 2, 2009 - Institute of Enzymology, Biological Research Center, Hungarian Academy ... Laboratory, UniVersity of Cambridge, Cambridge, United Kingdo...
2 downloads 0 Views 2MB Size
5728

J. Phys. Chem. B 2009, 113, 5728–5735

Evaluating Boundary Dependent Errors in QM/MM Simulations Iva´n Solt,† Petr Kulha´nek,† Istva´n Simon,† Steven Winfield,‡ Mike C. Payne,‡ Ga´bor Csa´nyi,§ and Monika Fuxreiter*,† Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Budapest, Hungary, and CaVendish Laboratory and Engineering Laboratory, UniVersity of Cambridge, Cambridge, United Kingdom ReceiVed: August 14, 2008; ReVised Manuscript ReceiVed: March 6, 2009

Hybrid quantum mechanics/molecular mechanics (QM/MM) simulations provide a powerful tool for studying chemical reactions, especially in complex biochemical systems. In most works to date, the quantum region is kept fixed throughout the simulation and is defined in an ad hoc way based on chemical intuition and available computational resources. The simulation errors associated with a given choice of the quantum region are, however, rarely assessed in a systematic manner. Here we study the dependence of two relevant quantities on the QM region size: the force error at the center of the QM region and the free energy of a proton transfer reaction. Taking lysozyme as our model system, we find that in an apolar region the average force error rapidly decreases with increasing QM region size. In contrast, the average force error at the polar active site is considerably higher, exhibits large oscillations and decreases more slowly, and may not fall below acceptable limits even for a quantum region radius of 9.0 Å. Although computation of free energies could only be afforded until 6.0 Å, results were found to change considerably within these limits. These errors demonstrate that the results of QM/MM calculations are heavily affected by the definition of the QM region (not only its size), and a convergence test is proposed to be a part of setting up QM/MM simulations. 1. Introduction Although computer power continues to increase rapidly, simulation of complex systems at the atomistic level still remains a challenge. The description of chemical reactions in solution, at interfaces, and especially involving enzymes requires models with many thousands of atoms, which is too costly for high accuracy quantum mechanical calculations. Thus, to achieve sufficient accuracy and allow extensive configurational sampling of these large systems, a multilevel treatment has to be applied.1-9 A wide range of applications have been published, e.g. for condensed-phase reactions,10-12 nanostructured materials,13 brittle fracture, and catalytic systems,14 such as zeolites15 or enzymes.16,17 In these hybrid schemes, the region of primary interest, where chemical changes are likely to occur, is treated quantum mechanically (QM) and the surroundings are described by a simple empirical molecular mechanics model (MM).18 The coupling between the empirical and quantum regions can be modeled in various ways, e.g. via mechanical constraints19 or electrostatic perturbations.20 The most critical point in this methodology is the handling of the boundary between the QM and MM regions. In particular, due to the long-range nature of quantum mechanical models, how to minimize the errors that arise in the quantum mechanical simulations. In covalently bonded systems, the selection of the QM region necessarily results in the breaking of covalent bonds. The ensuing problems introduced by the artificial surfaces created are typically ameliorated either by incorporating so-called link atoms to complete the valences of the atoms in the QM subsystem21 or by defining a boundary zone that will appear in * Corresponding author. Phone: 361-279-3138. Fax: 361-4665465. E-mail: [email protected]. † Hungarian Academy of Sciences. ‡ Cavendish Laboratory, University of Cambridge. § Engineering Laboratory, University of Cambridge.

both QM and MM calculations.22 Alternatively, QM capping is achieved by utilizing localized orbitals to represent bonding between the two subsystems, e.g. in localized self-consistent field (LSCF)23 or generalized hybrid orbital (GHO)24 methods. All these methods, explicitly or implicitly, are relying on some degree of cancellation of the errors between the QM and MM models associated with the artificial surface that is created between the two regions. In general, these errors can be very large due to the nonlocality of quantum mechanics and the failure of empirical potentials to capture these nonlocalities. Furthermore, they depend strongly on the electrostatic perturbations to the Hamiltonian. While it is impossible in general to estimate the residual errors, better results are obtained if the charges on the frontier MM atoms are small.5 In principle, this criterion could be used for selection of the QM region, but in practice, it leads to inconsistencies in the simulations and generates large errors, as demonstrated by a recent comparison of the different frontier treatments in the condensed phase.25 For example, errors in deprotonation energies of simple alcohols and carbonic acids exhibit large variations using different link atom methods and they do not decrease monotonically with increasing size of the QM region. It is difficult to conclude which method performs the best, but there is a clear indication that all these approaches become inadequate if there are strong electrostatic interactions with the environment. The method of last resort in dealing with such boundary errors is to make the QM region larger. Indeed in some sense, this is the only approach which is guaranteed to work: whatever property of the system one is interested in, it should be converged with respect to the size of the QM region. In practice, however, this is rarely done. There can be little doubt that if a particular boundary treatment method results in force errors in the quantum region that are smaller, all thermodynamic properties will approach their limiting values faster as the QM region

10.1021/jp807277r CCC: $40.75  2009 American Chemical Society Published on Web 04/02/2009

Errors in QM/MM Simulations

J. Phys. Chem. B, Vol. 113, No. 17, 2009 5729 errors. In section 3 the results are presented in terms of the dependence of force errors on the QM region size, on QM region definition, and on QM region polarity, and we discuss the dependence of free energies on the QM region size.

Figure 1. Three dimensional structure of lysozyme (PDB code: 4lzt) and the two regions considered for QM force error calculations: (A) the polar region (red) is centered around the OE1 atom of active site residue Glu35; and (B) the apolar region is selected around the CG1 atom of the buried Ile55 (blue).

is enlarged, or conversely, more accurate calculations are possible with any given QM region. As we show in this work, convergence to an acceptable tolerance in the force errors at the center of the QM region may only occur beyond a surprisingly large QM region size. This implies that the force errors on atoms nearer the edge of the QM region are very large, and thus, molecular dynamics carried out using these forces, as is usually done in QM/MM calculations, may lead to significant errors in the calculated thermodynamic properties. In solid state simulations, this problem can be overcome by discarding quantum mechanically computed forces from near the edge of the QM region using a variety of techniques,7 though these approaches have yet to be tested in the context of biochemical systems. The forces on the atoms near the center of the QM region are expected to converge with increasing size of the QM region at a rate that will depend on the locality properties of the particular QM model and to some extent on how the affected covalent bonds are treated. The purpose of the present work is to study this dependence in a prototypical biochemical context. In order to do this, we compute the average error in the force on a single atom at the center of the QM region and the variation of the free energy of a proton transfer reaction with the size of the QM region. Since our long-term goal is to devise heuristics for selection of QM regions, the performance and convergence properties of a number of simple rules are tested, which are detailed below. Focusing on the application of QM/MM schemes to enzymes, hen-egg lysozyme was chosen as the model system.26 Based on previous experience in other systems,27 the strength of the electrostatic coupling between QM and MM zones is expected to strongly influence the force errors in the QM region. To probe this effect, we compared the force errors in a polar and an apolar region of lysozyme. The polar region was centered at the active site residue Glu35, while the apolar region was centered at the buried Ile55 (Figure 1). We demonstrate a marked difference between the force errors in these two regions. As anticipated, errors for the apolar regions are much smaller than those of polar regions and they also depend less strongly on the size of the QM region. Similarly to the force errors, the free energy of a proton transfer reaction in a polar region is also heavily influenced by the size of the QM region and does not appear to converge faster. In section 2 we briefly describe the definitions used for selecting the QM region and the procedure for evaluating force

2. Methods 2.1. Model System. The atomic resolution structure of lysozyme (PDB code 4lzt26) was used as a starting model for all our calculations. The crystal structure was completed with hydrogens using the LEaP module of the Amber 928 program package. All ionizable residues were treated as charged, and the system was neutralized by adding 8 chloride ions. The model obtained was immersed into a 58.32 Å × 67.01 Å × 73.75 Å cell of TIP3P29 water molecules and was then subjected to the following relaxation procedure. First, the solvent molecules were optimized for 5000 steepest descent minimization steps, followed by concomitant minimization of the water and counterions for 5000 steepest descent steps. Then, the 25.0 kcal/mol Å2 harmonic constraints on the protein atoms that tethered them to their crystallographic positions were gradually released in 9000 minimization steps. Finally, 3000 steps of conjugate gradient unconstrained minimization were performed until the maximum force dropped below 0.1 kcal/mol Å. 2.2. Molecular Dynamics Simulations. The relaxed system was heated to 300 K in 140 ps. The water density was then adjusted in a 200 ps simulation at constant pressure. Subsequently, a 5 ns MD simulation in the canonical ensemble was performed with a cell size of 63.96 Å × 62.00 Å × 68.23 Å. A time step of 1 fs was used throughout the simulation. Configurations were collected at intervals of 0.5 ps. Long-range electrostatic interactions were computed using a particle mesh Ewald method30 with an 8 Å cutoff. SHAKE contraints were applied for all bonds that involve hydrogens. All simulations were performed by the Sander module of the Amber 9 program package.28 Root mean square deviations from the initial (relaxed) structure indicate that equilibration has been achieved after 1.5 ns, at which point the fluctuations of the backbone atoms from the average structure were of the order of 0.9 Å. The force error calculations have been carried out in a QM/ MM framework, where the PM3 semiempirical potential31 was applied for QM atoms, and the AMBER parm94 force field32 was applied for MM atoms. As it is implemented in the Sander module, the MM charges nearby to the QM frontier were not scaled. To achieve faster convergence in the self-consistent field (SCF) of the hybrid QM/MM calculation, a gas phase calculation has been carried out for the QM region by the GAMESS program package,33 neglecting the MM environment. The resulting electron density has been used as an initial guess for the QM/MM force calculation. Force errors were also evaluated by using the AM1 semiempirical potential.31 2.3. Definition of the QM Region. QM regions were defined around the central atoms, which were OE1 of Glu35 in the polar region and CG1 of Ile55 in the apolar region (Figure 1). The QM region is constructed as follows. We initially start with a spherical region with a given radius around the central atom. To avoid improper valences and the so-called link-atom overlap caused by severe perturbation of the covalently bonded network, three different heuristics used to arrive at the final QM region were studied. In the residue-based QM region, complete side chains or backbone fragments of the selected QM atoms were included (Figure 2A). Alternatively, in the group-based QM region, atoms that are attached to the selected QM atoms (chemical groups) by double covalent bonds as well as those that belong to the same in-out-in motifs that cross the QM region

5730 J. Phys. Chem. B, Vol. 113, No. 17, 2009

Solt et al. As a third alternative, we also tried a “coValent walk” approach (Figure 2C). Here, we include in the QM region all atoms within a given distance of the central atom that can be reached by walking along the covalent connectivity network starting from the central atom. This definition leads to the exclusion of water molecules from the QM region even for large initial radii. 2.4. Force Errors. The force error on the central atom is defined as

Ferr ) |FR - Fref|

(1)

where FR is the force on the central atom computed using a QM region with a radius R, and Fref is the reference force obtained using the largest possible QM region, which has been constructed according to the residue-based selection approach with an initial radius of 11.0 Å, leading to 700-750 QM atoms for the different configurations. Force errors have been evaluated using 150 snapshot configurations taken from the equilibrium phase of the atomic trajectory. For every snapshot, a QM region was selected according to residue-based and group-based selection rules with the radius R ranging from 2.0 to 9.0 Å with the step of 0.5 Å. Within a QM region of a given radius defined by any of the selection rules, the number of QM atoms varies. Hence, the force error is represented as a function of the number of QM atoms, that in turn may include different sets of QM atoms as configurations vary. Because the covalent walk approach selects a smaller number of atoms due to the exclusion of water, the radius R was extended to 12 Å to obtain better statistics of force errors. As force errors do not fall into natural bins along the number of QM atoms, which is the independent variable, the Gaussian process method has been applied to analyze the data.34 This provides regression analysis using the nonparametric model, leading to smoothed average force errors. The regression is written as a linear combination of Gaussian functions, each centered on a data point, with a common variance:

y(x) )



(

Ri exp

i

Figure 2. Schematic representation of the selected atoms around the OE1 atom of Glu35 in the polar region of lysozyme using different QM region definitions: (A) residue-based selection, where complete side chains or backbone motifs are included; (B) group-based selection, where chemical groups are included; (C) “covalent walk approach”, where atoms along the covalent connectivity network are included. The QM regions corresponding to a given QM radius are shown by dark gray, extensions of the initial QM region that are added according to the different definitions are gray, and the atoms in the resulting QM region are shown in blue. In the case of the “covalent walk approach”, the selected atoms are displayed in blue and waters are excluded.

boundary were included (Figure 2B). Non-QM hydrogens attached to heavy QM atoms are also included, as well as nonQM heavy atoms attached to QM hydrogens. Aromatic rings are also included as whole if any atoms fall within the QM region. Water molecules are considered as chemical groups and are included as whole molecules in both approaches. The groupbased QM region definition results in a more spherical QM region than the residue based approach.

-(x - xi)2 σ2

)

(2)

The vector of coefficients r ) {Ri} are given by C-1y, where y is the vector of data values and C is the regularized covariance matrix:

(

Cij ) γ1 exp

-(xi - xj)2 σ2

)

+ γ2I + γ3

(3)

where I is the identity matrix, and γ1, γ2, and γ3 are parameters to be adjusted together with σ to get the best fit. The Gaussian process regression maximizes a combination of smoothness and fit and is particularly suited to dealing with very noisy data. 2.5. Free Energies. The effect of QM region size was also assessed on the free energy of a proton transfer reaction between a single water molecule and the OE1 of Glu35 carboxylic group in the polar region. The free energy was computed by the adaptive biasing force (ABF) method35 along a prescribed reaction coordinate ζ, defined as the difference between breaking and forming bonds (i.e., the difference between the water hydrogen and carboxyl oxygen OE1 distance d1 and the water

Errors in QM/MM Simulations

J. Phys. Chem. B, Vol. 113, No. 17, 2009 5731

Figure 3. Schematic representation of the effect of increasing the size of the QM region with four different initial radii (A, 1.5 Å; B, 4.0 Å; C, 6.0 Å; D, 8.0) for the residue-based QM definition. Please note, that not all atoms included in the QM zone are represented on the schemes. Extensions of the initial sphere according to the residue based selection are shown in light gray.

oxygen and water hydrogen distance d2; see Figure 7). The reaction coordinate was kept in the interval of [-1.0; 2.0] Å by outward half-harmonic potentials with the force constant of 100 kcal mol-1 Å-1 to prohibit diffusion of water (or hydroxide ion) far away from Glu35. The potential of mean force (PMF) was accumulated in 70 bins along the reaction coordinate and applied to the system with a linear ramp of 200 samples in every bin. A multiple walker approach36,37 was applied to improve sampling and be able to probe larger QM zones in a reasonable computational time. Four independent ABF simulations were started from the same configurations but with different initial velocities, and the accumulated PMF was collected and mixed. Then the accumulated PMF was redistributed back to individual ABF simulations every 100 or 500 time steps via a special purpose application. The final free energies were obtained by the numerical integration of PMF using Simpson’s rule. The reference for the free energy of the proton transfer was calculated using a minimum QM region that contains the side chain of Glu35 and a single nearby water molecule (12 atoms in total), which selection was kept constant over the entire simulation. The effect of the QM region size on the free energy of proton transfer was evaluated by updating the QM region in the course of the simulation on-the-fly according to the residuebased scheme described in section 2.3. The list of selected QM atoms was updated every 100 time steps, which ensures continuous tracking of the changes around the central atom (e.g., water diffusion). To achieve faster SCF convergence in the first MD step of individual short QM/MM simulations, the initial electron density was taken from a GAMESS-US calculation in the same manner as it was described in section 2.2. The free

energy of the proton transfer was calculated for the initial QM radii R set to 3.0, 4.0, 5.0, and 6.0 Å. We could not go beyond this limit due to computational cost (on Dual-Core AMD Opteron processors 92 days for the 70 windows). In all free energy calculations, the integration time step was 0.5 fs. No bonds were constrained by SHAKE to avoid possible problems with the rethermalization of frozen degrees of freedom during entering species to the QM region and vice versa. Due to frequent restart of QM/MM simulations, the temperature of the system was maintained with a Berendsen thermostat with a weak coupling constant of 1 ps. 3. Results In the following we discuss the dependence of the errors in the forces on the central atom as a function of the definition of the QM region (section 3.1), the size of the QM region (section 3.2), and the polarity of the QM region (section 3.3). The dependence of the catalytically more relevant free energies on the QM region size is represented in section 3.4. 3.1. Definition of the QM Region. Schematic representations of residues which are included in the QM region centered on Glu35 according to the three definitions for the QM region are shown in Figure 2, for a particular value of the initial radius. As described previously in section 2.3, QM regions were defined (A) based on residues, where complete side chains or backbone motifs were included; (B) based on chemical groups; (C) or based on the “covalent walk approach”, where atoms along the covalent connectivity network were included. Figure 3 shows how the QM region varies with different choices for the initial radius, using the residue-based approach.

5732 J. Phys. Chem. B, Vol. 113, No. 17, 2009

Solt et al.

Figure 4. Force errors on the OE1 atom of Glu35 in the polar region as a function of the size of the QM region, using three QM region definitions: residue-based (green open triangles); group-based (red open circles); and covalent walk (blue open diamonds) definitions. The means of the distributions are shown by continuous lines, and the raw error data is shown by colored symbols.

The distribution of the force errors on the central atom of the polar region (OE1 of Glu35) and the average values of the forces, which were obtained by the three definitions, are shown in Figure 4. Since none of the definitions result in QM regions that are perfectly spherical, the size of the QM region was represented by the number of the QM atoms. The distributions show the largest force errors are for QM regions defined using the “covalent walk approach”; indeed, the forces do not converge within the range of sizes of QM regions used in this study. This observation illustrates that strong electrostatic perturbations at the QM/MM boundary can substantially increase the errors in hybrid calculations.3,5 These perturbations can be reduced by including neutral or chemically consistent groups, as will be discussed below. The covalent walk approach results in the least spherical QM regions, and our results show that relying on the MM charge distributions for atoms, in particular for water molecules, which are near the QM region spatially but distant from it covalently, is a bad approximation. Given the obvious problem with this approach, we will not discuss the covalent walk approach further. Force errors obtained with the group-based QM definition show a significantly larger range of scatter than those computed with the residue-based definition. With the exception of small QM zones, the magnitude of the force errors obtained using residue-based QM regions is approximately half of that using group-based selection at a similar QM region size (above 250 atoms). A contrasting difference in the trends of the force errors, however, can be observed in QM regions containing 70-150 atoms, where errors obtained with the group-based definition increase, while those computed with the residue-based definition decrease. In this range, QM regions that are defined on the residue basis also include the backbone of Trp108, Val109, and Ala110 residues, as well as the amide group of Gln57 (located, on average, 7.4 Å from Glu35 OE1) (Figure 3), which are either excluded or are only partially present in the group-based QM regions. This reflects that exclusion of polarizable backbone dipoles from the QM treatment may result in large perturbation of the electrostatic interactions. This also points to the importance of inclusion of backbone dipoles and other polar groups in the QM region, which can provide significant electrostatic stabilization, especially for the polar, partially charged atoms. 3.2. Size of the QM Region. The average force errors in the polar region using the residue-based QM selection decrease rapidly with increasing size of the QM region (Figure 4) but

Figure 5. Mean force errors (red) and mean charge errors on the central atom of the polar QM region (black) as a function of the size of the QM region, using two definitions of the QM region: residue-based (A) and group-based (B) in the polar region of the protein. The raw charge error data is shown as blue symbols in both cases.

reach an acceptable limit (