Automated Training of ReaxFF Reactive Force Fields for Energetics

Computational studies of the reaction mechanisms of various enzymes are nowadays based almost exclusively on hybrid QM/MM models. Unfortunately, the s...
2 downloads 12 Views 1MB Size
Subscriber access provided by READING UNIV

Article

Automated Training of ReaxFF Reactive Force Fields for Energetics of Enzymatic Reactions Tomáš Trnka, Igor Tvaroska, and Jaroslav Koca J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.7b00870 • Publication Date (Web): 20 Nov 2017 Downloaded from http://pubs.acs.org on November 21, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Automated Training of ReaxFF Reactive Force Fields for Energetics of Enzymatic Reactions Tom´aˇs Trnka,†,‡ Igor Tvaroˇska,†,¶ and Jaroslav Koˇca∗,†,‡ †Central European Institute of Technology (CEITEC), Masaryk University, 625 00 Brno, Czech Republic ‡Faculty of Science – National Centre for Biomolecular Research, Masaryk University, 625 00 Brno, Czech Republic ¶Institute of Chemistry, Slovak Academy of Sciences, 845 38 Bratislava, Slovak Republic E-mail: [email protected] Abstract Computational studies of the reaction mechanisms of various enzymes are nowadays based almost exclusively on hybrid QM/MM models. Unfortunately, the success of this approach strongly depends on the selection of the QM region, and computational cost is a crucial limiting factor. An interesting alternative is offered by empirical reactive molecular force fields, especially the ReaxFF potential developed by van Duin and coworkers. However, even though an initial parameterization of ReaxFF for biomolecules already exists, it does not provide the desired level of accuracy. We have conducted a thorough refitting of the ReaxFF force field to improve the description of reaction energetics. To minimize the human effort required, we propose a fully automated approach to generate an extensive training set comprised of thousands of different geometries and molecular fragments starting from a few model molecules. Electrostatic parameters were optimized with QM electrostatic potentials as the main target quantity, avoiding excessive dependence on the choice of reference atomic charges and improving

1 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

robustness and transferability. The remaining force field parameters were optimized using the VD-CMA-ES variant of the CMA-ES optimization algorithm. This method is able to optimize hundreds of parameters simultaneously with unprecedented speed and reliability. The resulting force field was validated on a real enzymatic system, ppGalNAcT2 glycosyltransferase. The new force field offers excellent qualitative agreement with the reference QM/MM reaction energy profile, matches the relative energies of intermediate and product minima almost exactly, and reduces the overestimation of transition state energies by 27–48% compared with the previous parameterization.

1

Introduction

A wide variety of diseases is tightly connected to disturbances in enzymatic activity. Considerable scientific effort has thus been devoted recently towards understanding the detailed reaction mechanisms of enzymes. 1,2 Computational methods are often used to supplement experimental data and provide particular insight into the catalytic reaction of enzymes on an atomic level. 3–6 To cope with the size and complexity of an enzymatic system, most computational studies nowadays employ a hybrid QM/MM (quantum mechanics/molecular mechanics) approach. 7–9 This scheme combines the accuracy of an ab initio method with the low computational cost of an empirical potential, applying the former only to the core reactive region (QM) and the latter to the rest of the studied system (MM). Unfortunately, the success of this approach strongly depends on the way the system is partitioned into these two regions. These studies are thus very demanding on human experience and consequently also timeconsuming and tedious, as there is usually considerable effort needed to devise a suitable QM/MM partitioning. In some cases, a good choice of a fixed QM/MM decomposition may even be impossible, especially when the reaction is linked to large conformational rearrangements or the diffusion of solvent molecules into and out of the active site. Many adaptive QM/MM schemes have been proposed to eliminate the need for a predefined partitioning, 2 ACS Paragon Plus Environment

Page 2 of 40

Page 3 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

and they definitely help to solve the problems with large structural changes. 10,11 Unfortunately, these methods do not remove the human effort required to set up a computational study; they just transform it into a task of finding the optimal parameters for a particular adaptive scheme. Additionally, even the QM/MM approach does not eliminate the significant computational cost of a quantum chemistry calculation. Even with all the advances in methods and computing power, the QM region is usually restricted to a modest level of theory. This problem is extremely pronounced in ab initio MD, limiting it to tens or maybe hundreds of picoseconds of total simulation time, at the cost of millions of CPU hours spent for a single study. 12–14 A variety of semi-empirical QM or tight-binding methods has been proposed to offer a faster alternative to DFT. 15–17 Unfortunately, even these methods are not fast enough to allow multi-nanosecond MD simulations of complete enzymatic systems comprising many thousands of atoms. Current applications of approximate QM methods to enzymatic reactivity thus still rely on hybrid QM/MM approaches with all of their shortcomings. Furthermore, even though these methods can be optimized to accurately describe reaction energetics, 18 tuning the description of the conformational motions of a protein is not straightforward due to the absence of explicit angle and torsion terms. When we cannot realistically replace a QM/MM description with a QM-only one, it seems natural to approach the problem from the other side. The empirical biomolecular force fields used for the MM region have a proven track record of describing protein dynamics at long timescales with acceptable accuracy and low cost. The only obstacle preventing an MM-only model of enzymatic reactivity is the predefined bonding topology, unable to accommodate a chemical reaction. Fortunately, there is a multitude of empirical reactive potentials, 19–26 with the most promising in this regard being the ReaxFF model by van Duin and coworkers. 27,28 The ReaxFF potential is versatile enough to be useful in many settings, ranging from combustion processes to materials science to catalysis. 21 It also retains the intuitive functional form resembling classical non-reactive force fields, enabling the directed optimiza-

3 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

tion of various conformational terms. Finally, its performance is usually just one or two orders of magnitude below conventional force fields, enabling multi-nanosecond simulations of biomolecular systems with commonly available resources. The first indication that ReaxFF could be applicable to enzymatic systems came from the work of Zhu et al. 29 Although this was just a simple 250-atom cluster model of an enzymatic active site, it showed that ReaxFF can describe it with remarkable qualitative accuracy. After that, several ReaxFF parameterizations relevant for biomolecules were developed, starting with a glycine force field by Rahaman et al. 30 and ending with the ProtReaxFF force field by Monti et al. 31 Even though the latter is parameterized for protein systems, its quantitative accuracy is still lacking in many aspects, as will be shown further in this study. In this work, we aim to refine the ProtReaxFF by refitting most of its parameters to improve reaction energetics. Due to the structure of the ReaxFF potential, the angle and torsion terms depend on the bond terms, but not vice versa. This makes it possible to first focus on the bond potentials, with a careful tuning of torsion terms and testing of protein conformational dynamics being the topic of a follow-up study. The ProtReaxFF force field 31 was created using a relatively small training set and a simple optimization protocol consisting of successive single-parameter parabolic searches. In this study, we introduce an automated approach to generate an extensive training set covering thousands of geometries of small model molecules and complexes. In addition to DFT-based energies, we also fitted the ReaxFF atomic forces to their DFT counterparts to leverage this valuable information about the shape of the reference potential energy surface. An advanced numerical optimization algorithm 32 from the CMA-ES family 33,34 was employed to find the best combination of parameters in a robust, reliable, highly parallel and efficient manner. Parameters for the charge equilibration scheme were optimized separately with electrostatic potentials as the main target quantity, avoiding excessive dependence on the choice of reference atomic charges. Finally, we validated the accuracy and transferability of the resulting force field by comparing the potential energy profile along a Nudged Elastic Band 35,36 reaction path of the

4 ACS Paragon Plus Environment

Page 4 of 40

Page 5 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

ppGalNAcT2 glycosyltransferase with reference data. 12,37

2 2.1

Methods Introduction to ReaxFF

ReaxFF is a general empirical force field that consists of the following main terms: 28

Epot = Ebond + ELP + Eover + Eunder + Eval + Epen + Ecoa + Etors + Econj + EHBond + EvdW + ECoul + Epol

The last three terms represent the nonbonded interactions, where EvdW stands for a Morse repulsion-dispersion potential and the other two terms correspond to the electrostatic interaction energy and polarization self-energy of atomic point charges calculated using the Electronegativity Equalization Method 38 (EEM, sometimes also QEq) with Louwen-Vogt empirical screening 39 of short-range interactions. All the remaining terms represent the bonded part of the potential and depend on the bond orders BOij calculated from distances between neighboring atoms. Because the force field needs to capture the behavior of a given element in many possible chemical situations, the individual energy terms are relatively complicated. Even a simple force field for glycine 30 covering only four elements is then comprised of around 800 empirical parameters, making the process of ReaxFF force field optimization decidedly non-trivial.

2.2

Calculation of Molecular Binding Energies with EEM

Care needs to be taken when calculating molecular binding energies using ReaxFF, due to a well-known artifact of the EEM model. This simple model allows unimpeded charge transfer even over an infinite distance so that the atomic charges do not tend to zero (assuming a non-ionic system) when a molecule is atomized. This in turn causes the polarization self5 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

energy term Epol to retain a non-zero value upon atomization, while all other terms vanish. The ReaxFF potential energy of an atomized molecule (atoms at infinite separation) is thus equal to Epol,0 and depends only on the composition of the molecule.

P −1 η χi µ0 = Pi i −1 i ηi µ0 − χ i qi,0 = 2ηi X  2 Epol,0 = χi qi,0 + ηi qi,0 i

Subtracting Epol,0 from the raw ReaxFF potential energy of a molecule thus yields the true molecular binding energy (amount of energy released when the molecule is formed from its constituent atoms). This is equivalent to the calculation of QM molecular binding energies by subtracting the atomic QM energies from the total energy of a molecule.

2.3

Training Set Generation

A software tool based on the OpenBabel library 40 was used to automate the generation of training sets of small molecules. A set of 31 small molecules was selected to model all natural amino acid side chains, the peptide bond, and other common organic functional groups. The starting geometries of model molecules were optimized using DFT. For every model molecule, a library of conformers was then generated by successively altering every bond, angle, torsion, and out-of-plane degree of freedom. Every such degree of freedom was perturbed separately, without allowing the rest of the molecule to relax. Because our objective function heavily relies on fitting forces, using rigid scans helps distinguish the contributions of individual potential terms by the forces that appear on atoms surrounding the perturbed one. Relaxed scans would eliminate these forces, losing a significant amount of valuable information and making the procedure more prone to overfitting. Additionally, not having to perform thousands of geometry optimizations during the generation of the 6 ACS Paragon Plus Environment

Page 6 of 40

Page 7 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

training set makes the process considerably simpler, computationally cheaper, and more reliable. However, using some combination of rigid and relaxed scans would likely provide even better description of concerted conformational motions and as such deserves further study. Bond stretching was scanned by multiplying the optimized bond length by a factor of 0.5 to 1.5 in steps of 0.1. For non-ring bonds, all other degrees of freedom were kept unchanged. The stretching of bonds participating in a ring was implemented by changing the position of the two atoms directly participating in the bond, thus also implicitly distorting the neighboring ring bonds and valence angles. The bending of valence angles was scanned by multiplying the optimized angle value by a factor of 0.8 to 1.2 in steps of 0.05. Rotation around torsional degrees of freedom was scanned by successively incrementing a selected non-redundant torsion angle by steps of 10 degrees over the whole range of 360 degrees. Planar moieties formed by three atoms each having a bond to the same central atom were distorted by successively displacing the central atom along the plane normal vector by −0.2 to 0.2 ˚ A in steps of 0.05 ˚ A. Distorted structures containing two atoms closer than a preset threshold were discarded. The threshold for eliminating steric clashes was 1.5 ˚ A for atoms not connected by a bond and 0.5 ˚ A for bonded atoms. To correctly parameterize acidobasic behavior, the conformer libraries of molecules containing potentially basic atoms were augmented by all singly protonated states. Potentially basic atoms were defined as atoms bearing at least one lone electron pair. For every potentially basic atom, a starting guess of the position of the proton was derived from the VSEPR model. The position of the proton was then optimized using the MMFF94 force field 41 with the rest of the molecule kept frozen. Finally, stretching of the newly formed bond was scanned analogously to other bonds. Important non-bonded interactions were modeled by a set of non-covalent dimers of neutral molecules based on the S66 benchmark set. 42 Geometries of all 66 complexes in this set were reoptimized using DFT. For every complex, a library of structures along the

7 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 40

dissociation curve was generated by successively altering the distance of the monomers by scaling the optimized value by factors of 0.5 to 1.5 in steps of 0.1; 1.6 to 2 in steps of 0.2; 2.5, 3, and 4. The threshold for eliminating steric clashes of nonbonded atoms was set to 1.0 ˚ A. The distance of monomers was defined as a vector between the weighted geometrical centers of selected atoms on each monomer, according to the following (simplified) algorithm: 1. Initialize the weights of all atoms to zero. 2. Find all pairs of atom a from monomer A with atom b from monomer B and calculate their distance dab 3. Sort the distances in ascending order and set dmin to the lowest dab found. 4. For every dab in the sorted list, calculate wab = max 1 −

dab −dmin ,0 2



. If any of the

atoms a, b still has a weight of zero, set its weight to wab . This procedure assigns a weight of 1.0 to the closest interacting atoms and linearly decreasing weights to atoms interacting over a distance of up to 0.5 ˚ A longer. This makes the selection of centers of interaction robust even in nearly symmetrical cases, such as in the acetic acid dimer or in dimers involving aromatic rings (see Figure S1). In addition to the closed-shell systems, a library of molecular fragments was also generated automatically using another software tool. Starting from the optimized geometries of all the systems mentioned above, pairs of fragments were created by successively splitting every bond, one bond at a time (see Figure S2). The resulting fragments were then subjected to another round of fragmentation. Any fragments consisting of a single atom were discarded. Finally, all the fragments were assembled in a single pool, no matter what source molecule they came from, and duplicates with the same constitution (as described by the InChI identifier) were eliminated. The properties of bulk water were modeled using 37 geometries of a water icosamer in a vacuum, extracted from MD simulations using the TTM3-F polarizable water model. 43 Four 8 ACS Paragon Plus Environment

Page 9 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

independent 100 ps MD simulations were run using a modified version of the LAMMPS MD package 44,45 combined with an implementation of TTM3-F by Babin. 46 These simulations used a time step of 1 fs and the Nos´e-Hoover thermostat with a time constant of 100 fs and a target temperature of 300 K. Trajectory snapshots of the system were saved with a period of 100 fs. The last 80 ps of each simulation were joined and used for further processing. To eliminate icosamers with a non-compact or highly elongated shape, the solvent-accessible surface 47 of a rolling probe of 3 ˚ A radius was calculated for each trajectory frame using the g sas tool from the software package GROMACS 48 version 4.6.6. Trajectory frames with ˚2 were clustered using the Jarvis-Patrick method 49 a solvent-accessible surface below 800 A with parameters P = 6, M = 10 using the g cluster tool from GROMACS. The middle structure from each cluster was included in the training set. Force fields created solely using this pre-generated training set exhibited some spurious reactivity due to concerted multi-atom processes not described by the training set. To alleviate this problem, the training set was augmented by structures extracted from the affected trial MD trajectories, similarly to established adaptive force-matching schemes. 50,51 A Gly-Pro-Ser tripeptide in vacuum was used as a simple model system for the purpose of capturing spurious reactions. A trial force field created by optimization with the pre-generated training set was used to run 100 independent MD trajectories using the software LAMMPSReaxC. 44,45,52 Atomic velocities were initialized from the Maxwell-Boltzmann distribution at 600 K, and an NVT simulation was carried out with a 0.5 fs integration time step. The temperature was kept at 600 K by the Nos´e-Hoover thermostat with a time constant of 100 fs, and all motion of the center of mass was removed every 100 fs. At every MD time step, the current ReaxFF bonding topology was compared with the initial state. Only those bonds with a bond order value over 0.3 were considered to be present. When a difference in the set of present bonds was detected, a “bond change event” was recorded and the simulation was terminated 20 steps later. Only the first bond change event in a given simulation was taken into account. A simulation was ignored if two or more bond change events occurred at the

9 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 40

same timestep. All the change events from all simulations concerning the same bond (atom pair) were assembled and a single representative trajectory was selected. This selection was carried out automatically by minimizing the time needed to transition between the bonded and non-bonded state and maximizing the average bond order in the bonded state. At least 10 fs before and 2.5 fs after the bond change event were taken into account. Finally, every other frame from the selected part of the selected trajectory was inserted into the training set. Total molecular binding energies and forces (gradients) were calculated for all the generated structures using NWChem 53 version 6.5.6. The meta-hybrid density functional M062X 54,55 was applied in combination with the def2-TZVP atomic basis set 56 and the Weigend density fitting set. 57 Dispersion interactions were described by the empirical DFT-D3 scheme. 58,59 The molecular binding energy was calculated as the difference between the total molecular DFT energy and the sum of DFT energies of isolated atoms in their lowest spin state calculated using the same level of theory. The electrostatic properties of all closed-shell systems were calculated with NWChem version 6.3. To calculate Hirshfeld and CM5 60 charges, molecular orbitals were exported from NWChem in WFX format using an interface by V´azquez-Mayagoitia 61 and then processed with the program postg. 62,63 CM5 correction was then added using an in-house Python script.

2.4

EEM Parameter Optimization

A subset of the main training set consisting of amino acid side chain models, alanine dipeptide, non-bonded dimers, and the water icosamer was used to train the parameters for the Electronegativity Equalization Method (EEM) charge model used within ReaxFF. All geometries of a given system were hierarchically clustered using Ward’s method 64 with the Euclidean distance metric based on CM5 charges. The resulting hierarchy was converted to a flat set of clusters so that all the members of a given flat cluster had a cophenetic distance lower than a set threshold. This threshold was initialized to 0.1 and sometimes repeatedly 10 ACS Paragon Plus Environment

Page 11 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

scaled by factors of 0.5 and/or 1.5 to yield at least two and not more than 20 clusters. A single representative geometry was then drawn randomly from each cluster. The resulting EEM training set was comprised of 376 geometries of 85 distinct systems, ranging from 6 to 60 atoms in size. An independent test set comprised of 901 geometries was generated in a similar way, aiming for at least 20 and not more than 50 clusters per system and ignoring any clusters that were already represented in the training set. The following objective function was minimized to find the optimal EEM parameters:

OEEM = Oφ + wq Oq + wie Oie sP 2 i (φi,EEM − φi,ref ) Oφ = Nφ sP 2 j (qj,EEM − qj,ref ) Oq = Nq sP 2 2 k (ηk + χk − IE k,ref ) + (ηk − χk − EAk,ref ) Oie = 2Nelem Here φi,ref is the QM electrostatic potential at grid point i, qj,ref is the CM5 charge of atom j, IE k,ref is the experimental first ionization energy of element k (as published in the NIST Standard Reference Database Number 69), and EAk,ref is its experimental electron affinity. The index i runs through all grid points of all molecules, index j runs through all atoms of all molecules, and k runs through all elements. A fixed cutoff of 10 ˚ A with a polynomial “taper correction” was used for the off-diagonal elements of the EEM matrix, as is customary in ReaxFF force fields. In total 12 parameters were optimized: The atomic electronegativity χ, hardness η and Louwen-Vogt screening 39 parameter γ for each of the four elements covered by the training set (C, H, O, N). A limited-memory Broyden-Fletcher-Goldfarb-Shanno quasi-Newton algorithm was employed using 10 vectors to approximate the Hessian. The gradient of OEEM with respect to EEM parameters was approximated numerically by two-point central finite differences with a fixed absolute step of 10−4 . Optimization was stopped as soon as both the 11 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 40

norm and the maximum element of the gradient decreased below 10−5 . Optimal weight parameters wq and wie were identified by a systematic grid search over the (0.01 ≤ wq ≤ 0.50) × (0.001 ≤ wie ≤ 0.050) region. To ensure good balance between the accuracy of electrostatic potentials, charges, and ionization energies, the grid point yielding the lowest value of the following objective function was chosen:  Owq ,wie =

Oφ − Oφ,best Oφ,best

2

 +

Oq − Oq,best 4Oq,best

2

 +

Oie − Oie,best 8Oie,best

2

Here, Ox,best stands for the lowest value of Ox among all grid points. The resulting optimal weight parameters were wq = 0.19 and wie = 0.011.

2.5

ReaxFF Parameter Optimization

Once the optimal EEM parameters were determined, most non-EEM parameters of the ReaxFF force field were reoptimized, starting from the values in the ProtReaxFF force field. 31 The “general” (non-atom-specific) parameters were kept at their original values to preserve compatibility with related force fields. The following objective function was minimized to

12 ACS Paragon Plus Environment

Page 13 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

find the optimal ReaxFF parameters: r P

g

O=

X

ws

2 ~ ~ i kFg,i −Fg,i,QM k 3Ns,at

P

ws,g

!

P

g ws,g v uP X u g ws,g (Eg − Eg,QM − (1 − ws,E0 ) ∆Es,0 )2 P + wE ws t g ws,g s sP  2 2 ππ ππ 2 π π σ σ X + BO − BO + BO − BO BO − BO j j j j,ref j,ref j,ref j + wBO ws N s,bond s   2 X pk − pk,ref + wrestr Sk k s P xi,geoopt − ~xi,geoopt,init k2 i k~ + wgeoopt,x + wgeoopt,E (Egeoopt − Egeoopt,init ) 3Nat P g (Eg − Eg,QM ) ∆Es,0 = Ng Eg,QM − ming (Eg,QM ) − Emin ws,g = 1 − (1 − wmin ) Emax − Emin s

Sk = max(|pk,ref | , 0.001)

The index s runs through all systems in the training set, g runs through all geometries of a given system, i runs through all Ns,at atoms for a given system, j runs through all bonds of the optimized geometry of system s, and k runs through all ReaxFF parameters being optimized. The properties being compared are the forces F~g,i on all atoms, molecular binding energy Eg for each geometry (after correction for Epol,0 ) and the corrected ReaxFF bond orders BOj . The per-geometry weight parameter ws,g decreases linearly from 1 to wmin depending on the relative energy of geometry g compared to the minimum energy (optimized) conformation. This emphasizes the errors in low-energy regions and decreases the importance of high-energy conformations. The last two terms in O are based on the result of a geometry optimization run starting from the QM-optimized structure of alanine dipeptide. A conjugate gradient algorithm

13 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

with quadratic line searches 65 implemented in LAMMPS was used for the optimization of geometry. The maximum atomic displacement in any geometry optimization step was constrained to 0.05 ˚ A. The optimized Cartesian atomic positions and ReaxFF potential energy are then compared to their respective initial values to yield the contribution to O. The weight parameter ws,E0 determines the contribution of the average offset ∆Es,0 between the ReaxFF and QM molecular binding energies of a given system. Setting ws,E0 = 0 causes the overall offset to be completely subtracted from the per-geometry energy differences, thus completely ignoring the absolute value of the binding energy of a given system and taking into account only the energetic costs of conformational perturbations. Conversely, ws,E0 = 1 means that the ReaxFF energy of a given geometry is compared directly with its QM counterpart, without any special handling of the global offset. The last term in O introduces a parabolic restraint of each ReaxFF parameter pk to its value in the reference (starting) force field. For this purpose, the parameters are “normalized” using the Sk scaling factors, so that the same relative change in the value of any parameter gives rise to the same change in O. A lower bound is imposed over the scaling factors to avoid problems with parameters whose reference values are (close to) zero. ReaxFF energies, forces and bond orders were evaluated by the ReaxC module 52 of the LAMMPS molecular dynamics package. 44,45 The optimization of O with regard to all the ReaxFF parameters pk was driven by the Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) algorithm 33,34 implemented in the Shark machine learning library. 66 CMA-ES is a stochastic algorithm for gradient-free numerical optimization of arbitrary functions. In contrast to the widely known family of “Evolutionary Algorithms” (such as Genetic Algorithm, Differential Evolution and many others), “Evolutionary Strategies” do not focus on evolving individual points (“chromosomes”) forming a sampled population, but instead consider only the distribution of these points. In the particular case of CMA-ES, the population is described by a multivariate Gaussian distribution, defined by its center (mean) µ and covariance matrix C. By iteratively updating these two parameters based on the

14 ACS Paragon Plus Environment

Page 14 of 40

Page 15 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

feedback from the objective function, the algorithm finds the best “Evolutionary Strategy” for a given problem, or in other words, the distribution that generates the best individual points overall. Every iteration of CMA-ES consists of the following phases, depicted in Figure S3: 1. Generation of a new trial population by randomly drawing points (force field parameterizations) from the internal Gaussian distribution 2. Evaluation of the objective function value (fitness score) for each point in the population 3. Ranking (ordering) of the points by their fitness, selection of a “good” subset of the population, assigning weights to points according to their rank 4. Updating the distribution mean and covariance matrix based on the weighted average of positions of the selected points The algorithm has several important advantages: • Evaluation of each individual in the population is completely independent on the others, making this costly step trivially parallel. • Because the value of the objective function is only used to order the points by their fitness, CMA-ES is invariant to any strictly monotonic (order-preserving) transformations of the objective function. • The procedure is relatively insensitive to noise or random evaluation failures, as those have only limited effect on the weighted average of selected points • By virtue of the covariance matrix, the algorithm incrementally learns the correlations between individual optimized parameters and also the optimal step size in each parameter. Any knowledge on which combinations of parameters tend to be “good” or “bad” is thus remembered throughout the optimization run. 15 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 40

• The algorithm is almost parameter-free, requiring only an initial guess of the covariance matrix. Standard CMA-ES needs to diagonalize the covariance matrix in each step, as the generation of the trial population is essentially based on taking normally distributed random steps in the direction of each eigenvector of C. This can also be understood as if CMA-ES was based on performing Principal Component Analysis of the sampled population and then taking steps in the principal components. As the diagonalization step is computationally expensive, it is important to use an efficient algorithm to prevent it from becoming a bottleneck. For our tests of standard CMA-ES, we have used the parallel DSYEVD eigensolver routine from the ATLAS linear algebra library. 67 The VD-CMA-ES variant 32 of CMA-ES was used in our force field optimization protocol to enable the simultaneous optimization of several hundreds of ReaxFF parameters. This variant approximates the covariance matrix C in terms of a diagonal matrix D (representing inequal parameter scales) and a vector v (representing major cross-correlations between parameters): C = D 1 + vv T



This has the effect of restricting the resulting distribution to a “cigar-like” shape, with just one preferred direction (eigenvector) v that tends to follow the “natural gradient direction” of O. In all other directions (perpendicular to v), the width of the distribution is forced to be the same (before scaling by D). This makes the number of evaluations of O needed to learn the shape of the objective function landscape scale linearly with the number of parameters N , because the matrix D and vector v contain only 2N values in total. An ordinary CMA-ES using an unrestricted (symmetric) covariance matrix C requires

N 2 +N 2

elements to be learned. This quadratic complexity together with N > 100 usually pushes the number of iterations needed for sufficient learning out of the feasible range. Additionally, the VD-CMA-ES variant does not need to diagonalize C in every iteration, further improving

16 ACS Paragon Plus Environment

Page 17 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

the computational efficiency and parallel scaling. The matrix D was initialized by Dkk = Sk to make the initial steps in all parameters have the same relative magnitude. Thanks to the excellent self-learning capabilities of VDCMA-ES, the value of its parameter σ denoting the initial step size is not crucial for a successful optimization. This value was set to σ = 0.01, as that appeared to be the optimal order of magnitude. Starting from significantly different values of σ caused the algorithm to spend some time locating the same optimal value, reducing the efficiency at the start of an optimization run. A population of 96 trial parameterizations (individuals) was sampled in every iteration of VD-CMA-ES. This value was chosen to enable an even distribution of work across 12, 16, or 24 CPU cores. The evaluation of each individual is independent of the rest of the population. This enables trivial parallelization over the individuals by multithreaded execution of the serial ReaxC code. Due to the stochastic nature of (VD-)CMA-ES, it is useful to run several replicas of the same optimization run to reliably achieve converged results. Twenty replicas differing only in the initial seed values of the random number generator were used for each optimization run. To reduce computational costs, a pruning strategy was employed to avoid wasting time on sub-optimal replicas. For this purpose, the whole optimization run was segmented into stages consisting of approximately 300 iterations each. The current rate of convergence was estimated by the difference in the average objective function values in the first and last 10% iterations in a given stage. If the estimated convergence rate for a given replica was not high enough to “catch up” with the current best replica (one with the lowest value of O) in at most two more stages, the optimization of the current replica was stopped. The results from the last surviving replica were used for further optimization or analysis.

2.6

Testing and Validation

The quality of the EEM parameters was assessed by calculating charges and electrostatic potentials on 901 geometries in the dedicated testing dataset. The following error metrics 17 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 40

were employed to measure the reproduction of CM5 atomic charges and QM electrostatic potentials: • Root Mean Squared Error (Oq , Oφ ) • Mean Unsigned Error • Maximum Unsigned Error • R2 over all charge/ESP values from all geometries 2 • Ravg – average per-geometry R2

The accuracy of molecular multipole moments was described by the absolute error k~xEEM − ~xQM k and the relative error k~xQM k−1 k~xEEM − ~xQM k. The vector ~x is formed by the three components of the Cartesian dipole moment or eight components of the traceless quadrupole moment. The transferability of the created ReaxFF force field and its suitability for modeling enzymatic reactions were tested using ppGalNAcT2 glycosyltransferase as a model system. This enzyme has recently been subjected to extensive computational studies by multiple groups, 68,69 and its reaction mechanism is thus fairly well understood. We have calculated ReaxFF potential energy profiles along a previously published reaction path 12,37 based on the Nudged Elastic Band method. In the reaction starting from the Michaelis complex (R), the donor glycosidic bond first dissociates (TS1) forming a short-lived oxocarbenium intermediate (I). Subsequently, a new glycosidic bond to the acceptor is formed with a simultaneous proton transfer (TS2), and finally the system relaxes into the product state (P). The NEB path also extends from both ends towards additional states not directly involved in the enzymatic reaction. On the reactant side, the segment of the path from R’ to R describes the formation of a crucial donor-acceptor hydrogen bond. On the product side, there is an additional energy minimum P’ corresponding to the loss of one of the enzymatic ligands of the catalytic metal ion. ReaxFF energies were calculated along the whole path from R’ to P’ 18 ACS Paragon Plus Environment

Page 19 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

to test the usability of the ReaxFF force field for modeling diverse chemical processes. The evaluation of energy was done in free boundary conditions using the previously published geometries of the whole enzymatic system (6051 atoms) and the hydrogen-capped QM zone (275 atoms).

3 3.1

Results and discussion Improvement in EEM Electrostatics

Optimizing the error weight parameters wq and wie resulted in the final setting of wq = 0.19 and wie = 0.011. The exact values of the weights are not critical, because the errors in all monitored metrics vary smoothly with the weight parameters (Figure S4). It is thus possible to find a combination of weights that results in well-balanced performance across all the metrics. Relative to the lowest errors seen for electrostatic potentials, CM5 charges, and ionization energies, the optimal combination of weights yields a 2.97% higher error in ESPs, 15.3% higher error in charges, and 56.7% higher error in ionization energies. This reflects the differences in the reliability of the reference values and the relative importance of the respective quantities for molecular modeling. The electrostatic potential is the most important factor to reproduce accurately, because it can be calculated precisely by DFT and directly determines the electrostatic interactions that are crucial for the recognition of substrates or inhibitors in the active site of enzymes. The importance of reproducing atomic point charges is much lower, because the partial atomic charges are only an approximation used to describe the molecular electrostatic potential, which we are already fitting directly. Additionally, the reference values of charges come from the empirical CM5 charge model, and thus their reliability is limited. In contrast, the reference values of atomic ionization energies are experimentally known with a high level of accuracy. However, the energy cost of charge transfer in a system consisting of many atoms is not directly determined by the atomic ionization energies, but the complex chemical environment definitely plays an important role. 19 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

For this reason, the atomic ionization energies do not need to be reproduced accurately, but they are still useful to guide the optimization of EEM parameters towards physically relevant, transferable values. Table 1: Comparison of various error metrics for the old and new EEM parameter set ProtReaxFF optimized ESP [e˚ A−1 ] RMSE 0.0945 0.0517 MUE 0.0596 0.0381 0.383 MaxErr 0.509 R2 0.908 0.959 2 0.915 0.966 Ravg Ionization Energies RMSE 2.408 1.124

ProtReaxFF optimized Charges [e] 0.0250 0.0256 0.0167 0.0174 0.326 0.365 0.934 0.931 0.622 0.637

ProtReaxFF

Optimized

0.9

0.9

0.6

0.6

EEM charge [e]

EEM charge [e]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 40

0.3 0.0 -0.3

C H O N

-0.6 -0.9 -0.9 -0.6 -0.3

0.0

0.3

0.6

0.3 0.0 -0.3

C H O N

-0.6 -0.9 0.9

-0.9 -0.6 -0.3

QM charge [e]

0.0

0.3

0.6

0.9

QM charge [e]

Figure 1: Reproduction of CM5 charges using the original and optimized EEM parameters Table 1 shows the performance of the original and new EEM parameters on the test dataset. The new EEM parameterization delivers a 30% lower RMS error on the CM5 atomic charges, a 64% lower RMS error on the atomic ionization energies, and a negligible difference in errors of electrostatic potentials. The maximum unsigned error in CM5 atomic charges also decreased by 25%. As shown in Figure 1, the improvement in the reproduction 20 ACS Paragon Plus Environment

Page 21 of 40

of atomic charges stems primarily from the elimination of spurious positive charges on carbon atoms. ProtReaxFF also tends to consistently overestimate the negative charges of oxygens and nitrogens, but the new parameterization does not exhibit this problem anymore. Dipole

Quadrupole

60

80

ProtReaxFF Optimized

50

Relative frequency [%]

Relative frequency [%]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

10 8 6 4 2 0

40 30 20

1

2

3

4

5

10 0

ProtReaxFF Optimized

70 60

5 4 3 2 1 0

50 40 30

2

20

4

6

8 10

10 0

0

1

2

3

4

5

0

1

Relative error

2

3

4

5

6

7

8

9 10

Relative error

Figure 2: Distributions of relative errors of molecular multipoles using the original and optimized EEM parameters. The rightmost histogram bin includes all data points exceeding axis range. Semi-transparent bars are used to make regions of overlap visible. As was already mentioned, the practical significance of accurately reproducing CM5 charges may be disputed. Even though this empirical charge model has been specifically parameterized to yield a transferable and reliable description of molecular multipoles (and thus also intermolecular electrostatic interactions), its selection is still somewhat arbitrary. It is thus important to focus on some more objective measures of the reproduction of molecular electrostatic properties. Figure 2 shows histograms of relative errors of total molecular dipoles and quadrupoles across the test set. It is evident that the new EEM parameterization outperforms ProtReaxFF for both quantities, exhibiting a distinct shift in the error distributions towards zero. In particular the region of the major relative errors is strongly suppressed by the new parameter set, with a several-fold reduction in the frequency of relative errors over 100%. As shown in Figure S5, this is not caused by reducing the relative errors 21 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

for molecules with negligible multipole moments, but the worst-case absolute deviations are suppressed as well. Even though the optimized EEM parameters deliver a significant improvement in reliability and reduction in worst-case errors, the residual average errors are still considerable and unlikely to decrease upon further optimization. Apart from the inherent shortcomings of an atom-centered point charge model, this is caused mainly by fundamental limitations of the simple EEM scheme. Fortunately, there is an ongoing trend in the ReaxFF community aiming to replace the EEM method with a better approximation such as the ACKS2 method by Verstraelen et al. 70

3.2

Efficiency of VD-CMA-ES with Pruning

To demonstrate the suitability of VD-CMA-ES for real ReaxFF parameter optimization problems, we chose the first phase of optimization of the bond, hydrogen bond, and nonbonded interaction parameters (denoted as Phase 4 in Supporting Information). This part of our multi-phase optimization protocol is arguably the most demanding for the optimization algorithm, due to the combination of high dimension, strong interplay between the bonded and non-bonded parameters, wildly varying sensitivity of the objective function to different parameters, and a complex training set. Figure 3 compares the performance of VD-CMA-ES with the more established µ/µw CMA-ES algorithm, all parameters being the same for both algorithms. It is evident that the linear scaling nature of VD-CMA-ES gives it a significant advantage. Thanks to its rapid learning capability, VD-CMA-ES is already visibly outperforming CMA-ES after a few hundred iterations and is able to reach the same value of the objective function approximately four times faster. After approximately 6000 iterations, VD-CMA-ES is already well converged, while CMA-ES is progressing slowly in a linear manner with no signs of leveling out. Further discussion of the performance of VD-CMA-ES and the employed “replica prun22 ACS Paragon Plus Environment

Page 22 of 40

Page 23 of 40

28 Objective function value /1000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

VD-CMA-ES CMA-ES CMA-ES+scaling

26 24 22 20 18 16 14 12

0

1

2

3

4

5

6

7

Iteration /1000

Figure 3: Comparison of VD-CMA-ES and CMA-ES with or without explicit parameter scaling. Each filled surface denotes the range of objective function values from five replicas. ing” scheme can be found in Supporting Information.

3.3

Description of Small Molecules

Table 2: Averages and 95th percentiles of RMS energy errors in kcal/mol for various types of geometry perturbations Average 95th Percentile ProtReaxFF Optimized ProtReaxFF Optimized Bonds 23.5 9.0 71.3 26.9 Angles 3.6 2.2 7.6 6.5 Torsions 9.8 4.2 28.0 9.3 Out-of-plane 2.6 0.8 7.2 2.9 Nonbonded 13.2 3.7 35.9 10.2 Figure 4 summarizes the distribution of root-mean-square energy errors for different kinds of potential energy curves included in the training set. The first histogram shows the errors in the molecular binding energy of the QM-optimized geometry for each system. It is evident that the optimized force field matches the QM data much more closely than ProtReaxFF. ProtReaxFF exhibits systematic overbinding and a considerable amount of errors over 100 23 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

Relative frequency [%]

Molecular binding energy 10

Bonds

Angles

50

ProtReaxFF Optimized

9 8

40

ProtReaxFF Optimized

45 40

30

35

6

30

25

5

25

20

4

20

3

15

15

2

10

1

5

0

10 5

0 0

100

200

ProtReaxFF Optimized

35

7

-200 -100

Relative frequency [%]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 40

0 0

20

40

60

80 100 120

0

5 10 15 20 25 30 35 40

Energy error [kcal/mol]

Energy RMSE [kcal/mol]

Energy RMSE [kcal/mol]

Torsions

Out-of-plane motions

Non-bonded interactions

40

80

ProtReaxFF Optimized

35 30

30

ProtReaxFF Optimized

70 60

25

50

20

40

15

30

10

20

5

10

0

20 15 10 5

0 0

5 10 15 20 25 30 35 40 45

Energy RMSE [kcal/mol]

ProtReaxFF Optimized

25

0 0

5

10

15

20

25

Energy RMSE [kcal/mol]

0

5 10 15 20 25 30 35 40 45

Energy RMSE [kcal/mol]

Figure 4: RMS energy errors for various types of geometry perturbations. Semi-transparent bars are used to make regions of overlap visible.

24 ACS Paragon Plus Environment

Page 25 of 40

kcal/mol. After optimization, the distribution of errors is now well centered around zero, and the significant errors are virtually eliminated. The remaining five histograms in Figure 4 show the errors along the energy curves for different types of geometry perturbations. All five types exhibit a significant decrease in the occurrence of large errors, as quantified by the 95th percentile values in Table 2. The average RMS errors exhibit a similar improvement.

3.4

Reaction Energy Profile for ppGalNAcT2 50

R'

R

TS1

I

TS2'

P

P'

40 Relative energy [kcal/mol]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

30 20 10 0 -10 -20 -30 -40

M06-2X QM/MM ProtReaxFF Optimized 0

10

20

30

40

50

NEB image number

Figure 5: Reproduction of QM/MM NEB reaction energy profile The transferability and reliability of the optimized force field for studies of enzymatic reactions has been tested on the reaction potential energy profile for ppGalNAcT2 glycosyltransferase. This enzyme catalyses the first step in the mucin-type O-linked glycosylation pathway by transferring N-acetylgalactosamine from a uridine diphosphate donor onto a hydroxyl group a serine or threonine residue of an acceptor protein. An overview of the active site is shown in Figure S7. The reaction proceeds across two barriers, first dissociating the

25 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 3: Energies of important points on ppGalNAcT2 reaction path (in kcal/mol) QM/MM ProtReaxFF Optimized R’ 1.0 -15.7 -8.8 0.0 0.0 0.0 R TS1 8.8 24.4 20.3 8.4 7.5 8.2 I TS2 14.4 44.8 30.3 -13.1 -1.5 / -30.5 -12.2 P P’ -5.6 5.6 21.2 glycosidic bond between GalNAc and the donor diphosphate moiety (TS1) and forming a short-lived oxocarbenium intermediate (I). The second barrier (TS2) is the rate-determining one and corresponds to the formation of the new glycosidic bond with the acceptor and a simultaneous proton transfer from the acceptor onto the leaving group. The active site including the UDP-GalNAc donor is enclosed in a pocket that allows only the acceptor sidechain to enter, preventing unwanted hydrolysis of the UDP-GalNAc substrate (see Figure S7). The reaction path has been previously optimized using the Nudged Elastic Band at the generalized gradient approximation DFT level. 12,37 We selected the M06-2X meta-hybrid density functional as the source of reference energy data, as it has been previously shown to be in excellent agreement with DLPNO-CCSD(T)/CBS energies. 12 The single-point ReaxFF potential energy profiles for the original and optimized force field are compared with this reference in Figure 5 and Table 3. Although the optimized force field was in no way trained for glycosyltransferases or ppGalNAcT2 in particular, it exhibits a considerable improvement in reproducing the reference energy profile. The SN i reaction itself is qualitatively well reproduced even with the original force field, although the magnitude of the barriers at TS1 and TS2 is severely overestimated. The optimized force field reduces both barriers, especially the second one, while maintaining excellent agreement with the reference in the positions of the transition states and the relative energy of the intermediate I. Furthermore, the optimized force field improves the reproduction of the product energy minimum significantly, in striking

26 ACS Paragon Plus Environment

Page 26 of 40

Page 27 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

contrast to the overly large depth and incorrect position of this minimum exhibited by ProtReaxFF. There are also two spurious local minima present in the ProtReaxFF profile (at images 7 and 32), but the profile from the optimized force field is smooth in both cases. Apart from the main reaction from R to P, the optimized force field also behaves well in the neighboring regions at R’ and P’. The depth of the R’ minimum is lower, putting the ReaxFF results closer to the reference data. The height of the barrier from P to P’ is also reproduced well by the optimized force field, although the P’ state itself is not stable enough in both force fields. We must note that P’ is not expected to participate in the real enzymatic reaction, as this state corresponds to a partial loss of the stable coordination of the metal cofactor. Reproducing P’ is thus neither necessary nor useful for studies of the glycosyl transfer reaction. It is, however, important for the barrier leading to this state to be sufficiently high that P’ does not interfere with modeling the main reaction. In addition to the energy profile for the whole enzymatic system shown in Figure 5, we have also calculated similar energy profiles for the QM part alone (Figure S6). Both kinds of profiles are qualitatively the same, confirming that the features of the whole-enzyme profiles stem from the underlying chemical changes in the QM zone and not from some fortuitous cancellation of errors across the entire protein. Overall, ReaxFF seems to be suitable for the task of modeling the energetics of complete enzymatic systems without the need for any artificial hybrid schemes. The main strength of ReaxFF becomes apparent when comparing the CPU time needed for a single evaluation of energies and forces (using ADF and LAMMPS-ReaxC on Intel Xeon E5-4617 processors). On the QM/MM DFT level, this takes approximately 200 CPU-hours using the reference M06-2X/TZP method and 4 CPU-hours using the more approximate OPBE/TZP method. 37 Evaluating the ReaxFF potential on the entire 6051-atom system takes only 0.0002 CPU-hours (0.7 CPU-seconds). ReaxFF thus delivers qualitatively similar results 20000 times faster than GGA DFT and a million times faster than meta-hybrid DFT.

27 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

4

Conclusions

We have presented a fully automated approach to the parameterization of the ReaxFF potential for complex biomolecular systems. Starting from a few initial structures of small model molecules provided by the user, an extensive training set comprised of thousands of different geometries was generated by perturbing all internal degrees of freedom. Additionally, a library of radical fragments was also generated by the automated splitting of relevant bonds. This protocol enables the creation of training sets with sufficient coverage of all ReaxFF potential energy terms with minimal manual effort. Parameters for the EEM charge equilibration scheme were fitted against a balanced subset of the training set, generated by clustering structures according to their CM5 atomic charges to eliminate redundancy. The fitting was directly driven by QM electrostatic potentials calculated on a grid around each target molecule. This removes the need to select a QM charge assignment scheme as the source of reference values, and avoids introducing errors stemming from the usage of QM atomic charge models as proxies for the electrostatic potential. Atomic charges calculated using the CM5 model still play a part in fitting the EEM parameters, together with atomic ionization energies. Both serve merely as the sources of weak restraints preventing nonphysical values of atomic charges that would otherwise result from overfitting the electrostatic potentials. This approach reduces the RMS errors by 30% for atomic charges and 64% for ionization energies, while retaining the accuracy of the ProtReaxFF force field for electrostatic potentials and significantly reducing the occurrence of large errors in molecular dipoles and quadrupoles. However, the remaining errors in all quantities are still sizable, highlighting the fundamental limitations of EEM. Further improvement in this regard will probably require switching to a more advanced charge equilibration scheme such as the ACKS2 model. 70 The remaining force field parameters were optimized using the VD-CMA-ES variant of the CMA-ES optimization algorithm. This particular algorithm has been shown to be an excellent choice for difficult numerical optimization problems such as ReaxFF fitting. Even 28 ACS Paragon Plus Environment

Page 28 of 40

Page 29 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

without access to the gradients of the objective function and with hundreds of parameters optimized simultaneously, VD-CMA-ES can rapidly learn the shape of the objective function landscape and converge in a few thousand iterations. When the optimization run is conducted in multiple replicas to obtain reliable results, a simple pruning strategy can be employed to reduce the computational cost by a factor of two, without compromising accuracy. Overall, this approach offers unprecedented speed and reliability and should be suitable for many other force field fitting problems. All parameters of the ReaxFF potential were refitted, starting from the ProtReaxFF parameterization. The target data used for the fitting consisted mainly of M06-2X energies and forces and empirical bond orders. A correction for the dissociation behavior of the EEM model was applied to obtain correct molecular binding energies. After optimization, the ReaxFF potential reproduces reference data fairly well, with worst-case errors within 27 kcal/mol for bond stretching and 10 kcal/mol for all the other degrees of freedom. The resulting force field was validated on a real enzymatic system, ppGalNAcT2 glycosyltransferase. When comparing the potential energy profiles along a previously published reaction path, even the ProtReaxFF offers an acceptable qualitative description of the glycosyl transfer reaction. However, even though the force field presented in this study was by no means fitted for glycosyltransferases, it delivers a sizable quantitative improvement. The relative energies of the intermediate and product minima are in excellent agreement with DFT reference data, previously validated against CCSD(T)/CBS. Although the transition state energies are still overestimated, the error is 27–48% lower than for ProtReaxFF. We can thus conclude that ReaxFF is up to the task of modeling the reaction energetics of a complete enzymatic system, with accuracy approaching much more expensive QM/MM DFT studies. The computational cost of ReaxFF is approximately four to six orders of magnitude lower than for DFT-based QM/MM calculations. As shown by the continuing development of nonreactive force fields over several decades, there are many factors that need to be carefully balanced to obtain the correct long-term

29 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

conformational dynamics of biomolecular systems. It would be unrealistic to assume that ReaxFF can quickly catch up with the common biomolecular force fields via the simple refitting presented here. Instead, a follow-up study will be dedicated to the challenges of biomolecular dynamics and further improvements in ReaxFF parameterization. We also expect that the general training set generation and force field fitting approach presented here will work just as well for other systems. However, testing this hypothesis in practice remains a topic for future work.

Acknowledgement This work has been financially supported by the Ministry of Education, Youth and Sports of the Czech Republic under the project CEITEC 2020 (LQ1601) and the programme INTER COST (LTC17076). This financial support is gratefully acknowledged. This work was supported by the Ministry of Education, Youth and Sports from the Large Infrastructures for Research, Experimental Development and Innovations project “IT4Innovations National Supercomputing Center – LM2015070”. Additional computational resources were provided by the CESNET LM2015042 and the CERIT Scientific Cloud LM2015085, provided under the program “Projects of Large Research, Development, and Innovations Infrastructures”. IT would also like to thank the Scientific Grant Agency of the Ministry of Education of the Slovak Republic and the Slovak Academy of Sciences (grant VEGA-02/0064/15 and 02/0024/16) for their support.

Supporting Information Available Figures S1–S8, further discussion of the performance of CMA-ES, detailed description of the optimization protocol, the entire training set, and the final force field parameters.

30 ACS Paragon Plus Environment

Page 30 of 40

Page 31 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

References (1) Liang, D.-M.; Liu, J.-H.; Wu, H.; Wang, B.-B.; Zhu, H.-J.; Qiao, J.-J. Glycosyltransferases: mechanisms and applications in natural product development. Chem. Soc. Rev. 2015, 44, 8350–8374. (2) Hurtado-Guerrero, R. Recent structural and mechanistic insights into protein OGalNAc glycosylation. Biochem. Soc. Trans. 2016, 44, 61–67. (3) Ard`evol, A.; Rovira, C. Reaction Mechanisms in Carbohydrate-Active Enzymes: Glycoside Hydrolases and Glycosyltransferases. Insights from ab Initio Quantum Mechanics/Molecular Mechanics Dynamic Simulations. J. Am. Chem. Soc. 2015, 137, 7528– 7547. (4) Tvaroˇska, I. Atomistic insight into the catalytic mechanism of glycosyltransferases by combined quantum mechanics/molecular mechanics (QM/MM) methods. Carbohydr. Res. 2015, 403, 38–47. (5) Ard`evol, A.; Iglesias-Fern´andez, J.; Rojas-Cervellera, V.; Rovira, C. The reaction mechanism of retaining glycosyltransferases. Biochem. Soc. Trans. 2016, 44, 51–60. (6) G´omez, H.; Mendoza, F.; Lluch, J. M.; Masgrau, L. QM/MM Studies Reveal How Substrate–Substrate and Enzyme–Substrate Interactions Modulate Retaining Glycosyltransferases Catalysis and Mechanism. Adv. Protein Chem. Struct. Biol. 2015, 100, 225–254. (7) Warshel, A. Multiscale Modeling of Biological Functions: From Enzymes to Molecular Machines (Nobel Lecture). Angew. Chem., Int. Ed. 2014, 53, 10020–10031. (8) Senn, H. M.; Thiel, W. QM/MM Methods for Biomolecular Systems. Angew. Chem., Int. Ed. 2009, 48, 1198–1229.

31 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(9) Mulholland, A. J. Modelling enzyme reaction mechanisms, specificity and catalysis. Drug Discovery Today 2005, 10, 1393–1402. (10) Duster, A. W.; Wang, C.-H.; Garza, C. M.; Miller, D. E.; Lin, H. Adaptive quantum/molecular mechanics: what have we learned, where are we, and where do we go from here? Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2017, 7, e1310. (11) Zheng, M.; Waller, M. P. Adaptive quantum mechanics/molecular mechanics methods. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2016, 6, 369–385. (12) Janoˇs, P.; Trnka, T.; Kozmon, S.; Tvaroˇska, I.; Koˇca, J. Different QM/MM Approaches To Elucidate Enzymatic Reactions: Case Study on ppGalNAcT2. J. Chem. Theory Comput. 2016, 12, 6062–6076. (13) van der Kamp, M. W.; Mulholland, A. J. Combined Quantum Mechanics/Molecular Mechanics (QM/MM) Methods in Computational Enzymology. Biochemistry 2013, 52, 2708–2728. (14) Heimdal, J.; Ryde, U. Convergence of QM/MM free-energy perturbations based on molecular-mechanics or semiempirical simulations. Phys. Chem. Chem. Phys. 2012, 14, 12592–12604. (15) Christensen, A. S.; Kubaˇr, T.; Cui, Q.; Elstner, M. Semiempirical Quantum Mechanical Methods for Noncovalent Interactions for Chemical and Biochemical Applications. Chem. Rev. 2016, 116, 5301–5337. (16) Yilmazer, N. D.; Korth, M. Enhanced semiempirical QM methods for biomolecular interactions. Comput. Struct. Biotechnol. J. 2015, 13, 169–175. (17) Thiel, W. Semiempirical quantum–chemical methods. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2014, 4, 145–157.

32 ACS Paragon Plus Environment

Page 32 of 40

Page 33 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(18) Goldman, N.; Fried, L. E.; Koziol, L. Using Force-Matched Potentials To Improve the Accuracy of Density Functional Tight Binding for Reactive Conditions. J. Chem. Theory Comput. 2015, 11, 4530–4535. (19) Shin, Y. K.; Shan, T.-R.; Liang, T.; Noordhoek, M. J.; Sinnott, S. B.; van Duin, A. C. T.; Phillpot, S. R. Variable charge many-body interatomic potentials. MRS Bull. 2012, 37, 504–512. (20) Liang, T.; Shin, Y. K.; Cheng, Y.-T.; Yilmaz, D. E.; Vishnu, K. G.; Verners, O.; Zou, C.; Phillpot, S. R.; Sinnott, S. B.; van Duin, A. C. T. Reactive Potentials for Advanced Atomistic Simulations. Annu. Rev. Mater. Res. 2013, 43, 109–129. (21) Senftle, T. P.; Hong, S.; Islam, M. M.; Kylasa, S. B.; Zheng, Y.; Shin, Y. K.; Junkermeier, C.; Engel-Herbert, R.; Janik, M. J.; Aktulga, H. M.; Verstraelen, T.; Grama, A.; van Duin, A. C. T. The ReaxFF reactive force-field: development, applications and future directions. npj Comput. Mater. 2016, 2, 15011. (22) Li, P.; Merz, K. M. Metal Ion Modeling Using Classical Mechanics. Chem. Rev. 2017, 117, 1564–1686. (23) Baskes, M. I. Modified embedded-atom potentials for cubic materials and impurities. Phys. Rev. B 1992, 46, 2727–2742. (24) Stuart, S. J.; Tutein, A. B.; Harrison, J. A. A reactive potential for hydrocarbons with intermolecular interactions. J. Chem. Phys. 2000, 112, 6472–6486. (25) Phillpot, S. R.; Sinnott, S. B. Simulating Multifunctional Structures. Science 2009, 325, 1634–1635. (26) Shan, T.-R.; Devine, B. D.; Hawkins, J. M.; Asthagiri, A.; Phillpot, S. R.; Sinnott, S. B. Second-generation charge-optimized many-body potential for Si/SiO2 and amorphous silica. Phys. Rev. B 2010, 82, 235302. 33 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(27) van Duin, A. C. T.; Dasgupta, S.; Lorant, F.; Goddard, W. A. ReaxFF: A Reactive Force Field for Hydrocarbons. J. Phys. Chem. A 2001, 105, 9396–9409. (28) Chenoweth, K.; van Duin, A. C. T.; Goddard, W. A. ReaxFF Reactive Force Field for Molecular Dynamics Simulations of Hydrocarbon Oxidation. J. Phys. Chem. A 2008, 112, 1040–1053. (29) Zhu, R.; Janetzko, F.; Zhang, Y.; van Duin, A. C. T.; Goddard, W. A.; Salahub, D. R. Characterization of the active site of yeast RNA polymerase II by DFT and ReaxFF calculations. Theor. Chem. Acc. 2008, 120, 479–489. (30) Rahaman, O.; van Duin, A. C. T.; Goddard, W. A.; Doren, D. J. Development of a ReaxFF Reactive Force Field for Glycine and Application to Solvent Effect and Tautomerization. J. Phys. Chem. B 2011, 115, 249–261. (31) Monti, S.; Corozzi, A.; Fristrup, P.; Joshi, K. L.; Shin, Y. K.; Oelschlaeger, P.; van Duin, A. C. T.; Barone, V. Exploring the conformational and reactive dynamics of biomolecules in solution using an extended version of the glycine reactive force field. Phys. Chem. Chem. Phys. 2013, 15, 15062–15077. (32) Akimoto, Y.; Auger, A.; Hansen, N. Comparison-based Natural Gradient Optimization in High Dimension. Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation. New York, NY, USA, 2014; pp 373–380. (33) Hansen, N.; Kern, S. Evaluating the CMA Evolution Strategy on Multimodal Test Functions. Parallel Problem Solving from Nature - PPSN VIII. 2004; pp 282–291. (34) Hansen, N. In Towards a New Evolutionary Computation. Advances on Estimation of Distribution Algorithms; Lozano, J., Larranaga, P., Inza, I., Bengoetxea, E., Eds.; Studies in Fuzziness and Soft Computing 192; Springer-Verlag Berlin Heidelberg, 2006; pp 75–102.

34 ACS Paragon Plus Environment

Page 34 of 40

Page 35 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(35) J´onsson, H.; Mills, G.; Jacobsen, K. W. Nudged elastic band method for finding minimum energy paths of transitions. Classical and Quantum Dynamics in Condensed Phase Simulations. 1998; pp 385–404. (36) Henkelman, G.; J´onsson, H. Improved tangent estimate in the nudged elastic band method for finding minimum energy paths and saddle points. J. Chem. Phys. 2000, 113, 9978–9985. (37) Trnka, T.; Kozmon, S.; Tvaroˇska, I.; Koˇca, J. Stepwise Catalytic Mechanism via ShortLived Intermediate Inferred from Combined QM/MM MERP and PES Calculations on Retaining Glycosyltransferase ppGalNAcT2. PLoS Comput. Biol. 2015, 11, e1004061. (38) Mortier, W. J.; Ghosh, S. K.; Shankar, S. Electronegativity-equalization method for the calculation of atomic charges in molecules. J. Am. Chem. Soc. 1986, 108, 4315–4320. (39) Louwen, J. N.; Vogt, E. T. C. Semi-empirical atomic charges for use in computational chemistry of molecular sieves. J. Mol. Catal. A: Chem. 1998, 134, 63–77. (40) O’Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. Open Babel: An open chemical toolbox. J. Cheminf. 2011, 3, 33. (41) Halgren, T. A. Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J. Comput. Chem. 1996, 17, 490–519. ˇ aˇc, J.; Riley, K. E.; Hobza, P. S66: A Well-balanced Database of Benchmark Inter(42) Rez´ action Energies Relevant to Biomolecular Structures. J. Chem. Theory Comput. 2011, 7, 2427–2438. (43) Fanourgakis, G. S.; Xantheas, S. S. Development of transferable interaction potentials for water. V. Extension of the flexible, polarizable, Thole-type model potential (TTM3F, v. 3.0) to describe the vibrational spectra of water clusters and liquid water. J. Chem. Phys. 2008, 128, 074506. 35 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 40

(44) Plimpton, S. Fast Parallel Algorithms for Short-Range Molecular Dynamics. J. Comput. Phys. 1995, 117, 1–19. (45) LAMMPS Molecular Dynamics Simulator. http://lammps.sandia.gov/index.html. (46) Babin, V. TTM2.1-F, TTM3-F and TTM4-F water models. 2013;

http://

deleramentum.net/codes/ttm/. (47) Eisenhaber, F.; Lijnzaad, P.; Argos, P.; Sander, C.; Scharf, M. The double cubic lattice method: Efficient approaches to numerical integration of surface area and volume and to dot surface contouring of molecular assemblies. J. Comput. Chem. 1995, 16, 273–284. (48) Hess, B.; Kutzner, C.; van der Spoel, D.; Lindahl, E. GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. J. Chem. Theory Comput. 2008, 4, 435–447. (49) Jarvis, R. A.; Patrick, E. A. Clustering Using a Similarity Measure Based on Shared Near Neighbors. IEEE Trans. Comput. 1973, C-22, 1025–1034. (50) Akin-Ojo, O.; Song, Y.; Wang, F. Developing ab initio quality force fields from condensed phase quantum-mechanics/molecular-mechanics calculations through the adaptive force matching method. J. Chem. Phys. 2008, 129, 064108. (51) Koziol, L.; Fried, L. E.; Goldman, N. Using Force Matching To Determine Reactive Force Fields for Water under Extreme Thermodynamic Conditions. J. Chem. Theory Comput. 2017, 13, 135–146. (52) Aktulga, H. M.; Fogarty, J. C.; Pandit, S. A.; Grama, A. Y. Parallel reactive molecular dynamics: Numerical methods and algorithmic techniques. Parallel Comput. 2012, 38, 245–259. (53) Valiev, M.; Bylaska, E. J.; Govind, N.; Kowalski, K.; Straatsma, T. P.; Van Dam, H. J. J.; Wang, D.; Nieplocha, J.; Apra, E.; Windus, T. L.; de Jong, W. A. NWChem: A 36 ACS Paragon Plus Environment

Page 37 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 2010, 181, 1477–1489. (54) Zhao, Y.; Truhlar, D. G. Density Functionals with Broad Applicability in Chemistry. Acc. Chem. Res. 2008, 41, 157–167. (55) Zhao, Y.; Truhlar, D. G. The M06 suite of density functionals for main group thermochemistry, thermochemical kinetics, noncovalent interactions, excited states, and transition elements: two new functionals and systematic testing of four M06-class functionals and 12 other functionals. Theor. Chem. Acc. 2008, 120, 215–241. (56) Weigend, F.; Ahlrichs, R. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy. Phys. Chem. Chem. Phys. 2005, 7, 3297–3305. (57) Weigend, F. Accurate Coulomb-fitting basis sets for H to Rn. Phys. Chem. Chem. Phys. 2006, 8, 1057–1065. (58) Grimme, S.; Antony, J.; Ehrlich, S.; Krieg, H. A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu. J. Chem. Phys. 2010, 132, 154104. (59) Goerigk, L.; Grimme, S. A thorough benchmark of density functional methods for general main group thermochemistry, kinetics, and noncovalent interactions. Phys. Chem. Chem. Phys. 2011, 13, 6670–6688. (60) Marenich, A. V.; Jerome, S. V.; Cramer, C. J.; Truhlar, D. G. Charge Model 5: An Extension of Hirshfeld Population Analysis for the Accurate Description of Molecular Interactions in Gaseous and Condensed Phases. J. Chem. Theory Comput. 2012, 8, 527–541.

37 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

´ Generator of AIM Wavefunction files NWCHEM [WFN (61) V´azquez-Mayagoitia, A. /

WFX].

https://sites.google.com/site/alvarovazquezmayagoitia/goals/

codes/nwchem-notes/generator-of-aim-wavefunction-files-nwchem. (62) Kannemann, F. O.; Becke, A. D. van der Waals Interactions in Density-Functional Theory: Intermolecular Complexes. J. Chem. Theory Comput. 2010, 6, 1081–1088. (63) Otero-de-la Roza, A.; Johnson, E. R. Non-covalent interactions and thermochemistry using XDM-corrected hybrid and range-separated hybrid density functionals. J. Chem. Phys. 2013, 138, 204109. (64) Ward, J. H. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236–244. (65) Dennis, J. E., Jr.; Schnabel, R. B. Numerical Methods for Unconstrained Optimization and Nonlinear Equations; Classics in Applied Mathematics 16; Society for Industrial and Applied Mathematics: Philadelphia, PA, 1996. (66) Igel, C.; Heidrich-Meisner, V.; Glasmachers, T. Shark. J. Mach. Learn. Res. 2008, 9, 993–996. (67) Whaley, R. C.; Petitet, A. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw: Pract. Exper. 2005, 35, 101–121. (68) G´omez, H.; Rojas, R.; Patel, D.; Tabak, L. A.; Lluch, J. M.; Masgrau, L. A computational and experimental study of O-glycosylation. Catalysis by human UDP-GalNAc polypeptide:GalNAc transferase-T2. Org. Biomol. Chem. 2014, 12, 2645–2655. (69) Lira-Navarrete, E.; Iglesias-Fern´andez, J.; Zandberg, W. F.; Compa˜ n´on, I.; Kong, Y.; Corzana, F.; Pinto, B. M.; Clausen, H.; Peregrina, J. M.; Vocadlo, D. J.; Rovira, C.; Hurtado-Guerrero, R. Substrate-Guided Front-Face Reaction Revealed

38 ACS Paragon Plus Environment

Page 38 of 40

Page 39 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

by Combined Structural Snapshots and Metadynamics for the Polypeptide NAcetylgalactosaminyltransferase 2. Angew. Chem., Int. Ed. 2014, 53, 8206–8210. (70) Verstraelen, T.; Ayers, P. W.; Van Speybroeck, V.; Waroquier, M. ACKS2: Atomcondensed Kohn-Sham DFT approximated to second order. J. Chem. Phys. 2013, 138, 074108.

39 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Graphical TOC Entry

M06-2X QM/MM ReaxFF: before, after

40 ACS Paragon Plus Environment

Page 40 of 40