Energy-Based Molecular Fragmentation Methods - Chemical Reviews

Apr 6, 2015 - Research School of Chemistry, Australian National University, ... Chemistry Division Medal of the Royal Australian Chemical Institute. ...
1 downloads 0 Views 9MB Size
Review pubs.acs.org/CR

Energy-Based Molecular Fragmentation Methods Michael A. Collins*,† and Ryan P. A. Bettens*,‡ †

Research School of Chemistry, Australian National University, Canberra, ACT 0200, Australia Department of Chemistry, National University of Singapore, 3 Science Drive 3, Singapore 117543, Singapore



3.6. Kernel Energy Method (KEM) 3.6.1. Fragmentation Algorithm 3.6.2. Gradients and Other Derivatives 3.6.3. Accuracy 3.7. Many-Overlapping-Body (MOB) Expansion 3.7.1. Fragmentation Algorithm 3.7.2. Gradients and Other Derivatives 3.7.3. Accuracy 3.8. Generalized Many-Body Expansion (GMBE) Method 3.8.1. Fragmentation Algorithm 3.8.2. Gradients and Other Derivatives 3.8.3. Accuracy 3.9. Summary of Energy-Based Fragmentation Methods 3.9.1. Estimates of the Electronic Energy 3.9.2. Energy Derivatives 3.9.3. Accuracy 4. Applications and Examples 4.1. Energies of Large Molecules 4.1.1. Geometry Optimization 4.2. Molecular Clusters 4.2.1. Geometry Optimization 4.2.2. Examples 4.3. Molecular Electrostatic Potential 4.4. Molecular Vibrations 4.5. Nuclear Magnetic Resonance (NMR) 4.6. Crystal Structures, Energetics, and Dynamics 4.7. Chemical Reactions and Molecular Dynamics 4.7.1. Global Potential Energy Surfaces 4.7.2. Reaction Paths 4.7.3. Ab Initio Molecular Dynamics 5. Speculations and Future Developments 5.1. How Accurately Do We Need To Know the Energy? 5.2. Is Energy-Based Molecular Fragmentation Accurate Enough for Equilibrium Properties? 5.3. What about Chemical Barriers and Rates in Large Molecules? 5.4. Reaction Mechanisms in Large Molecules 5.5. Manipulation of Large Molecules 5.6. On-the-Fly Monte Carlo Simulations 5.7. On-the-Fly Chemical Dynamics 5.8. Transition Metals and Organometallics 5.9. Defects in Crystals

CONTENTS 1. Introduction 1.1. Objectives 1.2. The Scaling Problem 1.3. Tackling the Scaling Problem 1.3.1. QM/MM 1.3.2. Quantum Mechanical Methods 1.4. Molecular Energy-Based Fragmentation Methods 2. History 3. Methods and Principles 3.1. Electrostatically Embedded Generalized Molecular Fractionation with Conjugate Caps (EE-GMFCC) 3.1.1. Fragmentation Algorithm 3.1.2. Gradients and Other Derivatives 3.1.3. Accuracy 3.2. Generalized Energy-Based Fragmentation (GEBF) 3.2.1. Fragmentation Algorithm 3.2.2. Gradients and Other Derivatives 3.2.3. Accuracy 3.3. Molecular Tailoring Approach (MTA) 3.3.1. Fragmentation Algorithm 3.3.2. Gradients and Other Derivatives 3.3.3. Accuracy 3.3.4. Associated Schemes 3.4. Systematic Molecular Fragmentation (SMF) 3.4.1. Fragmentation Algorithm 3.4.2. Gradients and Other Derivatives 3.4.3. Accuracy 3.4.4. Crystals and Crystal Surfaces 3.5. Combined Fragmentation Method (CFM) 3.5.1. Fragmentation Algorithm 3.5.2. Gradients and Other Derivatives 3.5.3. Accuracy

© XXXX American Chemical Society

B B B C C D D E G

G G I I I I J J J J K K K K K L L L L L N N

N N N N N O P P P P Q Q S S T T T T U V V W W X Y Z AA AA AB AC AC AC AC AD AD AE AE AE AE AE

Special Issue: Calculations on Large Systems Received: August 19, 2014

A

DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews 5.10. Structure and Reactivity of Surfaces 6. Concluding Remarks Author Information Corresponding Authors Notes Biographies Acknowledgments References

Review

any ab initio quantum chemistry method can be used to evaluate the right-hand-side of eq 1. It is important to remember that ab initio quantum chemistry methods only provide an approximation, , F, to the exact energy, EF. If calculations need only be performed on relatively small molecules, then more reliably accurate ab initio methods can be used. If the size of the molecular fragments, Fi, is independent of the size of the whole molecule, and the number of fragments, Nfrag, increases in linear proportion to the number of atoms in the molecule, then this means that the computational task of evaluating the right-hand-side of eq 1 increases only linearly with the number of atoms in the molecule. In many of the fragmentation methods described later in this Review, this linear scaling can be achieved. Furthermore, because the fragments of the greater system are separate simple molecules, the right-hand-side of eq 1 is trivially subject to massively parallel computing, with each fragment molecule being assigned, if desired, to a separate processing unit. In that way, the calculation time would be independent of the size of the molecule, only dependent on the number of processors available. A number of methods have appeared over the past decade, which provide estimates of the electronic energy (and other properties) for covalent molecules in the form of eq 1. Our intention is that this Review can be used as a working document for those wishing to implement any of the energybased molecular fragment methods for covalent molecules (and clusters or liquids or crystals). We will not consider methods that are limited to molecular clusters. Thus, in section 3 we pay particular attention to the details of the current implementation of each of the methods: how the fragments are constructed and the final energy, EF, is obtained. It is hoped that an interested reader should be able to implement these methods, or utilize the code made available by some of the authors, to good effect after having studied this Review and the literature cited herein. In section 4, we briefly describe some of the applications of the energy-based fragmentation methods, with a few illustrations. In section 5, we shamelessly speculate on possible new directions and applications in this field. This Review finishes with some concluding remarks. However, we begin by setting energy-based molecular fragmentation methods in the context of attempts in computational chemistry to avoid the roadblock of impossibly long computation times for large molecules. In section 2, we then briefly reprise the history of the development of energy-based molecular fragmentation methods, as this provides some context for the current state-of-the-art, as described in section 3.

AF AF AF AF AF AF AG AG

1. INTRODUCTION 1.1. Objectives

A major objective of theoretical and computational chemistry is the calculation of the energy and properties of molecules, so that chemical reactivity and material properties can be understood from first principles. As a practical matter, the aim is to complement the knowledge we gain from experiments, particularly where experimental data may be incomplete or very difficult to obtain. For a chemist, molecules are composed of nuclei and electrons. In the Born−Oppenheimer approximation, we separate the total molecular energy into an electronic component and a nuclear motion component. In this Review, we will mainly be concerned with the total electronic energy of a molecule, which forms the potential energy surface on which the nuclear motion takes place. If we can calculate the total electronic energy of a molecule, then we will have gone a long way (but not all of the way) toward calculating the reactivity and other properties of the molecule. The roadblock on the way to this laudable objective is the fact that the computational difficulty of calculating the molecular electronic energy increases very rapidly with the number of electrons and nuclei in the molecule, so rapidly as to make straightforward application of computational methods, which are reliable for small molecules, impossible for large molecules. Much effort has been expended on overcoming this problem in various ways. One broad approach to this task can be classified as “fragmentation methods”. In 2012, Gordon et al.1 published an excellent review in this journal entitled “Fragmentation Methods: A Route to Accurate Calculations on Large Systems”. We will not try to repeat this very broad review here. Rather, we concentrate on a subclass of fragmentation methods, which we denote as “energy-based molecular fragmentation methods”. The defining characteristic of an energy-based molecular fragmentation method is that the molecule (also cluster of molecules, or liquid or solid) is broken up into a set of relatively small molecular fragments, in such a way that the electronic energy, EF, of the full system F is given by a sum of the energies of these fragment molecules:

1.2. The Scaling Problem

Ultimately, much of chemistry can be reduced to solving the time-dependent Schrödinger equation for the system under study. For many applications, the time-dependency need not be explicitly included, with the more familiar form of the Schrödinger equation:

Nfrag

EF =

∑ ciEi + εF i=1

(1)

Ĥ Ψn = En Ψn

where Ei is the energy of a relatively small molecular fragment, Fi. The ci are simple coefficients (typically integers), and Nfrag is the number of fragment molecules. Some of the methods also require a correction to the energies evaluated from the fragments. However, where necessary, this correction, εF, is easily computed. The power of energy-based methods is based on the following facts. If the molecular fragments are small enough,

(2)

being solved instead. The most accurate methods available today, which directly solve this equation for molecules, can only be applied to a very small subset of chemical problems. The reason for this lies in the fact that the Schrödinger equation explicitly describes every electron−electron and electron− nucleus and nucleus−nucleus interaction present in the system under study, and that these interactions must be dealt with in B

DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

quantum mechanical methods. Below we highlight various techniques, other than energy-based molecular fragmentation methods, which are designed to circumvent the above scaling problem. Each of the techniques outlined below has associated with it numerous publications and usually numerous significant contributors. Here, we shall only very briefly describe the idea behind each method, and, aside from the review of Gordon et al.,1 we point the reader to additional specialized reviews of the topics.

the context of a wave equation. Even if one separates the nuclear and electronic motions by applying the Born− Oppenheimer approximation, and concentrates on treating the wave-like nature of them separately, the computational task required to solve either situation independently is immense for anything but the smallest systems. The immensity of the task can be appreciated by concentrating on solving the electronic Schrödinger equation, where the nuclei are located at fixed positions and only the electrons are described by the wave equation. Focusing our attention on ab initio methods, the ground-state wave function, Ψ0, is first approximated by placing all of the electrons in the system in molecular orbitals that are constructed from a linear combination of Gaussian-type functions. The more Gaussiantype functions, or the larger the basis set, the better the electrons’ behavior within the system can be described. In the simplest approximation, each electron’s motion occurs in the average electrostatic field of the nuclei and the remaining electrons in the system. After taking account of the fact that electrons are Fermions, the best solution to the Schrödinger equation can be obtained by varying the precise linear combinations of the basis functions that make up each molecular orbital in the system. This represents the crudest ab initio solution to the Schrödinger equation and is known as the Hartree−Fock (HF) method. The method is generally not considered accurate, but does capture the quantum mechanical nature of bonding and does, to some extent, allow for the making and breaking of chemical bonds. The computational expense of the HF method depends on the number of electrons in the system under consideration and the size of the basis set used. If we simply denote the “size” of the system with the canonical integer, N, then the HF method computation time nominally scales as N4. A cunning implementation of the HF method can reduce this scaling to around N3. Nevertheless, a doubling of the system size, that is, double the number of electrons and basis functions, and the calculation will take roughly between 8 and 16 times longer. The HF method is not accurate. To approach reliable accuracy, some amount of “electron correlation” needs to also be included.2 This may be achieved by the use of perturbation theory, in particular the Møller−Plesset (MP) method. Higherorder perturbative treatments are more accurate, but the computation time scales progressively more rapidly with N. At the MP2 level (second order),3 the scaling is around N4−N5. At MP3 (third order), the scaling is N6, and at MP4 (fourth order), the scaling is N7. Alternatively, a hybrid HF and density functional theory (DFT) treatment may be implemented. The accuracy of the results depends on the application, basis set size, and functional used, but often the quality of the results is similar to that of MP2. The scaling for such a treatment is essentially the same as HF, but with a larger constant of proportionality. One alternative treatment of electron correlation is known as coupled-cluster (CC) methods.2 A coupledcluster treatment that includes all singles and double electron excitations (CCSD) scales as N6, while the generally reliable CCSD(T) method scales as N7. CCSD(T) stands for couplecluster with all singles, doubles, and perturbative triple electron excitations, and is fairly widely accepted as the “gold standard” for quantum chemical calculations. Because of the prohibitive scaling of accurate quantum mechanical methods with system size, alternative approximations have been sought that scale much slower with system size, yet attempt to maintain the accuracy of the higher level

1.3. Tackling the Scaling Problem

1.3.1. QM/MM. One approach that deals with the poor scaling of quantum mechanical methods with system size is not to treat the entire system with quantum mechanical methods. Often one is not interested in a detailed and highly accurate quantum mechanical treatment of the entire system, but rather a selected much smaller region of the system, for example, an active site in an enzyme. If the focus of study is a much smaller region, then a hybrid quantum mechanical-classical mechanical (or molecular mechanical) method may be profitably employed. Such methods have come to be known as QM/ MM methods. Generally there are two schemes: (i) an additive scheme, and (ii) a subtractive scheme. The schemes are differentiated by their approaches to computing interactions between the quantum mechanical part of the system and the molecular mechanical part of the system. The additive scheme was first introduced in 1976 by Warshel and Levitt,4 who (along with Martin Karplus) were awarded the 2013 Nobel Prize in Chemistry for this type of approach. In this scheme, the total energy of the system is written as EQM/MM = EQM, S + EMM, F / S + EQM − MM

(3)

where S represents the small region of interest inside the full system, F; EQM,S denotes the energy of region S, evaluated using quantum chemistry methods; EMM,F/S denotes the energy of the full system, excluding S, evaluated using molecular mechanics; and EQM−MM is the interaction energy between S and the rest of the full system. The subtractive scheme appeared 19 years after the additive scheme. It was originally developed by Maseras and Morokuma and then known as IMOMM,5 but soon afterward was developed further by Morokuma and co-workers, and is more commonly known as ONIOM. In its original form, the total energy of the system was written as EONIOM(QM:MM) = EQM, S + EMM, F − EMM, S

(4)

where EMM,F is the energy of the full system, evaluated using molecular mechanics, and EMM,S is the energy of region S, evaluated using molecular mechanics. The method has been extended to a “multiple layer scheme” (where region S1 is contained within region S2, which is contained within F), along with other variations. It is revealing to provide the general expression for the n-layer scheme: n

EONIOM(QM1:QM2:...:QMn − 1:MM) =

n−1

∑ ELi ,S

i

i=1



∑ ELi+ 1,S

i

i=1

(5)

Here, Li represents quantum mechanical level-of-theory i with i = 1 representing the highest (most accurate and costly) level of quantum mechanical theory, and increasing values of i in Li denote progressively more and more approximate levels of theory until at Ln, the energy is obtained from molecular mechanics. Each region Si progressively incorporates a larger C

DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

and larger region of the system under study with increasing i, until at i = n, Sn = F − the entire system. With this scheme, all interactions between region Si and Si+1 are accounted for at the Li+1 level of theory. An additional point of note is that if level Ln, the lowest level of theory, was not molecular mechanics, but a quantum mechanical method, then this particular n-level ONIOM would represent an all quantum mechanical method. For further details of the additive scheme and other significant developments and workers in the field, see the 2011 review of Kamerlin, Vicatos, Dryga, and Warshel6 and the 2009 review of Senn and Thiel.7 For the subtractive scheme, ONIOM, see the 2012 review by Chung, Hirao, Li, and Morokuma.8 1.3.2. Quantum Mechanical Methods. If describing the entire system quantum mechanically is desired, then the steep scaling of quantum mechanical methods with system size needs to be directly addressed. It has been known for some time that the energy of a molecule, given by its standard molar heat of formation, ΔfHm° (298 K), can sometimes be approximately obtained to within a couple of kcal mol−1 from group additivity methods (see, for example, ref 9 and references therein). Indeed, for the simple homologous series of n-CmH2m+2 hydrocarbons from m = 2−20, the change in the heat of formation with the addition of a (CH2) group is nearly constant: ΔΔfH° (298 K) = −20.6 ± 1.9 kJ mol−1 (2σ error),10 with most of the error deriving from experiment. The approximate additivity and transferability of group enthalpy data implies that the total electronic energy of a molecule can be thought of as being approximately the sum of electronic energies of constituent groups, that is, energy contributions that are physically localized around sets of valence bonded atoms. Pulay11,12 noted in 1983 that a great deal of the steep scaling in correlated quantum mechanical methods was due an unphysical delocalization of virtual orbitals, which distributed the effect of a few important electronic configurations throughout the configuration space. The fact that the correlation energy of pairs of electrons is similar in systems of different sizes suggests that large differences in computational time for such systems are “unphysical”. Indeed, the steep scaling of computational time with system size can be significantly reduced as follows: A HF calculation is performed on the entire system; localized molecular orbitals (LMOs) are formed; and only excitations involving colocated LMOs are used in the correlation treatment. The localization of molecular orbitals obtained from an HF wave function, or some other approximate wave function, including those obtained from linear-scaling techniques, is an intense area of research. Many notable workers have contributed to the field and extended the approach to various electron correlation treatments, including CCSD(T).13,14 The interested reader is directed toward Gordon et al.’s review1 and the 2014 paper by Li and Li for a more specialized review,15 including the method of Cluster-in-Molecule. The divide and conquer method, originally developed by Yang16,17 in 1991 for DFT calculations, also utilizes an electronic localization idea: divide the full system into many fragments, determine the electron density of each fragment separately, and sum the corresponding contributions from all fragments to obtain the total electron density from which the energy of the full system may be determined. A 1998 review of the approach was published by Yang18 that describes the state of the technique until that point. This density-based fragmentation approach has been extended to include electron

correlation methods and has also been combined with the LMO method, which further improves the scaling of this quantum mechanical approach with system size. This approach has seen many significant contributions from just as many workers. The interested reader is directed toward a number of reviews on the topic.19−22 Another approach to dealing with the steep scaling of quantum mechanical methods with system size is the fragment molecular orbital (FMO) method, first introduced by Kitaura et al.23 in 1999. In this method, the target molecule is broken down into small disjoint sets of atoms, for example, −CH2− in an alkyl chain, which we shall designate as “groups” (originally referred to as fragments) in this Review. These groups correspond to the usual functional groups discussed in chemistry and are a very natural choice. However, the definition of a group in FMO can be made substantially larger for polypeptides, and involves two waters for water clusters. Nevertheless, the method involves first fragmenting the target molecule into the constituent groups and solving the Schrödinger equation for each of the individual groups in the presence of the full coulomb field of the remainder of the molecule. Note that this step is performed iteratively, and results in a fully self-consistent set of energies of the individual groups. The second step is to utilize the converged electron density from each of the groups in a calculation that involves solving the Schrödinger equation for all possible pairs of groups, in the presence of the full coulomb field due to the remainder of the molecule. This computation is performed once with no updates in the electron density performed. At this point, the final energy is obtained from N

EF =

N−1

N

∑ Ei +

∑ ∑

i=1

i=1 j=i+1

(Eij − Ei − Ej) (6)

where Ei is the energy of group i, and Eij is the energy of the two-group fragment ij. It is clear from this expression that the final energy is a many-body expression of the total energy of the molecule truncated after the two-body interaction. Very naturally, the above approach is called FMO2. A threebody correction has also been applied, and that more accurate, but more costly, method is designated FMO3. The double sum in FMO2 and triple sum in FMO3 formally imply a quadratic and cubic scaling with system size. However, accurate distancebased approximations can be made within the method that vastly reduces the overall computational effort leaving essentially a linear-scaling method with system size. The above is essentially a HF treatment. The method has been extended to include MP2 correlation and CCSD(T) among others.24−28 1.4. Molecular Energy-Based Fragmentation Methods

Fully quantum methods to solve the Schrödinger equation for the many-electron wave function of a large molecule are truly challenging. The efforts to make this a tractable problem, briefly mentioned above, require not only very clever ideas, but prodigious work in programming how to manipulate all of the components of a many-electron wave function and the associated contributions to the total electronic energy. For those (like us) who tremble even thinking about such a programming task, energy-based molecular fragmentation methods are the answer. No new electronic structure methods need be devised or programmed. One simply uses any of the methods currently available in the established quantum D

DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

than (say) all atoms within some volume. The electron density in these fragments was evaluated and combined into a molecular density specifically for the evaluation of the electrostatic potential of the molecule (MESP). This approach to evaluation of the MESP was further developed in later years.45,46 In 1999, Kitaura and co-workers23 proposed their fragment molecular orbital (FMO) theory (see section 1.3.2) in which localized molecular orbitals of overlapping fragments were evaluated and used to estimate the molecular electronic energy via a HF calculation. Indeed, there may be many much earlier papers in which molecular properties were estimated from the properties of molecular fragments because, for many, many decades, chemists have viewed molecules as collections of functional groups. We expect the reactivity of a large organic molecule to reflect the reactivity of its functional groups, albeit influenced by the presence of neighboring functional groups. In 2003, Zhang and Zhang30 proposed that the interaction of a small molecule with a protein should be accurately estimated from the interaction of the small molecule with each of the protein residues. To implement this idea, a protein is split into individual residues by breaking all the amide C−N bonds. A protein is essentially a simple chain of residues. The chemical environment of each residue is restored by capping each residue with part of the formerly connected residues on each side, to give fragments of almost three residues in length. Two “caps” on neighboring residues are then reconnected, and each of these smaller composite molecules is subtracted from the sum of capped residues to give a representation of the whole protein as a sum and difference of these fragments. They called this approach “molecular fractionation with conjugate caps” (MFCC). The interaction energies of a “substrate” molecule with the capped residues are evaluated using ab initio quantum chemistry in the usual way, and the results are added. The interaction energies of the “substrate” molecule with the composite pairs of caps are evaluated using ab initio quantum chemistry in the usual way, and the results are subtracted from the sum. This gives an accurate value for the total interaction energy because the net effect is to interact the small molecule with each residue, while each residue is placed in its original chemical environment (at least in terms of its adjoining residues). In 2004, Deev and Collins,47 Li, Li, and Fang48 and Huang, Massa, and Karle31 independently (and almost simultaneously) realized that the total electronic energy of any molecule might be directly estimated from the sum of the energies of overlapping fragments, minus the energy of the “overlaps”, with papers published in early 2005. Results in these early papers were stunning; ab initio energies of modest sized molecules were reproduced from fragment energies to within a few kJ mol−1. These methods for fragmenting molecules into overlapping fragments have some similarities and significant differences, which will be discussed in more detail in the next section. Each method involves breaking covalent bonds, and the valence is restored at the “dangling” atoms by the addition of hydrogen atoms as “caps” (in contrast to the partial residues used by Zhang and Zhang). The use of such hydrogen caps is familiar in mixed quantum molecular mechanics (QM/MM) treatments such as ONOIM,8,49,50 at the boundary where a quantum treatment converts to a molecular mechanics treatment. The “systematic molecular fragmentation” (SMF) method of Collins and co-workers47 forms fragments by

chemistry program packages to calculate the energy of molecular fragments of the large molecule. In qualitative terms, this approach rests on the basic idea that much of chemistry is a local phenomenon, supported by the empirical evidence that the reactivity of molecules can be understood in terms of the functional groups in a molecule, and the proximity of those functional groups to one another. In other language, we start with a “tight-binding” picture,29 in which electrons are strongly associated with the atomic nuclei within a small collection of atoms, a functional group, or a small set of adjacent functional groups. The energy of a molecule is taken to be the total energy of those small sets of atoms, corrected for the interactions between those sets, evaluated using perturbation theory or a many-body expansion. The trick is to translate this qualitative idea into an algorithm or procedure that evaluates molecular electronic energies to chemical accuracy, that is, accurate enough to yield reaction enthalpies and reaction barrier heights that can predict observable properties to within experimental error. Part of this trick is to define functional groups sensibly; for example, no one expects electrons to be usefully localized to individual CH groups in benzene, or to one metal atom in a conducting metal. Part of this trick is to devise an algorithm for fragmenting molecules without introducing spurious effects. The final task is to implement these ideas in a way that evaluates the molecular electronic energy in a computational time that increases only linearly with the size of the molecule, so that reliable ab initio quantum chemistry can be applied to virtually any molecule. We are writing this Review because this trick has been accomplished, in several different approaches, by different research groups.

2. HISTORY From our perspective, a number of the current energy-based molecular fragmentation approaches were largely inspired by the 2003 paper on molecular fractionation with conjugate caps (MFCC) by D. W. Zhang and J. Z. H. Zhang.30 However, there were several related threads running through the literature in previous years, including the “divide and conquer” concept, the “molecular tailoring” idea (see below), “quantum crystallography”,31 the molecular fragments considered by Christoffersen and co-workers,32,33 and no doubt many others. In this field, we often find that such threads appear to be independent, in that little cross citation occurs (at least initially). Perhaps this is because researchers have approached the task of estimating the properties of large molecules from quite different directions. As mentioned above, some of this early work lay within density functional theory (DFT), with Yang’s 1991 “divide and conquer” (D&C) scheme,17 whereby the electron density of a molecule might be estimated from the electron density of small molecular subsystems. These subsystems might be “overlapping”, in the sense that an atom might appear in more than one subsystem. The estimated electron density for the whole molecule was then used in a DFT calculation for the molecular energy. Later, the density matrix was estimated using this D&C scheme and applied to DFT calculations,34 and semiempirical calculations.35−37 The D&C approach has also been extended to other post HF methods and to energy gradients by Nakai and co-workers,38−40 and others.41−43 In 1994, Gadre and co-workers44 proposed a method called the “molecular tailoring approach” (MTA) for breaking a molecule into overlapping fragments. The fragments were chosen as sets of atoms connected by chemical bonds, rather E

DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

molecular fractionation with conjugated caps (EFA-MFCC) method. There, embedded point charges were used to model only the effect of formally charged groups in large molecules. The 2007 paper by Li and co-workers55 extended the use of embedded charges to general molecules. Interestingly, the use of embedded charges in a many-body expansion for the energy of clusters57 and in the X-POL method58 also appeared in 2007. Also in 2007, Netzloff and Collins59 further developed the SMF method to apply to periodic systems, so that the energy and geometry of nonconducting crystals could be calculated at any level of ab initio theory. This was demonstrated for several SiO2 polymorphs using HF, MP2, and CCSD(T) ab initio methods. Also in 2007, Collins60 showed that potential energy surfaces (PESs) could be constructed from the molecular fragments and combined to give a PES for the whole molecule. This approach was used to describe (using classical trajectories) the simple chemical reactions between a hydrogen atom and npentane, using PESs for H + CH3CH3 and H + CH4. The applications of energy-based fragmentation methods to the chemistry of large molecules also grew rapidly in number in the period from 2003 to 2007. Zhang and co-workers concentrated on evaluating the optimum geometry and binding energy of substrate−protein and biomolecule−ligand systems.61−73 In an echo of the earlier MTA and divide and conquer work, Zhang and co-workers used MFCC to estimate the electron density (and density matrix) for proteins to both estimate the total electronic energy of the protein at HF and DFT levels and evaluate the MESP of the protein.74−77 This electrostatic potential was then used to estimate the solvation energy of the protein and the substrate−protein binding energy.78 Bettens and Lee also applied their fragmentation approach to the calculation of enzyme−substrate interaction energies.79 In a new development, they demonstrated the use of their fragmentation approach to the estimation of NMR chemical shifts in large organic and biological molecules.80 This was one of the first practical uses of the fact that most molecular properties can be expressed in terms of derivatives of the electronic energy (albeit the energy when the molecule is in the presence of some external field). Hence, the fragmentation expression for the total electronic energy leads directly to expressions for most molecular properties. The application of fragmentation to the calculation of NMR spectra has continued to attract attention.81−83 During this period, the Kernel Energy Method was further explored84 and applied to a very large protein85 and to tRNA.86 In contrast to the MFCC, SMF, GEBF, and MTA methods, where molecules were broken into fragments of a few to a few tens of atoms, KEM was initially focused on breaking very large systems into not-quite-so-large fragments (hundreds of atoms). However, this emphasis became directed toward slightly smaller fragment sizes, when (in 2008) the KEM formalism was generalized to include the third- and fourth-order interactions of kernels (rather than just pairs).87 The interaction of proteins in solution with substrate molecules is thought to be dominated by electrostatic interactions (at least until the species are chemically bound). Not surprisingly then, from 2008 on, the recently developed energy-based fragmentation methods were applied to the calculation of the charge distribution in proteins in solution. Zhang and co-workers used a modified MFCC approach75 in an iterative scheme,88 as follows. The electron density of the MFCC fragments is obtained from ab initio calculations in vacuo. These fragment densities are combined to yield the

breaking bonds, guided by the bonded connectivity of the molecule: Atoms will reside in a common fragment because they are “close” in terms of steps along bonds. A hierarchy of approximations is obtained by increasing the “distance” in connectivity between bonds that are broken. The interaction of more distantly separated parts of a molecule was accounted for, either by ab initio calculations or perturbatively in terms of electrostatic interactions. Li and co-workers48 applied the same type of fragmentation introduced by Zhang and co-workers to more general chain-like molecules, using generalized groups in place of peptide residues, and modified to allow direct calculation of the total electronic energy, a scheme they called “energy corrected molecular fractionation with conjugate caps” (EC-MFCC). This paper followed shortly after Li and Li51 first applied the MFCC fragmentation approach to the evaluation of the correlation energy of large molecules. The Kernel Energy Method (KEM) of Huang, Massa, and Karle31 breaks very large molecules into quite large pieces or “kernels” and, in the original paper, forms pairs of connected kernels, so that the molecular electronic energy is given by a sum of the pair energies less the energies of the kernels. Development accelerated in 2006, when Gadre and coworkers52 modified their MTA approach to electron densities to a “cardinality guided” CG-MTA fragmentation scheme to evaluate total electronic energies for molecules. This paper introduced a different approach to fragmentation, based mainly on distance in space rather than on bonded connectivity. Spheres are constructed, centered on about one-half of the groups in the molecule. All of the atoms within a sphere belong to a primary fragment. Some additional atoms might be added because they are strongly connected (say by multiple bonds) to atoms within the sphere. Two spheres may overlap (an intersection of the two sets of atoms), and the overlapping atoms form a new (sub)fragment. One such overlap may intersect with another overlap, giving the intersection of these two intersections as a new subsub-fragment, and so on. Hydrogen caps are used as usual. The total energy is then given by an “inclusion and exclusion principle” from the theory of sets as a sum and difference of the energies of these fragments and sub fragments. Also in 2006, Zhang and co-workers53 extended their MFCC approach to the direct calculation of total electronic energies of proteins. Moreover, a new fragmentation approach was introduced by Bettens and Lee54 for ab initio quantum calculation of the total electronic energy. While based on the SMF method, it introduced an isodesmic approach to the fragmentation process. Methodology development continued apace in 2007. Li and co-workers introduced a “generalized energy-based fragmentation (GEBF) approach, which constructed fragments using both bonded connectivity and a distance based criterion, as in CG-MTA. Moreover, they included the use of “embedded charges”.55 The earlier EC-MFCC method did not account for the electrostatic interaction of molecular fragments, which are well-separated in distance. To include these important electrostatic interactions, and to account for polarization of the fragments and the associated induction energy, Li and coworkers carried out all fragment ab initio calculations in the presence of point charges sited at all of the atoms that are not contained in the fragment. A correction to remove any “double counting” of long-range electrostatic interactions must be included in this approach. This general application of embedded charges followed the work, in 2006, of Jiang, Ma, and Jiang56 who first developed an electrostatic field-adapted F

DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

mechanically and more distant interactions using perturbation theory (electrostatics, induction, and dispersion). More accurate treatments of these long-range interactions were developed.125−127 Similar perturbative treatments of long-range fragment-fragment interactions had been developed by Bettens and Lee.79 Such long-range interactions were ignored in the original approaches of the Gadre, Li, and Zhang groups, until the most important electrostatic interactions were introduced by the use of embedded charges in the ab initio calculations of the fragment energies.55,128 The formulation of SMF for nonconducting crystals was also extended to account for longrange interactions (including dispersion).129 Optimized geometries were evaluated, and the phonon (vibrational) frequency bands were evaluated. While the earlier methods have undergone a period of further refinement,128,130−132 a number of groups have produced new energy-based molecular fragmentation schemes.133−138 The SMF method was extended to describe the fragmentation of crystal surfaces.139 Bettens and co-workers developed a new algorithm called the “Combined Fragmentation Method” (CFM);133 Collins produced the Systematic Molecular Fragmentation by Annihilation (SMFA) method;130 Zhang and co-workers developed the “electrostatically embedded generalized molecular fractionation with conjugate caps” (EEGMFCC) method; and Li and co-workers produced an improved GEBF scheme.132 Mayhall and Raghavachari developed fragmentation approaches termed the molecules-inmolecules (MIM)135 and many-overlapping-body (MOB) expansion methods.136 Richard and Herbert presented a generalized many body expansion (GMBE).137,138 All of these approaches demonstrate a remarkable ability to accurately estimate the total electronic energy of molecules, clusters, and solids. Most (perhaps with slight modifications) calculate the molecular electronic energy in a time that increases only linearly with the size of the molecule or cluster. Moreover, the computation is trivially parallelizable, as each fragment energy can be evaluated independently on separate processors. Hence, the real time required can be made independent of the size of the molecule, only dependent on the number of processors available. These approaches vary in method, in the generality of their applicability, and to what extent they can be applied in an automated (programmable) fashion. The differences in these approaches result in different compromises between accuracy and computational efficiency.

electron density for the protein. This charge distribution is represented by point charges on the atoms. The protein is surrounded by a polarizable continuum (with a dielectric constant of 80), separated from the protein by the van der Waals radii of the protein’s atoms. The Poisson−Boltzmann equation is solved in the presence of the atomic charges to give the charge distribution of this continuum solvent. This charge distribution is represented by discrete point charges on the solvent surface surrounding the protein (which describes a “reaction field”). The ab initio calculations of the MFCC fragments are then repeated in the presence of these “solvent charges” to give new charges on the protein’s atoms. The Poisson−Boltzmann equation is solved in the presence of these new charges, and so the process is iterated to convergence. This scheme provides both a set of “polarized protein-specific charges” (PPC) for use in molecular dynamics (MD) simulations of these proteins and an estimate of the free energy of solvation for the protein. This approach was applied to a number of studies of binding and molecular dynamics in biological systems.89−106 Bettens and co-workers also developed distributed multipole107,108 descriptions of charge distributions in conjunction with their fragmentation algorithm to evaluate the MESP for large proteins and the influence of this MESP on substrate binding.109,110 A similar method for estimating binding energies using the fragment-derived MESP and substrate multipole moments has also been developed.111 As noted above, the fragmentation expression for the molecular electronic energy can be differentiated with respect to parameters in the molecular Hamiltonian to obtain formulas for molecular properties. The 2005 paper on SMF 47 demonstrated that equilibrium geometries and accurate vibrational frequencies (using the first and second derivatives of the energy with respect to the atomic coordinates) could be obtained using fragmentation in some simple cases. In 2008, Li and co-workers demonstrated that the GEBF method produced accurate equilibrium geometries, optimized using the fragmentation expression for the energy gradients, and accurate vibrational frequencies from the second derivatives, for a range of larger molecules.112 Molecular clusters appear to be natural candidates for accurate treatment by fragmentation, because no covalent bonds are broken during the fragmentation process. Thus, from 2008 to the present, Gadre and co-workers have applied their MTA method to clusters of lithium dimers,113 CO2,114,115 benzene,116 acetylene,117,118 and water.119 Construction of such clusters and geometry optimization are facilitated by the fragmentation approach.120,121 A fragmentation approach to the description of water clusters, in particular, has been employed by a number of authors.122−124 Molecules pose a more difficult test for fragmentation methods because breaking covalent bonds introduces large perturbations that must be accounted for, and creates an interface between fragments and their environment that must also be accounted for. Methodology development has therefore continued on a number of fronts over the past few years to improve the accuracy of the methods and to ensure the methods are reliable over the range of chemical systems. All of the fragmentation methods primarily account for the interactions between atoms, which are relatively “close”, in terms of either bonded connectivity or distance in space. Interactions at greater distances are very much smaller, but not necessarily negligible. The SMF method treats the nearer of these “nonbonded” or “through-space” interactions quantum

3. METHODS AND PRINCIPLES In this section, we summarize the current “state-of-the-art” of each of the major methods that have been developed in energybased molecular fragmentation. The applications of these approaches are presented in the next section. 3.1. Electrostatically Embedded Generalized Molecular Fractionation with Conjugate Caps (EE-GMFCC)

3.1.1. Fragmentation Algorithm. This method is automated, but applies only to proteins and polypeptides. The procedure is as follows:128 (i) Each atom is assigned to a single amino acid residue. Denoting the set of side chains as {Ri}, each amino acid residue, Ai, is given by G

DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

peptide with the same adjoining peptides as in the original molecule. Aside, where disulfide bonds exist, they are broken and capped with SCH3 groups, in a similar manner.64,67 (v) The energy of this sum of fragments is given by N−1

The molecule is assumed to have the primary structure of a simple chain:

EMFCC =



N−2

E(Capi*− 1 Ai Capi + 1 ) −

∑ E(Capi*Capi+ 1 )

i=2

i=2

(14)

where E denotes the quantum electronic energy of a molecule, and N is the number of peptide residues in the chain. (vi) Equation 14 only accounts for the energy of the individual residues and their interactions with neighboring partial residues. The interaction between more distantly connected residues is accounted for in two ways. If the closest atom−atom distance between a pair of residues, i and j, is less than a specified value, λ, the interaction energy, Eij, is calculated quantum mechanically:

(ii) The fractionation process starts by breaking the C----N bond in all amide groups, to get a sum of individual peptide units:

Eij = E(Ai Aj ) − E(Ai ) − E(Aj )

(iii) Capi is defined as

(15)

where the single peptides of eq 7 are capped with hydrogen atoms. If the distance between a pair of residues is greater than λ, the interaction is described using embedded charges, which also accounts for the mutual polarization of all parts of the protein. All electronic structure calculations are carried out in the presence of embedded charges. The expression for the energy in this Electrostatically Embedded Generalized Molecular Fractionation with Conjugate Caps (EE-GMFCC) approach is128

Conjugate cap, Capi*, is defined as

These caps and conjugate caps are themselves terminated with hydrogen atoms. Now, “cap” each amino acid residue Ai in eq 9 to the left with conjugate cap Capi−1 * , and to the right with cap Capi+1 to get a sum of fragments of the form Cap*i−1AiCapi+1:

N−1

E EE‐GMFCC =

∑ E(̃ Capi*− 1 AiCapi+ 1 ) i=2

N−2



N−3

N

∑ E(̃ Capi*Capi+ 1 ) + ∑ ∑ i=2

i=1

(Eij̃ − Eĩ − Ej̃ )

j=i+3

|R i − R j|≤ λ

⎧ ⎪ N−2 ⎪ − ⎨∑ ⎪ i=2 ⎪ ⎩

(iv) Subtract the joined caps, Capi*Capi+1, from eq 12 to get the final molecular fractionation with conjugate caps representation of the polypeptide:

∑∑ m ∈ k i n ∈ li

N−3



qm(k )qn(l ) i

N

∑∑ ∑ ∑ i = 1 m ′∈ i

i

R m(ki)n(li)

j=i+3 |R i − R j|≤ λ

n ′∈ j

⎫ qm ′ (i)qn ′ (j) ⎪ ⎪ ⎬ R m ′ (i)n ′ (j) ⎪ ⎪ ⎭

(16)

Here, Ẽ denotes the energy for a quantum electronic structure calculation carried out in the presence of point charges {q} at the positions of all atomic nuclei, which are not explicitly included in the quantum calculation. In eq 16, |Ri − Rj| denotes the distance between residues i and j. The third term in eq 16 describes the quantum interaction energy for pairs of residues that have sufficiently close inter-residue atom−atom distances, but which are not included in the MFCC energy of eq 14. When two residues are sufficiently far apart, their electrostatic interaction is accounted for by the fact that the quantum energy of each residue is calculated (a net once) in the presence of the charge distribution (as described by point charges) of the other. Unfortunately, this means that some electrostatic interactions

This sum of fragments preserves all of the atoms in the original molecule. Each of the larger tripeptide fragments contains one H

DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

are counted twice in the first three terms in eq 16: residue k in the presence of charges for residue l, and residue l in the presence of charges for residue k. This “double counting” is approximately removed by the last terms in eq 16 using point charges. There, fragment ki = (H)Cap*i−1 − NH [where (H) denotes that the added hydrogen atom on the left is excluded]; fragment li = CO − Ai+2Ai+3...AN; qm(ki) represents the point charge of the mth atom in fragment ki; and qm′(i) represents the point charge of the m′th atom in residue i. The use of point charges ensures that some account of the total molecular charge distribution is present in every electronic structure calculation, so that induction is treated as a many-body effect. There are two important details: the value of the distance parameter λ, and the choice of point charges. Tests of several values of λ were carried out for a number of different protein structures. Very large values of λ are inefficient, because the number of pairwise quantum interactions increases rapidly with increasing λ, without significantly increasing the accuracy. It was found that λ = 4.0 Å gave a good balance between accuracy and efficiency. Two sets of point charges have been tested in this approach. Polarized protein-specific charges (PPCs) are obtained via a complicated iterative procedure for each protein studied.88,128 The Amber94140,141 charges were also tested and found to give slightly more accurate energies in eq 16, without the computational cost of PPCs. 3.1.2. Gradients and Other Derivatives. Formulas for derivatives of the EE-GMFCC energy have not been reported. Geometry optimization does not appear to have been carried out using the EE-GMFCC approach. However, energy derivatives and geometry optimization have been reported for the earlier MFCC approach,63 and there does not seem to be any reason why derivatives of eq 16 could not be evaluated using approaches similar to those implemented in GEBF or SMFA (see below). 3.1.3. Accuracy. A brief summary of the accuracy of EEGMFCC is as follows: 18 protein structures containing between 242 and 1142 atoms gave a mean absolute error in the energy of 10.0 kJ mol−1 at HF/6-31G*; six protein structures containing between 218 and 608 atoms gave a mean absolute error in the energy of 22.6 kJ mol−1 at B3LYP/631G*, and nine proteins containing between 89 and 174 atoms gave a mean absolute error in the energy of 16.3 kJ mol−1 at MP2/6-31G*. Significantly, the relative energies of 18 conformers of a protein containing 396 atoms showed a mean absolute error of only 8.8 kJ mol−1 at HF/6-31G*.

Figure 1. A two-dimensional representation of the spherical volume used to define primitive subsystems in the GEBF method.

The idea is to preserve the bonding environment of all groups in the primitive subsystem, including “terminal” groups near the surface of the sphere. (2a) If two groups are separated by less than 2ξ but do not share a common subsystem, then an additional primitive subsystem is created centered at the midpoint between these two groups with radius ξ.132 (3) Some subsystems are identical; all but one copy are discarded. Subsystems that are subsets of other subsystems are also discarded. (4) The authors sum the remaining primitive subsystems with coefficients of +1. However, this sum cannot equal the original molecule or “system” as most groups occur in more than one subsystem. These “overlaps” of the subsystems must be removed. This is achieved by constructing “derivative subsystems”. Let η represent the number of groups in the largest primitive subsystem. Then, find any set of (η − 1) groups that occurs in more than one subsystem; say it occurs k times. Add this set as a “derivative subsystem” to the sum of subsystems with coefficient (1 − k). Repeat this for all sets of (η − 1) groups that occur in more than one primitive subsystem. Now find any set of (η − 2) groups that occurs in more than one subsystem (including previously defined derivative subsystems); say it occurs j times. Add this set as a “derivative subsystem” to the sum of subsystems with coefficient (1 − j). Continue this process down to single groups that are contained more than once in the list of primitive and derivative subsystems. The list of subsystems, {Sm}, and associated coefficients,144 is now complete. Finally, hydrogen atom caps are added to restore the valence of all groups as necessary (using bond lengths fixed at given values for CH, NH, and OH bonds). The whole molecule or system, S, is represented by the sum of subsystems (Sm) with integer coefficients (Cm):

3.2. Generalized Energy-Based Fragmentation (GEBF)

This method applies to any molecule. An automated fragmentation procedure was presented by Hua, Hua, and Li in 2010,142 and modified slightly in 2013132 and 2014.143 3.2.1. Fragmentation Algorithm. (1) The basic components of molecules are ordinary functional groups, which are called “fragments” by Li and co-workers. The molecule or cluster is often referred to as a “system”. (2) “Primitive subsystems” of the molecule or cluster are sets of spatially close groups that are formed as follows. A sphere of radius ξ is constructed around each group in the molecule (as illustrated below). All groups within the sphere belong to the primitive subsystem of the central group. In addition, some groups that are outside the sphere, but bonded to groups within the sphere, are added to this primitive subsystem, according to given “extension rules” (see Figure 1 and the adjacent text of ref 142).

M

S=

∑ CmSm m=1

(17)

and the total electronic energy E (including the internuclear repulsion energy) of the system is approximated by M

E=

∑ CmE(Sm) m=1

(18)

The only parameter in the GEBF approach is the radius, ξ, which can be varied to establish that the energy converges with I

DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

increasing values of ξ. For large enough values of ξ, eq 18 is quite accurate for systems that do not contain formally charged groups or very many highly polar groups. However, when such polarizing groups are present, they are taken into account using point charges. (5) A point charge for each atom is generated as follows. A HF calculation is carried out for each primitive subsystem generated in point (2) above [but not those in (2a)]. Natural atomic charges145,146 are determined for the atoms in the central group of each subsystem. These charges can be refined by repeating the HF calculations using these “embedded” point charges at the positions of all atoms that are not contained in the subsystem, a process that can be iterated to convergence. Normally, a single iteration is taken to be sufficient.142 (6) The ab initio calculation of each subsystem in point (4) is carried out in the presence of the point charges at the positions of all atoms that are not contained in the subsystem. The electronic energy of the whole system is then approximated by M

E=

M

∑ CmE(̃ Sm) − [ ∑ Cm − 1] ∑ ∑ m=1

m=1

A

with analogous formulas for other properties. 3.2.3. Accuracy. A brief summary of the accuracy of GEBF is as follows. For five miscellaneous organic/biological molecules containing between 106 and 199 atoms, the mean absolute error in the energy [including step (2a) above, with ξ = 3.5 Å] was 7.2 kJ mol−1.132 For four proteins and a segment of DNA (molecules containing between 130 and 638 atoms), the mean absolute error in the energy [excluding step (2a) above, with ξ = 3.5 Å] was 37.6 kJ mol−1. For 10 conformers for each of two proteins, the mean absolute error in the relative energies [excluding step (2a) above, with ξ = 3.5 Å] was 8.2 kJ mol−1. For 10 configurations of a hydrogelator molecule (106 atoms), the rms deviation of the HF/6-31G(d) forces from the GEBF values was 0.0005 au/bohr using eq 20 and 0.0006 au/ bohr using eq 21.143 3.3. Molecular Tailoring Approach (MTA)

This method applies to any molecule. The current fragmentation algorithm is sometimes used in an automated fashion or sometimes “manually”.52,147−149 3.3.1. Fragmentation Algorithm. The molecule is broken into “fragments” based on spherical regions. (1) A sphere of radius Rg (called “R-goodness”) is constructed with its center at each of a selection of “heavy” (non-hydrogen) atoms in the molecule. There is no reported prescription for selecting the fragment centers. All atoms within the sphere belong to a “fragment”. If atom A is in fragment i, construct a sphere centered on A that has as large a radius as possible so long as the sphere only contains atoms in fragment (i) i. Call this radius R(i) g (A). The maximum of Rg (A) over all fragments that contain atom A is taken as the “R-goodness”, Rg(A), for atom A. Rg(A) is a measure of the extent of the molecular environment that is “described” around A. The minimum value of Rg(A) over all atoms is taken as the “Rgoodness” value, Rg, for the fragmentation scheme. The value of Rg is chosen to ensure that the maximum number of atoms in any fragment, Nmax, is less than some chosen value. Changing the centers of the original spheres will lead to different Rg values for a fixed value of Nmax. A fragmentation scheme that has a larger value of Rg, with a comparatively small value of Nmax, is considered to be “better” than alternative schemes. (2) Neighboring fragments are merged if the consequent merged fragment has less than the allowed maximum number of atoms. This results in a set of fragments, {Fi}. These fragments “overlap” like the cloth cut to make a garment, hence the term “molecular tailoring approach”. (3) Hydrogen atom “caps” are used to restore the valence at atoms where bonds have been broken, with standard (fixed) bond distances to the connected “real” atom. (4) The term “cardinality-guided” used in this method52 comes from the cardinality of a set, that is, the number of elements in a set. Use is made of the “Principle of Inclusion and Exclusion” in set theory. If we have a collection of sets {Fi}, and F denotes the union of these sets, then the cardinality of F, |F|, is given by150

Q AQ B RAB

B>A

(19)

Here, Ẽ denotes the electronic energy calculated in the presence of the “embedded charges”, QA and QB denote point charges on atoms A and B, and RAB is the distance separating these charges. Note that the mutual Coulomb interaction of the point charges is included in Ẽ . The second term in eq 19 corrects for the multiple counting of electrostatic interactions between atoms in the electronic structure calculation and the embedded point charges,55 in a manner similar to that later employed in EE-GMFCC. 3.2.2. Gradients and Other Derivatives. The gradient of the energy in eq 19 is given by143 ∂E = ∂Xα(A)

M

⎡ ∂E(̃ S ) m

∑ Cm⎢ m=1

⎢⎣ ∂Xα(A)

⎡ M ⎤ + ⎢ ∑ Cm − 1⎥ ∑ f AB ⎢⎣ m = 1 ⎥⎦ B



− Fm , A Q A −

∑ f AB⎥ B

⎥⎦

(20)

where Xα(A) is a component of the Cartesian coordinates for atom A. The gradient of the subsystem energy, ((∂Ẽ (Sm))/ (∂Xα(A))), is only evaluated for the “real” atoms (not the point charges) that are contained in the subsystem, Sm; Fm,A is the electric field at the position of atom A due to subsystem, Sm; and fAB is the force on atom A due to the point charges on A and on atom B. Neglecting the (arguably small142) forces on the point charges gives143 ∂E ≈ ∂Xα(A)

M

∑ Cm m=1

∂E(̃ Sm) ∂Xα(A)

(21)

Most molecular properties can be expressed as derivatives of the molecular energy, albeit in the presence of an external field. For example, for a molecule in an external electric field, the molecular dipole moment is the derivative of the energy with respect to the field (in the limit of zero field). The molecular dipole is given by GEBF as

|F | =

i

∑ Cmμ(Sm) m=1

ij K

+( −1)

M

μ=

∑ |Fi| − ∑ |Fi ∩ Fj + ... ∑

|Fi ∩ Fj... ∩ Fk| + ...

ij...k K fragments

(22) J

(23) DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

presents favorable comparisons between MTA and fragment molecular orbital (FMO) theory in terms of both accuracy and computational efficiency. Some examples of the convergence, with Rg, of the Cartesian energy gradients have been presented.52 MTA values for vibrational frequencies on CO2 clusters appear to agree well with experimental values.147 3.3.4. Associated Schemes. The MTA has been used in recent times in conjunction with basis set correction schemes to estimate only the correlation energy of a molecule or cluster.119,147,149 This entails calculation of (say) the MP2 or CCSD(T) energy of the whole molecule or cluster with a small basis set, with only MTA fragment calculations at the MP2 or CCSD(T) level with large basis sets. The MTA has been used since its inception to estimate the charge distribution and electrostatic potential of molecules. In molecular clusters, the electrostatic interactions between the monomers are a significant component of the total binding energy of the cluster. MTA estimates of the electrostatic component of the binding energy have been used to “build” possible low-energy configurations of clusters for use in subsequent ab initio exploration of the cluster geometry.121

The intersections of all pairs of fragments (viewed as sets of atoms), Fi ∩ Fj, are calculated. All intersections of three fragments, Fi ∩ Fj ∩ Fk, are calculated, and so on. The number of atoms (and in fact each individual atom) is the same in the molecule as in the following combination of sets:

∑ Fi − ∑ Fi ∩ Fj + ...+(−1)K i



Fi ∩ Fj... ∩ Fk

ij...k

ij

K fragments

(24)

+ ...

The hydrogen atom caps formally cancel in this finite series. The energy of the molecule or cluster is given by the set inclusion−exclusion principle as52,148 EMTA =

∑ E(Fi) − ∑ E(Fi ∩ Fj) + ... i K

+( −1)

ij



E(Fi ∩ Fj... ∩ Fk) + ...

ij...k K fragments

(25)

Each fragment or intersection of fragments is a molecule (including H atom caps), the energy of which can be calculated using any level of ab initio theory. The MTA does not employ background charges. Most applications of MTA do not account for the interaction between parts of a molecule that do not share a common fragment. In the case of a zwitterion, an additional two-body interaction of the formally charged groups was added to the total energy.52 In other cases, all formal charges were contained in each fragment,147 or all fragments carry the total charge of the molecule.113 The fragments in MTA are generally much larger than in some other fragmentation schemes, so that atoms that never share a common fragment are usually well-separated in space. The fragmentation scheme can be varied, by varying Nmax and Rg, and as might be expected, larger values of Rg and Nmax generally give more accurate energies and other molecular properties. 3.3.2. Gradients and Other Derivatives. The ab initio gradients and other derivatives of the molecular energy are given by the combination of ab initio fragment derivatives corresponding to eq 25. Any residual gradients on hydrogen atom caps are neglected. Geometry optimization can be carried out.52,119 As noted previously, molecular properties, P, that can be expressed as a derivative of the energy, can also be obtained from MTA by PMTA =

3.4. Systematic Molecular Fragmentation (SMF)

The current elaboration of this approach is denoted as systematic molecular fragmentation by annihilation (SMFA). This method applies to any molecule. The automated fragmentation procedure for molecules was presented in refs 130 and 131. Related algorithms have been reported for crystals59,129 and crystal surfaces.139 3.4.1. Fragmentation Algorithm. (1) Bonds and functional groups are defined, consistent with normal chemical concepts; atoms connected by multiple bonds reside in the same group.152 A molecule is viewed as a collection of groups connected by single bonds. (2) Beginning with some arbitrary group (call it group A): (i) Remove group A from the molecule; (ii) leave A untouched, but remove all groups separated from A by more than a specified number of bonds (call it level bonds); and (iii) remove A and all the groups in (ii) from the molecule. (iv) The molecule is represented by the sum of the fragments produced by steps (i) and (ii), minus the fragments produced by step (iii). (v) Taking each fragment in (iv), repeat steps (i) to (iii), and continue with every new fragment produced until there is no fragment that has groups separated by more than level bonds. This produces a set of fragments {Fn} (Nfrag in number) with integer coefficients {f n}. Hydrogen atom caps are appended to restore the valence for atoms with dangling bonds, with bond lengths determined by the covalent radii of the atoms involved.59 Thus, the molecule, M, is represented by a sum of fragments, and the “bonded” energy, Eb, is a sum of the energies of the fragments.

∑ P(Fi) − ∑ P(Fi ∩ Fj) + ... i K

+( −1)

ij



P(Fi ∩ Fj... ∩ Fk) + ...

ij...k K fragments

(26)

Nfrag

M→

3.3.3. Accuracy. For a set of 10 molecules, including peptides and water clusters, ranging in size from 48 to 304 atoms, MTA gives a mean absolute deviation (MAD) from the HF/6-31G* energy of 0.9 kJ mol−1, and a MAD of 1.3 kJ mol−1 from the MP2/6-31G* value.151 For compact molecules in this set, like α-helical peptides and water clusters, the MTA fragments appear to be sufficiently large that the total calculation time for MP2/6-31G* is comparable for MTA and the whole molecule. For more extended structures, MTA is significantly faster than the full calculation. Reference 151

∑ fn Fn n=1

(27)

Nfrag

Eb =

∑ fn E(Fn) n=1

(28)

Each hydrogen cap in eq 27 occurs in fragments with canceling coefficients. The largest fragments in eq 27 have groups that are separated by level bonds. Thus, as the value of level is increased, the fragments increase in size, and eq 28 provides a K

DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

Eind, and Edisp, and for contributions arising from the fact that the positions of any embedded charges depend on the position of “real” atoms. When dtol is sufficiently large, all of these approximate derivatives appear to be negligible.131 3.4.3. Accuracy. For 96 “typical” neutral organic molecules containing 18−80 atoms, the mean absolute error in the energy (MAE) for a range of ab initio methods and basis sets was 2.1− 4.4 kJ mol−1, for level = 3, and dtol = 1.1.127 For larger molecules,131 containing 45−180 atoms, 24 peptides and sugars (with extensive hydrogen bonding) had an MAE of 14.9, 5.6, and 3.9 kJ mol−1, for levels 3, 4, and 5, respectively; and 14 ions and zwitterions had an MAE of 8.2, 4.5, and 2.9 kJ mol−1, for levels 3, 4, and 5, respectively. Twenty conformers of a protein containing 246 atoms (with three formally charged groups) had a mean absolute error in their relative energies of 4.6 kJ mol−1 at level 3. For these 20 conformers, the root-mean-square error in the energy gradient falls below about 0.0003 au for level = 3 and level = 4, if dtol > 2.131 3.4.4. Crystals and Crystal Surfaces. The structure and energy of a crystal are determined by the positions of the atoms in the crystal unit cell, and the lattice parameters that specify the size and shape of the unit cell. Some modifications of the algorithm for molecules, to ensure lattice periodicity, provide a SMFA representation of a crystal structure. This has been applied to nonconducting covalent crystals, such as diamond, silicon, and polymorphs of SiO2. The crystal geometry can be optimized, and the energy per unit cell can be evaluated at any level of ab initio theory.59 Phonon frequencies and free energies can be estimated.129 The high symmetry of crystal structures ensures that many (often most) of the molecular fragments that constitute the unit cell are simply translations and rotations of equivalent fragments. Hence, the energy (and other properties) of such crystals can be evaluated from ab initio calculations of a very few fragment molecules. The surface of cleaved crystals is important for catalysis. Another modification of the SMFA algorithm provides a fragmentation description for the total bulk volume and surface for a cleaved crystal.139

more reliable estimate of the energy, as the bonding environment of each group is better described. (3) However, eq 28 neglects the interactions of more widely separated groups. These “nonbonded” interactions are accounted for in three ways. First, a level = 1 fragmentation is evaluated (comprising N(1) frag fragments), and the nonbonded interactions, Enb, are given by the interactions of these level = 1 fragments: Enb =

(1) Nfrag −1

(1) Nfrag





f n(1) f n(1) E[Fn(1) ↔ Fn(1) ] 1 2 allowed

n1= 1 n2 = n1+ 1

1

2

(29)

where ↔ F(1) n2 ]allowed denotes the interaction energy of (1) level = 1 fragments, F(1) n1 and Fn2 (with associated integer (1) (1) coefficients f n1 and f n2 ) that do not include group interactions E[F(1) n1

that have already been accounted for in the bonded energy of eq 28. These fragment−fragment interaction energies are evaluated ab initio if the fragment−fragment distance is less than some specified tolerance: The shortest atom−atom distance divided by the corresponding sum of van der Waals radii (denoted ∥Rn1 − Rn2∥ for fragments n1 and n2) must be less than a value denoted by dtol. For longer separations, the interaction energy is evaluated using perturbation theory as the sum of an electrostatic interaction, Eele (evaluated using distributed multipole electrostatic moments), an induction energy, Eind (evaluated using the ab initio static polarizability of each group), and (except for HF molecular energies) a dispersion interaction, Edisp (evaluated using the ab initio imaginary frequency polarizability of each group).127 If formally charged groups are present in the molecule, then all ab initio calculations are carried out in the presence of embedded point charges at the positions of the charged groups. As in GEBF, natural atomic charges are employed. However, the charges are simply evaluated from the electron density for the individual charged groups. Similar to the GEBF method, the use of embedded charges “double counts” the electrostatic interaction between two formally charged groups (represented by embedded charges), which are separated by more than dtol. The electrostatic correction for this “double counting” is included in Eele. Finally, the total energy of the molecule is approximated by Nfrag

E=

(1) Nfrag −1

∑ fn E(̃ Fn) + ∑ n=1

n1= 1

3.5. Combined Fragmentation Method (CFM)

The CFM133 is significantly simpler to apply than its 2006 precursor.54 Very recently, both CFM and SMF were reviewed,152 so the reader is directed there for additional details. CFM is applicable to both valence-bonded systems and clusters or any combination of the two, and can be described in two equivalent ways. Both descriptions begin with dividing the system up into disjoint sets of adjacent connected atoms, denoted as groups. The grouping method of CFM133,152 was developed to more accurately deal with nonbonded interaction energies between fragment molecules, and to better cancel capping hydrogen interactions, than the approach described in the 2006 precursor method. Additionally, the new grouping method eliminates complications that arise in valence-bonded systems due to so-called “capping growth”. This type of complication arises when bonds that are directed toward a common center are severed to create fragment molecules. 3.5.1. Fragmentation Algorithm. Given a Lewis structure for a valence bonded system, or a three-dimensional structure of a cluster, CFM begins by first assigning the connectivity of atoms within the system. A connection is specified with three parameters, the identities of the pair of atoms involved in the connection and a flag indicating whether the connection is a valence bond (and thus requiring a capping hydrogen if

(1) Nfrag

∑ n2 = n1+ 1

f n(1) f n(1) 1

2

|| R n2 − R n1 ||< d tol

{E[̃ Fn(1) Fn(1) ] − E[̃ Fn(1) ] − E[̃ Fn(1) ]}allowed + Eele + E ind 1 2 1 2 + Edisp

(30)

Here, Ẽ denotes that the ab initio quantum energy calculation takes place in the presence of embedded point charges. There are two parameters in SMFA: the value of level and the value of dtol. 3.4.2. Gradients and Other Derivatives. The first and second derivatives of the energy in eq 30 have been evaluated to allow geometry optimization and vibrational frequency calculations.47,59,129,131 Derivatives for the first two terms in eq 30 are obtained from the ab initio quantum chemistry programs employed. These derivatives account for the contributions from the hydrogen atom caps (even though these are small in magnitude).47,59 Approximate derivatives are evaluated for Eele, L

DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

correct final CFM energy expression, standard graph-theory techniques can be used to identify these cycles, at which point, a many-body analysis can be performed to check that, for each tri- and quadricycle, the associated three- and four-body terms appear once, and once only, in ECFM. If not, then the formula is adjusted to ensure that it is so. Thus, some care must be taken when using approach 1 in evaluating ECFM to ensure accuracy. Alternately, one could either adopt the approach discussed next to obtain ECFM, or simply supply the primary precursory fragments to the GMBE method (reviewed below) and truncate the expansion at second order to obtain the correct energy expression. If approach 1 is followed in generating ECFM, then after correcting for any tricycles and quadricycles in the system of interest, we finally arrive at

severed) or not. For valence bonded systems, the connectivity between atoms is identical to the bonding given in the Lewis structure. For clusters, the user may specify the connectivity or simply assign connections between monomers in the cluster based on distance. Having established the atoms and connections between them, the algorithm then assigns disjoint sets of atoms to groups. These groups are typically twice as large as the groups defined in SMF. The interested reader can consult refs 152 and 133 for details. Having assigned the groups, fragments are then built up from them in one of two alternative, but exactly equivalent, procedures. “Equivalent” here means that the final fragmentation energy expression, ECFM, is the same via either method. In the first approach, after having generated the set of n groups, S1 = {g1, g2, g3, ... gn}, in the system of interest, a set, S2, is constructed by drawing all possible pairs of connected groups, or dimers, from the set S1. The number of times the group gi appears in S2 is di, where di is the number of connections gi makes to other groups. A crude so-called precursory energy expression can now be obtained for the energy of the whole molecule, EI, in terms of the ab initio fragment energies, {Ei}: EI =

ECFM =

where the ci are simple integers that may be positive or negative, and Ei are the energies of fragments that may be anywhere between 1−4 groups in size. In CFM no fragment is larger than four groups in size, regardless of the size of the system being studied. This means that for water clusters, using single water molecules as groups, no fragment molecule will be larger than a tetramer. In the second approach, the exact same expression as eq 35 can be trivially obtained from the many-body expansion. If we recognize each group, gi, as a “body”, then ECFM is given by the sum of all one-body energies, all two-body interactions, all three-body interactions except those three-body interactions involving disconnected groups, and only four-body interactions involving disjoint pairs of connected groups. Thus

(31)

i ∈ S1

Note that the fragments are capped with hydrogen atoms as in the original SMF method.47 This energy expression is identical to the level 1 energy expression of SMF, given the above CFM definition of groups. Equation 31 can be more simply expressed as EI =



c′i Ei (32)

i ∈ S1∪ S2

ECFM =

where c′i is 1 when i ∈ S2 or −(di − 1) when i ∈ S1. However, the precursory energy, EI, is too crude to be used as an estimate of the total energy of the molecule because it ignores many potentially significant interactions. The final energy in CFM is obtained by interacting all of the elements in the set {S1 ∪ S2} with each other. This approach is similar to that in Addicoat and Collins (2009),127 but in CFM larger fragments (up to twice as large) are generated. CFM defines a general form of the interaction energy: εij = Ei ∪ j − Ei − Ej + Ei ∩ j

∑ i ∈ S1∪ S2

c′i Ei +



(i < j) ∈ S1

εij +



εijk

(i < j < k) ∈ S1: {ij , ik , jk} ∩ S2 ≠ 0

+



εijkl

(i < j < k < l) ∈ S1: {ij , kl} ⊂ S2|{ik , jl} ⊂ S2

(36)

Here, S1 is the set of all groups, S2 is the set of all pairs of connected groups, and εij, εijk, and εijkl are the two-, three-, and four-body interaction energies, respectively. Applying this expression to obtain the final ECFM in terms of energies of fragment molecules involves considerable cancellation. For example, the five group molecule, g1-g2-g3-g4-g5 denoted as 12345, produces the following energy expression: ECFM = E1234 + E2345 − E234 + E1245 − E124 − E245 + E24. It should be noted that in this expression all capping hydrogen interactions, that is, cap−fragment and cap−cap interactions, approximately cancel. Charge embedding was first implemented in the precursor to CFM in 2007,79 and was implemented in the same way in CFM.133 Each fragment molecule was embedded in a crude electrostatic field, generated by placing monopoles at the locations of heavy atoms not present in the fragment molecule. The monopoles were obtained from Stone’s distributed multipole analysis,153,154 using his 2005 algorithm108 and “switch = 4.0”. The monopoles were iterated to convergence by performing a “bonded” CFM calculation in the presence of the monopoles, recalculating the monopoles, then rerunning the “bonded” CFM calculation in the presence of the updated monopoles. The interested reader is directed to ref 152 for the details of how to obtain a “bonded” CFM energy expression.

(33)

(i < j) ∈ S1∪ S2

∑ Ei + ∑ i ∈ S1

to account for those cases where the dimer fragments i and j overlap, that is, if i ∩ j ≠ 0. The CFM energy expression is therefore obtained as E′CFM =

(35)

i

∑ Ei − ∑ (di − 1)Ei i ∈ S2

∑ ciEi

c′i c′ j εij (34)

There is only one complication arising from direct use of this expression to estimate , F, and that occurs when the groups form a tricycle or quadricycle. In the case of a single tricycle, the three-body interaction is canceled in eq 34, so it needs to be added back for accuracy and consistency with other interactions included in E′CFM. In the single quadricycle case, because there are two different ways to form the same quadricycle from elements in S2, the four-body interaction is double counted so the additional interaction must be removed from E′CFM. For systems with a complex morphology, for example, clusters, it can occur that many tricycles and quadricycles exist in the system, some of which could be fused. To ensure a rigorously M

DOI: 10.1021/cr500455b Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

3.6. Kernel Energy Method (KEM)

The process is continued until convergence, which is after only a few iterations. As an aside, test cases were run on ideal α- and 310-helices using distributed dipoles and quadrupoles in addition to distributed monopoles. No significant improvement in the accuracy of ECFM was noted. The scaling of CFM with system size is formally quadratic, because the final number of fragments is generated by interacting precursory fragments with one another. The number of precursory fragments strictly depends linearly on system size. However, if precursory fragments are far enough apart, a perturbative treatment of the precursory fragment− fragment interaction energy reproduces ab initio interaction energies virtually exactly. Absolute errors are less than 200 J mol−1 deviation at 2.0 times van der Waals separation, less than 30 J mol−1 at 2.5 times this separation, and essentially exact further away at the MP2/6-311+G(2d,p) level, as demonstrated in the implementation of this approach in the precursor to CFM.79 Such perturbative calculations take an insignificant amount of computational time and resources as compared to ab initio calculations, thus changing the scaling of CFM with system size from quadratic to linear. 3.5.2. Gradients and Other Derivatives. Any derivatives of the electronic energy of the full system can also be evaluated using CFM through differentiation of eq 35. Care must be taken to ensure that embedded charges are treated correctly. Geometry optimization has not been carried out using the CFM approach. However, properties that are derivatives of the energy have been reported. NMR shielding constants have been evaluated using CFM and its precursor,80,83 as well as molecular multipole moments and the molecular electrostatic potential of a 24 000 atom protein, neuraminidase, at the MP2/6311(+)G(2d,p) level.109 3.5.3. Accuracy. CFM was applied to 15 peptides obtained from the Protein Data Bank155 (PDB) that range in size from 72 to 193 atoms and that occupy various charge states. The RMS error in the total molecular energy for HF/6-311G(d) was 7 kJ mol−1, using iterated embedded charges. One specific case was in error by 21 kJ mol−1. Without embedded charges, this latter error was halved. Other test cases involved polypeptides of (ala)10, (ala)18 in 310 and α-helical and βstrand configurations. The maximum absolute errors in the energy were not more than 11 kJ mol−1, for (ala)18, and less than and around 2.6 kJ mol−1 for (ala)10 forms. SCRF, or implicit solvent calculations (water in this case), were also performed. The use of cavities for each fragment that mimicked the entire peptide, as well as cavities that snugly fit each fragment (default cavities), were investigated for α-, 310-helices and β-sheets for (ala)n, n = 6−18. There was noted to be some improvement in the ability of CFM to reproduce the SCRF energies using the full peptide cavities for each fragment, but the improvement was not commensurate with the additional computational expense, so traditional SCRF calculations on each fragment molecule using the default cavity were recommended. SCRF calculations were performed for 14 of the peptides whose energies in the gas phase were examined earlier. No embedded charges were used, and the RMS error was within chemical accuracy at 3.5 kJ mol−1. Of note, the density matrix for α-helix (ala)20 at the HF/6-311G(d) level was reconstructed using CFM and compared to the density matrix of the full calculation. Only small differences (