8126
J. Phys. Chem. A 2010, 114, 8126–8134
An Efficient Implementation of the Generalized Energy-Based Fragmentation Approach for General Large Molecules Shugui Hua,† Weijie Hua,†,‡ and Shuhua Li*,† School of Chemistry and Chemical Engineering, Key Laboratory of Mesoscopic Chemistry of MOE, Institute of Theoretical and Computational Chemistry, Nanjing UniVersity, Nanjing 210093, People’s Republic of China, and Department of Theoretical Chemistry, School of Biotechnology, Royal Institute of Technology, S-106 91 Stockholm, Sweden ReceiVed: April 6, 2010; ReVised Manuscript ReceiVed: June 30, 2010
An efficient implementation of the generalized energy-based fragmentation (GEBF) approach (Li, W.; Li, S.; Jiang, Y. J Phys. Chem. A 2007, 111, 2193) for treating a wide range of large molecules is presented. In this implementation, the fragmentation process can be automatically done for a general molecule, with only some functional groups defined by users. A new and fast scheme is designed for the generation of various subsystems and the derivation of their coefficients. The newly implemented GEBF approach has been applied to several large molecules including proteins, nucleic acids, and supermolecules with fused aromatic rings. Test calculations within the Hartree-Fock (HF) and density functional theory (DFT) framework demonstrate that the GEBF approach can provide reasonably accurate ground-state energies and optimized structures, which are in good agreement with those from conventional HF or DFT calculations. The GEBF approach implemented in this work can now be employed by nonexpert users to compute energies, optimized structures, and some molecular properties at various ab initio levels for a broad range of large molecules on ordinary PC workstations. 1. Introduction The development of fast and efficient electronic structure algorithms for large molecules has attracted much attention in recent years. Theoretical chemists have proposed a variety of linear scaling approaches for various quantum chemistry methods. For example, linear scaling Hartree-Fock (HF) or density functional theory (DFT) algorithms1-4 are based on the “nearsightedness” of the density matrix, while linear scaling post-HF algorithms5-16 can be achieved with the use of localized molecular orbitals or atomic orbitals. However, as the crossover between these linear scaling approaches and the corresponding conventional approaches occurs at quite large molecules (usually with more than two hundred atoms), they have not been established as a practical tool for electronic structure and property calculations of large systems. On the basis of the chemical intuition, people have also developed various fragment-based approaches for quantum chemical calculations of large molecules.17-61 The main idea of these fragment-based approaches is to obtain the total energy or molecular properties of a large system from conventional quantum chemistry calculations on a series of its subsystems. Various fragment-based approaches employ different strategies to evaluate the energy and properties of the target system. For example, density-matrix based approaches17-26 employ the transferability of density matrix elements from subsystems to the parent system, and energy-based approaches34-61 are based on the transferability of fragment energies. Since energy-based approaches only require energies or energy derivatives of various subsystems, which can be computed with existing quantum chemical programs, they can be easily implemented at various * To whom correspondence should be addressed. E-mail: shuhua@ nju.edu.cn. † Nanjing University. ‡ Royal Institute of Technology.
theoretical levels without much additional programming. Several variants of energy-based approaches exist, among which some approaches34-36 are only applicable to molecular clusters, and some37-61 are applicable for both covalently bonded macromolecules and molecular clusters. With functional groups as fragments, an automatic fragmentation procedure has been developed for two energy-based approaches, the systematic fragmentation method46-51 and the cardinality guided molecular tailoring approach.56-61 These two approaches have been employed to calculate energies and optimized structures for a variety of systems. However, the automatic fragmentation procedure developed previously still has some difficulties in dealing with systems with many fused rings.48,59 To deal with general large molecules with polar or charged groups, which are difficult to be accurately treated by most of energy-based approaches mentioned above, we developed the generalized energy-based fragmentation (GEBF) approach.40 In this approach, the ground-state energy of a large molecule can be derived from energies of various subsystems, each of which is embedded in the background charges generated by all atoms outside this subsystem. The GEBF approach has been demonstrated to provide satisfactory descriptions for energies, some electronic properties, and vibration spectra for a wide range of systems including proteins and water clusters.41-45 However, we should mention that the implementation of the GEBF approach reported earlier did not allow general large molecules to be treated without manual involvement. On the one hand, in our previous work the fragmentation process requires users to provide all fragments by hand. On the other hand, the procedure for constructing coefficients of various subsystems is not very efficient for quite large systems with many fragments.40 In this work, we report an efficient implementation of the GEBF approach for treating general large molecules. In this new implementation, an automatic fragmentation procedure is developed, in which a heavy atom (and its bonded hydrogens)
10.1021/jp103074f 2010 American Chemical Society Published on Web 07/19/2010
GEBF Approach for General Large Molecules
J. Phys. Chem. A, Vol. 114, No. 31, 2010 8127
or a functional group (such as a double or triple bond) is defined as a fragment. With a distance threshold, all subsystems can be automatically constructed by this procedure. Besides this, a quite efficient way to construct all subsystems and derive their coefficients is proposed. The effectiveness and applicability of this implementation have been demonstrated by performing a few test calculations. This paper is organized as follows. In section 2, we will introduce the working principle of the GEBF method first and then describe in detail the automatic fragmentation procedure and the efficient procedure for deriving coefficients of subsystems. Next, the newly implemented GEBF approach is applied to perform single-point energy calculations and geometry optimizations for several large systems with complicated structures. Finally, a brief summary is given in section 4. 2. Methodology 2.1. GEBF Method. Within the GEBF approach, the groundstate energy of a large molecule can be assembled from energies of various subsystems, each of which is embedded in the background point charges at all distant atoms (outside this subsystem). With such “embedded” subsystems, the interaction between two distant fragments in the target system is approximately taken into account. As a result, the GEBF approach is applicable for systems with many polar or charged groups. The total energy of a large system can be evaluated with the expression below40 M
Etot )
∑ m
[( ∑ ) ] ∑ ∑ M
CmE˜m -
Cm - 1
m
A
QAQB RAB B>A
may be iterated many times until net atomic charges on all atoms do not change. We have found that the natural atomic charges from the first iteration are nearly convergent and thus can be employed to compute the GEBF energy (or properties) of the target molecule. A detailed description on this procedure has been discussed in our previous work.40 Within the GEBF approach, it is quite straightforward to compute the energy gradients of the target system. By differentiating eq 1 with respect to a certain Cartesian displacement qIi (i ) x, y, z) of the Ith atom, one can obtain the corresponding energy gradient as below43
∂Etot ) ∂qIi
∂E˜
∑ Cn ∂qIin
∂E˜
∑ Ck ∂qIik -
+
n
k
[(
∑ Cm) - 1] ∑ m
QIQA(qAi - qIi)
A*I
3 RAI
(2)
Here the summation over n in the first term is limited to those subsystems containing atom I as a real atom; the summation over k in the second term includes all subsystems with atom I as a charge center, (∂E˜m)/(∂qIi) denotes the i component of the gradient on the real atom I (in the first term) or the charge center I (in the second term) in the mth “embedded” subsystem. It turn outs that eq 2 can be further approximated as below (the derivation can be found in the Supporting Information)
∂Etot ≈ ∂qIi
∂E˜
∑ Cn ∂qIin
(3)
n
(1)
where E˜m denotes the total energy (including the self-energy of the point charges) of the mth subsystem (M is the total number of subsystems), Cm represents the coefficient of the mth subsystem, which can be automatically derived by following some rules (as discussed in the subsection 2.3), and QA denotes the net atomic charge on atom A. The second term in the equation above is included to avoid the overcounting of the interactions between fragments. Let us describe how to add background point charges for each subsystem. For all atoms outside a given subsystem (but within the target system), they are approximately modeled as point charges. These point charges are assumed to locate at the corresponding nuclear centers. It should be mentioned that for a given subsystem those junction atoms (atoms replaced by capping hydrogen atoms) are also represented as point charges. To get atomic charges for all atoms in the target system, the following procedure is adopted. Although the definition of atomic charges is not unique, our numerical calculations show that natural atomic charges62,63 derived from the HF calculation can give satisfactory results in most cases. In the first step, we perform conventional HF calculations on all primitive subsystems (without background point charges) to get an initial guess for net atomic charges on all atoms. The definition and construction of primitive subsystems will be discussed in the subsequent subsection. Since a primitive subsystem is centered on a given fragment, only net atomic charges on atoms of this fragment will be extracted from calculation on a primitive subsystem. Next, a conventional HF calculation is carried out again for each primitive subsystem, which is placed into the background point charges generated above, and more realistic net atomic charges can be obtained. In principle, this process
Thus, the gradient of the GEBF energy on a given atom can be evaluated with the corresponding gradients on this atom in some subsystems containing this atom as a real atom. The latter information can be obtained with many ab initio packages. It should be pointed out that the gradients of link-atom hydrogens in various subsystems are not required for the geometry optimization of the target system, since the positions of those link-atom hydrogens are only determined by the coordinates and types of their bonded atoms (see the discussion in the next subsection). Our numerical calculations have demonstrated that eq 3 works well for a wide range of systems. For example, for a water cluster (H2O)20, the maximum absolute deviation and root-mean-square deviation (rmsd) between the GEBF-HF/631G** energy gradients (each water molecule is a fragment and ξ ) 3.0 Å) and the HF/6-31G** energy gradients at the first step of geometry optimization is about 5 × 10-4 and 2 × 10-4 au, respectively. The rmsd between the optimized structures obtained with the GEBF-HF and standard HF methods is only 0.014 Å. Thus, eq 3 will be used in this work for geometry optimizations. 2.2. Automatic Fragmentation Procedure. As mentioned above, in GEBF and other energy-based approaches, a large molecule is usually divided into many fragments by cutting single bonds or hydrogen bonds. To keep some functional groups (like multiple bonds, carboxylic acid, carbonyl groups, and so on) intact, one must choose these functional groups as fragments. Besides these functional groups, each heavy atom (and its bonded hydrogen atoms) is chosen as a fragment. By defining fragments in this manner, one can achieve the automatic fragmentation for a general large molecule. After dividing the target molecule into many small fragments, we can construct a primitive subsystem for every fragment according to the
8128
J. Phys. Chem. A, Vol. 114, No. 31, 2010
Figure 1. Schematic illustration of the extension rules in constructing a primitive subsystem: (a) Inclusion of the ring structure; (b) inclusion of the nearest bond-linked fragment for the terminal fragment -CH2-; (c-e) inclusion of two consecutive fragments for the terminal fragments such as -CH3, -OH, and -CH2-. All fragments within a given distance of a central fragment constitute a fragment domain, denoted as a dotted circle.
following rules. First, for each fragment, connect this fragment with its spatially close fragments within a given distance threshold (ξ) to form a fragment domain. For convenience, all fragments in the target molecule are numbered from one to the total number of fragments (this order is arbitrary). The number assigned to a fragment is denoted as its “serial number”. To better reproduce the local chemical surrounding of a given fragment, the fragment domain constructed above will be extended slightly by following some “extension” rules: (1) If one fragment in the fragment domain belongs to one or two rings other fragments in the related one or two rings, which are beyond the distance threshold from the central fragment, should also be added to keep the related rings intact. But the following extension rule does not apply to these extra fragments beyond the distance threshold. This case is schematically represented in Figure 1a, and an example for illustrating this extension is provided in the Supporting Information. (2) If a terminal fragment in the fragment domain contains no more than two non-hydrogen atoms its bond-linked fragments should be included to mimic its local chemical environment. As shown in Figure 1b, for the terminal fragment sCH2s, a formaldehyde (sHCdO) group is a better capping group than a simple hydrogen atom. Furthermore, if the bond-linked fragment contains only one non-hydrogen atom, then the fragments bonded to this fragment should be added to better reproduce the local chemical environment. A few most common cases are shown in parts c-e of Figure 1. For example, in Figure 1d, for the terminal -OH fragment, an ethyl group (rather than a methyl group) should be used to mimic the alkyl chain. In
Hua et al. application of this rule, bond-linked fragments usually mean two fragments covalently bonded to each other. However,if two fragments are connected by two or more hydrogen bonds,64 these two fragments will also be treated as “bond-linked”. In the final step, for the extended fragment domain, we should add hydrogen atoms (if necessary) as capping atoms for satisfying the valences to build a primitive subsystem. So a primitive subsystem consists of a subset of fragments constructed above and some capping hydrogen atoms (if necessary). Now we describe how to obtain the positions of these capping hydrogen atoms. If a covalent X-Y bond is broken, one replaces the Y atom (called junction atom) with a hydrogen atom and places this added hydrogen atom somewhere between atoms X and Y. The X-H bond length is set to be a constant, depending on the nature of the X atom. In the present work, we use 1.07 Å for C-H, 1.00 Å for N-H, and 0.96 Å for O-H. After all primitive subsystems are constructed, it is necessary to determine whether a subsystem is identical to other subsystems or totally embedded in other larger subsystems. If yes, this subsystem is excluded from the primitive subsystem list. 2.3. An Efficient Way for Constructing Derivative Subsystems and Deriving Their Coefficients. It is clear that if one decomposes the total energy of the target system into onefragment or multifragment terms, each specific term will have the coefficient +1. In the GEBF approach, the coefficients of all primitive subsystems obtained above are set to be +1 by construction. For convenience in latter discussions, we define a parameter η to be the maximum number of fragments in all primitive subsystems. By decomposing the total energy of each primitive subsystem into one-fragment or multifragment terms, one can find that some terms may have the coefficients different from +1, since some primitive subsystems are overlapping (they share some fragments). Thus, to eliminate the overcounting of some one-fragment or multifragment terms due to the overlapping of primitive subsystems, a series of smaller subsystems, called derivative subsystems, should be constructed. The guiding rule for constructing derivative subsystems and deriving their coefficients is to make sure that for any specific m-fragment (m e η) interaction term the net number of this term derived from decomposing the energies of all primitive subsystem and derivative subsystems is +1. The first step is to construct m-fragment (m ) η - 1) derivative subsystems. Assume that the net number of a specific (η - 1) body term occurring in all primitive subsystems is counted to be k; then a derivative subsystem containing these (η - 1) fragments will be constructed, with its coefficient being set as (1 - k). Obviously, in the case k ) 1, this derivative subsystem should be excluded. Once this derivative subsystem is added, the list of subsystems (both primitive and derivative) should be updated. Clearly, other (η - 1) fragment derivative subsystems can be constructed in a similar way. In the subsequent steps, derivative subsystems with m fragments (m ) η - 2, ..., 2, 1) and their coefficients can be obtained with the procedure described above. To illustrate this procedure, we will take β-strand acetyl(ala)5NH2 as an example. Assume that each fragment is a carbonyl group or a non-hydrogen atom; this parent system can be divided into 19 fragments, as shown in Figure 2 (all hydrogen atoms are not shown for clarity). By setting the distance threshold ξ to be 3.0 Å, the program can automatically generate 19 primitive subsystems (Table 1). Among 19 subsystems, some subsystems are identical to each other, and some are totally embedded in other larger subsystems, so that 13 subsystems will be deleted from the list. The retained 6 primitive subsystems (they are renumbered) are shown in Table 2. We take the first
GEBF Approach for General Large Molecules
J. Phys. Chem. A, Vol. 114, No. 31, 2010 8129 TABLE 3: Components (in Terms of Fragments) of Four Primitive Subsystems Selected for Illustrating the Generation of Derivative Subsystems
Figure 2. Fragmentation scheme of β-strand acetyl(ala)5NH2 (all hydrogen atoms are not shown for clarity).
TABLE 1: Initial Primitive Subsystems Constructed for β-Strand Acetyl(ala)5NH2a subsystem
type
componentsb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
primitive primitive primitive primitive primitive primitive primitive primitive primitive primitive primitive primitive primitive primitive primitive primitive primitive primitive primitive
(1, 2, 3, 4, 5, 6) (1, 2, 3, 4, 5, 6) (1, 2, 3, 4, 5, 6, 7, 8) (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) (2, 3, 4, 5, 6, 7, 8, 9, 10) (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14) (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14) (6, 7, 8, 9, 10, 11, 12, 13, 14) (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16) (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18) (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18) (10, 11, 12, 13, 14, 15, 16, 17, 18) (10, 11, 12, 13, 14, 15, 16, 17, 18, 19) (11, 12, 13, 14, 15, 16, 17, 18, 19) (11, 12, 13, 14, 15, 16, 17, 18, 19) (14, 15, 16, 17, 18, 19) (14, 15, 16, 17, 18, 19)
a Its various fragments are shown in Figure 2, and the distance threshold ξ is chosen to be 3.0 Å. b Numbers in parentheses stand for the serial numbers of fragments in a given subsystem.
TABLE 2: Primitive and Derivative Subsystems Constructed for β-Strand Acetyl(ala)5NH2 subsystem
type
components
coefficient
1 2 3 4 5 6 7 8 9 10 11
primitive primitive primitive primitive primitive primitive derivative derivative derivative derivative derivative
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14) (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16) (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18) (10, 11, 12, 13, 14, 15, 16, 17, 18, 19) (3, 4, 5, 6, 7, 8, 9, 10, 11, 12) (7, 8, 9, 10, 11, 12, 13, 14, 15, 16) (2, 3, 4, 5, 6, 7, 8, 9, 10) (6, 7, 8, 9, 10, 11, 12, 13, 14) (10, 11, 12, 13, 14, 15, 16, 17, 18)
+1 +1 +1 +1 +1 +1 -1 -1 -1 -1 -1
primitive subsystem (centered on fragment 1) as an example to illustrate the process of constructing a primitive subsystem. With ξ ) 3.0 Å, fragments 2 and 3 should be included in the fragment domain of fragment 1. According to the “extension” rules, fragment 4 should be added to the fragment domain to cap the terminal fragment 3 (which contains one non-hydrogen atom). As fragment 4 consists of only one non-hydrogen atom, we should also include its bond-linked fragments 5 and 6 in the fragment domain of fragment 1. Thus, the primitive subsystem for fragment 1 contains all fragments from 1 to 6 and an extra hydrogen atom (used for capping the fragment 6). By checking 19 primitive subsystems listed in Table 1, we find that subsystem 1 is the same as subsystem 2 and is totally embedded in subsystems 3, 4, and 5, respectively. Thus, subsystems 1 and 2 should be excluded in the final primitive subsystem list. Similarly, subsystems 3 and 4 should also be deleted. Subsystem 5 will be retained, which can be considered as the parent subsystem for fragments from 1 to 5.
primitive subsystem
components
coefficient
a b c d
1, 2, 3, 5, 6 2, 3, 4, 5, 6 1, 2, 5, 6, 7 2, 3, 5, 6, 7
+1 +1 +1 +1
The coefficients of these six primitive subsystems are set to +1 by construction. By checking these six primitive subsystems, one can see that the maximum number of fragments (the parameter η) in all primitive subsystems is 12. According to the rules described above, the next step is to find overlapping 11-fragment terms (only these terms) by comparing the components of any two primitive subsystems. Since there are no overlapping 11-body terms among these primitive subsystems, no derivative subsystems are generated in this step. Then, the program will look for 10-body overlapping terms by comparing the components of any two primitive subsystems. In doing so, two derivative subsystems, (3, 4, 5, 6, 7, 8, 9, 10, 11, 12) and (7, 8, 9, 10, 11, 12, 13, 14, 15, 16), will be generated. Here numbers in the parentheses represent the serial numbers of some fragments or some capped fragments (this information can be derived automatically by the program). As the total coefficients of these two 10-body energy terms from decomposing the energies of six primitive subsystems are +2, the coefficients of these two derivative subsystems should be set to be -1. Next, the program will continue to find overlapping 9-body terms among the updated subsystems (six primitive and two derivative). It turns out that only three 9-fragment derivative subsystems, (2, 3, 4, 5, 6, 7, 8, 9, 10), (6, 7, 8, 9, 10, 11, 12, 13, 14), and (10, 11, 12, 13, 14, 15, 16, 17, 18), should be generated, with their coefficients determined to be -1. In the following steps, the program will check other overlapping n-body terms (n ) 8, 7, ..., 1) to find out other derivative systems. However, in this case, no other derivative subsystems are required. All derivative subsystems and their coefficients are also listed in Table 2. One may notice that, if the number of fragments in subsystems is quite large (e.g., about 30), the cost of searching and counting a given m-fragment energy term occurring in all subsystems (both primitive and derivative) will be relatively expensive especially for treating very large systems with hundreds of fragments. So it is very important to develop an efficient algorithm for comparing overlapping multifragment terms and deriving the coefficients of the corresponding derivative subsystems. Suppose that there are four primitive subsystems (a, b, c, d). The components of these subsystems (in terms of their fragments) and their coefficients are shown in Table 3. As the number of fragments in these subsystems is 5, we will first find out overlapping 4-fragment terms by comparing every two primitive subsystems. For example, the overlapping of subsystems a and b will lead to a term (2, 3, 5, 6); the overlapping of subsystems a and c will lead to a term (1, 2, 5, 6), etc. Obviously, the overlapping of different subsystem pairs may give rise to the same overlapping multifragment term. For instance, the pairs (a, b), (a, d), and (b, d) lead to the same term, (2, 3, 5, 6). Thus, it is necessary to merge all overlapping multifragment terms so that a list of various unique multifragment terms can be obtained. To speed up the comparison, we will define a compound index H(k) for each multifragment term. This index is defined as below
8130
J. Phys. Chem. A, Vol. 114, No. 31, 2010
Hua et al.
N
H(k) )
∑ ki
(4)
i)1
in which ki denotes the serial number of the ith fragment in this multifragment term. For example, the index for the term (2, 3, 5, 6) is 16, and that for the term (1, 2, 5, 6) is 14. In determining whether a multifragment term is identical to another multifragment term, one can compare their compound indices first. If two multifragment terms have different indices, they must be different. Otherwise, the program will continue to compare all components corresponding to these two multifragment terms one by one. The use of this compound index greatly improves the efficiency of comparing various multifragment terms. On the other hand, when an overlapping multifragment term differs from the other overlapping multifragment terms in the list, information on its parent subsystems should also be stored. If an overlapping multifragment term is identical to a certain term in the list generated before, information on the parent subsystems of this term should be updated. For example, as the overlapping of the subsystem pairs (a, b), (a, d), and (b, d) all leads to the term (2, 3, 5, 6), the parent subsystems of this term will include three subsystems (a, b, d). Once overlapping m-fragment terms and their corresponding parent subsystems are generated, the coefficients of the corresponding m-fragment derivative subsystems can be readily derived. For example, the coefficient of the derivative subsystem (2, 3, 5, 6) can be calculated as [1 - (C(a) + C(b) + C(d))], with C(a), C(b), and C(d) being the coefficients of its parent subsystems. With the procedures described above, one can obtain all 4-fragment derivative subsystems and their coefficients for four primitive subsystems (a, b, c, d), as shown in Table 4.
TABLE 4: Overlapping 4-Fragment Terms and the Coefficients of the Corresponding Derivative Subsystems derivative parent subsystem components indexa subsystems e
2, 3, 5, 6
16
a,b,d
f g
1, 2, 5, 6 2, 5, 6, 7
14 20
a, c c, d
coefficientb -2 ) C(b) -1 ) -1 )
1 - [C(a) + + C(d)] 1 - [C(a) + C(c)] 1 - [C(c) + C(d)]
a The definition of this index is discussed in the text. coefficients of four primitive subsystems are from Table 3.
b
The
3. Results and Discussion In this section, the newly implemented GEBF approach will be applied to obtain ground-state energies and optimized geometries for some large molecules with complicated structures. As shown previously,40-45 the GEBF approach is applicable at various theoretical levels, such as HF, DFT, and so on. Previous applications40-45 showed that the accuracy of the GEBF approach seems to be independent of the theoretical levels. In this work, GEBF calculations are carried out with the LSQC program,65 in which the GAUSSIAN 03 package66 is used for ab initio calculations on various subsystems. 3.1. Ground-State Energies. The GEBF-HF approach is applied to compute the ground-state energies for the following molecules (shown in Figure 3): the DNA poly(dA · dT) decamer, protein fibril (2KIB) and three supermolecules. The geometry of the DNA poly(dA · dT) decamer is taken from the nucleic acid database.67 The protein fibril (2KIB) is taken from the protein databank of RCSB.68 The three supermolecules are taken from the database of Cambridge Crystallographic Data Centre.69 The coordinates of these systems are given in the Supporting Information for reference. For these systems (and two protein molecules in the next subsection), the carbonyl group, the carboxylic group, the ammonium group, and the guanidinium group are defined as a fragment, respectively. Besides these, the phosphate group and each base in the DNA poly(dA · dT) decamer are treated as a fragment, respectively. For those molecules with conjugated rings, each double bond in a given Kekule´ structure is selected as a fragment. This strategy was adopted by Bettens and Lee previously in their fragmentation procedure.53 Except these functional groups, each of other non-
Figure 3. Selected systems under study.
hydrogen atoms is implicitly assumed to constitute a fragment. It should be mentioned that the selection of different Kekule´ structures for fragmentation may give slightly different energies for a large molecule with aromatic rings. Nevertheless, this arbitrariness may have little effect on the problems under study (as discussed later in the following subsections), since the energy deviations obtained with different Kekule´ structures (for fragmentation) are generally within the accuracy of the GEBF approach. According to certain rules, our program can automatically generate one Kekule´ structure for the purpose of fragmentation. The 6-31G(d) basis set is used for the DNA poly(dA · dT) decamer, and a 6-31G basis set is used for other systems. For these molecules, we have shown in Table 5 their groundstate energies calculated by the conventional HF and GEBFHF methods for comparison. One can see that the difference between the GEBF-HF energies and conventional HF values
GEBF Approach for General Large Molecules
J. Phys. Chem. A, Vol. 114, No. 31, 2010 8131
TABLE 5: Ground-State Energies Calculated with the Conventional HF and GEBF-HF Methodsa largest subsystemc
energy difference (kcal/mol) molecule
basis setb
conventional HF (au)
GEBF (ξ ) 3.5 Å)
GEBF (ξ ) 3.0 Å)
GEBF (ξ ) 3.5 Å)
GEBF (ξ ) 3.0 Å)
EGAKUT EGEFIG LOKFOH (4-) 2KIB(4+,4-) DNA decamer (18-)
6-31G (2152) 6-31G (1489) 6-31G (2390) 6-31G (2252) 6-31G* (4176)
-8477.227 20 -5541.079 33 -11093.262 84 -9852.529 66 -26322.982 77
-1.08 -5.26 5.34 2.99 -5.05
-0.76 -6.73 -8.17 -8.45 30.01
493 469 610 587 1224
441 332 534 553 1051
a For GEBF-HF, the energy deviations (in kcal/mol) with respect to the conventional HF energies are listed. b The number of basis functions included in the parentheses. c The number of basis functions for the largest subsystem in GEBF-HF calculations.
TABLE 6: Relative Energies of 10 Conformers of 1VTP (4+,7-) and 1OMG (7+,2-) Calculated by Conventional B3LYP and GEBF-B3LYP Methods with the 6-31G Basis Seta 1VTP (4+,7-)
1OMG (7+,2-)
conformers
conventional B3LYP (kcal/mol)
GEBF-B3LYP (kcal/mol)
conventional B3LYP (kcal/mol)
GEBF-B3LYP (kcal/mol)
1 2 3 4 5 6 7 8 9 10
0.0 2.89 6.28 27.74 28.61 32.25 35.52 43.86 49.07 70.09
0.0 -2.76 1.51 23.85 26.23 32.13 34.89 44.05 45.37 67.14
0.0 4.75 10.29 11.48 12.17 18.26 21.52 21.90 29.68 45.75
0.0 8.97 11.99 12.61 16.63 20.08 21.46 22.40 29.99 44.93
a
For both systems, the B3LYP energy of the conformer 1 is taken as zero. In GEBF-B3LYP calculations, the parameter ξ is set to be 3.5 Å. For the conformer 1, the GEBF-B3LYP energy relative to the B3LYP energy is 1.51 kcal/mol for 1VTP and -3.14 kcal/mol for 1OMG.
for each system is less than 5.3 kcal/mol when the parameter ξ is set to be 3.5 Å. The accuracy of the GEBF-HF approach for charged systems (LOKFOH, 2KIB, and DNA decamer) is similar to that for neutral systems (EGAKUT and EGEFIG). It should be mentioned that the GEBF approach implemented previously40-45 can not treat systems like EGAKUT and EGEFIG with many fused rings. The DNA decamer is the largest molecule in our calculations, which contains 4176 basis functions (638 atoms). For this molecule, the largest subsystem in the GEBF calculations also requires 1224 basis functions (189 atoms) to obtain reasonably accurate results. For all systems under study, the size of the largest subsystem is between onefourth to one-third of the size of the whole system. For all molecules (except the DNA decamer), the GEBF-HF energies calculated with ξ ) 3.5 Å yield only slight improvement on those calculated with ξ ) 3.0 Å. Our calculations also show that the GEBF-HF results calculated with ξ ) 4.0 Å do not provide systematic improvement over those calculated with ξ ) 3.5 Å. Thus, the GEBF approach with ξ ) 3.5 Å is expected to provide reasonably accurate results for a wide range of chemical systems. 3.2. Relative Energies of Proteins. In chemistry, the relative energies among different conformers are more important quantities than the absolute energies. Here, for two small proteins, vacuolar targeting peptide (1VTP) and ω-conotoxin mviia (1OMG), we will investigate the relative stabilities of their ten conformers with both conventional B3LYP and GEBF-B3LYP approaches. For each protein, we first perform a MD simulation with the AMBER package70 for 800 ps after 200 ps of equilibration (with a time step of 2 fs) at 298 K. Along the trajectories, 10 conformers with the lowest energies are selected for our study. The fragmentation scheme for these two systems is described in the preceding subsection. As shown in Table 6 and Figure 4, one can notice that the relative energies of ten conformers predicted with the GEBF-
Figure 4. Comparison of GEBF-B3LYP/6-31G (ξ ) 3.5 Å) and conventional B3LYP/6-31G energies of ten conformers of (a) 1VTP (4+,7-) and (b) 1OMG (7+,2-). For 1VTP, the energy of -6320480 kcal/mol is taken as zero. For 1OMG, the energy of -7012390 kcal/ mol is taken as zero.
B3LYP/6-31G method (ξ ) 3.5 Å) are in good agreement with those from conventional B3LYP/6-31G calculations for each
8132
J. Phys. Chem. A, Vol. 114, No. 31, 2010
Hua et al.
Figure 5. The superposition between two optimized structures of protein 1RPB obtained with conventional B3LYP and GEBF-B3LYP calculations.
protein. The mean absolute deviation of relative energies for 9 conformers (with respect to the energy of the conformer 1) is 2.64 kcal/mol for 1VTP and 1.47 kcal/mol for 1OMG, respectively. Correspondingly, the maximum deviation is 5.65 kcal/ mol for 1VTP and 4.46 kcal/mol for 1OMG. On the other hand, the maximum energy difference between the GEBF-B3LYP energies and conventional B3LYP energies for all conformers is 4.14 and 4.03 kcal/mol for 1VTP and 1OMG, respectively. Thus, one can expect that the GEBF-B3LYP method will be very useful for searching for low-energy conformers of many biological molecules. Nevertheless, the GEBF-B3LYP approach may not be able to correctly predict the relative order of different conformers if these conformers are very close in energies (i.e., within 5.0 kcal/mol). 3.3. Optimized Geometries. With an interface between Gaussian03 package and our own LSQC program, we can apply the Berny algorithm in the Gaussian03 package to perform GEBF geometry optimizations at various levels of theory for large molecules containing hundreds of atoms. Here, the structure of a small protein (1RPB) will be optimized with the GEBF-B3LYP method for illustration. This protein contains 280 atoms and is expected to have a flexible three-dimensional structure. The 6-31G basis set is employed. In GEBF-B3LYP calculations, the definitions of various fragments are the same as described in subsection 3.1, and the parameter ξ is set to be 3.5 Å. The convergence criteria for the maximum gradient and displacement are set to be 0.00045 (hartree/Bohr) and 0.0018 (Å) (the default values in Gaussian03 package), respectively. Figure 5 shows the superposition between the two optimized structures obtained with standard B3LYP and GEBF-B3LYP approaches, respectively. The rmsd between the two structures is 0.572 Å. If only the non-hydrogen atoms are considered, the rmsd is reduced to 0.510 Å. Thus, the geometrical parameters obtained with the GEBF-B3LYP approach are very close to the corresponding values from the conventional B3LYP approach. In addition, the conventional B3LYP energy at the GEBFB3LYP optimized structure deviates from that at the B3LYPoptimized structure only by 1.47 kcal/mol. As described earlier, the selection of an arbitrary Kekule´ structure for systems with aromatic rings may have an impact on the GEBF-optimized structure and energy. For aromatic rings, there exist many Kekule´ structures, each of which corresponds to a different fragmentation scheme (double bonds are implicitly assumed as fragments). Here, we select a perylene bisimide derivative (a part of the supermolecule EGAKUT) as an example to explore the influence of selecting different Kekule´ structures.
Figure 6. (a) Two Kekule´ structures for a perylene bisimide derivative. (b) Optimized structures obtained from GEBF-HF calculations with two different fragmentation schemes corresponding to these two Kekule´ structures. Selected geometrical parameters (bond lengths in Ångstroms, and bond angles in degrees) obtained with the Kekule´ structure (II) are listed in parentheses.
For this molecule, two Kekule´ structures of its central part are shown in Figure 6a. The definitions of various fragments are similar to those described before for EGAKUT. With two fragmentation schemes, GEBF-HF/6-31G** (ξ ) 3.0 Å) geometry optimizations starting from the same initial structure lead to two slightly different optimized structures, as shown in Figure 6b. One can see that the optimized bond lengths and bond angles of the two structures are very close. The rmsd between the two structures are 0.001 Å for bond lengths, 0.057° for bond angles, and 0.369° for dihedral angles. This result shows that the selection of different Kekule´ structures for fragmentation has a negligible effect on the GEBF optimized geometry. On the other hand, the conventional HF and GEBFHF energies at the initial structure and two optimized structures are shown in Table 7. GEBF-HF energies calculated with two different fragmentation schemes at the same initial structure differ from each other by 2.45 kcal/mol. Thus, the GEBF energy slightly depends on the selected Kekule´ structure for fragmentation. If the energy difference between two arbitrary structures is concerned, it is better to choose the same Kekule´ structure for fragmentation in GEBF calculations. If only the optimized geometries are concerned, the selection of an arbitrary Kekule´ structure for the fragmentation scheme will give reasonably accurate results.
GEBF Approach for General Large Molecules
J. Phys. Chem. A, Vol. 114, No. 31, 2010 8133
TABLE 7: Ground-State Energies Calculated with the Conventional HF and GEBF-HF Methods for a Perylene Bisimide Derivativea GEBF-HF (a.u.) structure initial opt-1 opt-2
b
conventional HF (au) -3320.731 4363 -3321.513 8563 -3321.513 9679
Kekule´-1
Kekule´-2
-3320.730 6397 -3320.734 5703 -3321.512 1735 -3321.515 1702 -3321.512 4208 -3321.515 4186
a In GEBF-HF (ξ ) 3.0 Å) calculations, two different Kekule´ structures correspond to two different fragmentation schemes. b The opt-1 (or opt-2) structure is obtained from GEBF-HF optimizations starting with the Kekule´-1 (or Kekule´-2) structure.
4. Conclusions In this work, we report an efficient implementation of the GEBF approach for treating a wide range of large molecules. In this implementation, the fragmentation procedure for a general molecule can be automatically done with only some functional groups defined by users. A new and fast scheme is also designed for the generation of various derivative subsystems and the derivation of their coefficients. The newly implemented GEBF approach has been applied to several large molecules including proteins, nucleic acids, and supermolecules with fused aromatic rings. Test calculations within the HF (or DFT) framework demonstrate that the GEBF approach can provide reasonably accurate ground-state energies and optimized structures, which are in good agreement with those from conventional HF (or DFT) calculations. It is worth mentioning that the GEBF approach implemented here can readily treat systems containing planar fused rings, which are difficult for the GEBF approach implemented previously and some other energy-based fragmentation approaches. On the other hand, we should point out some limitations of the GEBF approach so that this approach can be appropriately employed by nonexpert users. This approach is not suitable for those systems with strongly delocalized electrons (like metallic compounds) or systems containing transition metals. Keeping these limitations in mind, nonexpert users now can employ the GEBF approach to obtain energies, optimized structures, and some molecular properties at various ab initio levels for a broad range of large molecules on ordinary PC workstations. Acknowledgment. This work was supported by the National Natural Science Foundation of China (Grant Nos. 20625309 and 20833003). Supporting Information Available: The detailed derivation of eq 3, an illustrative example on the construction of primitive subsystems for a system with fused conjugated rings, and the Cartesian coordinates of all systems in subsection 3.1. This material is available free of charge via the Internet at http:// pubs.acs.org. References and Notes (1) (2) (3) (4) 10891. (5) (6) (7) (8) (9) (10)
Yang, W. Phys. ReV. Lett. 1991, 66, 1438. Scuseria, G. E. J. Phys. Chem. A 1999, 103, 4782. Millam, J. M.; Scuseria, G. E. J. Chem. Phys. 1997, 106, 5569. Li, X. P.; Nunes, R. W.; Vanderbilt, D. Phys. ReV. B 1993, 47, Pulay, P. Chem. Phys. Lett. 1983, 100, 151. Saebø, S.; Pulay, P. Annu. ReV. Phys. Chem. 1993, 44, 213. Hampel, C.; Werner, H.-J. J. Chem. Phys. 1996, 104, 6286. Ayala, P. Y.; Scuseria, G. E. J. Chem. Phys. 1999, 110, 3660. Scuseria, G. E.; Ayala, P. Y. J. Chem. Phys. 1999, 111, 8330. Almlo¨f, J. Chem. Phys. Lett. 1991, 181, 319.
(11) Head-Gordon, M.; Maslen, P. E.; White, C. A. J. Chem. Phys. 1998, 108, 616. (12) Nakao, Y.; Hirao, K. J. Chem. Phys. 2004, 120, 6375. (13) Fo¨rner, W.; Ladik, J.; Otto, P.; Cˇizˇek, J. Chem. Phys. 1985, 97, 251. (14) Li, S.; Ma, J.; Jiang, Y. J. Comput. Chem. 2002, 23, 237. (15) Li, S.; Shen, J.; Li, W.; Jiang, Y. J. Chem. Phys. 2006, 125, 074109. (16) Subotnik, J. E.; Sodt, A.; Head-Gordon, M. J. Chem. Phys. 2006, 125, 074116. (17) Exner, T. E.; Mezey, P. G. J. Phys. Chem. A 2004, 108, 4301. (18) He, X.; Zhang, J. Z. H. J. Chem. Phys. 2005, 122, 031103. (19) Chen, X.; Zhang, Y.; Zhang, J. Z. H. J. Chem. Phys. 2005, 122, 184105. (20) Chen, X.; Zhang, J. Z. H. J. Chem. Phys. 2006, 125, 044903. (21) Li, W.; Li, S. J. Chem. Phys. 2005, 122, 194109. (22) Gu, F. L.; Aoki, Y.; Korchowiec, J.; Imamura, A.; Kirtman, B. J. Chem. Phys. 2004, 121, 10385. (23) Akama, T.; Kobayashi, M.; Nakai, H. J. Comput. Chem. 2007, 28, 2003. (24) Kobayashi, M.; Nakai, H. Int. J. Quantum Chem. 2009, 109, 2227. (25) Kobayashi, M.; Imamura, Y.; Nakai, H. J. Chem. Phys. 2007, 127, 074103. (26) Kobayashi, M.; Nakai, H. J. Chem. Phys. 2009, 127, 074103. (27) Nagata, T.; Fedorov, D. G.; Kitaura, K.; Gordon, M. S. J. Chem. Phys. 2009, 131, 024101. (28) Li, H.; Fedorov, D. G.; Nagata, T.; Kitaura, K.; Jensen, J. H.; Gordon, M. S. J. Comput. Chem. 2010, 31, 778. (29) Fedorov, D. G.; Ishimura, K.; Ishida, T.; Kitaura, K.; Pulay, P.; Nagase, S. J. Comput. Chem. 2007, 28, 1476. (30) Fedorov, D. G.; Kitaura, K. J. Chem. Phys. 2004, 120, 6832. (31) Fedorov, D. G.; Kitaura, K. J. Chem. Phys. 2005, 122, 134103. (32) Hirata, S.; Valiev, M.; Dupuis, M.; Xantheas, S. S.; Sugiki, S.; Sekino, H. Mol. Phys. 2005, 103, 2255. (33) Fedorov, D. G.; Kitaura, K. J. Chem. Phys. 2009, 131, 171106. (34) Morita, S.; Sakai, S. J. Comput. Chem. 2001, 22, 1107. (35) Sakai, S.; Morita, S. J. Phys. Chem. A 2005, 109, 8424. (36) Dahlke, E. E.; Truhlar, D. G. J. Chem. Theory Comput. 2007, 3, 46. (37) Li, W.; Li, S. J. Chem. Phys. 2004, 121, 6649. (38) Li, S.; Li, W.; Fang, T. J. Am. Chem. Soc. 2005, 127, 7215. (39) Jiang, N.; Ma, J.; Jiang, Y. J. Chem. Phys. 2006, 124, 114112. (40) Li, W.; Li, S.; Jiang, Y. J. Phys. Chem. A 2007, 111, 2193. (41) Li, S.; Li, W. Annu. Rep. Prog. Chem., Sect. C 2008, 104, 256. (42) Li, W.; Dong, H.; Li, S. Progress in Theoretical Chemistry and Physics; Springer-Verlag: Berlin and Heidelberg, Germany, 2008; Vol. 18, p 289. (43) Hua, W.; Fang, T.; Li, W.; Yu, J.; Li, S. J. Phys. Chem. A 2008, 112, 10864. (44) Dong, H.; Hua, S.; Li, S. J. Phys. Chem. A 2009, 113, 1335. (45) Li, W.; Hua, W.; Fang, T.; Li, S. In Computational Methods for Large Systems: Electronic Structure Approaches for Biotechnology and Nanotechnology; Reimers, J. R., Ed., 2010, in press. (46) Deev, V.; Collins, M. A. J. Chem. Phys. 2005, 122, 154102. (47) Collins, M. A.; Deev, V. J. Chem. Phys. 2006, 125, 104104. (48) Collins, M. A.; Heather, M. N. J. Chem. Phys. 2007, 127, 134113. (49) Collins, M. A. J. Chem. Phys. 2007, 127, 024104. (50) Mullin, J. M.; Roskop, L. B.; Pruitt, S. R.; Collins, M. A. J. Phys. Chem. A 2009, 113, 10040. (51) Collins, M. A.; Addicoat, M. A. J. Chem. Phys. 2009, 131, 104103. (52) Rˇeza´c´, J.; Salahub, D. J. Chem. Theory Comput. 2010, 6, 91. (53) Bettens, R. P. A.; Lee, A. M. J. Phys. Chem. A 2006, 110, 8777. (54) Lee, A. M.; Bettens, R. P. A. J. Phys. Chem. A 2007, 111, 5111. (55) Le, H. A.; Lee, A. M.; Bettens, R. P. A. J. Phys. Chem. A 2009, 113, 10527. (56) Ganesh, V.; Dongare, R. K.; Balanarayan, P.; Gadre, S. R. J. Chem. Phys. 2006, 125, 104109. (57) Gadre, S. R.; Ganesh, V. J. Theor. Comput. Chem. 2006, 5, 835. (58) Ganesh, V.; Kavathekar, R.; Rahalkar, A.; Gadre, S. R. J. Comput. Chem. 2008, 29, 488. (59) Kavathekar, R.; Khire, S.; Ganesh, V.; Rahalkar, A. P.; Gadre, S. R. J. Comput. Chem. 2009, 30, 1167. (60) Rahalkar, A. P.; Ganesh, V.; Gadre, S. R. J. Chem. Phys. 2008, 129, 234101. (61) Yeole, S. D.; Gadre, S. R. J. Chem. Phys. 2010, 132, 094102. (62) Foster, J. P.; Weinhold, F. J. Am. Chem. Soc. 1980, 102, 7211. (63) Reed, A. E.; Weinstock, R. B.; Weinhold, F. J. Chem. Phys. 1985, 83, 735. (64) The criteria for hydrogen bonds X-H · · · Y in our calculations is rH · · · Y e 2.9 Å, rX · · · Y e 3.5 Å, and X-H · · · Y g 120°. (65) Li, S.; Li, W.; Fang, T.; Ma, J.; Hua, W.; Hua, S.; Jiang, Y. LowScaling Quantum Chemistry (LSQC); Version 2.0.: Nanjing University: Nanjing, 2010.
8134
J. Phys. Chem. A, Vol. 114, No. 31, 2010
(66) Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Montgomery, J. A., Jr.; Vreven, T.; Kudin, K. N.; Burant, J. C.; Millam, J. M.; Iyengar, S. S.; Tomasi, J.; Barone, V.; Mennucci, B.; Cossi, M.; Scalmani, G.; Rega, N.; Petersson, G. A.; Nakatsuji, H.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Klene, M.; Li, X.; Knox, J. E.; Hratchian, H. P.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.; Gomperts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.; Pomelli, C.; Ochterski, J. W.; Ayala, P. Y.; Morokuma, K.; Voth, G. A.; Salvador, P.; Dannenberg, J. J.; Zakrzewski, V. G.; Dapprich, S.; Daniels, A. D.; Strain, M. C.; Farkas, O.; Malick, D. K.; Rabuck, A. D.; Raghavachari, K.; Foresman, J. B.; Ortiz, J. V.; Cui, Q.; Baboul, A. G.; Clifford, S.; Cioslowski, J.; Stefanov, B. B.; Liu, G.; Liashenko, A.; Piskorz, P.; Komaromi, I.; Martin, R. L.; Fox, D. J.; Keith, T.; Al-Laham, M. A.; Peng, C. Y.; Nanayakkara, A.; Challacombe, M.; Gill, P. M. W.; Johnson,
Hua et al. B.; Chen, W.; Wong, M. W.; Gonzalez, C.; Pople, J. A. Gaussian 03, revision D.01; Gaussian, Inc.: Wallingford, CT, 2004. (67) http://ndbserver.rutgers.edu/. (68) http://www.rcsb.org/. (69) http://www.ccdc.cam.ac.uk/. (70) Case, D. A.; Darden, T. A.; Cheatham, T. E.; Simmerling, C. L.; Wang, J.; Duke, R. E.; Luo, R.; Merz, K. M., Jr.; Pearlman, D. A.; Crowley, M.; Walker, R. C.; Zhang, W.; Wang, B.; Hayik, S.; Roitberg, A. E.; Seabra, G.; Wong, K. F.; Paesani, F.; Wu, X.; Brozell, S.; Tsui, V.; Gohlke, H.; Yang, L.; Tan, C.; Mongan, J.; Hornak, V.; Cui, G.; Beroza, P.; Mathews, D. H.; Schafmeister, C. E. A. F.; Ross, W. S.; Kollman, P. A. AMBER 9; University of California: San Francisco, 2006.
JP103074F