libreta: Computerized Optimization and Code Synthesis for

Dec 14, 2017 - United States ... a fast evaluation of the four-index, two-electron repulsion ...... codes: (1) Randomly generate a large number of RR ...
0 downloads 0 Views 2MB Size
Article Cite This: J. Chem. Theory Comput. 2018, 14, 572−587

pubs.acs.org/JCTC

LIBRETA:

Computerized Optimization and Code Synthesis for Electron Repulsion Integral Evaluation

Jun Zhang* Department of Chemistry, University of Illinois at UrbanaChampaign, 600 South Mathews Avenue, Urbana, Illinois 61801, United States S Supporting Information *

ABSTRACT: A new library called LIBRETA for the evaluation of electron repulsion integrals (ERIs) over segmented and contracted Gaussian functions is developed. Our LIBRETA is optimized from three aspects: (1) The Obara−Saika, Dupuis− Rys−King, and McMurchie−Davidson method are all employed. The recurrence relations involved are optimized by tree-search for each combination of angular momenta, and in the best case, 50% of the intermediates can be eliminated to reduce the computational cost. (2) The optimized codes for recurrence relations are combined with different contraction orders, each of which is suitable for ERIs of different angular momenta and contraction patterns. In practice, LIBRETA will determine and use the best scheme to evaluate each ERI. (3) LIBRETA is also optimized at the coding level. For example, with common subexpression elimination and local memory access, the performance can be increased by about 6% and 20%, respectively. The performance was compared with LIBINT2. For both popular segmented and contracted basis sets, LIBRETA can be faster than LIBINT2 by 7.2−912.0%. For basis sets of heavy elements that contain Gaussian basis functions of large contraction degrees, the performance can be increased 20−30 times. We also tested the performance of LIBRETA in direct self-consistent field (SCF) calculations and compared it with NWCHEM. In most cases, the average time for one SCF iteration by LIBRETA is less than NWCHEM by 144.2−495.9%. Finally, we discuss the origin of redundancies occurring in the recurrence relations and derive an upper bound of the least number of intermediates required to be calculated in a McMurchie−Davidson recurrence, which is confirmed by ours as well as previous authors’ results. We expect that LIBRETA can become a useful tool for theoretical and computational chemists to develop their own algorithms rapidly.

1. INTRODUCTION The defining feature of an ab initio electronic structure method is the nonapproximated evaluation of molecular integrals over the exact Hamiltonian and basis functions, which also constitutes a major hurdle for its implementation and execution. Specifically, a fast evaluation of the four-index, two-electron repulsion integrals (ERIs) over Gaussian-type-orbital (GTO) atomic basis functions has been the primary concern for computational chemists even to this day.1−23 Many of the widely used evaluation methods are based on recurrence relations (RRs) of ERIs over GTOs with different angular momenta. The first such RR was derived by Boys1 in 1950 by considering the nuclear-coordinate derivatives of ERIs of the (ss|ss) type. In 1976, Dupuis, Rys, and King2,24 realized that an ERI could be expressed as a product of one-dimensional functions over a kernel function computable by a Gaussian quadrature with its grid points and weights determined by a RR. This led to the method known as the Dupuis−Rys−King (DRK) method, which is particularly efficient for GTOs with higher angular momenta with a short contraction length. In 1978, Pople and Hehre4 proposed the axis-switch method, which minimizes the computational cost by rotating the © 2017 American Chemical Society

coordinates to make many ERIs over primitive GTOs vanish. This is the first of the methods aiming at contracted GTOs and exhibits excellent performance for highly contracted s- and ptype GTOs, although it performs rather poorly elsewhere. Around the same time, McMurchie and Davidson3 found that a GTO could be expanded in Hermite−Gaussians (HGs), allowing an ERI to be calculated as a sum of more easily evaluable integrals over HGs. This method is known as the McMurchie−Davidson (MD) method. In 1986, Obara and Saika6,7 (OS) introduced a series of important RRs derived on the basis of the translational invariance of the ERIs. In 1988, Head-Gordon and Pople8 proposed dividing the OS RRs into Gaussian-exponentdependent (“vertical”) and -independent (“horizontal”) ones, with the latter applicable to contracted GTOs as well as to more numerous primitive GTOs, reducing the computational cost by permitting early contractions. Gill et al.9 extended this idea to the MD recursion and implemented the highly successful PRISM algorithm. For a given group (“shell”) of ERIs, PRISM Received: July 21, 2017 Published: December 14, 2017 572

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation

consistent field (SCF) calculations, in the best case our LIBRETA was faster than NWCHEM55 by 495.9% in terms of the average wall time for one SCF iteration. The technical aspects of LIBRETA will be discussed in detail in this paper, which is organized as follows: first, we give the notations and methods used in the paper; second, we will discuss how to optimize the RRs and contraction paths and implement them efficiently by ACG on computers; finally, we examine the performance of LIBRETA and compare it with LIBINT2 and NWCHEM, and conclude our paper. In the Appendix, the origin of the redundancies in RRs is discussed and some theoretical results are derived.

estimates the evaluation costs of 20 distinct orders or “contraction paths” and selects, at runtime, the least expensive one to be used in actual ERI evaluations. More recent developments include the use of advanced hardware technologies such as the graphics processing units (GPUs),25−30 field programmable gate arrays (FPGAs),31 and specialized hardware.32,33 The performance of a software library for the ERI evaluation depends on at least three key factors: (1) an optimized implementation of the RRs; (2) a selection of the shortest contraction path; (3) a careful design of the program structure. The first two (mathematical optimization) minimize the nominal computational cost regardless of hardware architectures or compiler capabilities. The last (computer-science optimization) includes common subexpression elimination, consecutive memory access, complete loop unrolling, etc., and is critical for reducing actual wall time cost by further minimizing the numbers of floating-point operations (FLOPs) and memory operations (MOPs), although these optimizations tend to be automated by advances in hardware, operating-system, and compiler technologies. In this work, we implement a software library for ERIs with both advanced mathematical and computer-science optimizations by adopting the modern strategy of using domain-specific middleware (by which executable codes (software) are generated and optimized depending on the hardware) that performs automatic formula and code generation as well as optimization. The first example of automatic code generation (ACG) dates back to Jones’ work34−36 for the evaluation of molecular integrals over Slater-type orbitals, and since then, it saw applications in ab initio electron-correlation methods,37−44 exchange−correlation density functionals,45−47 optimizations of the RRs,11,14,20 a detection of their numerical instability,48 and the synthesis of an entire ERI library.49−54 One of the most notable examples is the LIBINT2 library54 of Valeev, which consists of efficient computerized implementations of a few of the contraction paths of the OS RRs. We aim to develop a code library by ACG that is efficient for the evaluation of all types of ERIs occurring in a typical quantum chemical calculation. This library, called LIBRETA, has been built according to the three factors we proposed above: (1) All the three classical methods, i.e., OS, DRK, and MD, were employed in LIBRETA, which are suitable for ERIs of different angular momenta. The RRs involved in the three methods were explored and highly optimized using tree-search for each angular momentum type. (2) For ERIs over contracted GTOs, an efficient contraction path is determined. Thus, in a quantum chemical calculation, e.g., a direct formation of the Fock matrix, LIBRETA can automatically determine and apply the best scheme for each ERI. (3) The codes were generated using carefully designed program structures. The impact of several technical aspects, including common subexpression elimination or the memory access order, on the performance of LIBRETA, were examined. The performance of LIBRETA was compared with LIBINT2 with popular basis sets: Pople’s split-valence basis sets, segmented Karlsruhe basis sets, and Dunning and Peterson’s general contracted correlation-consistent basis sets. Our LIBRETA turned out to show superior performance to LIBINT2: for example, for triple-ζ segmented and general contracted basis sets of SF6, LIBRETA was faster than LIBINT2 by 36.6% and 246.7%. For Au4, for which highly contracted basis functions were used, LIBRETA was 20−30 times faster. When it was used in direct self-

2. NOTATIONS The μth real primitive Cartesian GTO1 centered at A = (Ax, Ay, Az) with exponent αμ is defined as [aμ] ≡ Naμ(x − A x )ax (y − A y )ay (z − A z )az exp(− αμ|r − A|2 ) (1)

where Naμ is the normalization constant and a ≡ (ax, ay, az) are the Cartesian components of the angular momentum of this basis function. When the τ-component (τ = x, y, z) of a is increased (decreased) by 1, the obtained GTO is represented by [a ± 1τ]. Primitive Cartesian GTOs with the identical center A and angular momentum a are linearly combined into a contracted Cartesian GTO, K

(a) ≡ χa (r) ≡

∑ Dμ[aμ] μ=1 K

≡ (x − A x )ax (y − A y )ay (z − A z )az

∑ DμNaμ exp(−αμ|r − A|2 ) μ=1

(2)

where K and {Dμ} are the contraction degree and contraction coefficients, respectively. Throughout this paper, we associate square brackets [·] with primitive GTOs and parentheses (·) with contracted GTOs. The set of all Cartesian GTOs that differ only in the Cartesian components of the angular momentum ax, ay, az but share the same total angular momentum L = ax + ay + az form a Cartesian L-shell. There are NL = (L + 1)(L + 2)/2 Cartesian GTOs in an L-shell, which can furthermore be transformed to a fewer 2L + 1 spherical GTOs. GTO basis sets can be split into two groups depending on the contraction pattern: segmented or general. In a segmented basis set56 (such as 6-31G57), each primitive GTO appears only once in the definition of a contracted GTO. In a general-contraction basis set58 (such as ANO-RCC59), a primitive GTO is used in several contracted GTOs. We call a set of GTO shells contracted from the same K primitive GTO shells a block; in a segmented basis set, the block size B is equal to 1 for every contracted GTO, but in a general-contraction basis set, B > 1 for at least one GTO. Generally, K ≥ B is required to ensure the linear independence of the basis set. An ERI over contracted GTOs is defined as (ab|cd) ≡

∫ ∫ χa (r1) χb (r1) r1 χc (r2) χd (r2) dr1 dr2 12

(3)

This, in turn, is written as a contraction of ERIs over primitive GTOs: Ka

(ab|cd) =

Kb

Kc

Kd

∑ ∑ ∑ ∑ DμDνDρDτ [aμbν|cρd τ] μ=1 ν=1 ρ=1 τ=1

573

(4) DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation

referred to a review16 or monograph.60 The most important feature of these RRs is that there are two distinct sets of RRs: the one (“vertical”) that depends on individual exponents of GTOs and thus can only be applied to primitive ERIs, and the one (“horizontal”) that does not depend on exponents and can be used for both primitive and (partially or fully) contracted ERIs. Because primitive ERIs are far more numerous (when contraction lengths are high) than contracted ERIs, it is often advantageous to contract one or more of the four GTOs involved as early as possible, after which only the horizontal RRs can be used. Below, we succinctly summarize the pertinent aspects of each RR. 3.1. Obara−Saika Method. The RRs underlying this method were derived by Obara and Saika6,7 from the translational invariance of ERIs. Head-Gordon and Pople8 made an essential rearrangement of the RRs into the vertical and horizontal RRs (VRRs and HRRs), which are implemented in LIBRETA: 1. VRR: calculate [m0|n0]0 for Lm = La to La + Lb and Ln = Lc to Lc + Ld by

A batch of ERIs is the set of ERIs over all contracted GTOs that belong to a block. In many ERI libraries (including LIBRETA), an entire batch of contracted ERIs is generated at once to avoid redundant calculations of common intermediates. A batch of (dd|dd) with B = 4 (meaning that four d-type Cartesian GTO shells are contracted from an unspecified number of primitive dtype GTO shells) contains 44 × ((2 + 1)(2 + 2)/2)4 = 331 776 individual contracted Cartesian ERIs. LIBRETA computes all ERIs (and other one-electron integrals) in Cartesian GTO basis. When integrals in spherical GTO basis are required, we will carry out corresponding transformations at suitable steps (see subsection 3.1). The 8-fold permutation symmetry of ERIs over real GTOs is taken into account during the Fock build at a batch level. Let the centers, exponents, and angular momenta of primitive GTOs a, b, c, d (where we have dropped μ, etc. for brevity) be A, B, C, D, α, β, γ, δ, and La, Lb, Lc, Ld, respectively. Let the total angular momentum of the ERI [ab|cd] be L = La + Lb + Lc + Ld. For the bra and ket component, [ab| and |cd], the following quantities are defined: ξ=α+β

P=

η=γ+δ

Q=

α A + βB ξ γC + δD η

i y αβ Kab = expjjjj− RAB2zzzz α + β k { (5)

[m + 1x , 0|n0]N = XPA[m0|n0]N

m i η XPQ [m0|n0]N + 1 + x jjjj[m − 1x , 0|n0]N ξ+η 2ξ k y η − [m − 1x , 0|n0]N + 1 zzzz ξ+η { nx N+1 + [m0|n − 1x , 0] 2(ξ + η) (10)

i y γδ Kcd = expjjjj− R CD 2zzzz k γ+δ { (6)



and θ=

ξη ξ+η

[m0|n + 1x , 0]N = XQC[m0|n0]N

n i ξ XPQ [m0|n0]N + 1 + x jjjj[m0|n − 1x , 0]N 2η k ξ+η y ξ − [m0|n − 1x , 0]N + 1 zzzz ξ+η { mx N+1 + [m − 1x , 0|n0] 2(ξ + η) (11)

(7)

where RAB ≡ |A − B|, and so on. We also define XAB ≡ Ax − Bx, YAB ≡ Ay − By, ZAB ≡ Az − Bz, and so on. With these, quantity [00|00]N (0 ≤ N ≤ L) can be defined: [00|00]N =

2π 5/2 KabKcdFN (θRPQ 2) ξη(ξ + η)1/2

+

(8)

where FN(x) is the so-called Boys function given by FN (x) =

∫0

1

t 2N exp( −xt 2) dt

and their analogues for y and z components. These VRRs start from (8) with N = L. They have to be carried out for each primitive GTO quartet. 2. Bra and ket contraction: form (m0|n0) by contractions of [m0|n0]0. 3. Bra and ket HRR: obtain the target ERIs by

(9)

The evaluation of the Boys function is given in the Supporting Information.

3. FORMALISMS For a given batch of ERIs, LIBRETA will determine the best method from all three of the classical RRs: OS, DRK, and MD. For the derivation and detail analyses of these RRs, the reader is

(a , b + 1x |cd) = (a + 1x , b|cd) + XAB(ab|cd) (12)

(ab|c , d + 1x ) = (ab|c + 1x , d) + XCD(ab|cd) (13)

and their analogues for y and z components. We will use (ab|-HRR (or (LaLb|-HRR) to stand for the subroutine of evaluating (ab|ef) from (a0|ef), ..., (a + b, 0|ef) (e and f are arbitrary GTOs). For |cd)-HRR (or |LcLd)-HRR), a similar interpretation applies. For an ERI over spherical GTOs, it is possible to perform an evaluation over Cartesian GTOs followed by a Cartesianspherical transformation (CST). However, taking the advantage of HRRs being independent of exponents, it is beneficial to separate this CST into two partial ones before and after the last HRR (13):

Figure 1. All MD-CXXXX methods. From [0]N to (ab|cd), there are four contraction paths: CCTTT, CTCTT, CTTCT, and CTTTC. Switching the order of bra and ket contraction gives another four. 574

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation

i y X 2 η ImX+ 1, n(sκ 2) = jjjjXPA − XPQ sκ 2zzzzImn (sκ ) ξ+η k { y η m ijj sκ 2zzzzImX− 1, n(sκ 2) + jj1 − 2ξ k ξ+η {

CST HRR (m0|n0)CRT ⎯⎯⎯⎯⎯⎯→ CRT(ab|n0)CRT ⎯⎯⎯⎯⎯→ sph(ab|n0)CRT CST HRR ⎯⎯⎯⎯⎯⎯→ sph(ab|cd)CRT ⎯⎯⎯⎯⎯→ sph(ab|cd)sph

CRT

(14)

Let the four GTOs in the ERI have the same angular momentum L. We can see that the posterior CST scales as (NL4 + (2L + 1)2NL2)·L ≈ L9 + L7 whereas the two-partial-CST scales as (NL3·L + (2L + 1)2NL2)·L ≈ L8 + L7). More importantly, the cost of the last HRR can be significantly decreased: In the posterior CST scheme, (13) has to be carried out for NL2 ≈ L4 components of CRT(ab|; in the two-partial-CST scheme, this number reduces to (2L + 1)2 ≈ L2 components of sph(ab|, making this scheme preferred in LIBRETA. 3.2. Dupuis−Rys−King Method. We have seen that [ab| cd] is calculated by RRs from a one-dimensional integral (8), revealing that [ab|cd] =

∫0

1

2

+

+

1

[m0|n0] =

∑ wκg(sκ 2) κ=1

∫0

R mλ(x) R nλ(x) exp( −λx 2) dx = δmn

×

x x

y y

z z

(21)

Then, (ab|cd) is calculated by applying contractions and HRRs on [m0|n0]. When the OS method is compared with the DRK method, the memory requirement for the former becomes more demanding as L increases. For example, for (gg|gg), OS-VRR and DRK-VRR requires storing 136 125 and 21 286 intermediates, respectively, by LIBRETA (For the memory requirement of other types of ERI, please refer to the Supporting Information). For ERIs of low and high L, it is better to use the OS and DRK method, respectively. 3.3. McMurchie−Davidson Method. In the MD method, an ERI is transformed to integrals over Hermite−Gaussians (HGs), which are easier to calculate than the original one.3 HGs are defined as

(17)

(18)

There are two ways of implementing the DRK method:62 (1) Obtain [ab|cd] directly by a five-term RR on IXaxbxcxdx, etc., followed by contractions to (ab|cd). (2) Calculate [m0|n0] using Gaussian quadrature; thus in (18), IXaxbxcxdx becomes IXmx0nx0 ≡ IXmn, etc., which can be obtained by a three-term RR. Then perform contractions and HRRs (which were referred to as “transfer equations” in ref 62) to obtain (ab|cd). Consider the most expensive step of the two ways, i.e., forming [ab|cd] or [m0|n0]. Let the four GTOs in the ERI have the same angular momentum L and contraction degree K. The cost of this step in the former way scales as G·NL4·K4 ≈ L9K4 whereas in the latter way scales as G·L2·NL2·K4 ≈ L7K4. Therefore, the latter way shifts a large amount of work from the primitive GTO quartet loop.10,21 In LIBRETA, we will use the second way. The evaluation of sκ and wκ is given in the Supporting Information. The DRK method calculates [m0|n0] by the following steps: 1.

∑ wκImX n (sκ 2) ImY n (sκ 2) ImZ n (sκ 2) κ=1

The integral kernel can be proved to be separated into x, y, and z components: g (sκ 2) = IaXxbxcxdx(sκ 2) IaYybycydy(sκ 2) IaZzbzczdz(sκ 2)

2π 5/2 KabKcd ξη(ξ + η)1/2 G

(16)

In (16), G = L/2 + 1; sκ and wκ are the quadrature roots and weights of an R-type Rys polynomial2 RλG(x), respectively. Rλm(x) is a 2m-degree polynomial, forming a set of orthogonal polynomials: 1

(20)

start from IX00 = IY00 = IZ00 = 1. 2. Gaussian quadrature: calculate [m0|n0] by

G

g (t 2) exp( −λt 2) dt =

msκ 2 ImX− 1, n(sκ 2) 2(ξ + η)

and their analogues for y and z components. These RRs

(15)

where λ = θRPQ2 and the integral kernel g(t2) is a 2L-degree polynomial in t. Thus, this integral can be calculated by standard Gaussian quadrature theory:61

∫0

(19)

i y X 2 ξ ImX, n + 1(sκ 2) = jjjjXQC + XPQ sκ 2zzzzImn (sκ ) ξ η + k { y n ijj ξ sκ 2zzzzImX, n − 1(sκ 2) + jj1 − 2η k ξ+η {

2

g (t ) exp( −λt ) dt

nsκ 2 ImX, n − 1(sκ 2) 2(ξ + η)

ij ∂ yz z w ≡ jjj j ∂X zzz k P{

wx

ij ∂ yz jj z jj ∂Y zzz k P{

wy

ij ∂ yz z 2 jj z jj ∂Z zzz exp( −θRPQ ) k P{ w

(22)

We denote the product of two GTOs and one HG as abp. In this paper, all HGs are represented by w, p, and q. A straightforward way of implementing the MD method is first to calculate the integrals over HGs [00p|00q] by [w + 1x ]N = XPQ [w]N + 1 + wx[w − 1x ]N + 1

(23)

[00p|00q] = ( −1)qx + qy + qz [p + q]0

(24)

and then transform [00p|00q] to [ab0|cd0] ≡ [ab|cd], which can be contracted to the target ERIs. Here, (24) converts the derivatives with respective to P along in [p + q]0 to those with respective to P and Q. For example, the x-component of (24) is

VRR: calculate IXmn, IYmn, and IZmn for m = La to La + Lb, n = Lc to Lc + Ld, and κ = 1 to G by 575

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation p ij ∂ yz x jij ∂ zyz x zz jj z exp(−θXPQ 2) pxq x ≡ jjj j ∂X zz jj ∂X zzz k P{ k Q { q

i ∂ yz x z = ( −1) jj j ∂X zzz P k { qx j j

and their analogues for y and z components. After applying similar RRs to the ket component, obtain the target ERIs: (ab|cd) ≡ 000(ab0|cd0)000. The MD-CTTTC method in LIBRETA contains the following steps: 1. Ket contraction: calculate the fundamental quantities for N = 0 to L:

p + qx

exp( −θXPQ 2)

≡ ( −1)qx (p + q)x

(25)

The RR (23) starts with (0)cN′ d ′ q′ ≡

2π 5/2 [0] = K K ( −2θ )N FN (θRPQ 2) 1/2 ab cd ξη(ξ + η) N

Kd

a′

∑ ∑ ∑ ∑ DμDνDρDτ

c′

b′

(2ζμν) p′

[00p|00q)c ′ d ′ q′ = ( −1)qx + qy + qz [p + q)c0′ d ′ q′

(2ηρτ )q′

+

= −XAB 011(000|000)000 + = 0 001(1)000

(29)

and their analogues for y and z components. This is the exponent-independent variant of (23). 3. Bra and ket HRRs: for the bra component, the HGs are transformed to 000(ab0| by

p + 1x |

b + 1x , p| = px a ′ b ′ p′(ab , p − 1x | a ′ b ′ (p ′+ 1)(ab ,

−XAB 011(0)0000

+

001(001|000)000

0 001(1)000

(37)

= +XAB102(0)1000 − XCD 001(0)1101 + XBD 001(0)1000 (38)

+ 1x , bp| = px a ′ b ′ p′(ab , p − 1x |

+ XAB(a ′+ 1)b ′ (p ′+ 1)(abp| +

(36)

Thus, the fundamental quantities required in (27) are 011(0)0000, 1 1 1 102(0)000, 001(0)101, and 001(0)000. For ERIs over high angular momenta, the number of additional indices becomes very large, making the early contraction strategy less favorable, and the OS or DRK method are preferred. In LIBRETA the MD-CXXXX methods are applied for ERIs over highly contracted s and p GTOs. 3.4. Integral Screening. For molecules of extended geometry, a large number of ERIs may be very small in magnitude, thus screening them in advance can significantly reduce the computational cost. The classical Schwarz inequality63 can efficiently screen the ERIs with small GTO overlaps, but not for the ones with distant bra and ket pair. To account for

(30) a ′ b ′ p′(a ,

1 [ab , p + 1x | 2ζ

(ps|ss) ≡000 (100|000)000

(28)

a ′ b ′ (p ′+ 1)(ab ,

(35)

4. Ket HRR: transform |00q)c′d′q′ to |cd0)000 by (30) and (31). 5. Bra contraction: the target ERIs are obtained by the contraction of [ab|cd) ≡ [ab|cd0)000. The additional indices a′, b′, p′, c′, d′, q′ in these equations indicate the powers of exponents and can be determined backforwardly from the RRs. For example, to calculate (ps|ss) by MD-CCTTT, according to (28) and (31):

= ( −1)qx + qy + qz a ′ b ′ p′(p + q)c0′ d ′ q′

− XAB a ′ (b ′+ 1)(p ′+ 1)(abp| +

1 [ab , p + 1x | 2ζ

[0]N

+1 N+1 − XCD a ′ b ′ p′(w)(Nc ′+ 1)d ′ (q ′+ 1) + XBD a ′ b ′ p′(w)c ′ d ′ q′

a ′ b ′ p′(a

(34)

[ab + 1x , p| = px [a + 1x , b , p − 1x | + XPB[abp|

+ 1x )cN′ d ′ q′ = XAB(a ′+ 1)b ′ (p ′+ 1)(w)cN′ d+′1q′

a ′ b ′ p′(00p|00q)c ′ d ′ q′

(33)

[a + 1x , bp| = px [a + 1x , b , p − 1x | + XPA[abp|

The a′, b′, p′, c′, d′, q′ are integers that will be discussed later. 2. VRRs: calculate the integrals over HGs a′b′p′(00p|00q)c′d′q′ for Lp = 0 to La + Lb and Lq = 0 to Lc + Ld, respectively, by

+ wx a ′ b ′ p′(w − 1x )cN′ d+′1q′

(32)

3. Bra HRR: transform [00p| to [ab0| by

(27)

a ′ b ′ p′(w

(2ηρτ )

[0]N

+ wx[w − 1x )cN +d 1q ′ ′′

d′

(2αμ) (2βν ) (2γρ) (2δτ)

μ=1 ν=1 ρ=1 τ=1

q′

+1 [w + 1x )cN′ d ′ q′ = XPD[w)cN′ d+′1q′ − XCD[w)(Nc ′+ 1)d ′ (q ′+ 1)

+ Kc

(2γρ)c′(2δτ )d′

2. VRR: calculate [00p|00q) for Lp = 0 to La + Lb and Lq = 0 to Lc + Ld, respectively, by

N a ′ b ′ p′(0)c ′ d ′ q′



∑ DρDτ

ρ=1 τ=1

1. Bra and ket contractions: calculate the fundamental quantities for N = 0 to L: Kb



Kd

(26)

Using the terminology from the PRISM algorithm,12 the bra and ket contractions are referred to as two C’s; the transformation from [0]N to [00p|00q], [00p| to [ab0|, and |00q] to | cd0] are referred to as three T’s. With suitable modifications of (22) to (24), the target ERI can be calculated in arbitrary orders of C’s and T’s, resulting in 20 contraction paths, like MDTCCTT, MD-CTCTT, etc.12 We have only implemented the paths starting with “C”, because other paths do not give any significant advantages over the OS and DRK methods in LIBRETA. There are eight different MD-CXXXX methods, shown in Figure 1. We will describe the method using MD-CCTTT and MDCTTTC as examples. For a complete description please refer to ref 12. The MD-CCTTT method in LIBRETA is given below:

Ka

Kc

p + 1x | (31) 576

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation

Figure 2. Development of LIBRETA.

both, we implemented the so-called QQR screening developed by Maurer et al.64 in our program. An ERI is estimated as ˆ (ab|cd) ≈ (ab|cd) l Q abQ cd o o o o o R − ext′ − ext′ R − ext′ab − ext′cd > 0 =m ab cd o o o o oQ abQ cd R − ext′ab − ext′cd ≤ 0 n (39)

where Q ab ≡ (ab|ab) and Q cd ≡ (cd|cd) . ext′ is the wellseparatedness extent of a GTO pair,65 the definition of which is given in the Supporting Information. Thus, during the evaluation of ERIs, given a threshold τ, an ERI that is estimated ˆ to satisfy (ab|cd) < τ will not be calculated.

4. DEVELOPMENT OF LIBRETA The development of LIBRETA using the formulas in the last section is shown in Figure 2. It will be discussed from three

Figure 3. One possible DRK-VRR diagram for (dp|ps). For example, “10 and 20 connecting 21” means that, according to (20), I21 = (·)I20 + (·)I10. For all such diagrams in this paper, this interpretation applies.

aspects, which correspond to our previously proposed three factors of building an efficient ERI library: (1) optimization of the RR codes; (2) optimization of contraction paths; (3) design of the program structure. 4.1. Optimization of Recurrence Relations. Manually written codes for the implementation of RRs usually contain loops over the indices like N or a. Using ACG, loops can be unrolled; i.e., each step of the RRs can be written down explicitly. The efficiency of the ACG codes may be increased by exploring these RRs. First, we examine the DRK-VRRs (19) and (20). For (dp|ps), the DRK-VRR step has to calculate [fs|ps] and [ds|ps]; thus the range of m and n in (19) and (20) is 0−3 and 0−1, respectively.

Figure 4. One possible MD-VRR diagram for (dp|ps). The redundant intermediates are rendered gray. In this figure, [0]N’s (and also [00| 00]N’s in Figure 5, see below) are calculated independently from the Boys function (9). But to make the diagrams compact (one root node), they are formally connected according to the downward recursion of the Boys function: FN(x) = (2xFN+1(x) − exp(−x))/(2N + 1).

577

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation

Figure 5. One possible OS-VRR diagram for (ps|ps). The redundant intermediates are rendered gray.

Table 1. Optimization of MD-VRR Diagramsa type

nonopt

opt

s p d f g h i k l

1 5 15 35 70 126 210 330 495

1 5 15 33 63 108 173 261 381

a

The numbers are the number of intermediates that have to be calculated.

Table 2. Optimization of OS HRR Diagramsa Figure 6. Number of remaining intermediates for 3000 randomly generated OS-VRR diagrams of (dd|dd) after eliminating redundancies.

One possible DRK-VRR diagram for (dp|ps) is shown in Figure 3. The DRK-VRR step can be applied in various ways, but the total number of the intermediates (Imn) is always 8. LIBRETA will use the RRs of the least FLOPs. For example, between I21 = (·)I20 + (·)I10 or I21 = (·)I11 + (·)I01 + (·)I10, the former expression will be chosen. Generally, for each type of the angular momenta, ACG will generate codes through a DRK-VRR diagram containing the least FLOPs. Next, we examine the MD-VRRs (23), (28), and (33). Because the additional indices a′, b′, p′, c′, d′, q′ do not affect the recurrence structure, we only consider (23) here. For (dp|ps), the MD-VRR step has to calculate [g], [f ], [d], and [p]; thus the range of Lr is 0−4. A possible MD-VRR diagram is shown in Figure 4. However, unlike the case of the DRK-VRR step, seven intermediates that will never be used again are calculated; i.e., they are redundant for the ERI evaluation. This redundancy problem has been realized for a long time in the context of the MD method.11,14,16

type

nonopt

opt

ss ps pp ds dp dd fs fp fd ff gs gp gd gf gg

0 0 9 0 18 84 0 30 135 388 0 45 198 558 1269

0 0 9 0 18 79 0 30 129 334 0 45 191 488 1019

a

The numbers are the number of intermediates that have to be calculated.

For the OS-VRRs (10) and (11), the redundancy problem is more complex, because they depend on two angular momenta, m and n. A possible OS-VRR diagram of (ps|ps) is shown in

Figure 7. Optimization of a RR digram in LIBRETA by ACG. 578

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation Table 3. Optimization of OS-VRR Diagramsa type

nonopt

opt

type

nonopt

opt

type

nonopt

opt

(ss|ss) (ps|ss) (ps|ps) (pp|ss) (pp|ps) (pp|pp) (ds|ss) (ds|ps) (ds|pp) (ds|ds) (dp|ss) (dp|ps) (dp|pp) (dp|ds) (dp|dp) (dd|ss) (dd|ps) (dd|pp) (dd|ds) (dd|dp) (dd|dd) (fs|ss) (fs|ps) (fs|pp) (fs|ds) (fs|dp) (fs|dd) (fs|fs) (f p|ss) (f p|ps) (f p|pp) (f p|ds) (f p|dp) (f p|dd) (f p|fs) (f p|fp) (fd|ss) (fd|ps) (fd|pp) (fd|ds)

1 5 24 15 70 200 15 70 200 200 35 160 450 450 1000 70 315 875 875 1925 3675 35 160 450 450 1000 1925 1000 70 315 875 875 1925 3675 1925 3675 126 560 1540 1540

1 5 18 15 52 146 15 43 116 116 33 101 280 258 611 63 191 514 529 1188 2220 31 81 209 209 469 929 455 61 171 449 449 1018 2080 887 1910 106 300 829 837

(fd|dp) (fd|dd) (fd|fs) (fd|f p) (fd|fd) (f f|ss) (f f|ps) (f f|pp) (f f|ds) (f f|dp) (f f|dd) (f f|fs) (f f|fp) (f f|fd) (f f|ff) (gs|ss) (gs|ps) (gs|pp) (gs|ds) (gs|dp) (gs|dd) (gs|fs) (gs|f p) (gs|fd) (gs|f f) (gs|gs) (gp|ss) (gp|ps) (gp|pp) (gp|ds) (gp|dp) (gp|dd) (gp|fs) (gp|fp) (gp|fd) (gp|ff) (gp|gs) (gp|gp) (gd|ss) (gd|ps)

3360 6370 3360 6370 10976 210 924 2520 2520 5460 10290 5460 10290 17640 28224 70 315 875 875 1925 3675 1925 3675 6370 10290 3675 126 560 1540 1540 3360 6370 3360 6370 10976 17640 6370 10976 210 924

1916 3695 1565 3377 5894 171 477 1298 1286 2981 5833 2860 5552 9689 15153 57 136 336 336 769 1622 768 1575 2936 5122 1378 102 265 702 675 1664 3313 1605 3113 5835 9121 2451 4938 167 442

(gd|pp) (gd|ds) (gd|dp) (gd|dd) (gd|fs) (gd|fp) (gd|fd) (gd|ff) (gd|gs) (gd|gp) (gd|gd) (gf|ss) (gf|ps) (gf|pp) (gf|ds) (gf|dp) (gf|dd) (gf|fs) (gf|fp) (gf|fd) (gf|ff) (gf|gs) (gf|gp) (gf|gd) (gf|gf) (gg|ss) (gg|ps) (gg|pp) (gg|ds) (gg|dp) (gg|dd) (gg|fs) (gg|fp) (gg|fd) (gg|ff) (gg|gs) (gg|gp) (gg|gd) (gg|gf) (gg|gg)

2520 2520 5460 10290 5460 10290 17640 28224 10290 17640 28224 330 1440 3900 3900 8400 15750 8400 15750 26880 42840 15750 26880 42840 64800 495 2145 5775 5775 12375 23100 12375 23100 39270 62370 23100 39270 62370 94050 136125

1132 1209 2633 5755 2699 5186 9493 15048 4046 8153 13571 258 677 1794 1821 4022 8875 3990 8385 14876 23457 6311 12719 21162 31903 382 981 2588 2880 6462 13084 6159 11633 22171 34977 9413 18977 31567 47570 67374

a

The numbers are the number of intermediates that have to be calculated.

computational ability. Each step of a diagram is carried out using the index that can minimize the FLOPs. For example, to calculate [(2,1,0)0|(0,0,0)0]0, LIBRETA will apply the RR with my rather than with mx because the former leads to only two nonzero terms in (10). One is randomly selected if multiple indices satisfy this condition. (2) For each diagram, it is treated as a tree (Figure 7): a vertex stands for an intermediate, and its incoming edges emit from the intermediates that produce it by RRs. Now, every vertex is examined and eliminated if it has zero outdegree and does not correspond to the target quantity in this ERI evaluation step. This process is repeated until no vertices are eliminated. Now all the diagrams are redundancy-free. (3) The diagram with the least remaining intermediates is selected and used to generate the codes. For all the RRs used in LIBRETA, the optimal diagram has been searched and optimized, and the results are shown in Tables 1, 2, and 3, respectively. The optimization is quite effective. For MDVRRs, our optimized results are similar to those of Johnson et

Figure 5. Of the 24 intermediates, 6 are redundant. Note that in a manually written loop, it is not trivial to identify whether an intermediate is redundant or not because this depends on the entire RR diagram. This is demonstrated in Figure 6: for 3000 randomly generated OS-VRR diagrams of (dd|dd), each one contains 3675 intermediates. After elimination of redundancies, the number of remaining intermediates ranges rather densely from 2273 to 3037. For HRRs, the redundancy problem is similar. In LIBRETA, ACG is used to find the RR diagram that has the largest number of redundancies. Thus, after elimination of them, this diagram becomes the most efficient one. In the Appendix, we discuss the origin of the redundancies these RRs. In LIBRETA, given a RR (say OS-VRR) and an ERI type (say (f f|f f)), ACG uses a tree-search algorithm to optimize the RR codes: (1) Randomly generate a large number of RR diagrams. Note that because the number of all possible diagrams increases exponentially with L, a complete search is beyond our 579

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation

(gs|dd), (gs|f p), and (gs|gs). With these optimized digrams, we can straightforwardly generate the evaluation codes for the OS, DRK, and MD RRs by interpreting the diagrams (see the caption in Figure 3). 4.2. Selection of Contraction Paths. An optimal implementation of contractions can significantly improve the evaluation efficiency. The contraction problem has been discussed in the famous PRISM algorithm12 for segmented contraction. In LIBRETA, we have generalized it to both segmented and general contracted basis sets. For an ERI (ab| cd), we assume at the moment that four GTOs have the same contraction degree K and general contracted block size B. The cost estimation expression has the form

Table 4. Program Structure for the OS/DRK Method For each bra exponent-pair For each ket exponent-pair Calculate [m0|n0] For each ket block Contract [m0|n0] to [m0|n0) Next ket block Next ket exponent-pair For each ket block // If any operations here For each bra block Contract [m0|n0) to (m0|n0) Next bra block Next ket block Next bra exponent-pair For each bra block For each ket block Transform (m0|n0) to (ab|n0) Transform (ab|n0) to (ab|cd) Next ket block Next bra block

// K4 // K4B2

// K2B2 // K2B4

xK 4 + yK 4B2 + zK 2B4 + uK 2B2 + vB4

To understand this expression, one can refer to the program structure in Table 4: K4 term is contributed from the calculations for all primitive GTO quartets; K4B2 and K2B4 from the first and second contraction, respectively; K2B2 and B4 from the transformations before and after the second contraction, respectively. The prefactors x, y, z, u, v of (pp|pp) are given in Table 5. As described in section 3, for the OS and DRK method, we perform contractions immediately after VRRs. Assume the angular momenta of GTOs in (ab|cd) are all L, the cost of contractions immediately after VRRs (forming (L0|L0] ∼ (2L, 0|2L, 0] to (L0|L0) ∼ (2L, 0|2L, 0)) scales as L2·NL2(K4B2 + K2B4) ≈ L6(K4B2 + K2B4). In contrast, the cost of contractions at last (forming [LL|LL] to (LL|LL)) is estimated to be NL4(K4B2 + K2B4) ≈ L8(K4B2 + K2B4), which confirms that performing contractions at the last step is almost always more expensive. This is also revealed in the PRISM algorithm12 where MDXXXXC paths are often used for ERIs where Kbra or Kket is 1 (in this case there is no contraction at all). In the MD-CXXXX methods, we can even carry out contractions at the very beginning of the evaluation subroutine. Although this reduces the amount of work in contractions, the increasing z, u, v in Table 5 suggest that the additional indices in (28) or (32) impose an extra burden on the following transformations. Therefore, when the ERI contains low contracted parts, late contractions are preferred. For ERIs over s and p GTOs, LIBRETA will calculate the cost of all methods using x, y, z, u, v and L, K, B, then selects the most efficient one for its evaluation. For ERIs of higher total angular momentum L, the choice of the OS or DRK method is more subtle. In Table 6, the computational cost (in terms of x in (40)) and memory requirement of the two methods for some types of ERIs are given. For (pp|pp) or (gs|ss), the OS method has better

// B4 // B4

Table 5. Prefactors in the Cost Estimation Expressions for (pp|pp) scheme

x

y

z

u

v

OS/DRK MD-CCTTT MD-CTCTT MD-CTTCT MD-CTTTC

630/1008 122 0 0 0

162 305 106 106 106

162 305 262 630 81

0 0 732 1698 2130

324 3370 1104 432 0

(40)

al.11 and satisfy the theoretical upper bound we derive (Appendix). For HRRs, 3.5−19.7% of the intermediates can be removed. This is rather considerable savings because each HRR step has to be repeated several times during the evaluation of an ERI. For example, to calculate (gg|f f) from (gg|fs), (gg|gs), (gg|hs), and (gg|is), we have to perform the |f f)-HRR for 225 components of (gg|; therefore, according to Table 2, we can save the computational cost of (388 − 334) × 225 = 12 150 intermediates. For OS-VRRs, the optimization is more effective. For ERIs of high L like ( fd|dd) or (gd|ds), more than 50% of the intermediates can be eliminated. We can see that the ERI types with the same La + Lb and Lc + Ld, like (dd|dd), (f p|f p), (gs| dd), (gs|f p), and (gs|gs), have the same number of intermediates in their nonoptimized diagrams, and the larger La − Lb or Lc − Ld is, the more redundancies there are; thus the number of intermediates in their optimized diagrams decreases in the order

Table 6. Computational Cost and Resource Required for the OS and DRK Methods in LIBRETA memorya

x (pp|pp) (dd|dd) (f p|f p) (gs|ss) (gs|fd) (gd|gd) (gg|gg)

wall timeb (unit: s)

OS

DRK

OS

DRK

OS

DRK

Ntest

630 11750 10222 235 17920 80569 425588

1008 18815 12860 252 14865 120169 769347

200 3675 3675 70 6370 28224 136125

114 1046 710 36 790 4257 21286

21.4 21.8 20.3 8.6 15.5 16.2 33.6

29.7 13.6 10.5 13.5 12.9 16.3 20.0

1000000 50000 60000 1000000 15000 5000 1000

a

Number of intermediates that have to be stored. bThe time was measured by Ntest calculations. 580

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation Table 7. Preferred Method between the OS and DRK Method for Each Type of ERI (ss|ss) (ps|ss) (ps|ps) (pp|ss) (pp|ps) (pp|pp) (ds|ss) (ds|ps) (ds|pp) (ds|ds) (dp|ss) (dp|ps) (dp|pp) (dp|ds) (dp|dp) (dd|ss) (dd|ps) (dd|pp) (dd|ds) (dd|dp) (dd|dd) ( fs|ss) ( fs|ps) ( fs|pp)

OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS DRK DRK OS OS OS

(fs|ds) (fs|dp) (fs|dd) (fs|fs) (f p|ss) (f p|ps) (f p|pp) (f p|ds) (f p|dp) (f p|dd) (f p|fs) (f p|fp) (fd|ss) (fd|ps) (fd|pp) (fd|ds) (fd|dp) (fd|dd) (fd|fs) (fd|fp) (fd|fd) (f f|ss) (f f|ps) (f f|pp)

OS OS DRK OS OS OS OS OS DRK DRK DRK DRK OS OS OS OS DRK DRK DRK DRK OS OS OS DRK

(f f |ds) (f f |dp) (f f |dd) (f f |fs) (f f |fp) (f f |fd) (f f |ff) (gs|ss) (gs|ps) (gs|pp) (gs|ds) (gs|dp) (gs|dd) (gs|fs) (gs|f p) (gs|fd) (gs|f f) (gs|gs) (gp|ss) (gp|ps) (gp|pp) (gp|ds) (gp|dp) (gp|dd)

DRK DRK OS DRK DRK OS OS OS OS DRK DRK DRK DRK DRK DRK DRK DRK DRK OS OS DRK DRK DRK DRK

Table 8. Impact of Two Bra HRR Memory Access Patterns on the Performance of ERI Evaluation Codes wall timeb (unit: s)

type

pattern A

pattern B

pattern A

pattern B

(f f |f f) (gf |f f) (gf |gf) (gg|f f) (gg|gf) (gg|gg)

3.3 3.8 4.0 5.4 5.4 5.7

3.4 3.4 3.7 3.8 4.0 3.8

4.3 6.1 8.8 11.7 15.6 27.1

3.8 5.3 7.7 9.1 12.2 21.1

DRK DRK DRK OS DRK DRK OS OS DRK DRK DRK OS DRK DRK OS DRK DRK OS OS OS OS DRK DRK DRK

(gf|dd) (gf|fs) (gf|fp) (gf|fd) (gf|ff) (gf|gs) (gf|gp) (gf|gd) (gf|gf) (gg|ss) (gg|ps) (gg|pp) (gg|ds) (gg|dp) (gg|dd) (gg|fs) (gg|fp) (gg|fd) (gg|ff) (gg|gs) (gg|gp) (gg|gd) (gg|gf) (gg|gg)

OS DRK OS OS DRK DRK DRK DRK DRK OS DRK OS DRK OS OS DRK DRK DRK DRK DRK DRK DRK DRK DRK

much more demanding. Such memory traffic becomes a serious burden, decreasing the performance significantly. Indeed, for (dd|dd), (f p|f p), and (gg|gg), although the OS method has smaller x, it is much less efficient than the DRK method. The latter requires a memory of only 16−28% of the former. Actually, several studies reveal that in some cases minimizing the memory implementation instead of FLOPs can lead to better performance improvement.66,67 Thus, in LIBRETA, calibrations similar to those in Table 6 have been carried out to determine whether the OS or DRK method is superior for each type of ERI. This guarantees that LIBRETA can select the most suitable method for all ERIs in practice. The preferred method between the OS and DRK method for each type of ERI is given in Table 7. 4.3. Design of the Program Structure. Optimization of RRs and contraction paths can reduce the theoretical computational cost significantly. However, the practical performance depends on not only the formulas but also specific computer architecture and how it is implemented. To achieve peak performance in practice, many technical details in the program structure have to be considered, which we will mention some here. The code generator is written in C++ and generates C++ codes of LIBRETA. Common Subexpression Elimination. The last three terms (which are center-independent) in OS-VRRs (10) and (11) can be shared by several intermediates, as shown in the calculation of (fd|dp) by the singly and doubly underlined terms of (41)− (43):

Figure 8. Two memory access patterns. The numbers in paratheses are xab or xcd.

% cache missesa

(gp|fs) (gp|fp) (gp|fd) (gp|ff) (gp|gs) (gp|gp) (gd|ss) (gd|ps) (gd|pp) (gd|ds) (gd|dp) (gd|dd) (gd|fs) (gd|fp) (gd|fd) (gd|ff) (gd|gs) (gd|gp) (gd|gd) (gf|ss) (gf|ps) (gf|pp) (gf|ds) (gf|dp)

[(1,0,5)0|(1,0,5)0]0 = XQC[(1,0,5)0|(0,0,5)0]0

a

The cache miss were measured with Valgrind (http://valgrind.org/). b For each ERI type, the time was measured by 1000 calculations.

+

performance, in agreement with their relative costs. However, as L increases, the memory requirement of the OS method gets

ξ XPQ [(1,0,5)0|(0,0,5)0]1 ξ+η

+ [(0,0,5)0|(0,0,5)0]1 /2(ξ + η) 581

(41)

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation Table 9. Performance (Unit s) of LIBRETAa basis set

NBFb

LIBINT2

NBF

LIBRETA

LIBINT2

C6H6 STO-6G 6-31G(d) 6-311G(2d,2p) def2-SVP def2-TZVP def2-QZVP cc-pVDZ cc-pVTZ cc-pVQZ aug-cc-pVDZ aug-cc-pVTZ aug-cc-pVQZ

36 102 228 120 252 642 120 300 630 204 480 960

40.4 16.3 74.6 11.5 75.8 1101.4 52.5 186.6 1176.1 119.2 525.3 4477.1

ECP-VDZc ECP-VTZc ECP-VQZc

70 150 293

13.1 49.1 216.6

LIBRETA

SF6 19.8 (104.0%) 9.2 (77.2%) 49.7 (50.1%) 7.0 (64.3%) 52.1 (45.5%) 1027.4 (7.2%) 11.3 (364.6%) 67.1 (78.1%) 788.0 (49.3%) 32.1 (271.3%) 279.2 (88.1%) 3959.8 (13.1%)

39 137 211 109 258 518 109 249 494 179 389 739

38.9 23.6 52.2 7.1 69.4 598.9 93.1 228.4 885.5 158.2 471.5 2571.2

2.5 (424.0%) 10.3 (376.7%) 74.7 (190.0%)

204 316 540

687.6 7732.3 71348.0

XeO3

17.5 (122.3%) 16.8 (40.5%) 32.3 (61.6%) 5.6 (26.8%) 50.8 (36.6%) 505.7 (18.4%) 9.2 (912.0%) 40.6 (462.6%) 357.8 (147.5%) 20.2 (683.2%) 136.0 (246.7%) 1565.7 (64.2%) Au4 32.1 (2042.1%) 301.2 (2467.2%) 2056.5 (3369.4%)

a

All the calculations were performed with AMD Opteron(tm) Processor 6134. The number in parentheses is the efficiency improvement. bNBF = number of basis functions. cFor XeO3 and Au4, only the valence basis part was considered.

“wanders” in discontinuous regions of the memory. In contrast, Pattern B shows more locality in memory access; thus it is considered to be more efficient than Pattern A. In fact, a benchmark in Table 8 suggests that as the HRR codes become larger, Pattern A has an increasing cache miss whereas that of Pattern B is relatively stable and smaller. Its impact on the performance is obvious: Pattern B is more efficient than Pattern A in the sense of the wall time by about 10−20%. For every part of LIBRETA, the cache optimization is taken into account.

[(0,0,6)0|(2,0,4)0]0 = XQC[(0,0,6)0|(1,0,4)0]0 +

ξ XPQ [(0,0,6)0|(1,0,4)0]1 ξ+η

+ ([(0,0,6)0|(0,0,4)0]0 − [(0,0,6)0|(0,0,4)0]1 )/2η (42)

[(0,0,6)0|(0,0,6)0]0 = ZQC[(0,0,6)0|(0,0,5)0]0 +

ξ ZPQ [(0,0,6)0|(0,0,5)0]1 ξ+η

5. PERFORMANCE OF LIBRETA In the following, all the calculations were performed using Cartesian GTOs with one CPU core on the same machine.

+ 5([(0,0,6)0|(0,0,4)0]0 − [(0,0,6)0|(0,0,4)0]1 )/2η + 6[(0,0,5)0|(0,0,5)0]1 /2(ξ + η)

(43)

Table 10. Performance (Unit s) When One Method Is Removeda

In this case, the common subexpression elimination (CSE) technique68 can be applied. These common subexpressions may not be recognizable by the compilers due to their complexity; thus CSE is explicitly done by backward search during the code generation. CSE gives some performance improvement, for example, reducing the wall time of calculating (fd|dp) by 4−6%. The improvement is limited because each common term appears at most three times. Memory Access. Memory access is another important issue. The essential part of memory access optimization is to increase the cache hit rate, which can significantly improve the performance of numerical computing. To achieve this, data reference patterns in the codes should exhibit locality feature.69 For the evaluation of ERIs, the number of intermediates is large thus reasonable memory access is critical. This program design can be demonstrated through the calculation of (gp|gs) by (12) from (gs|gs) and (hs|gs). Assume the ERIs are “row-major” ordered (all indices/addresses start at 0 in line with the C array convention): that is, if the index of a in La-shell is xa, then the index of ab and (ab|cd) is xab = xaNLb + xb and xabNLcNLd + xcd, respectively. There are two computing orders, or memory access patterns, to carry out this HRR (Figure 8): computing the entire (gp| shell for every component of |gs) (Pattern A) or computing every component of (gp| for the entire |gs) shell (Pattern B). Although Pattern A looks more natural, it

basis set

NBFb

LIBRETA

without OSc

without DRKd

without MD-CXXXXe

STO-6G 6-31g(d) 6-311g(2d,2p) def2-SVP def2-TZVP def2-QZVP cc-pVDZ cc-pVTZ cc-pVQZ aug-cc-pVDZ aug-cc-pVTZ aug-cc-pVQZ

39 137 211 109 258 518 109 249 494 179 389 739

17.5 16.8 32.3 5.6 50.8 505.7 9.2 40.6 357.8 20.2 136.0 1565.7

17.6 17.9 34.1 5.7 53.3 510.9 9.0 41.5 361.2 21.1 138.4 1582.1

17.7 17.6 32.7 5.6 53.7 567.8 9.3 43.1 389.3 20.1 156.3 1789.0

19.3 18.0 32.4 5.9 50.6 514.7 9.8 41.4 360.8 20.6 137.4 1600.4

a

All the calculations were performed for SF6 with AMD Opteron(tm) Processor 6134. bNBF = number of basis functions. cOS is replaced by the DRK algorithm. dDRK is replaced by the OS algorithm. eMDCXXXX is replaced by the OS algorithm.

5.1. Comparison with LIBINT2. We examined the performance of LIBRETA by calculating all ERIs (8-fold permutation symmetry was used) of some molecules with popular basis sets and compared it with LIBINT2. The basis sets tested included the highly contracted Pople’s STO-6G,70,71 6-31G(d) and 6582

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation Table 11. Average Time for One SCF Iteration (Unit s) of NWCHEM and LIBRETAa basis set def2-TZVP def2-QZVP cc-pVTZ cc-pVQZ aug-cc-pVTZ aug-cc-pVQZ

NBFb

NWCHEM CH3CHO 7.3 250.0 13.8 240.8 86.3 1261.3

132 356 165 350 265 535

NBFb

LIBRETA

NWCHEM

LIBRETA

S4 7.4 (−1.4%) 209.7 (19.2%) 14.1 (−2.1%) 181.6 (32.6%) 75.2 (14.7%) 1136.0 (11.0%)

168 344 156 296 236 436

23.5 328.1 131.7 549.4 225.7 1416.1

20.5 (14.6%) 262.7 (24.9%) 22.1 (495.9%) 148.2 (270.7%) 66.1 (241.5%) 579.8 (144.2%)

a

All calculations converged to the same energy. All the calculations were carried out with AMD Opteron(tm) Processor 6134. The number in paratheses is the efficiency improvement. bNBF = number of basis functions.

311G(2d, 2p);72,73 segmented Karlsruhe basis set def2-XVP,74 Dunning75,76 and Peterson’s77−79 general contracted correlation consistent basis sets (aug)-cc-pVXZ (including ECP-VXZ). We selected four molecules: organic molecule C6H6; inorganic molecules containing heavy elements SF6 and XeO3; metal cluster Au4. Note that for the last three molecules, the basis functions usually have considerably larger contraction degrees than those for C, H, O that often appear in small organic molecules. The results are given in Table 9. Our LIBRETA outperformed LIBINT2 for most cases, being faster than LIBINT2 by 7.2−912.0% for C6H6 and SF6. For ECP-VXZ, which may contain general contracted d GTOs with K = 10, LIBRETA exhibited excellent performance, being 20−30 times faster. The performance improvement comes from two aspects: (1) LIBINT2 implements the OS method; LIBRETA can use the DRK method for ERIs of high L which is much more efficient. (2) LIBINT2 is designed for segmented contracted basis sets; thus for ERIs over a set of general contracted GTOs, VRRs for the primitive GTO quartets have to be repeated for about B4 times, and the following contractions scale as K4B4. Using LIBRETA, the VRRs needs to be carried out only once, and the contractions scale as K4B2 + K2B4 (see the program structure in Table 4). Thus for general contracted basis sets like aug-cc-pVXZ or ECPVXZ, our library exhibited much better performance. This can also be demonstrated from some calculations where each method was disabled in LIBRETA, and the results are given in Table 10. It is observed that without any one method, all the calculations will take a long wall time. The DRK method is particularly important for basis sets containing GTOs of high angular momenta. An example is that without the DRK method, LIBRETA requires additional computation time of more than 200 s for aug-cc-pVQZ. Without the OS or MD-CXXXX method, the calculations of ERIs over GTOs of low angular momenta or high contraction degrees are also less efficient. Therefore, we confirm that LIBRETA can select the best method during the evaluation of ERIs. 5.2. Comparison with NWCHEM. Finally, the performance of LIBRETA is examined in real quantum chemical calculations. Thus, we wrote a HF program using simple direct SCF strategy,

i.e., recomputing the ERIs when required without disk usage. By “simple” we mean that we did not use point group symmetry or incremental Fockian formation, etc. They were also turned off in NWCHEM, because these techniques are out of the scope of LIBRETA. We calibrated the HF performance of our program and NWCHEM55 by comparing the average time of one SCF iteration. Because the evaluation of ERIs in an SCF calculation dominates, the calibration should reflect the performance of the ERI evaluation ability to a good extent. Results are given in Table 11. Again, LIBRETA exhibits excellent performance, especially for S4 with general contracted basis sets, LIBRETA was faster than NWCHEM by 144.2−495.9%. This feature is important for the highly accurate electron correlation calculations of large molecules. To test the effect of integral screening, we performed HF calculations on the global minimum of the silicon cluster80 Si16, which is a heavy system with a large number of highly contracted basis functions. The performances with screening parameter τ = 10−8 and without screening are shown in Table 12. We can see that both NWCHEM and LIBRETA exhibited performance improvement. In the cc-pVQZ calculation, the QQR screening (39) can reduce about 5.9 × 1010 small ERIs from 2.5 × 1011 ones. In practice, τ can be variable during the SCF procedure to achieve the best accuracy-cost ratio. We conclude that LIBRETA can become a useful and efficient library for theoretical and computational chemists.

6. CONCLUSION In this paper, we have developed a new library for the evaluation of ERIs using computerized optimization, LIBRETA. It contains codes of OS, DRK, and MD RRs that are highly optimized by tree-search. Given an ERI, LIBRETA will determine and use the most efficient scheme to calculate it. Calibrations revealed that for popular basis sets, LIBRETA showed performance superior to LIBINT2 and NWCHEM. Especially for general contracted basis sets, LIBRETA could be faster than LIBINT2 by 912.0%, and for Au4 was even 20−30 times faster; the direct SCF program supported by LIBRETA could be faster than NWCHEM by 495.9%. LIBRETA is a black box library, containing codes for the evaluation of not only

Table 12. Average Time for One SCF Iteration (Unit s) of NWCHEM and LIBRETAa screening: τ = 10−8 b

basis set

NBF

cc-pVDZ cc-pVTZ cc-pVQZ

304 624 1184

no screening

NWCHEM

LIBRETA

NWCHEM

LIBRETA

1222.6 5334.0 55307.0

1063.2 (15.0%) 4805.3 (11.0%) 37135.9 (48.9%)

2724.9 11317.0 59276.8

1227.3 (122.0%) 5860.0 (93.1%) 43252.1 (37.0%)

a

All calculations converged to the same energy. All the calculations were carried out with AMD Opteron(tm) Processor 6134. The number in paratheses is the efficiency improvement (relative to NWCHEM). bNBF = number of basis functions. 583

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation

Therefore, we conclude that the redundancies originate from the fact that for the RRs in constrained indices w or m, n, the next set of quantities is able to be constructed from only a subset of the current one. Note that the “constraint” like wx + wy + wz = Lw (thus the points form a triangle) is essential: without constraints, there will be no redundancies. Now we derive (44); i.e., at most, ML components of the Lshell are sufficient to construct the entire (L + 1)-shell. See Figure 9, now the green and blue points represent the components of the (L + 1)- and L-shells, respectively. We can see that two lines of green points have to be constructed by L + 1 blue ones. The remaining green points constitute a “shifted-(L − 1)-shell”, which can be constructed from ML−2 blue ones; thus we have a RR:

ERIs but also important one-electron integrals like overlap, multipole, and electron−nuclear attraction integrals. For other kinds of molecular integrals, like the ones over anti-Coulomb operator,81 London orbitals,82 or Gaussian geminals,83 one can try to optimize the recurrence relations and contraction paths by approaches in this paper and apply the automatic code generation technique. LIBRETA is expected to be a useful tool for theoretical and computational chemists to efficiently develop and carry out quantum chemical algorithms. In the future, we hope to generalize this computerized optimization methodology to their first and second derivatives and more kinds of molecular integrals.



APPENDIX: ORIGIN OF REDUNDANCIES Now we want to figure out why there are redundancies in MD/ OS-VRRs and HRRs but not in DRK-VRRs. The redundancies occur only for angular momentum indices like w but not for onedimensional indices like m or N; thus their origin lies in the process of constructing (L + 1)-shell by adding 1 to the components of L-shell. It is trivial to find out that p- and d-shells have to be constructed from the entire s- and p-shells, respectively. However, only a part of the six components of the d-shell is needed to construct the entire f-shell. This can be seen, for example, from gray nodes in Figure 4: [(1, 0, 1)]2 and [(0, 1, 1)]2 of [2]2 are not used to calculate [3]1. An elegant way to visualize this scenario is shown in Figure 9: the angular

ML = (L + 1) + ML − 2

(45)

with M0 = 1 and M1 = 3. Then we have ML = (L + 1) + ML − 2 = (L + 1) + (L − 1) + ML − 4 l L + 1) + (L ≠−ÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖ 1) + μ + 3ÖÆ + M 0 L is even o o ´(ÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖ o o o L /2terms o =o m o (L + 1) + (L − 1) + μ + 4 + M1 L is odd o o o ´ÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖ≠ÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÖÆ o o (L − 1)/2terms n l1 o o (L + 4)L + 1 L is even o o o o4 =m o o 1 o o (L + 5)(L − 1) + 3 L is odd o o o n4 1 3 = L(L + 4) + 1 + (L mod 2) 4 4

(46)

which proves (44). Now it is possible to give an estimation of the upper bound of the least number of intermediates required in MD-VRRs. Given an ERI of total angular momentum L, the total number of intermediates is L

X total =

L−n

∑ ∑ Nl = n=0 l=0

1 (L + 1)(L + 2)(L + 3)(L + 4) 24 (47)

In generated codes, redundancies are removed, only Xredundancy‑free intermediates remaining to be calculated. Thus, we define

Figure 9. Origin of redundancies. The d- and f-shell are represented by blue and green points in 3. The tetrahedrons illustrate the construction procedure. For example, the (1, 1, 0) component of the d-shell can be used to derive (2, 1, 0), (1, 2, 0), and (1, 1, 1). In this way, 4 blue points are sufficient to construct 10 green ones, making (1, 0, 1) and (0, 1, 1) redundant (see also Figure 4).

f MD‐VRR ≡

X redundancy‐free X total

(48)

To estimate Xredundancy‑free, we note that for n < L, only Ml rather than Nl components of an l-shell are needed as proved above; thus we get

momenta are represented by lattice points in 3, and each component of the d-shell (blue points) can be used to obtain three components of the f-shell (green points) by adding one (yellow tetrahedrons) to its x, y, and z components. One can see that four components of the d-shell are already sufficient to construct the entire f-shell. Generally, ML components of L-shell are sufficient to construct the entire (L + 1)-shell, where 1 3 ML = L(L + 4) + 1 + (L mod 2) (44) 4 4

L−1 L−n

X redundancy‐free ≤

L

∑ ∑ Ml + ∑ Nl

∑ ∑ jijj 1 l(l + 4) + n=0 l=0 L−1 L−n

l=0

7 zy zz + 4{

L

k4 l=0 2 1 3 2 = (L + 1)(L + 15L + 78L + 48) 48 ≤

n=0 l=0

∑ 1 (l + 1)(l + 2) (49)

This is an upper bound because the redundancies due to the ones of higher l are not taken into account in (49). Finally, we obtain

which will be proved later. We can see that ML/NL ≈ 1/2; thus for large L only about 50% of the L-shell are needed in the RRs. 584

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Journal of Chemical Theory and Computation f MD‐VRR ≤

(L3 + 15L2 + 78L + 48) 2(L + 2)(L + 3)(L + 4)



In Table 13, the estimation (50) is compared with our optimized MD-VRR (Table 1) as well as the work of Johnson et al.11 We can see that the optimized RRs from both works satisfy the upper bound. Table 13. Examination of the Upper Bound in (50) f MD‑VRR by this worka

by Johnson et al.b

upper bound by eq 50

s p d f g h i k l m n L = 11 L = 12 L = 13 L = 14 L = 15

1 1 1 0.943 0.886 0.857 0.824 0.791 0.770

1 1 0.953 0.905 0.889 0.859 0.824 0.794 0.766 0.740 0.718 0.702 0.679 0.658 0.639 0.622

1 1.183 1.133 1.057 0.988 0.930 0.883 0.844 0.812 0.784 0.761 0.742 0.725 0.710 0.697 0.685

a

The data are calculated from results in Table 1. bThe data are estimated from results in Table 1 in ref 11 by “Present/Full”, because in that table they gave flop-costs rather than the numbers of intermediates. The true f MD‑VRR should be a little larger.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jctc.7b00788. Evaluation of the Boys function and roots and weights of the R-type Rys polynomial, memory requirement of LIBRETA for the OS-VRR and DRK-VRR, definition of the well-separatedness extent, and geometries of C6H6, SF6, XeO3, Au4, CH3CHO, and S4 (PDF)



REFERENCES

(1) Boys, S. F. Electronic Wave Functions. I. A General Method of Calculation for the Stationary States of Any Molecular System. Proc. R. Soc. London, Ser. A 1950, 200, 542−554. (2) Dupuis, M.; Rys, J.; King, H. F. Evaluation of Molecular Integrals over Gaussian Basis Functions. J. Chem. Phys. 1976, 65, 111−116. (3) McMurchie, L. E.; Davidson, E. R. One- and Two-electron Integrals Over Cartesian Gaussian Functions. J. Comput. Phys. 1978, 26, 218−231. (4) Pople, J. A.; Hehre, W. J. Computation of Electron Repulsion Integrals Involving Contracted Gaussian Basis Functions. J. Comput. Phys. 1978, 27, 161−168. (5) Hamilton, T. P.; Schaefer, H. F., III New Variations in Twoelectron Integral Evaluation in the Context of Direct SCF Procedures. Chem. Phys. 1991, 150, 163−171. (6) Obara, S.; Saika, A. Efficient Recursive Computation of Molecular Integrals over Cartesian Gaussian Functions. J. Chem. Phys. 1986, 84, 3963−3974. (7) Obara, S.; Saika, A. General Recurrence Formulas for Molecular Integrals over Cartesian Gaussian Functions. J. Chem. Phys. 1988, 89, 1540−1559. (8) Head-Gordon, M.; Pople, J. A. A Method for Two-electron Gaussian Integral and Integral Derivative Evaluation Using Recurrence Relations. J. Chem. Phys. 1988, 89, 5777−5786. (9) Gill, P. M. W.; Head-Gordon, M.; Pople, J. A. Efficient Computation of Two-electron-repulsion Integrals and Their nthorder Derivatives Using Contracted Gaussian Basis Sets. J. Phys. Chem. 1990, 94, 5564−5572. (10) Lindh, R.; Ryu, U.; Liu, B. The Reduced Multiplication Scheme of the Rys Quadrature and New Recurrence Relations for Auxiliary Function Based Two-electron Integral Evaluation. J. Chem. Phys. 1991, 95, 5889−5897. (11) Johnson, B. G.; Gill, P. M. W.; Pople, J. A. Exact and Approximate Solutions to the One-center McMurchie−Davidson Tree-search Problem. Int. J. Quantum Chem. 1991, 40, 809−827. (12) Gill, P. M. W.; Pople, J. A. The Prism Algorithm for Two-electron Integrals. Int. J. Quantum Chem. 1991, 40, 753−772. (13) Ishida, K. General Formula Evaluation of the Electron-repulsion Integrals and the First and Second Integral Derivatives over Gaussiantype Orbitals. J. Chem. Phys. 1991, 95, 5198−5205. (14) Ryu, U.; Lee, Y. S.; Lindh, R. An Efficient Method of Implementing the Horizontal Recurrence Relation in the Evaluation of Electron Repulsion Integrals Using Cartesian Gaussian Functios. Chem. Phys. Lett. 1991, 185, 562−568. (15) Lindh, R. The Reduced Multiplication Scheme of the Rys-Gauss Quadrature for 1st Order Integral Derivatives. Theor. Chem. Acc. 1993, 85, 423−440. (16) Gill, P. M. W. Molecular Integrals Over Gaussian Basis Functions. Adv. Quantum Chem. 1994, 25, 141−205. (17) Ishida, K. ACE Algorithm for the Rapid Evaluation of the Electron-repulsion Integral Over Gaussian-type Orbitals. Int. J. Quantum Chem. 1996, 59, 209−218. (18) Yanai, T.; Ishida, K.; Nakano, H.; Hirao, K. New Algorithm for Electron Repulsion Integrals Oriented to the General Contraction Scheme. Int. J. Quantum Chem. 2000, 76, 396−406. (19) Nakai, H.; Kobayashi, M. New Algorithm for the Rapid Evaluation of Electron Repulsion Integrals: Elementary Basis Algorithm. Chem. Phys. Lett. 2004, 388, 50−54. (20) Makowski, M. Simple Yet Powerful Techniques for Optimization of Horizontal Recursion Steps in Gaussian-type Two-electron Integral Evaluation Algorithms. Int. J. Quantum Chem. 2007, 107, 30−36. (21) Flocke, N.; Lotrich, V. Efficient Electronic Integrals and Their Generalized Derivatives for Object Oriented Implementations of Electronic Structure Calculations. J. Comput. Chem. 2008, 29, 2722− 2736. (22) Sandberg, J. A. R.; Rinkevicius, Z. An Algorithm for the Efficient Evaluation of Two-electron Repulsion Integrals over Contracted Gaussian-type Basis Functions. J. Chem. Phys. 2012, 137, 234105.

(50)

type

Article

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Jun Zhang: 0000-0001-9093-9430 Notes

The author declares no competing financial interest.



ACKNOWLEDGMENTS This material is based upon work supported by a Molecular Software Sciences Institute (MolSSI) Fellowship funded by the National Science Foundation under Grant No. 1547580. It has also been supported by the National Science Foundation (award CHE-1361586) and by CREST, Japan Science and Technology Agency. 585

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation (23) Pritchard, B. P.; Chow, E. Horizontal Vectorization of Electron Repulsion Integrals. J. Comput. Chem. 2016, 37, 2537−2546. (24) King, H. F.; Dupuis, M. Numerical Integration Using Rys Polynomials. J. Comput. Phys. 1976, 21, 144−165. (25) Yasuda, K. Two-electron Integral Evaluation on the Graphics Processor Unit. J. Comput. Chem. 2008, 29, 334−342. (26) Ufimtsev, I. S.; Martínez, T. J. Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-electron Integral Evaluation. J. Chem. Theory Comput. 2008, 4, 222−231. (27) Asadchev, A.; Gordon, M. S. New Multithreaded Hybrid CPU/ GPU Approach to Hartree−Fock. J. Chem. Theory Comput. 2012, 8, 4166−4176. (28) Miao, Y.; Merz, K. M. Acceleration of Electron Repulsion Integral Evaluation on Graphics Processing Units via Use of Recurrence Relations. J. Chem. Theory Comput. 2013, 9, 965−976. (29) Yasuda, K.; Maruoka, H. Efficient Calculation of Two-electron Integrals for High Angular Basis Functions. Int. J. Quantum Chem. 2014, 114, 543−552. (30) Miao, Y.; Merz, K. M. Acceleration of High Angular Momentum Electron Repulsion Integrals and Integral Derivatives on Graphics Processing Units. J. Chem. Theory Comput. 2015, 11, 1449−1462. (31) Kindratenko, V.; Ufimtsev, I. S.; Martínez, T. J. Evaluation of Two-electron Repulsion Integrals over Gaussian Basis Functions on SRC-6 Reconfigurable Computer. Proceedings of the Reconfigurable Systems Summer Institute; NCSA: Urbana, IL, 2008. (32) Nakamura, K.; Hatae, H.; Harada, M.; Kuwayama, Y.; Uehara, M.; Sato, H.; Obara, S.; Honda, H.; Nagashima, U.; Inadomi, Y.; Murakami, K. Eric: A Special-Purpose Processor for ERI Calculations in Quantum Chemistry Applications. Proceedings of the 5th International Conference on High Performance Computing and Grid in Asia Pacific Region; HPC, 2002. (33) Ramdas, T.; Egan, G. K.; Abramson, D.; Baldridge, K. Converting Massive TLP to DLP: a Special-purpose Processor for Molecular Orbital Computations. Proceedings of the 4th International Conference on Computing Frontiers; Springer: Berlin, 2007. (34) Jones, H. W. Computer-generated Formulas for Overlap Integrals of Slater-type Orbitals. Int. J. Quantum Chem. 1980, 18, 709−713. (35) Jones, H. W. Computer-generated Formulas for Two-center Coulomb Integrals over Slater-type Orbitals. Int. J. Quantum Chem. 1981, 20, 1217−1224. (36) Jones, H. W. Computer-generated Formulas for Four-center Integrals over Slater-type Orbitals. Int. J. Quantum Chem. 1986, 29, 177−183. (37) Janssen, C. L.; Schaefer, H. F. The Automated Solution of Second Quantization Equations with Applications to the Coupled Cluster Approach. Theor. Chim. Acta 1991, 79, 1−42. (38) Li, X.; Paldus, J. Automation of the Implementation of Spinadapted Open-shell Coupled-cluster Theories Relying on the Unitary Group Formalism. J. Chem. Phys. 1994, 101, 8812−8826. (39) Nooijen, M.; Lotrich, V. Towards a General Multireference Coupled Cluster Method: Automated Implementation of Open-shell CCSD Method for Doublet States. J. Mol. Struct.: THEOCHEM 2001, 547, 253−267. (40) Hirata, S. Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-interaction, Coupled-cluster, and Many-body Perturbation Theories. J. Phys. Chem. A 2003, 107, 9887−9897. (41) Hirata, S. Higher-order Equation-of-motion Coupled-cluster Methods. J. Chem. Phys. 2004, 121, 51−59. (42) Hirata, S. Automated Symbolic Algebra for Quantum Chemistry. J. Phys.: Conf. Ser. 2006, 46, 249. (43) Auer, A. A.; Baumgartner, G.; Bernholdt, D. E.; Bibireata, A.; Choppella, V.; Cociorva, D.; Gao, X.; Harrison, R.; Krishnamoorthy, S.; Krishnan, S.; Lam, C.-C.; Lu, Q.; Nooijen, M.; Pitzer, R.; Ramanujam, J.; Sadayappan, P.; Sibiryakov, A. Automatic Code Generation for Many-body Electronic Structure Methods: The Tensor Contraction Engine. Mol. Phys. 2006, 104, 211−228.

(44) Hartono, A.; Lu, Q.; Henretty, T.; Krishnamoorthy, S.; Zhang, H.; Baumgartner, G.; Bernholdt, D. E.; Nooijen, M.; Pitzer, R.; Ramanujam, J.; Sadayappan, P. Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry. J. Phys. Chem. A 2009, 113, 12715−12723. (45) Strange, R.; Manby, F.; Knowles, P. Automatic Code Generation in Density Functional Theory. Comput. Phys. Commun. 2001, 136, 310−318. (46) Sałek, P.; Hesselmann, A. A Self-contained and Portable Density Functional Theory Library for Use in Ab Initio Quantum Chemistry Programs. J. Comput. Chem. 2007, 28, 2569−2575. (47) Ekström, U.; Visscher, L.; Bast, R.; Thorvaldsen, A. J.; Ruud, K. Arbitrary-order Density Functional Response Theory from Automatic Differentiation. J. Chem. Theory Comput. 2010, 6, 1971−1980. (48) Knizia, G.; Li, W.; Simon, S.; Werner, H.-J. Determining the Numerical Stability of Quantum Chemistry Algorithms. J. Chem. Theory Comput. 2011, 7, 2387−2398. (49) Bracken, P.; Bartlett, R. J. Calculation of Gaussian Integrals Using Symbolic Manipulation. Int. J. Quantum Chem. 1997, 62, 557−570. (50) Gomez, C.; Scott, T. Maple Programs for Generating Efficient FORTRAN Code for Serial and Vectorised Machines. Comput. Phys. Commun. 1998, 115, 548−562. (51) Titov, A. V.; Ufimtsev, I. S.; Luehr, N.; Martínez, T. J. Generating Efficient Quantum Chemistry Codes for Novel Architectures. J. Chem. Theory Comput. 2013, 9, 213−221. (52) Song, C.; Wang, L.-P.; Martínez, T. J. Automated Code Engine for Graphical Processing Units: Application to the Effective Core Potential Integrals and Gradients. J. Chem. Theory Comput. 2016, 12, 92−106. (53) Mazur, G.; Makowski, M.; Łazarski, R.; Włodarczyk, R.; Czajkowska, E.; Glanowski, M. Automatic Code Generation for Quantum Chemistry Applications. Int. J. Quantum Chem. 2016, 116, 1370−1381. (54) Valeev, E. F. Libint library. See “https://github.com/evaleev/ libint”. Access date: 01.07.2016. (55) Valiev, M.; Bylaska, E.; Govind, N.; Kowalski, K.; Straatsma, T.; Dam, H. V.; Wang, D.; Nieplocha, J.; Apra, E.; Windus, T.; de Jong, W. NWChem: A Comprehensive and Scalable Open-source Solution for Large Scale Molecular Simulations. Comput. Phys. Commun. 2010, 181, 1477−1489. (56) Dunning, T. H. Gaussian Basis Functions for Use in Molecular Calculations. I. Contraction of (9s5p) Atomic Basis Sets for the First Row Atoms. J. Chem. Phys. 1970, 53, 2823−2833. (57) Hehre, W. J.; Ditchfield, R.; Pople, J. A. Self-Consistent Molecular Orbital Methods. XII. Further Extensions of Gaussian-Type Basis Sets for Use in Molecular Orbital Studies of Organic Molecules. J. Chem. Phys. 1972, 56, 2257−2261. (58) Raffenetti, R. C. General Contraction of Gaussian Atomic Rrbitals: Core, valence, Polarization, and Diffuse Basis Sets; Molecular Integral Evaluation. J. Chem. Phys. 1973, 58, 4452−4458. (59) Widmark, P.-O.; Malmqvist, P.-Å.; Roos, B. O. Density Matrix Averaged Atomic Natural Orbital (ANO) Basis Sets for Correlated Molecular Wave Functions. Theor. Chim. Acta 1990, 77, 291−306. (60) Helgaker, T.; Jøprgensen, P.; Olsen, J. Molecular Electronicstructure Theory; John Wiley & Sons Ltd., 2000. (61) Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. Numerical Recipes: The Art of Scientific Computing, third ed.; Cambridge University Press, 2007. (62) Rys, J.; Dupuis, M.; King, H. F. Computation of Electron Repulsion Integrals Using the Rys Quadrature Method. J. Comput. Chem. 1983, 4, 154−157. (63) Whitten, J. L. Coulombic Potential Energy Integrals and Approximations. J. Chem. Phys. 1973, 58, 4496−4501. (64) Maurer, S. A.; Lambrecht, D. S.; Flaig, D.; Ochsenfeld, C. Distance-dependent Schwarz-based Integral Estimates for Twoelectron Integrals: Reliable Tightness vs. Rigorous Upper Bounds. J. Chem. Phys. 2012, 136, 144107. (65) White, C. A.; Johnson, B. G.; Gill, P. M.; Head-Gordon, M. The Continuous Fast Multipole Method. Chem. Phys. Lett. 1994, 230, 8−16. 586

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587

Article

Journal of Chemical Theory and Computation (66) Frisch, M. J.; Johnson, B. G.; Gill, P. M.; Fox, D. J.; Nobes, R. H. An Improved Criterion for Evaluating the Efficiency of Two-electron Integral Algorithms. Chem. Phys. Lett. 1993, 206, 225−228. (67) Johnson, B. G.; Gill, P. M.; Pople, J. The Efficient Transformation of (m0|n0) to (ab|cd) Two-Electron Repulsion Integrals. Chem. Phys. Lett. 1993, 206, 229−238. (68) Potkonjak, M.; Srivastava, M. B.; Chandrakasan, A. P. Multiple Constant Multiplications: Efficient and Versatile Framework and Algorithms for Exploring Common Subexpression Elimination. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 1996, 15, 151−165. (69) Kowarschik, M.; Weiß, C. An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms; Algorithms for Memory Hierarchies: Advanced Lectures, Vol. 2625; Springer: Berlin, 2003. (70) Hehre, W. J.; Stewart, R. F.; Pople, J. A. Self-Consistent Molecular-Orbital Methods. I. Use of Gaussian Expansions of SlaterType Atomic Orbitals. J. Chem. Phys. 1969, 51, 2657−2664. (71) Hehre, W. J.; Ditchfield, R.; Stewart, R. F.; Pople, J. A. SelfConsistent Molecular Orbital Methods. IV. Use of Gaussian Expansions of Slater-Type Orbitals. Extension to Second-Row Molecules. J. Chem. Phys. 1970, 52, 2769−2773. (72) McLean, A. D.; Chandler, G. S. Contracted Gaussian Basis Sets for Molecular Calculations. I. Second Row Atoms, Z = 11−18. J. Chem. Phys. 1980, 72, 5639−5648. (73) Francl, M. M.; Pietro, W. J.; Hehre, W. J.; Binkley, J. S.; Gordon, M. S.; DeFrees, D. J.; Pople, J. A. Self-consistent Molecular Orbital Methods. XXIII. A Polarization-type Basis Set for Second-row Elements. J. Chem. Phys. 1982, 77, 3654−3665. (74) Weigend, F.; Ahlrichs, R. Balanced Basis Sets of Split Valence, Triple Zeta Valence and Quadruple Zeta Valence Quality for H to Rn: Design and Assessment of Accuracy. Phys. Chem. Chem. Phys. 2005, 7, 3297−3305. (75) Dunning, T. H. Gaussian Basis Sets for Use in Correlated Molecular Calculations. I. The Atoms Boron through Neon and Hydrogen. J. Chem. Phys. 1989, 90, 1007−1023. (76) Woon, D. E.; Dunning, T. H. Gaussian Basis Sets for Use in Correlated Molecular Calculations. III. The Atoms Aluminum through Argon. J. Chem. Phys. 1993, 98, 1358−1371. (77) Peterson, K. A.; Figgen, D.; Goll, E.; Stoll, H.; Dolg, M. Systematically Convergent Basis Sets with Relativistic Pseudopotentials. II. Small-core Pseudopotentials and Correlation Consistent Basis Sets for the Post-d Group 16−18 Elements. J. Chem. Phys. 2003, 119, 11113−11123. (78) Peterson, K. A.; Puzzarini, C. Systematically Convergent Basis Sets for Transition Metals. II. Pseudopotential-based Correlation Consistent Basis Sets for the Group 11 (Cu, Ag, Au) and 12 (Zn, Cd, Hg) Elements. Theor. Chem. Acc. 2005, 114, 283−296. (79) Figgen, D.; Rauhut, G.; Dolg, M.; Stoll, H. Energy-consistent Pseudopotentials for Group 11 and 12 Atoms: Adjustment to Multiconfiguration Dirac−Hartree−Fock Data. Chem. Phys. 2005, 311, 227− 244. (80) Yoo, S.; Zeng, X. C. Motif Transition in Growth Patterns of Small to Medium-sized Silicon Clusters. Angew. Chem., Int. Ed. 2005, 44, 1491−1494. (81) Klopper, W.; Röhse, R. Computation of Some New Twoelectron Gaussian Integrals. Theor. Chim. Acta 1992, 83, 441−453. (82) Tellgren, E. I.; Soncini, A.; Helgaker, T. Nonperturbative Ab Initio Calculations in Strong Magnetic Fields Using London Orbitals. J. Chem. Phys. 2008, 129, 154114. (83) Barca, G. M. J.; Gill, P. M. W. Two-Electron Integrals over Gaussian Geminals. J. Chem. Theory Comput. 2016, 12, 4915−4924.

587

DOI: 10.1021/acs.jctc.7b00788 J. Chem. Theory Comput. 2018, 14, 572−587