libreta: Computerized Optimization and Code ... - ACS Publications

Subscriber access provided by UNIVERSITY OF ADELAIDE LIBRARIES

Article

LIBRETA: Computerized Optimization and Code Synthesis for Electron Repulsion Integral Evaluation Jun Zhang J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.7b00788 • Publication Date (Web): 14 Dec 2017 Downloaded from http://pubs.acs.org on December 14, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

LIBRETA :

Computerized Optimization and Code

Synthesis for Electron Repulsion Integral Evaluation Jun Zhang∗ Department of Chemistry, University of Illinois at Urbana-Champaign, 600 South Mathews Avenue, Urbana, Illinois 61801, USA E-mail: [email protected]

Abstract A new library called LIBRETA for the evaluation of electron repulsion integrals (ERIs) over segmented and contracted Gaussian functions is developed. Our

LIBRETA

is optimized from

three aspects: (1) The Obara–Saika, Dupuis–Rys–King, and McMurchie–Davidson method are all employed. The recurrence relations involved are optimized by tree-search for each combination of angular momenta, and in the best case, 50% of the intermediates can be eliminated to reduce the computational cost. (2) The optimized codes for recurrence relations are combined with different contraction orders, each of which is suitable for ERIs of different angular momenta and contraction patterns. In practice, LIBRETA will determine and use the best scheme to evaluate each ERI. (3) LIBRETA is also optimized at the coding level. For example, with common subexpression elimination and local memory access, the performance can be increased by about 6% and 20%, respectively. The performance was compared with LIBINT 2. For both popular segmented and contracted basis sets,

LIBRETA

can be faster than

LIBINT 2

by 7.2∼912.0%. For basis sets of heavy elements which contain Gaussian basis functions of large contraction degrees, the performance can be increased by 20 to 30 times. We also tested ∗ To

whom correspondence should be addressed

1

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the performance of LIBRETA in direct self-consistent field (SCF) calculations and compared it with NWC HEM. In most cases, the average time for one SCF iteration by LIBRETA is less than NWC HEM by 144.2∼495.9%. Finally, we discuss the origin of redundancies occurred in the recurrence relations and derive an upper bound of the least number of intermediates required to be calculated in a McMurchie–Davidson recurrence, which is confirmed by ours as well as previous authors’ results. We expect that LIBRETA can become a useful tool for theoretical and computational chemists to develop their own algorithms rapidly.

1 Introduction The defining feature of an ab initio electronic structure method is the nonapproximated evaluation of molecular integrals over the exact Hamiltonian and basis functions, which also constitutes a major hurdle for its implementation and execution. Specifically, a fast evaluation of the four-index, two-electron repulsion integrals (ERIs) over Gaussian-type-orbital (GTO) atomic basis functions has been the primary concern for computational chemists even to this day. 1–23 Many of the widely used evaluation methods are based on recurrence relations (RRs) of ERIs over GTOs with different angular momenta. The first such RR was derived by Boys 1 in 1950 by considering the nuclear-coordinate derivatives of ERIs of the (ss|ss) type. In 1976, Dupuis, Rys, and King 2,24 realized that an ERI could be expressed as a product of one-dimensional functions over a kernel function computable by a Gaussian quadrature with its grid points and weights determined by a RR. This led to the method known as the Dupuis–Rys–King (DRK) method, which is particularly efficient for GTOs with higher angular momenta with a short contraction length. In 1978, Pople and Hehre 4 proposed the axis-switch method, which minimizes the computational cost by rotating the coordinates to make many ERIs over primitive GTOs vanish. This is the first of the methods aiming at contracted GTOs and exhibits excellent performance for highly contracted s- and p-type GTOs, although it performs rather poorly elsewhere. Around the same time, McMurchie and Davidson 3 found that a GTO could be expanded in Hermite-Gaussians (HGs), allowing an ERI to be calculated as a sum of more easily evaluable integrals over HGs. This method 2


Page 2 of 55

Page 3 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


is known as the McMurchie–Davidson (MD) method. In 1986, Obara and Saika 6,7 (OS) introduced a series of important RRs derived on the basis of the translational invariance of the ERIs. In 1988, Head-Gordon and Pople 8 proposed dividing the OS RRs into Gaussian-exponent-dependent (“vertical”) and -independent (“horizontal”) ones, with the latter applicable to contracted GTOs as well as to more numerous primitive GTOs, reducing the computational cost by permitting early contractions. Gill et al 9 extended this idea to the MD recursion and implemented the highly successful PRISM algorithm. For a given group (“shell”) of ERIs, PRISM estimates the evaluation costs of 20 distinct orders or “contraction paths” and selects, at runtime, the least expensive one to be used in actual ERI evaluations. More recent developments include the use of advanced hardware technologies such as the graphics processing units (GPUs), 25–30 field programmable gate arrays (FPGAs), 31 and specialized hardware. 32,33 The performance of a software library for the ERI evaluation depends on at least three key factors: (1) An optimized implementation of the RRs; (2) A selection of the shortest contraction path; (3) A careful design of the program structure. The first two (mathematical optimization) minimize the nominal computational cost regardless of hardware architectures or compiler capabilities. The last (computer-science optimization) includes common subexpression elimination, consecutive memory access, complete loop unrolling, etc., and is critical for reducing actual wall time cost by further minimizing the numbers of floating-point operations (FLOPs) and memory operations (MOPs), although these optimizations tend to be automated by advances in hardware, operating-system, and compiler technologies. In this work, we implement a software library for ERIs with both advanced mathematical and computer-science optimizations by adopting the modern strategy of using domain-specific middleware (by which executable codes (software) are generated and optimized depending on the hardware) that performs automatic formula and code generation as well as optimization. The first example of automatic code generation (ACG) dates back to Jones’ work 34–36 for the evaluation of molecular integrals over Slater-type orbitals, and since then it saw applications in ab

3



Page 4 of 55

initio electron-correlation methods, 37–44 exchange-correlation density functionals, 45–47 optimizations of the RRs, 11,14,20 a detection of their numerical instability, 48 and the synthesis of an entire ERI library. 49–54 One of the most notable examples is the

LIBINT 2

library 54 of Valeev, which

consists of efficient computerized implementations of a few of the contraction paths of the OS RRs. We aim to develop a code library by ACG that is efficient for the evaluation of all types of ERIs occurring in a typical quantum chemical calculation. This library, called LIBRETA, has been built according to the three factors we proposed above: (1) All the three classical methods, i.e. OS, DRK, and MD, were employed in

LIBRETA ,

which are suitable for ERIs of different angular

momenta. The RRs involved in the three methods were explored and highly optimized using tree-search for each angular momentum type. (2) For ERIs over contracted GTOs, an efficient contraction path is determined. Thus, in a quantum chemical calculation, say a direct formation of the Fock matrix,

LIBRETA

can automatically determine and apply the best scheme for each ERI.

(3) The codes were generated using carefully designed program structures. The impact of several technical aspects, including common subexpression elimination or the memory access order, on the performance of LIBINT 2

LIBRETA ,

were examined. The performance of

LIBRETA

was compared with

with popular basis sets: Pople’s split-valence basis sets, segmented Karlsruhe basis sets,

and Dunning and Peterson’s general contracted correlation-consistent basis sets. Our turned out to show superior performance to

LIBINT 2:

LIBRETA

for example, for triple zeta segmented and

general contracted basis sets of SF6 , LIBRETA was faster than LIBINT 2 by 36.6% and 246.7%. For Au4 , for which highly contracted basis functions were used,

LIBRETA

was 20 to 30 times faster.

When it was used in direct self-consistent field (SCF) calculations, in the best case our

LIBRETA

was faster than NWC HEM 55 by 495.9% in terms of the average wall time for one SCF iteration. The technical aspects of LIBRETA will be discussed in detail in this paper, which is organized as follows: first, we give the notations and methods used in the paper; second, we will discuss how to optimize the RRs and contraction paths and implement them efficiently by ACG on computers; finally, we examine the performance of

LIBRETA

4

and compare it with


LIBINT 2

and NWC HEM,

Page 5 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


and conclude our paper. In Appendix A, the origin of the redundancies in RRs is discussed and some theoretical results are derived.

2 Notations The µth real primitive Cartesian GTO 1 centered at A = (Ax , Ay , Az ) with exponent αµ is defined as aµ ≡ Naµ (x − Ax )ax (y − Ay )ay (z − Az )az exp −αµ |r − A|2

(1)

where Naµ is the normalization constant and a ≡ (ax , ay , az ) are the Cartesian components of the angular momentum of this basis function. When the τ-component (τ = x, y, z) of a is increased (decreased) by 1, the obtained GTO is represented by [a ± 1τ ]. Primitive Cartesian GTOs with the identical center A and angular momentum a are linearly combined into a contracted Cartesian GTO, K

(a) ≡ χa (r) ≡

∑ Dµ

aµ

µ=1 ax

ay

az

≡ (x − Ax ) (y − Ay ) (z − Az )

K

∑

(2) Dµ Naµ exp −αµ |r − A|2

µ=1

where K and {Dµ } are the contraction degree and contraction coefficients, respectively. Throughout this paper, we associate square brackets [·] with primitive GTOs and parentheses (·) with contracted GTOs. The set of all Cartesian GTOs that differ only in the Cartesian components of the angular momentum ax , ay , az , but share the same total angular momentum L = ax + ay + az form a Cartesian L-shell. There are NL = (L + 1)(L + 2)/2 Cartesian GTOs in an L-shell, which can furthermore be transformed to a fewer 2L + 1 spherical GTOs. GTO basis sets can be split into two groups depending on the contraction pattern: segmented or general. In a segmented basis set 56 (such as 6-31G 57 ), each primitive GTO appears only once in the definition of a contracted GTO. In a general-contraction basis set 58 (such as ANO-RCC 59 ),

5



Page 6 of 55

a primitive GTO is used in several contracted GTOs. We call a set of GTO shells contracted from the same K primitive GTO shells as a block; in a segmented basis set, the block size B is equal to one for every contracted GTO, but in a general-contraction basis set, B > 1 for at least one GTO. Generally, K ≥ B is required to ensure the linear independence of the basis set. An ERI over contracted GTOs is defined as

(ab|cd) ≡

ZZ

χa (r1 ) χb (r1 )

1 χc (r2 ) χd (r2 ) dr1 dr2 r12

(3)

This, in turn, is written as a contraction of ERIs over primitive GTOs: Ka

(ab|cd) =

Kb

Kc Kd

∑ ∑ ∑ ∑ Dµ Dν Dρ Dτ

aµ bν |cρ dτ

(4)

µ=1 ν=1 ρ=1 τ=1

A batch of ERIs is the set of ERIs over all contracted GTOs that belong to a block. In many ERI libraries (including LIBRETA), an entire batch of contracted ERIs is generated at once to avoid redundant calculations of common intermediates. A batch of (dd|dd) with B = 4 (meaning that 4 d-type Cartesian GTO shells are contracted from an unspecified number of primitive d-type GTO shells) contains 44 × ((2 + 1)(2 + 2)/2)4 = 331776 individual contracted Cartesian ERIs.

LIBRETA

computes all ERIs (and other one-electron integrals) in Cartesian GTO basis. When integrals in spherical GTO basis are required, we will carry out corresponding transformations at suitable steps (see Subsection 3.1). The 8-fold permutation symmetry of ERIs over real GTOs is taken into account during the Fock build at a batch level. Let the centers, exponents, and angular momenta of primitive GTOs a, b, c, d (where we have dropped µ, etc. for brevity) be A, B, C, D, α, β , γ, δ , and La , Lb , Lc , Ld , respectively. Let the total angular momentum of the ERI [ab|cd] be L = La + Lb + Lc + Ld . For the bra and ket component, [ab| and |cd], the following quantities are defined:

ξ = α +β;

αA + β B ; P= ξ

6

αβ 2 Kab = exp − R α + β AB


(5)

Page 7 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


η = γ +δ;

γδ 2 Kcd = exp − R γ + δ CD

γC + δ D Q= ; η

(6)

and θ=

ξη ξ +η

(7)

where RAB ≡ |A − B|, and so on. We also define XAB ≡ Ax − Bx , YAB ≡ Ay − By , ZAB ≡ Az − Bz , and so on. With these, quantity [00|00]N (0 ≤ N ≤ L) can be defined: [00|00]N =

2π 5/2 1/2

ξ η (ξ + η)

Kab Kcd FN θ R2PQ

(8)

where FN (x) is the so-called Boys function given by Z 1

FN (x) =

t 2N exp(−xt 2 )dt

(9)

0

The evaluation of the Boys function is given in the Supporting Information.

3

Formalisms

For a given batch of ERIs, LIBRETA will determine the best method from all three of the classical RRs: OS, DRK, and MD. For the derivation and detail analyses of these RRs, the reader is referred to a review 16 or monograph. 60 The most important feature of these RRs is that there are two distinct sets of RRs: the one (“vertical”) that depends on individual exponents of GTOs and thus can only be applied to primitive ERIs, and the one (“horizontal”) that does not depend on exponents and can be used for both primitive and (partially or fully) contracted ERIs. Since primitive ERIs are far more numerous (when contraction lengths are high) than contracted ERIs, it is often advantageous to contract one or more of the four GTOs involved as early as possible, after which only the horizontal RRs can be used. Below, we succinctly summarize the pertinent aspects of each RR.

7



3.1

Page 8 of 55

The Obara–Saika Method

The RRs underlying this method were derived by Obara and Saika 6,7 from the translational invariance of ERIs. Head-Gordon and Pople 8 made an essential rearrangement of the RRs into the vertical and horizontal RRs (VRRs and HRRs), which are implemented in LIBRETA: 1. VRR: calculate [m0|n0]0 for Lm = La to La + Lb and Ln = Lc to Lc + Ld by η XPQ [m0|n0]N+1 [m + 1x , 0|n0]N =XPA [m0|n0]N − ξ +η mx η nx N N+1 + [m − 1x , 0|n0] − [m − 1x , 0|n0] + [m0|n − 1x , 0]N+1 2ξ ξ +η 2(ξ + η) (10) ξ XPQ [m0|n0]N+1 [m0|n + 1x , 0]N =XQC [m0|n0]N + ξ +η nx ξ mx N N+1 + [m0|n − 1x , 0] − [m0|n − 1x , 0] + [m − 1x , 0|n0]N+1 2η ξ +η 2(ξ + η) (11) and their analogues for y and z components. These VRRs start from (8) with N = L. They have to be carried out for each primitive GTO quartet. 2. Bra and ket contraction: form (m0|n0) by contractions of [m0|n0]0 . 3. Bra and ket HRR: obtain the target ERIs by

(a, b + 1x |cd) = (a + 1x , b|cd) + XAB (ab|cd)

(12)

(ab|c, d + 1x ) = (ab|c + 1x , d) + XCD (ab|cd)

(13)

and their analogues for y and z components. We will use (ab|-HRR (or (La Lb |-HRR) to stand for the subroutine of evaluating (ab|ef) from (a0|ef), · · · , (a + b, 0|ef) (e and f are arbitrary GTOs). For |cd)-HRR (or |Lc Ld )-HRR), similar interpretation applies.

8


Page 9 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


For an ERI over spherical GTOs, it is possible to perform an evaluation over Cartesian GTOs followed by a Cartesian-spherical transformation (CST). However, taking the advantage of HRRs being independent of exponents, it is beneficial to separate this CST into two partial ones before and after the last HRR (13): CRT

HRR

CST

(m0|n0)CRT −→ CRT (ab|n0)CRT −→ sph (ab|n0)CRT HRR sph

−→

CRT CST sph

(ab|cd)

−→

(14) sph

(ab|cd)

Let the four GTOs in the ERI have the same angular momentum L, we can see that the posterior CST scales as (NL4 + (2L + 1)2 NL2 ) · L ≈ L9 + L7 while the two-partial-CST scales as (NL3 · L + (2L + 1)2 NL2 ) · L ≈ L8 + L7 ). More importantly, the cost of the last HRR can be significantly decreased: In the posterior CST scheme, (13) has to be carried out for NL2 ≈ L4 components of CRT (ab|; in the two-partial-CST scheme, this number reduces to (2L + 1)2 ≈ L2 components of

sph (ab|,

making

this scheme preferred in LIBRETA.

3.2 The Dupuis–Rys–King Method We have seen that [ab|cd] is calculated by RRs from a one-dimensional integral (8), revealing that Z 1

[ab|cd] =

g(t 2 ) exp(−λt 2 )dt

(15)

0

where λ = θ R2PQ and the integral kernel g(t 2 ) is a 2L-degree polynomial in t. Thus, this integral can be calculated by standard Gaussian quadrature theory: 61 Z 1

G

g(t 2 ) exp(−λt 2 )dt =

0

∑ wκ g(s2κ )

(16)

κ=1

In (16), G = bL/2c + 1; sκ and wκ are the quadrature roots and weights of an R-type Rys polynomial 2 RλG (x), respectively. Rλm (x) is a 2m-degree polynomial, forming a set of orthogonal polyno-

9



Page 10 of 55

mials: Z 1 0

Rλm (x)Rλn (x) exp(−λ x2 )dx = δmn

(17)

The integral kernel can be proved to be separated into x, y, and z components:

g(s2κ ) = IaXx bx cx dx (s2κ )IaYy by cy dy (s2κ )IaZz bz cz dz (s2κ )

(18)

There are two ways of implementing the DRK method: 62 (1) Obtain [ab|cd] directly by a 5-term RR on IaXx bx cx dx , etc., followed by contractions to (ab|cd); (2) Calculate [m0|n0] using X , etc., which can be obtained by a Gaussian quadrature, thus in (18), IaXx bx cx dx becomes ImXx 0nx 0 ≡ Imn

3-term RR. Then perform contractions and HRRs (which were referred to as “transfer equations” in Ref. 62) to obtain (ab|cd). Consider the most expensive step of the two ways, i.e., forming [ab|cd] or [m0|n0]. Let the four GTOs in the ERI have the same angular momentum L and contraction degree K, the cost of this step in the former way scales as G · NL4 · K 4 ≈ L9 K 4 while in the latter way scales as G · L2 · NL2 · K 4 ≈ L7 K 4 . Therefore, the latter way shifts a large amount of work from the primitive GTO quartet loop. 10,21 In

LIBRETA ,

we will use the second way. The evaluation of

sκ and wκ is given in the Supporting Information. The DRK method calculates [m0|n0] by the following steps: X , IY and I Z for m = L to L + L , n = L to L + L , and κ = 1 to G by 1. VRR: calculate Imn a a c c b d mn mn

m ns2κ η 2 X η 2 X 2 XPQ sκ Imn (sκ )+ 1− sκ Im−1,n (s2κ )+ IX (s2 ) ξ +η 2ξ ξ +η 2(ξ + η) m,n−1 κ (19) 2 ξ n ξ 2 X msκ X X X (s2κ ) = XQC + (s2κ )+ Im,n+1 XPQ s2κ Imn 1− sκ Im,n−1 (s2κ )+ Im−1,n (s2κ ) ξ +η 2η ξ +η 2(ξ + η) (20)

X Im+1,n (s2κ ) =

XPA −

X = IY = I Z = 1. and their analogues for y and z components. These RRs start from I00 00 00

2. Gaussian quadrature: calculate [m0|n0] by G 2π 5/2 Y [m0|n0] = Kab Kcd ∑ wκ ImXx nx (s2κ )Im (s2 )I Z (s2 ) y ny κ mz nz κ 1/2 ξ η(ξ + η) κ=1

10


(21)

Page 11 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Then, (ab|cd) is calculated by applying contractions and HRRs on [m0|n0]. Comparing the OS with DRK method, the memory requirement for the former becomes more demanding as L increases. For example, for (gg|gg), the OS-VRR and DRK-VRR requires storing 136125 and 21286 intermediates, respectively, by LIBRETA (For the memory requirement of other types of ERI, please refer to the Supporting Information). For ERIs of low and high L, it is better to use the OS and DRK method, respectively.

3.3 The McMurchie–Davidson Method In the MD method, an ERI is transformed to integrals over Hermite-Gaussians (HGs), which are easier to calculate than the original one. 3 HGs are defined as w≡

∂ ∂ XP

wx

∂ ∂YP

wy

∂ ∂ ZP

wz

exp −θ R2PQ

(22)

We denote the product of two GTOs and one HG as abp. In this paper, all HGs are represented by w, p, and q. A straightforward way of implementing the MD method is first to calculate the integrals over HGs [00p|00q] by: [w + 1x ]N = XPQ [w]N+1 + wx [w − 1x ]N+1

(23)

[00p|00q] = (−1)qx +qy +qz [p + q]0

(24)

and then transform [00p|00q] to [ab0|cd0] ≡ [ab|cd], which can be contracted to the target ERIs. Here, (24) converts the derivatives with respective to P along in [p + q]0 to those with respective to P and Q. For example, the x-component of (24) is: qx ∂ 2 exp −θ XPQ px qx ≡ ∂ XQ px +qx ∂ 2 qx = (−1) exp −θ XPQ ≡ (−1)qx (p + q)x ∂ XP

∂ ∂ XP

px

11


(25)


Page 12 of 55

The RR (23) starts with [0]N =

2π 5/2 N 2 K K (−2θ ) F θ R N ab cd PQ ξ η(ξ + η)1/2

(26)

Using the terminology from the PRISM algorithm, 12 the bra and ket contraction are referred to as two C’s; the transformation from [0]N to [00p|00q], [00p| to [ab0|, and |00q] to |cd0] are referred to as three T’s. With suitable modifications of (22) to (24), the target ERI can be calculated in arbitrary orders of C’s and T’s, resulting in 20 contraction paths, like MD-TCCTT, MD-CTCTT, etc. 12 We have only implemented the paths starting with “C”, since other paths do not give any significant advantages over the OS and DRK methods in

LIBRETA .

There are 8 different MD-

CXXXX methods, shown in Figure 1. We will describe the method using MD-CCTTT and MDCTTTC as examples. For a complete description please refer to Ref. 12.

Figure 1: All MD-CXXXX methods. From [0]N to (ab|cd), there are 4 contraction paths: CCTTT, CTCTT, CTTCT, and CTTTC. Switching the order of bra and ket contraction gives another 4. The MD-CCTTT method in LIBRETA is given below: 1. Bra and ket contractions: calculate the fundamental quantities for N = 0 to L: N a0 b0 p0 (0)c0 d 0 q0

Ka

Kb

0

Kc Kd

0

0

0

(2αµ )a (2βν )b (2γρ )c (2δτ )d ≡ ∑ ∑ ∑ ∑ Dµ Dν Dρ Dτ [0]N p0 q0 (2ζ ) (2η ) µν ρτ µ=1 ν=1 ρ=1 τ=1 12


(27)

Page 13 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


The a0 , b0 , p0 , c0 , d 0 , q0 are integers that will be discussed later. 2. VRRs: calculate the integrals over HGs a0 b0 p0 (00p|00q)c0 d 0 q0 for L p = 0 to La + Lb and Lq = 0 to Lc + Ld , respectively, by N a0 b0 p0 (w + 1x )c0 d 0 q0

N+1 =XAB (a0 +1)b0 (p0 +1) (w)cN+1 0 d 0 q0 − XCD a0 b0 p0 (w)(c0 +1)d 0 (q0 +1)

(28)

N+1 + XBD a0 b0 p0 (w)cN+1 0 d 0 q0 + wx a0 b0 p0 (w − 1x )c0 d 0 q0

a0 b0 p0 (00p|00q)c0 d 0 q0

= (−1)qx +qy +qz a0 b0 p0 (p + q)0c0 d 0 q0

(29)

and their analogues for y and z components. This is the exponent-independent variant of (23). 3. Bra and ket HRRs: for the bra component, the HGs are transformed to 000(ab0| by

a0 b0 p0(a + 1x , bp| =

px a0 b0 p0(ab, p − 1x | − XAB a0 (b0 +1)(p0 +1)(abp| + a0 b0 (p0 +1)(ab, p + 1x | (30)

a0 b0 p0(a, b + 1x , p| =

px a0 b0 p0(ab, p − 1x | + XAB (a0 +1)b0 (p0 +1)(abp| + a0 b0 (p0 +1)(ab, p + 1x | (31)

and their analogues for y and z components. After applying similar RRs to the ket component, the target ERIs are obtained: (ab|cd) ≡ 000 (ab0|cd0)000 . The MD-CTTTC method in LIBRETA contains the following steps: 1. Ket contraction: calculate the fundamental quantities for N = 0 to L:

[0)N c0 d 0 q0

0

Kc Kd

0

(2γρ )c (2δτ )d ≡ ∑ ∑ Dρ Dτ [0]N q0 (2η ) ρτ ρ=1 τ=1

(32)

2. VRR: calculate [00p|00q)c0 d 0 q0 for L p = 0 to La + Lb and Lq = 0 to Lc + Ld , respectively, by

N+1 N+1 N+1 [w + 1x )N c0 d 0 q0 =XPD [w)c0 d 0 q0 − XCD [w)(c0 +1)d 0 (q0 +1) + wx [w − 1x )c0 d 0 q0

13


(33)


Page 14 of 55

[00p|00q)c0 d 0 q0 = (−1)qx +qy +qz [p + q)0c0 d 0 q0

(34)

3. Bra HRR: transform [00p| to [ab0| by

[a + 1x , bp| = px [a + 1x , b, p − 1x | + XPA [abp| +

1 [ab, p + 1x | 2ζ

(35)

[ab + 1x , p| = px [a + 1x , b, p − 1x | + XPB [abp| +

1 [ab, p + 1x | 2ζ

(36)

4. Ket HRR: transform |00q)c0 d 0 q0 to |cd0)000 by (30) and (31). 5. Bra contraction: the target ERIs are obtained by the contraction of [ab|cd) ≡ [ab|cd0)000 . The additional indices a0 , b0 , p0 , c0 , d 0 , q0 in these equations indicate the powers of exponents, and can be determined backforwardly from the RRs. For example, to calculate (ps|ss) by MDCCTTT, according to (28) and (31): (ps|ss) ≡000 (100|000)000 = −XAB 011 (000|000)000 + 001 (001|000)000

(37)

= −XAB 011 (0)0000 + 001 (1)0000 0 001 (1)000

= +XAB 102 (0)1000 − XCD 001 (0)1101 + XBD 001 (0)1000

(38)

Thus, the fundamental quantities required in (27) are 011 (0)0000 , 102 (0)1000 , 001 (0)1101 and 001 (0)1000 . For ERIs over high angular momenta, the number of additional indices becomes very large, making the early contraction strategy less favorable, and the OS or DRK method are preferred. In LIBRETA

the MD-CXXXX methods are applied for ERIs over highly contracted s and p GTOs.

3.4 Integral Screening For molecules of extended geometry, a large number of ERIs may be very small in magnitude, thus screening them in advance can significantly reduce the computational cost. The classical 14


Page 15 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Schwarz inequality 63 can efficiently screen the ERIs with small GTO overlaps, but not for the ones with distant bra and ket pair. To account for both, we implemented the so-called QQR screening developed by Maurer et al 64 in our program. An ERI is estimated as

^ = (ab|cd) ≈ (ab|cd)

   

Qab Qcd R − ext0 ab − ext0 cd

p

(ab|ab) and Qcd ≡

p

(39) R − ext0 ab − ext0 cd ≤ 0

  Qab Qcd where Qab ≡

R − ext0 ab − ext0 cd > 0

(cd|cd). ext0 is the well-separatedness extent of a GTO

pair, 65 the definition of which is given in the Supporting Information. Thus, during the evaluation ^ < τ will not be calculated. of ERIs, given a threshold τ, an ERI that is estimated to satisfy (ab|cd)

4 Development of LIBRETA The development of LIBRETA using the formulas in the last Section is shown in Figure 2. It will be discussed from three aspects, which correspond to our previously proposed three factors of building an efficient ERI library: (1) Optimization of the RR codes; (2) Optimization of contraction paths; (3) Design of the program structure.

Figure 2: The development of LIBRETA.

15



4.1

Page 16 of 55

Optimization of Recurrence Relations

Manually written codes for the implementation of RRs usually contain loops over the indices like N or a. Using ACG, loops can be unrolled, i.e. each step of the RRs can be written down explicitly. The efficiency of the ACG codes may be increased by exploring these RRs. First, we examine the DRK-VRRs (19) and (20). For (d p|ps), the DRK-VRR step has to calculate [ f s|ps] and [ds|ps], thus the range of m and n in (19) and (20) is 0∼3 and 0∼1, respectively. One possible DRK-VRR diagram for (d p|ps) is shown in Figure 3. The DRK-VRR step can be applied in various ways, but the total number of the intermediates (Imn ) is always 8.

LIBRETA

will use the RRs of the least FLOPs. For example, between I21 = (·)I20 + (·)I10 or

I21 = (·)I11 + (·)I01 + (·)I10 , the former expression will be chosen. Generally, for each type of the angular momenta, ACG will generate codes through a DRK-VRR diagram containing the least FLOPs.

Figure 3: One possible DRK-VRR diagram for (d p|ps). For example, “10 and 20 connecting 21” means that according to (20), I21 = (·)I20 + (·)I10 . For all such diagrams in this paper, this interpretation applies. Next, we examine the MD-VRRs (23), (28), and (33). Since the additional indices a0 , b0 , p0 , c0 , d 0 , q0 does not affect the recurrence structure, we only consider (23) here. For (d p|ps), the MD-VRR step has to calculate [g], [ f ], [d], and [p], thus the range of Lr is 0 ∼ 4. A possible MD-VRR diagram is shown in Figure 4. However, unlike the case of the DRK-VRR step, 7 intermediates that will never be used again are calculated, i.e. they are redundant for the ERI evaluation. This redundancy problem has been realized for a long time in the context of the MD method. 11,14,16 For the OS-VRRs (10) and (11), the redundancy problem is more complex, since they depend

16


Page 17 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 4: One possible MD-VRR diagram for (d p|ps). The redundant intermediates are rendered grey. In this figure, [0]N ’s (and also [00|00]N ’s in Figure 5, see below) are calculated independently from Boys function (9). But to make the diagrams compact (one root node), they are formally connected according to the downward recursion of Boys function: FN (x) = (2xFN+1 (x) − exp(−x))/(2N + 1).

17



on two angular momenta, m and n. A possible OS-VRR diagram of (ps|ps) is shown in Figure 5. Of the 24 intermediates, 6 are redundant. Note that in a manually written loop, it is not trivial to identify whether an intermediate is redundant or not because this depends on the entire RR diagram. This is demonstrated in Figure 6: for 3000 randomly generated OS-VRR diagrams of (dd|dd), each one contains 3675 intermediates. After elimination of redundancies, the number of remaining intermediates ranges rather densely from 2273 to 3037.

Figure 5: One possible OS-VRR diagram for (ps|ps). The redundant intermediates are rendered grey.

Figure 6: Number of remaining intermediates for 3000 randomly generated OS-VRR diagrams of (dd|dd) after eliminating redundancies. For HRRs, the redundancy problem is similar. In LIBRETA, ACG is used to find the RR diagram 18


Page 18 of 55

Page 19 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


that has the largest number of redundancies. Thus after elimination of them, this diagram becomes the most efficient one. In Appendix A, we discuss the origin of the redundancies these RRs. In LIBRETA, given a RR (say OS-VRR) and an ERI type (say ( f f | f f )), ACG uses a tree-search algorithm to optimize the RR codes: (1) Randomly generate a large number of RR diagrams. Note that since the number of all possible diagrams increases exponentially with L, a complete search is beyond our computational ability. Each step of a diagram is carried out using the index that can minimize the FLOPs. For example, to calculate [(2, 1, 0)0|(0, 0, 0)0]0 , LIBRETA will apply the RR with my rather than with mx since the former leads to only two nonzero terms in (10). One is randomly selected if multiple indices satisfy this condition. (2) For each diagram, it is treated as a tree (see Figure 7): a vertex stands for an intermediate, and its incoming edges emit from the intermediates that produce it by RRs. Now, every vertex is examined, and eliminated if it has zero outdegree and does not correspond to the target quantity in this ERI evaluation step. This process is repeated until no vertices are eliminated. Now all the diagrams are redundancy-free. (3) The diagram with the least remaining intermediates is selected and used to generate the codes.

Figure 7: The optimization of a RR digram in LIBRETA by ACG. For all the RRs used in LIBRETA, the optimal diagram has been searched and optimized, and the results are shown in Table 1, 2, and 3, respectively. The optimization is quite effective. For MDVRRs, our optimized results are similar to those of Johnson et al 11 and satisfy the theoretical upper bound we derive (see Appendix A). For HRRs, 3.5∼19.7% of the intermediates can be removed. This is rather considerable saving because each HRR step has to be repeated for several times during the evaluation of an ERI. For example, to calculate (gg| f f ) from (gg| f s), (gg|gs), (gg|hs), and (gg|is), we have to perform the | f f )-HRR for 225 components of (gg|, therefore according to Table 2, we can save the computational cost of (388 − 334) × 225 = 12150 intermediates. 19



Page 20 of 55

For OS-VRRs, the optimization is more effective. For ERIs of high L like ( f d|dd) or (gd|ds), more than 50% of the intermediates can be eliminated. We can see that the ERI types with the same La + Lb and Lc + Ld , like (dd|dd), ( f p| f p), (gs|dd), (gs| f p) and (gs|gs), have the same number of intermediates in their non-optimized diagrams, and the larger La − Lb or Lc − Ld is, the more redundancies there are, thus the number of intermediates in their optimized diagrams decreases in the order of (gs|dd), (gs| f p) and (gs|gs). With these optimized digrams, we can straightforwardly generate the evaluation codes for the OS, DRK, and MD RRs by interpreting the diagrams (see the caption in Figure 3). Table 1: The optimization of MD-VRR diagrams. The numbers are the intermediates that have to be calculated Type s p d f g h i k l

4.2

Non-opt 1 5 15 35 70 126 210 330 495

Opt 1 5 15 33 63 108 173 261 381

Selection of Contraction Paths

An optimal implementation of contractions can significantly improve the evaluation efficiency. The contraction problem has been discussed in the famous PRISM algorithm 12 for segmented contraction. In

LIBRETA ,

we have generalized it to both segmented and general contracted basis

sets. For an ERI (ab|cd), we assume at the moment that 4 GTOs have the same contraction degree K and general contracted block size B. The cost estimation expression has the form

xK 4 + yK 4 B2 + zK 2 B4 + uK 2 B2 + vB4

20


(40)

Page 21 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Table 2: The optimization of OS HRR diagrams. The numbers are the intermediates that have to be calculated Type ss ps pp ds dp dd fs fp fd ff gs gp gd gf gg

Non-opt 0 0 9 0 18 84 0 30 135 388 0 45 198 558 1269

Opt 0 0 9 0 18 79 0 30 129 334 0 45 191 488 1019

To understand this expression, one can refer to the program structure in Table 4: K 4 term is contributed from the calculations for all primitive GTO quartets; K 4 B2 and K 2 B4 from the first and second contraction, respectively; K 2 B2 and B4 from the transformations before and after the second contraction, respectively. The prefactors x, y, z, u, v of (pp|pp) are given in Table 5. As described in Section 3, for the OS and DRK method, we perform contractions immediately after VRRs. Assume the angular momenta of GTOs in (ab|cd) are all L, the cost of contractions immediately after VRRs (forming (L0|L0] ∼ (2L, 0|2L, 0] to (L0|L0) ∼ (2L, 0|2L, 0)) scales as L2 · NL2 (K 4 B2 + K 2 B4 ) ≈ L6 (K 4 B2 + K 2 B4 ). In contrast, the cost of contractions at last (forming [LL|LL] to (LL|LL)) is estimated to be NL4 (K 4 B2 + K 2 B4 ) ≈ L8 (K 4 B2 + K 2 B4 ), which confirms that performing contractions at the last step is almost always more expensive. This is also revealed in the PRISM algorithm 12 where MD-XXXXC paths are often used for ERIs where Kbra or Kket is 1 (in this case there is no contraction at all). In the MD-CXXXX methods, we can even carry out contractions at the very beginning of the evaluation subroutine. While this reduces the amount of work in contractions, the increasing z, u, v in Table 5 suggest that the additional indices in (28) or (32) impose extra burden on the following

21



Page 22 of 55

Table 3: The optimization of OS-VRR diagrams. The numbers are the intermediates that have to be calculated Type (ss|ss) (ps|ss) (ps|ps) (pp|ss) (pp|ps) (pp|pp) (ds|ss) (ds|ps) (ds|pp) (ds|ds) (d p|ss) (d p|ps) (d p|pp) (d p|ds) (d p|d p) (dd|ss) (dd|ps) (dd|pp) (dd|ds) (dd|d p) (dd|dd) ( f s|ss) ( f s|ps) ( f s|pp) ( f s|ds) ( f s|d p) ( f s|dd) ( f s| f s) ( f p|ss) ( f p|ps) ( f p|pp) ( f p|ds) ( f p|d p) ( f p|dd) ( f p| f s) ( f p| f p) ( f d|ss) ( f d|ps) ( f d|pp) ( f d|ds)

Non-opt 1 5 24 15 70 200 15 70 200 200 35 160 450 450 1000 70 315 875 875 1925 3675 35 160 450 450 1000 1925 1000 70 315 875 875 1925 3675 1925 3675 126 560 1540 1540

Opt 1 5 18 15 52 146 15 43 116 116 33 101 280 258 611 63 191 514 529 1188 2220 31 81 209 209 469 929 455 61 171 449 449 1018 2080 887 1910 106 300 829 837

Type ( f d|d p) ( f d|dd) ( f d| f s) ( f d| f p) ( f d| f d) ( f f |ss) ( f f |ps) ( f f |pp) ( f f |ds) ( f f |d p) ( f f |dd) ( f f | f s) ( f f | f p) ( f f | f d) (f f|f f) (gs|ss) (gs|ps) (gs|pp) (gs|ds) (gs|d p) (gs|dd) (gs| f s) (gs| f p) (gs| f d) (gs| f f ) (gs|gs) (gp|ss) (gp|ps) (gp|pp) (gp|ds) (gp|d p) (gp|dd) (gp| f s) (gp| f p) (gp| f d) (gp| f f ) (gp|gs) (gp|gp) (gd|ss) (gd|ps)

Non-opt 3360 6370 3360 6370 10976 210 924 2520 2520 5460 10290 5460 10290 17640 28224 70 315 875 875 1925 3675 1925 3675 6370 10290 3675 126 560 1540 1540 3360 6370 3360 6370 10976 17640 6370 10976 210 924

22

Opt 1916 3695 1565 3377 5894 171 477 1298 1286 2981 5833 2860 5552 9689 15153 57 136 336 336 769 1622 768 1575 2936 5122 1378 102 265 702 675 1664 3313 1605 3113 5835 9121 2451 4938 167 442


Type (gd|pp) (gd|ds) (gd|d p) (gd|dd) (gd| f s) (gd| f p) (gd| f d) (gd| f f ) (gd|gs) (gd|gp) (gd|gd) (g f |ss) (g f |ps) (g f |pp) (g f |ds) (g f |d p) (g f |dd) (g f | f s) (g f | f p) (g f | f d) (g f | f f ) (g f |gs) (g f |gp) (g f |gd) (g f |g f ) (gg|ss) (gg|ps) (gg|pp) (gg|ds) (gg|d p) (gg|dd) (gg| f s) (gg| f p) (gg| f d) (gg| f f ) (gg|gs) (gg|gp) (gg|gd) (gg|g f ) (gg|gg)

Non-opt 2520 2520 5460 10290 5460 10290 17640 28224 10290 17640 28224 330 1440 3900 3900 8400 15750 8400 15750 26880 42840 15750 26880 42840 64800 495 2145 5775 5775 12375 23100 12375 23100 39270 62370 23100 39270 62370 94050 136125

Opt 1132 1209 2633 5755 2699 5186 9493 15048 4046 8153 13571 258 677 1794 1821 4022 8875 3990 8385 14876 23457 6311 12719 21162 31903 382 981 2588 2880 6462 13084 6159 11633 22171 34977 9413 18977 31567 47570 67374

Page 23 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Table 4: The program structure for the OS/DRK method For each bra exponent-pair For each ket exponent-pair Calculate [m0|n0] For each ket block Contract [m0|n0] to [m0|n0) Next ket block Next ket exponent-pair For each ket block // If any operations here For each bra block Contract [m0|n0) to (m0|n0) Next bra block Next ket block Next bra exponent-pair For each bra block For each ket block Transform (m0|n0) to (ab|n0) Transform (ab|n0) to (ab|cd) Next ket block Next bra block

// K 4 // K 4 B2

// K 2 B2 // K 2 B4

// B4 // B4

transformations. Therefore, when the ERI contains low contracted parts, late contractions are preferred. For ERIs over s and p GTOs, LIBRETA will calculate the cost of all methods using x, y, z, u, v and L, K, B, then selects the most efficient one for its evaluation. Table 5: The prefactors in the cost estimation expressions for (pp|pp) Scheme OS/DRK MD-CCTTT MD-CTCTT MD-CTTCT MD-CTTTC

x 630/1008 122 0 0 0

y 162 305 106 106 106

z 162 305 262 630 81

u 0 0 732 1698 2130

v 324 3370 1104 432 0

For ERIs of higher total angular momentum L, the choice of the OS or DRK method is more subtle. In Table 6, the computational cost (in terms of x in (40)) and memory requirement of the two methods for some types of ERIs are given. For (pp|pp) or (gs|ss), the OS method has better performance, in agreement with their relative costs. However, as L increases, the memory requirement of the OS method gets much more demanding. Such memory traffic becomes a serious

23



burden, decreasing the performance significantly. Indeed, for (dd|dd), ( f p| f p), and (gg|gg), although the OS method has smaller x, it is much less efficient than the DRK method. The latter requires a memory of only 16∼28% of the former. Actually, several studies reveal that in some cases minimizing the memory implementation instead of FLOPs can lead to better performance improvement. 66,67 Thus, in LIBRETA, calibrations similar to those in Table 6 have been carried out to determine whether the OS or DRK method is superior for each type of ERI. This guarantees that

LIBRETA

can select the most suitable method for all ERIs in practice. The preferred method

between the OS and DRK method for each type of ERI is given in 7. Table 6: The computational cost and resource required for the OS and DRK methods in LIBRETA Memorya Wall timeb (unit: s) OS DRK OS DRK OS DRK Ntest 630 1008 200 114 21.4 29.7 1000000 11750 18815 3675 1046 21.8 13.6 50000 10222 12860 3675 710 20.3 10.5 60000 235 252 70 36 8.6 13.5 1000000 17920 14865 6370 790 15.5 12.9 15000 80569 120169 28224 4257 16.2 16.3 5000 425588 769347 136125 21286 33.6 20.0 1000 a Number of intermediates that have to be stored. b The time was measured by N test calculations. x

(pp|pp) (dd|dd) ( f p| f p) (gs|ss) (gs| f d) (gd|gd) (gg|gg)

4.3

Design of the Program Structure

Optimization of RRs and contraction paths can reduce the theoretical computational cost significantly. However, the practical performance depends on not only the formulas but also specific computer architecture and how it is implemented. To achieve peak performance in practice, many technical details in the program structure have to be considered, which we will mention some here. The code generator is written in C++ and generates C++ codes of LIBRETA. Common subexpression elimination. The last three terms (which are center-independent) in OS-VRRs (10) and (11) can be shared by several intermediates, as shown in the calculation of

24


Page 24 of 55

Page 25 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Table 7: Preferred method between the OS and DRK method for each type of ERI (ss|ss) (ps|ss) (ps|ps) (pp|ss) (pp|ps) (pp|pp) (ds|ss) (ds|ps) (ds|pp) (ds|ds) (d p|ss) (d p|ps) (d p|pp) (d p|ds) (d p|d p) (dd|ss) (dd|ps) (dd|pp) (dd|ds) (dd|d p) (dd|dd) ( f s|ss) ( f s|ps) ( f s|pp)

OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS DRK DRK OS OS OS

( f s|ds) ( f s|d p) ( f s|dd) ( f s| f s) ( f p|ss) ( f p|ps) ( f p|pp) ( f p|ds) ( f p|d p) ( f p|dd) ( f p| f s) ( f p| f p) ( f d|ss) ( f d|ps) ( f d|pp) ( f d|ds) ( f d|d p) ( f d|dd) ( f d| f s) ( f d| f p) ( f d| f d) ( f f |ss) ( f f |ps) ( f f |pp)

OS OS DRK OS OS OS OS OS DRK DRK DRK DRK OS OS OS OS DRK DRK DRK DRK OS OS OS DRK

( f f |ds) ( f f |d p) ( f f |dd) ( f f | f s) ( f f | f p) ( f f | f d) (f f|f f) (gs|ss) (gs|ps) (gs|pp) (gs|ds) (gs|d p) (gs|dd) (gs| f s) (gs| f p) (gs| f d) (gs| f f ) (gs|gs) (gp|ss) (gp|ps) (gp|pp) (gp|ds) (gp|d p) (gp|dd)

DRK DRK OS DRK DRK OS OS OS OS DRK DRK DRK DRK DRK DRK DRK DRK DRK OS OS DRK DRK DRK DRK

(gp| f s) (gp| f p) (gp| f d) (gp| f f ) (gp|gs) (gp|gp) (gd|ss) (gd|ps) (gd|pp) (gd|ds) (gd|d p) (gd|dd) (gd| f s) (gd| f p) (gd| f d) (gd| f f ) (gd|gs) (gd|gp) (gd|gd) (g f |ss) (g f |ps) (g f |pp) (g f |ds) (g f |d p)

DRK DRK DRK OS DRK DRK OS OS DRK DRK DRK OS DRK DRK OS DRK DRK OS OS OS OS DRK DRK DRK

(g f |dd) (g f | f s) (g f | f p) (g f | f d) (g f | f f ) (g f |gs) (g f |gp) (g f |gd) (g f |g f ) (gg|ss) (gg|ps) (gg|pp) (gg|ds) (gg|d p) (gg|dd) (gg| f s) (gg| f p) (gg| f d) (gg| f f ) (gg|gs) (gg|gp) (gg|gd) (gg|g f ) (gg|gg)

OS DRK OS OS DRK DRK DRK DRK DRK OS DRK OS DRK OS OS DRK DRK DRK DRK DRK DRK DRK DRK DRK

( f d|d p) by the singly and doubly underlined terms of (41) to (43): [(1, 0, 5)0|(1, 0, 5)0]0 =XQC [(1, 0, 5)0|(0, 0, 5)0]0 +

ξ XPQ [(1, 0, 5)0|(0, 0, 5)0]1 ξ +η

(41)

1

+ [(0, 0, 5)0|(0, 0, 5)0] /2(ξ + η) ξ XPQ [(0, 0, 6)0|(1, 0, 4)0]1 ξ +η + [(0, 0, 6)0|(0, 0, 4)0]0 − [(0, 0, 6)0|(0, 0, 4)0]1 /2η

[(0, 0, 6)0|(2, 0, 4)0]0 =XQC [(0, 0, 6)0|(1, 0, 4)0]0 +

ξ ZPQ [(0, 0, 6)0|(0, 0, 5)0]1 ξ +η + 5 [(0, 0, 6)0|(0, 0, 4)0]0 − [(0, 0, 6)0|(0, 0, 4)0]1 /2η

(42)

[(0, 0, 6)0|(0, 0, 6)0]0 =ZQC [(0, 0, 6)0|(0, 0, 5)0]0 +

+ 6[(0, 0, 5)0|(0, 0, 5)0]1 /2(ξ + η)

25


(43)


Page 26 of 55

In this case, common subexpression elimination (CSE) technique 68 can be applied. These common subexpressions may not be recognizable by the compilers due to their complexity, thus CSE is explicitly done by backward search during the code generation. CSE gives some performance improvement, for example, reducing the wall time of calculating ( f d|d p) by 4∼6%. The improvement is limited since that each common term appears at most 3 times. Memory access. Memory access is another important issue. The essential part of memory access optimization is to increase the cache hit rate, which can significantly improve the performance of numerical computing. To achieve this, data reference patterns in the codes should exhibit locality feature. 69 For the evaluation of ERIs, the number of intermediates is large thus reasonable memory access is critical. This program design can be demonstrated through the calculation of (gp|gs) by (12) from (gs|gs) and (hs|gs). Assume the ERIs are “row-major” ordered (all indices/addresses start at 0 in line with the C array convention): that is, if the index of a in La -shell is xa , then the index of ab and (ab|cd) is xab = xa NLb + xb and xab NLc NLd + xcd , respectively. There are two computing orders, or memory access patterns, to carry out this HRR (see Figure 8): computing the entire (gp| shell for every component of |gs) (Pattern A) or computing every component of (gp| for the entire |gs) shell (Pattern B). Although Pattern A looks more natural, however, it “wanders” in discontinuous regions of the memory; in contrast, Pattern B shows more locality in memory access, thus it is considered to be more efficient than Pattern A. In fact, a benchmark in Table 8 suggests that as the HRR codes become larger, Pattern A has an increasing cache miss while that of Pattern B is relatively stable and smaller. Its impact on the performance is obvious: Pattern B is more efficient than Pattern A in the sense of the wall time by about 10∼20%. For every part of cache optimization is taken into account.

26


LIBRETA ,

the

Page 27 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 8: Two memory access patterns. The numbers in paratheses are xab or xcd . Table 8: The impact of two Bra HRR memory access patterns on the performance of ERI evaluation codes Cache missesa Wall timeb (unit: s) Type Pattern A Pattern B Pattern A Pattern B (f f|f f) 3.3% 3.4% 4.3 3.8 (g f | f f ) 3.8% 3.4% 6.1 5.3 (g f |g f ) 4.0% 3.7% 8.8 7.7 (gg| f f ) 5.4% 3.8% 11.7 9.1 (gg|g f ) 5.4% 4.0% 15.6 12.2 (gg|gg) 5.7% 3.8% 27.1 21.1 a The cache miss were measured with Valgrind (http://valgrind.org/). b For each ERI type, the time were measured by 1000 calculations.

5 Performance of LIBRETA In the following, all the calculations were performed using Cartesian GTOs with one CPU core on the same machine.

27



5.1

Comparison with LIBINT 2

We examined the performance of LIBRETA by calculating all ERIs (8-fold permutation symmetry was used) of some molecules with popular basis sets and compared it with LIBINT 2. The basis sets tested included: the highly-contracted Pople’s STO-6G, 70,71 6-31G(d) and 6-311G(2d, 2p); 72,73 segmented Karlsruhe basis set def2-XVP, 74 Dunning 75,76 and Peterson’s 77–79 general contracted correlation consistent basis sets (aug)-cc-pVXZ (including ECP-VXZ). We selected 4 molecules: organic molecule C6 H6 ; inorganic molecules containing heavy elements SF6 and XeO3 ; metal cluster Au4 . Note that for the last 3 molecules, the basis functions usually have considerably larger contraction degrees than those for C, H, O that often appear in small organic molecules. The results are given in Table 9. Our LIBRETA outperformed LIBINT 2 for most cases, being faster than LIBINT 2

by 7.2∼912.0% for C6 H6 and SF6 . For ECP-VXZ which may contain general contracted

d GTOs with K = 10, LIBRETA exhibited excellent performance, being 20 to 30 times faster. Table 9: Performance (unit: s) of LIBRETAa NBFb LIBINT 2 LIBRETA NBF LIBINT 2 LIBRETA C6 H6 SF6 STO-6G 36 40.4 19.8(104.0%) 39 38.9 17.5(122.3%) 6-31G(d) 102 16.3 9.2(77.2%) 137 23.6 16.8(40.5%) 6-311G(2d,2p) 228 74.6 49.7(50.1%) 211 52.2 32.3(61.6%) def2-SVP 120 11.5 7.0(64.3%) 109 7.1 5.6(26.8%) def2-TZVP 252 75.8 52.1(45.5%) 258 69.4 50.8(36.6%) def2-QZVP 642 1101.4 1027.4(7.2%) 518 598.9 505.7(18.4%) cc-pVDZ 120 52.5 11.3(364.6%) 109 93.1 9.2(912.0%) cc-pVTZ 300 186.6 67.1(78.1%) 249 228.4 40.6(462.6%) cc-pVQZ 630 1176.1 788.0(49.3%) 494 885.5 357.8(147.5%) aug-cc-pVDZ 204 119.2 32.1(271.3%) 179 158.2 20.2(683.2%) aug-cc-pVTZ 480 525.3 279.2(88.1%) 389 471.5 136.0(246.7%) aug-cc-pVQZ 960 4477.1 3959.8(13.1%) 739 2571.2 1565.7(64.2%) XeO3 Au4 ECP-VDZc 70 13.1 2.5(424.0%) 204 687.6 32.1(2042.1%) c ECP-VTZ 150 49.1 10.3(376.7%) 316 7732.3 301.2(2467.2%) ECP-VQZc 293 216.6 74.7(190.0%) 540 71348.0 2056.5(3369.4%) a All the calculations were performed with AMD Opteron(tm) Processor 6134. The number in parentheses are the efficiency improvement. b NBF = number of basis functions. c For XeO and Au , only the valence basis part was considered. 3 4 Basis set

28


Page 28 of 55

Page 29 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


The performance improvement comes from two aspects: (1) method; (2)

LIBRETA

LIBINT 2

LIBINT 2

implements the OS

can use the DRK method for ERIs of high L which is much more efficient.

is designed for segmented contracted basis sets, thus for ERIs over a set of general

contracted GTOs, VRRs for the primitive GTO quartets have to be repeated for about B4 times, and the following contractions scale as K 4 B4 ; using

LIBRETA ,

the VRRs needs to be carried out

only once, and the contractions scale as K 4 B2 + K 2 B4 (see the program structure in Table 4). Thus for general contracted basis sets like aug-cc-pVXZ or ECP-VXZ, our library exhibited much better performance. This can also be demonstrated from some calculations where each method was disabled in LIBRETA ,

and the results are given in Table 10. It is observed that without any one method, all

the calculations will take long wall time. The DRK method is particularly important for basis sets containing GTOs of high angular momenta. An example is that without the DRK method, LIBRETA requires additional computation time of more than 200 seconds for aug-cc-pVQZ. Without the OS or MD-CXXXX method, the calculations of ERIs over GTOs of low angular momenta or high contraction degrees are also less efficient. Therefore, we confirm that LIBRETA can select the best method during the evaluation of ERIs.

5.2

Comparison with NWC HEM

Finally, the performance of

LIBRETA

is examined in real quantum chemical calculations. Thus,

we wrote a HF program using simple direct SCF strategy, i.e. recomputing the ERIs when required without disk usage. By “simple” we mean that we did not use point group symmetry or incremental Fockian formation, etc. They were also turned off in NWC HEM, since these techniques are out of the scope of

LIBRETA .

We calibrated the HF performance of our program and

NWC HEM 55 by comparing the average time of one SCF iteration. Since the evaluation of ERIs in an SCF calculation dominates, the calibration should reflect the performance of the ERI evaluation ability to a good extent. Results are given in Table 11. Again,

LIBRETA

exhibits excellent perfor-

mance, especially for S4 with general contracted basis sets, LIBRETA was faster than NWC HEM by 29



Table 10: Performance (unit: s) when one method is removeda Basis set NBFb LIBRETA without OSc without DRKd without MD-CXXXXe STO-6G 39 17.5 17.6 17.7 19.3 6-31g(d) 137 16.8 17.9 17.6 18.0 6-311g(2d,2p) 211 32.3 34.1 32.7 32.4 def2-SVP 109 5.6 5.7 5.6 5.9 def2-TZVP 258 50.8 53.3 53.7 50.6 def2-QZVP 518 505.7 510.9 567.8 514.7 cc-pVDZ 109 9.2 9.0 9.3 9.8 cc-pVTZ 249 40.6 41.5 43.1 41.4 cc-pVQZ 494 357.8 361.2 389.3 360.8 aug-cc-pVDZ 179 20.2 21.1 20.1 20.6 aug-cc-pVTZ 389 136.0 138.4 156.3 137.4 aug-cc-pVQZ 739 1565.7 1582.1 1789.0 1600.4 a All the calculations were performed for SF with AMD Opteron(tm) Processor 6134. 6 b NBF = number of basis functions. c OS is replaced by the DRK algorithm. d DRK is replaced by the OS algorithm. e MD-CXXXX is replaced by the OS algorithm.

144.2∼495.9%. This feature is important for the highly accurate electron correlation calculations of large molecules. Table 11: Average Time for One SCF Iteration (unit: s) of NWC HEM and LIBRETAa . All calculations converged to the same energy NBFb NWC HEM LIBRETA NBFb NWC HEM LIBRETA CH3 CHO S4 def2-TZVP 132 7.3 7.4(−1.4%) 168 23.5 20.5(14.6%) def2-QZVP 356 250.0 209.7(19.2%) 344 328.1 262.7(24.9%) cc-pVTZ 165 13.8 14.1(−2.1%) 156 131.7 22.1(495.9%) cc-pVQZ 350 240.8 181.6(32.6%) 296 549.4 148.2(270.7%) aug-cc-pVTZ 265 86.3 75.2(14.7%) 236 225.7 66.1(241.5%) 535 1261.3 1136.0(11.0%) 436 1416.1 579.8(144.2%) aug-cc-pVQZ a All the calculations were carried out with AMD Opteron(tm) Processor 6134. The number in paratheses are the efficiency improvement. b NBF = number of basis functions. Basis set

To test the effect of integral screening, we performed HF calculations on the global minimum of the silicon cluster 80 Si16 , which is a heavy system with a large number of highly contracted basis functions. The performances with screening parameter τ = 10−8 and without screening are shown in Table 12. We can see that both NWC HEM and 30

LIBRETA

exhibited performance improvement.


Page 30 of 55

Page 31 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


In the cc-pVQZ calculation, the QQR screening (39) can reduce about 5.9 × 1010 small ERIs from 2.5 × 1011 ones. In practice, τ can be variable during the SCF procedure to achieve the best accuracy-cost ratio. We conclude that

LIBRETA

can become a useful and efficient library for

theoretical and computational chemists. Table 12: Average Time for One SCF Iteration (unit: s) of NWC HEM and LIBRETAa . All calculations converged to the same energy Screening: τ = 10−8 No screening b Basis set NBF NWC HEM LIBRETA NWC HEM LIBRETA cc-pVDZ 304 1222.6 1063.2(15.0%) 2724.9 1227.3(122.0%) cc-pVTZ 624 5334.0 4805.3(11.0%) 11317.0 5860.0(93.1%) cc-pVQZ 1184 55307.0 37135.9(48.9%) 59276.8 43252.1(37.0%) a All the calculations were carried out with AMD Opteron(tm) Processor 6134. The number in paratheses are the efficiency improvement (relative to NWC HEM). b NBF = number of basis functions.

6 Conclusion In this paper, we have developed a new library for the evaluation of ERIs using computerized optimization, LIBRETA. It contains codes of OS, DRK, and MD RRs that are highly optimized by tree-search. Given an ERI, LIBRETA will determine and use the most efficient scheme to calculate it. Calibrations revealed that for popular basis sets, LIBINT 2

than

LIBRETA

showed performance superior to

and NWC HEM. Especially for general contracted basis sets,

LIBINT 2

LIBRETA

could be faster

by 912.0%, and for Au4 was even 20 to 30 times faster; the direct SCF program

supported by LIBRETA could be faster than NWC HEM by 495.9%.

LIBRETA

is a black box library,

containing codes for the evaluation of not only ERIs, but also important one-electron integrals like overlap, multipole, and electron-nuclear attraction integrals. For other kinds of molecular integrals, like the ones over anti-Coulomb operator, 81 London orbitals 82 or Gaussian geminals, 83 one can try to optimize the recurrence relations and contraction paths by approaches in this paper and apply the automatic code generation technique.

LIBRETA

is expected to be a useful tool for theoretical

and computational chemists to efficiently develop and carry out quantum chemical algorithms. In

31



Page 32 of 55

future, we hope to generalize this computerized optimization methodology to their first and second derivatives and more kinds of molecular integrals.

Appendix A: Origin of Redundancies Now we want to figure out why there are redundancies in MD/OS-VRRs and HRRs but not in DRK-VRRs. The redundancies occur only for angular momentum indices like w but not for onedimensional indices like m or N, thus their origin lies in the process of constructing (L + 1)-shell by adding 1 to the components of L-shell. It is trivial to find out that p- and d-shell have to be constructed from the entire s- and p-shell, respectively. However, only a part of 6 components of d-shell is needed to construct the entire f -shell. This can be seen, for example, from grey nodes in Figure 4: [(1, 0, 1)]2 and [(0, 1, 1)]2 of [2]2 are not used to calculate [3]1 . An elegant way to visualize this scenario is shown in Figure 9: the angular momenta are represented by lattice points in R3 , and each component of d-shell (blue points) can be used to obtain 3 components of f -shell (green points) by adding 1 (yellow tetrahedrons) to its x, y and z component. One can see that 4 components of d-shell are already sufficient to construct the entire f -shell. Generally, ML components of L-shell are sufficient to construct the entire (L + 1)-shell, where 3 1 ML = L(L + 4) + 1 + (L mod 2) 4 4

(44)

which will be proved later. We can see that ML /NL ≈ 1/2, thus for large L only about 50% of the L-shell are needed in the RRs. Therefore, we conclude that the redundancies originate from the fact that for the RRs in constrained indices w or m, n, the next set of quantities is able to be constructed from only a subset of the current one. Note that the “constraint” like wx + wy + wz = Lw (thus the points form a triangle) is essential: without constraints, there will be no redundancies. Now we derive (44), i.e. at most ML components of L-shell are sufficient to construct the entire (L +1)-shell. See Figure 9, now the green and blue points represent the components of (L +1)- and L-shell, respectively. We can see that two lines of green points have to be constructed by L + 1 blue 32


Page 33 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 9: The origin of redundancies. In the Figure, the d- and f -shell are represented by blue and green points in R3 . The tetrahedrons illustrate the construction procedure. For example, the (1, 1, 0) component of d-shell can be used to derive (2, 1, 0), (1, 2, 0) and (1, 1, 1). In this way, 4 blue points are sufficient to construct 10 green ones, making (1, 0, 1) and (1, 1, 0) redundant (see also Figure 4). ones. The remaining green points constitute a “shifted-(L − 1)-shell”, which can be constructed from ML−2 blue ones, thus we have a RR:

ML = (L + 1) + ML−2

33


(45)


Page 34 of 55

with M0 = 1 and M1 = 3, then we have ML = (L + 1) + ML−2 = (L + 1) + (L − 1) + ML−4     (L + 1) + (L − 1) + · · · + 3 +M0 L is even  {z }  | L/2 terms =    (L + 1) + (L − 1) + · · · + 4 +M1 L is odd   {z }  | (L−1)/2 terms     1 (L + 4)L + 1 L is even = 4    1 (L + 5)(L − 1) + 3 L is odd 4 1 3 = L(L + 4) + 1 + (L mod 2) 4 4

(46)

which proves (44). Now it is possible to give an estimation of the upper bound of the least number of intermediates required in MD-VRRs. Given an ERI of total angular momentum L, the total number of intermediates is L L−n

X total =

1

∑ ∑ Nl = 24 (L + 1)(L + 2)(L + 3)(L + 4)

(47)

n=0 l=0

In generated codes, redundancies are removed, only X redundancy-free intermediates remaining to be calculated. Thus we define: f MD-VRR ≡

X redundancy-free X total

(48)

To estimate X redundancy-free , we note that for n < L, only Ml rather than Nl components of an l-shell are needed as proved above, thus we get L−1 L−n

X redundancy-free ≤

n=0 l=0 L−1 L−n

≤

∑∑

n=0 l=0

=

L

∑ ∑ Ml + ∑ Nl l=0

L 1 7 1 l(l + 4) + + ∑ (l + 1)(l + 2) 4 4 l=0 2

1 (L + 1)(L3 + 15L2 + 78L + 48) 48

34


(49)

Page 35 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


This is an upper bound because the redundancies due to the ones of higher l are not taken into account in (49). Finally, we obtain

f MD-VRR ≤

(L3 + 15L2 + 78L + 48) 2(L + 2)(L + 3)(L + 4)

(50)

In Table 13, the estimation (50) is compared with our optimized MD-VRR (see Table 1) as well as the work of Johnson et al. 11 We can see that the optimized RRs from both works satisfy the upper bound. Table 13: The examination of the upper bound in (50) f MD-VRR Type by this worka by Johnson et alb Upper bound by Eq. (50) s 1 1 1 p 1 1 1.183 d 1 0.953 1.133 f 0.943 0.905 1.057 g 0.886 0.889 0.988 h 0.857 0.859 0.930 i 0.824 0.824 0.883 k 0.791 0.794 0.844 l 0.770 0.766 0.812 m 0.740 0.784 n 0.718 0.761 L = 11 0.702 0.742 L = 12 0.679 0.725 L = 13 0.658 0.710 L = 14 0.639 0.697 L = 15 0.622 0.685 a The data are calculated from results in Table 1. b The data are estimated from results in Table I in Ref. 11 by “Present/Full”, since in that table they gave flop-costs rather than the numbers of intermediates. The true f MD-VRR should be a little larger.

Acknowledgement This material is based upon work supported by a Molecular Software Sciences Institute (MolSSI) Fellowship funded by the National Science Foundation under Grant No. 1547580. It has also been supported by the National Science Foundation (award CHE-1361586) and by CREST, Japan

35



Science and Technology Agency.

Supporting Information Available The evaluation of the Boys function and roots and weights of the R-type Rys polynomial. The memory requirement of

LIBRETA

for the OS-VRR and DRK-VRR. The definition of the well-

separatedness extent. The geometries of C6 H6 , SF6 , XeO3 , Au4 , CH3 CHO, and S4 . This material is available free of charge via the Internet at http://pubs.acs.org/.

References (1) Boys, S. F. Electronic Wave Functions. I. A General Method of Calculation for the Stationary States of Any Molecular System. Proc. R. Soc. Am. 1950, 200, 542–554. (2) Dupuis, M.; Rys, J.; King, H. F. Evaluation of Molecular Integrals over Gaussian Basis Functions. J. Chem. Phys. 1976, 65, 111–116. (3) McMurchie, L. E.; Davidson, E. R. One- and Two-electron Integrals Over Cartesian Gaussian Functions. J. Comput. Phys. 1978, 26, 218–231. (4) Pople, J. A.; Hehre, W. J. Computation of Electron Repulsion Integrals Involving Contracted Gaussian Basis Functions. J. Comput. Phys. 1978, 27, 161–168. (5) Hamilton, T. P.; Schaefer III, H. F. New Variations in Two-electron Integral Evaluation in the Context of Direct SCF Procedures. Chem. Phys. 1991, 150, 163–171. (6) Obara, S.; Saika, A. Efficient Recursive Computation of Molecular Integrals over Cartesian Gaussian Functions. J. Chem. Phys. 1986, 84, 3963–3974. (7) Obara, S.; Saika, A. General Recurrence Formulas for Molecular Integrals over Cartesian Gaussian Functions. J. Chem. Phys. 1988, 89, 1540–1559.

36


Page 36 of 55

Page 37 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


(8) Head-Gordon, M.; Pople, J. A. A Method for Two-electron Gaussian Integral and Integral Derivative Evaluation Using Recurrence Relations. J. Chem. Phys. 1988, 89, 5777–5786. (9) Gill, P. M. W.; Head-Gordon, M.; Pople, J. A. Efficient Computation of Two-electronrepulsion Integrals and Their nth-order Derivatives Using Contracted Gaussian Basis Sets. J. Phys. Chem. 1990, 94, 5564–5572. (10) Lindh, R.; Ryu, U.; Liu, B. The Reduced Multiplication Scheme of the Rys Quadrature and New Recurrence Relations for Auxiliary Function Based Two-electron Integral Evaluation. J. Chem. Phys. 1991, 95, 5889–5897. (11) Johnson, B. G.; Gill, P. M. W.; Pople, J. A. Exact and Approximate Solutions to the Onecenter McMurchie–Davidson Tree-search Problem. Int. J. Quantum Chem. 1991, 40, 809– 827. (12) Gill, P. M. W.; Pople, J. A. The Prism Algorithm for Two-electron Integrals. Int.J. Quantum Chem. 1991, 40, 753–772. (13) Ishida, K. General Formula Evaluation of the Electron-repulsion Integrals and the First and Second Integral Derivatives over Gaussian-type Orbitals. J. Chem. Phys. 1991, 95, 5198– 5205. (14) Ryu, U.; Lee, Y. S.; Lindh, R. An Efficient Method of Implementing the Horizontal Recurrence Relation in the Evaluation of Electron Repulsion Integrals Using Cartesian Gaussian Functios. Chem. Phys. Lett. 1991, 185, 562–568. (15) Lindh, R. The Reduced Multiplication Scheme of the Rys-Gauss Quadrature for 1st Order Integral Derivatives. Theor. Chem. Acc. 1993, 85, 423–440. (16) Gill, P. M. W. Molecular Integrals Over Gaussian Basis Functions. Adv. Quantum Chem. 1994, 25, 141–205.

37



(17) Ishida, K. ACE Algorithm for the Rapid Evaluation of the Electron-repulsion Integral Over Gaussian-type Orbitals. Int. J. Quantum Chem. 1996, 59, 209–218. (18) Yanai, T.; Ishida, K.; Nakano, H.; Hirao, K. New Algorithm for Electron Repulsion Integrals Oriented to the General Contraction Scheme. Int. J. Quantum Chem. 2000, 76, 396–406. (19) Nakai, H.; Kobayashi, M. New Algorithm for the Rapid Evaluation of Electron Repulsion Integrals: Elementary Basis Algorithm. Chem. Phys. Lett. 2004, 388, 50–54. (20) Makowski, M. Simple Yet Powerful Techniques for Optimization of Horizontal Recursion Steps in Gaussian-type Two-electron Integral Evaluation Algorithms. Int. J. Quantum Chem. 2007, 107, 30–36. (21) Flocke, N.; Lotrich, V. Efficient Electronic Integrals and Their Generalized Derivatives for Object Oriented Implementations of Electronic Structure Calculations. J. Comput. Chem. 2008, 29, 2722–2736. (22) Sandberg, J. A. R.; Rinkevicius, Z. An Algorithm for the Efficient Evaluation of Two-electron Repulsion Integrals over Contracted Gaussian-type Basis Functions. J. Chem. Phys. 2012, 137, 234105. (23) Pritchard, B. P.; Chow, E. Horizontal Vectorization of Electron Repulsion Integrals. J. Comput. Chem. 2016, 37, 2537–2546. (24) King, H. F.; Dupuis, M. Numerical Integration Using Rys Polynomials. J. Comput. Phys. 1976, 21, 144–165. (25) Yasuda, K. Two-electron Integral Evaluation on the Graphics Processor Unit. J. Comput. Chem. 2008, 29, 334–342. (26) Ufimtsev, I. S.; Martínez, T. J. Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-electron Integral Evaluation. J. Chem. Theory Comput. 2008, 4, 222–231.

38


Page 38 of 55

Page 39 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


(27) Asadchev, A.; Gordon, M. S. New Multithreaded Hybrid CPU/GPU Approach to Hartree– Fock. J. Chem. Theory Comput. 2012, 8, 4166–4176. (28) Miao, Y.; Merz, K. M. Acceleration of Electron Repulsion Integral Evaluation on Graphics Processing Units via Use of Recurrence Relations. J. Chem. Theory Comput. 2013, 9, 965– 976. (29) Yasuda, K.; Maruoka, H. Efficient Calculation of Two-electron Integrals for High Angular Basis Functions. Int. J. Quantum Chem. 2014, 114, 543–552. (30) Miao, Y.; Merz, K. M. Acceleration of High Angular Momentum Electron Repulsion Integrals and Integral Derivatives on Graphics Processing Units. J. Chem. Theory Comput. 2015, 11, 1449–1462. (31) Kindratenko, V.; Ufimtsev, I. S.; Martínez, T. J. Evaluation of Two-electron Repulsion Integrals over Gaussian Basis Functions on SRC-6 Reconfigurable Computer; Proceedings of the Reconfigurable Systems Summer Institute, 2008. (32) Nakamura, K.; Hatae, H.; Harada, M.; Kuwayama, Y.; Uehara, M.; Sato, H.; Obara, S.; Honda, H.; Nagashima, U.; Inadomi, Y.; Murakami, K. Eric: A Special-Purpose Processor for ERI Calculations in Quantum Chemistry Applications; Proceedings of the 5th International Conference on High Performance Computing and Grid in Asia Pacific Region, 2002. (33) Ramdas, T.; Egan, G. K.; Abramson, D.; Baldridge, K. Converting Massive TLP to DLP: a Special-purpose Processor for Molecular Orbital Computations; Proceedings of the 4th International Conference on Computing Frontiers, 2007. (34) Jones, H. W. Computer-generated Formulas for Overlap Integrals of Slater-type Orbitals. Int. J. Quantum Chem. 1980, 18, 709–713. (35) Jones, H. W. Computer-generated Formulas for Two-center Coulomb Integrals over Slatertype Orbitals. Int. J. Quantum Chem. 1981, 20, 1217–1224. 39



(36) Jones, H. W. Computer-generated Formulas for Four-center Integrals over Slater-type Orbitals. Int. J. Quantum Chem. 1986, 29, 177–183. (37) Janssen, C. L.; Schaefer, H. F. The Automated Solution of Second Quantization Equations with Applications to the Coupled Cluster Approach. Theor. Chim. Acta 1991, 79, 1–42. (38) Li, X.; Paldus, J. Automation of the Implementation of Spin-adapted Open-shell Coupledcluster Theories Relying on the Unitary Group Formalism. J. Chem. Phys. 1994, 101, 8812– 8826. (39) Nooijen, M.; Lotrich, V. Towards a General Multireference Coupled Cluster Method: Automated Implementation of Open-shell CCSD Method for Doublet States. J. Mol. Struct. (TheoChem) 2001, 547, 253–267. (40) Hirata, S. Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-interaction, Coupled-cluster, and Many-body Perturbation Theories. J. Phys. Chem. A 2003, 107, 9887–9897. (41) Hirata, S. Higher-order Equation-of-motion Coupled-cluster Methods. J. Chem. Phys. 2004, 121, 51–59. (42) Hirata, S. Automated Symbolic Algebra for Quantum Chemistry. J. Phys. 2006, 46, 249. (43) Auer, A. A.; Baumgartner, G.; Bernholdt, D. E.; Bibireata, A.; Choppella, V.; Cociorva, D.; Gao, X.; Harrison, R.; Krishnamoorthy, S.; Krishnan, S.; Lam, C.-C.; Lu, Q.; Nooijen, M.; Pitzer, R.; Ramanujam, J.; Sadayappan, P.; Sibiryakov, A. Automatic Code Generation for Many-body Electronic Structure Methods: The Tensor Contraction Engine. Mol. Phys. 2006, 104, 211–228. (44) Hartono, A.; Lu, Q.; Henretty, T.; Krishnamoorthy, S.; Zhang, H.; Baumgartner, G.; Bernholdt, D. E.; Nooijen, M.; Pitzer, R.; Ramanujam, J.; Sadayappan, P. Performance Optimiza-

40


Page 40 of 55

Page 41 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


tion of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry. J. Phys. Chem. A 2009, 113, 12715–12723. (45) Strange, R.; Manby, F.; Knowles, P. Automatic Code Generation in Density Functional Theory. Comput. Phys. Commun. 2001, 136, 310–318. (46) Sałek, P.; Hesselmann, A. A Self-contained and Portable Density Functional Theory Library for Use in Ab Initio Quantum Chemistry Programs. J. Comput. Chem. 2007, 28, 2569–2575. (47) Ekström, U.; Visscher, L.; Bast, R.; Thorvaldsen, A. J.; Ruud, K. Arbitrary-order Density Functional Response Theory from Automatic Differentiation. J. Chem. Theory Comput. 2010, 6, 1971–1980. (48) Knizia, G.; Li, W.; Simon, S.; Werner, H.-J. Determining the Numerical Stability of Quantum Chemistry Algorithms. J. Chem. Theory Comput. 2011, 7, 2387–2398. (49) Bracken, P.; Bartlett, R. J. Calculation of Gaussian Integrals Using Symbolic Manipulation. Int. J. Quantum Chem. 1997, 62, 557–570. (50) Gomez, C.; Scott, T. Maple Programs for Generating Efficient FORTRAN Code for Serial and Vectorised Machines. Comput. Phys. Commun. 1998, 115, 548 – 562. (51) Titov, A. V.; Ufimtsev, I. S.; Luehr, N.; Martínez, T. J. Generating Efficient Quantum Chemistry Codes for Novel Architectures. J. Chem. Theory Comput. 2013, 9, 213–221. (52) Song, C.; Wang, L.-P.; Martínez, T. J. Automated Code Engine for Graphical Processing Units: Application to the Effective Core Potential Integrals and Gradients. J. Chem. Theory Comput. 2016, 12, 92–106. (53) Mazur, G.; Makowski, M.; Łazarski, R.; Włodarczyk, R.; Czajkowska, E.; Glanowski, M. Automatic Code Generation for Quantum Chemistry Applications. Int. J. Quantum Chem. 2016, 116, 1370–1381.

41



(54) Valeev, E. F. Libint library. see “https://github.com/evaleev/libint”. Access date: 01.07.2016. (55) Valiev, M.; Bylaska, E.; Govind, N.; Kowalski, K.; Straatsma, T.; Dam, H. V.; Wang, D.; Nieplocha, J.; Apra, E.; Windus, T.; de Jong, W. NWC HEM: A Comprehensive and Scalable Open-source Solution for Large Scale Molecular Simulations. Comput. Phys. Commun. 2010, 181, 1477–1489. (56) Dunning, T. H. Gaussian Basis Functions for Use in Molecular Calculations. I. Contraction of (9s5p) Atomic Basis Sets for the First - Row Atoms. J. Chem. Phys. 1970, 53, 2823–2833. (57) Hehre, W. J.; Ditchfield, R.; Pople, J. A. Self-Consistent Molecular Orbital Methods. XII. Further Extensions of Gaussian-Type Basis Sets for Use in Molecular Orbital Studies of Organic Molecules. J. Chem. Phys. 1972, 56, 2257–2261. (58) Raffenetti, R. C. General Contraction of Gaussian Atomic Rrbitals: Core, valence, Polarization, and Diffuse Basis Sets; Molecular Integral Evaluation. J. Chem. Phys. 1973, 58, 4452–4458. (59) Widmark, P.-O.; Malmqvist, P.-Å.; Roos, B. O. Density Matrix Averaged Atomic Natural Orbital (ANO) Basis Sets for Correlated Molecular Wave Functions. Theor. Chim. Acta 1990, 77, 291–306. (60) Helgaker, T.; Jøprgensen, P.; Olsen, J. Molecular Electronic-structure Theory; John Wiley & Sons Ltd, 2000. (61) Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. Numerical Recipes: The Art of Scientific Computing, Third Edition; Cambridge University Press, 2007. (62) Rys, J.; Dupuis, M.; King, H. F. Computation of Electron Repulsion Integrals Using the Rys Quadrature Method. J. Comput. Chem. 1983, 4, 154–157. (63) Whitten, J. L. Coulombic Potential Energy Integrals and Approximations. J. Chem. Phys. 1973, 58, 4496–4501. 42


Page 42 of 55

Page 43 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


(64) Maurer, S. A.; Lambrecht, D. S.; Flaig, D.; Ochsenfeld, C. Distance-dependent Schwarzbased Integral Estimates for Two-electron Integrals: Reliable Tightness vs. Rigorous Upper Bounds. J. Chem. Phys. 2012, 136, 144107. (65) White, C. A.; Johnson, B. G.; Gill, P. M.; Head-Gordon, M. The Continuous Fast Multipole Method. Chem. Phys. Lett. 1994, 230, 8–16. (66) Frisch, M. J.; Johnson, B. G.; Gill, P. M.; Fox, D. J.; Nobes, R. H. An Improved Criterion for Evaluating the Efficiency of Two-electron Integral Algorithms. Chem. Phys. Lett. 1993, 206, 225–228. (67) Johnson, B. G.; Gill, P. M.; Pople, J. The Efficient Transformation of (m0|n0) to (ab|cd) Two-Electron Repulsion Integrals. Chem. Phys. Lett. 1993, 206, 229–238. (68) Potkonjak, M.; Srivastava, M. B.; Chandrakasan, A. P. Multiple Constant Multiplications: Efficient and Versatile Framework and Algorithms for Exploring Common Subexpression Elimination. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 1996, 15, 151–165. (69) Kowarschik, M.; Weiß, C. An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms; Algorithms for Memory Hierarchies: Advanced Lectures, 2003. (70) Hehre, W. J.; Stewart, R. F.; Pople, J. A. Self-Consistent Molecular-Orbital Methods. I. Use of Gaussian Expansions of Slater-Type Atomic Orbitals. J. Chem. Phys. 1969, 51, 2657–2664. (71) Hehre, W. J.; Ditchfield, R.; Stewart, R. F.; Pople, J. A. Self-Consistent Molecular Orbital Methods. IV. Use of Gaussian Expansions of Slater-Type Orbitals. Extension to Second-Row Molecules. J. Chem. Phys. 1970, 52, 2769–2773. (72) McLean, A. D.; Chandler, G. S. Contracted Gaussian Basis Sets for Molecular Calculations. I. Second Row Atoms, Z=11-18. J. Chem. Phys. 1980, 72, 5639–5648. (73) Francl, M. M.; Pietro, W. J.; Hehre, W. J.; Binkley, J. S.; Gordon, M. S.; DeFrees, D. J.;

43



Pople, J. A. Self-consistent Molecular Orbital Methods. XXIII. A Polarization-type Basis Set for Second-row Elements. J. Chem. Phys. 1982, 77, 3654–3665. (74) Weigend, F.; Ahlrichs, R. Balanced Basis Sets of Split Valence, Triple Zeta Valence and Quadruple Zeta Valence Quality for H to Rn: Design and Assessment of Accuracy. Phys. Chem. Chem. Phys. 2005, 7, 3297–3305. (75) Dunning, T. H. Gaussian Basis Sets for Use in Correlated Molecular Calculations. I. The Atoms Boron through Neon and Hydrogen. J. Chem. Phys. 1989, 90, 1007–1023. (76) Woon, D. E.; Dunning, T. H. Gaussian Basis Sets for Use in Correlated Molecular Calculations. III. The Atoms Aluminum through Argon. J. Chem. Phys. 1993, 98, 1358–1371. (77) Peterson, K. A.; Figgen, D.; Goll, E.; Stoll, H.; Dolg, M. Systematically Convergent Basis Sets with Relativistic Pseudopotentials. II. Small-core Pseudopotentials and Correlation Consistent Basis Sets for the Post-d Group 16-18 Elements. J. Chem. Phys. 2003, 119, 11113– 11123. (78) Peterson, K. A.; Puzzarini, C. Systematically Convergent Basis Sets for Transition Metals. II. Pseudopotential-based Correlation Consistent Basis Sets for the Group 11 (Cu, Ag, Au) and 12 (Zn, Cd, Hg) Elements. Thero. Chem. Acc. 2005, 114, 283–296. (79) Figgen, D.; Rauhut, G.; Dolg, M.; Stoll, H. Energy-consistent Pseudopotentials for Group 11 and 12 Atoms: Adjustment to Multi-configuration Dirac–Hartree–Fock Data. Chem. Phys. 2005, 311, 227–244. (80) Yoo, S.; Zeng, X. C. Motif Transition in Growth Patterns of Small to Medium-sized Silicon Clusters. Angew. Chem. Int. Ed. 2005, 44, 1491–1494. (81) Klopper, W.; Röhse, R. Computation of Some New Two-electron Gaussian Integrals. Theor. Chim. Acta 1992, 83, 441–453.

44


Page 44 of 55

Page 45 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


(82) Tellgren, E. I.; Soncini, A.; Helgaker, T. Nonperturbative Ab Initio Calculations in Strong Magnetic Fields Using London Orbitals. J. Chem. Phys. 2008, 129, 154114. (83) Barca, G. M. J.; Gill, P. M. W. Two-Electron Integrals over Gaussian Geminals. J. Chem. Theory Comput. 2016, 12, 4915–4924.

45



Table of Content 42x21mm (300 x 300 DPI)


Page 46 of 55

Page 47 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 1 71x30mm (300 x 300 DPI)





Page 48 of 55

Page 49 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60





Figure 4 179x380mm (300 x 300 DPI)


Page 50 of 55

Page 51 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60







Page 52 of 55

Page 53 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60





Figure 8 131x101mm (300 x 300 DPI)


Page 54 of 55

Page 55 of 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60




libreta: Computerized Optimization and Code ... - ACS Publications

Recommend Documents