“Balancing” the Block Davidson–Liu Algorithm - ACS Publications

Jun 2, 2016 - “Balancing” the Block Davidson−Liu Algorithm. Robert M. Parrish,. †,‡. Edward G. Hohenstein,*,§,∥ and Todd J. Martínez. â€...
0 downloads 0 Views 2MB Size
Subscriber access provided by UNIV OF NEBRASKA - LINCOLN

Letter

A “balanced” block Davidson-Liu algorithm Robert M. Parrish, Edward G. Hohenstein, and Todd J. Martínez J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.6b00459 • Publication Date (Web): 02 Jun 2016 Downloaded from http://pubs.acs.org on June 5, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

A “balanced” block Davidson-Liu algorithm Robert M. Parrish,†,‡ Edward G. Hohenstein,∗,¶,§ and Todd J. Mart´ınez†,‡ † Department of Chemistry and the PULSE Institute, Stanford University, Stanford, CA 94305 ‡ SLAC National Accelerator Laboratory, Menlo Park, CA 94025 ¶ Department of Chemistry and Biochemistry, The City College of New York, New York, NY 10031 § Ph.D. Program in Chemistry, The Graduate Center of the City University of New York, New York, NY 10016 Received June 1, 2016; E-mail: [email protected]

Abstract: We describe a simple modification (“balancing”) of the block Davidson-Liu eigenvalue algorithm which allows the norms of the Krylov search directions to decrease naturally as convergence is approached. In the context of integraldirect configuration interaction singles and time-dependent density functional theory, this provides for efficient utilization of density-based screening. Tests within the TeraChem GPGPU code exhibit speedups of ∼ 2× on systems with up to 1500 atoms, with negligible loss in accuracy. Introduction: Iterative subspace methods for the determination of a few selected eigenvalues/eigenvectors of large matrices are ubiquitous in electronic structure theory. These methods have proven invaluable as they work entirely in terms of a handful of matrix-vector products per root – the matrix need never be explicitly computed or stored. One of the most elegant of these schemes for use with Hermitian indefinite matrices is the block Davidson-Liu (DL) algorithm. 1,2 DL allows for the simultaneous determination of several eigenvectors, and naturally targets the lowest eigenvalues within the eigenspectrum. DL has become the de facto standard for configuration interaction (CI) problems, in methods ranging from CI singles 3 (CIS) to full CI (FCI), including complete active space (CAS) approaches and timedependent density functional theory 4,5 (TD-DFT) variants. As with all generalized Krylov subspace algorithms, DL works by building up an iterative subspace representation of the full matrix. In each iteration, the Ritz estimates of the eigenvalue/eigenvector pairs are built by diagonalizing the subspace matrix. Orthogonal search directions are then produced from preconditioned residuals, and the subspace is expanded to include these new search vectors. DL normally uses orthonormal search directions, such that the subspace metric matrix is the identity. This implies that all search directions are accorded the same importance, regardless of how close the iterative procedure is to convergence. In this work, we show that a slight “balancing” modification of DL leads to significant performance gains in practical problems. This work was motivated by recent implementations of integral-direct Hartree-Fock 6–8 (HF) and CIS methods 9 (and their DFT/TD-DFT extensions) on general-purpose graphical processing unit (GPGPU) hardware, including density-based screening. 10 In practice, we have found that CIS exhibits significantly worse per-root computational performance than HF, even though both de-

scribe wavefunctions of the same quality (e.g., uncorrelated). This performance dichotomy stems from the ability of HF to take advantage of incremental Fock builds, 10 wherein only the difference Fock matrix is constructed in each iteration. As the difference Fock matrix becomes increasingly sparse near convergence, the incremental Hartree-Fock iterations accelerate significantly during the convergence process. In CIS, by contrast, the DL iterations decelerate during the convergence process, as the orthonormal DL correction vectors become increasingly dense as convergence is approached. Here we present the solution to this problem, which works with DL correction vectors which are orthogonal but not normalized, naturally providing for efficient density-based screening. In exact precision, the iterative histories of both the canonical and “balanced” DL variants will be identical, and in practice we find that the two procedures produce numerically identical results. However, the balanced DL procedure is found to be ∼ 2× faster than the canonical DL procedure, for a series of test calculations in CIS/TD-DFT on systems with up to 1500 atoms and 13000 basis functions. After submission of this manuscript, we became aware of a recently published paper by Furche and co-workers 11 that describes a similar procedure. The principle difference is that Furche and co-workers explicitly solve a (balanced) non-orthogonal generalized eigenvalue problem, whereas our approach solves a (balanced) orthogonal eigenvalue problem. The motivation and principal benefit of the two approaches are highly similar. We encourage the interested reader to consult the paper by Furche and co-workers, also. Canonical Davidson-Liu: One common variant of the canonical DL algorithm 1,2 is presented in Algorithm 1, largely following the notation of Leininger et al. 12 In this description, we ignore subtleties such as the order in which vectors are orthogonalized and subspace collapse procedures, as these do not bear on our main point. We have formally modified the algorithm to include consideration of the diagonal overlaps of the subspace basis vectors S˜II ≡ hbI |bI i, i.e., to allow for an orthogonal but non-normalized subspace basis. In the canonical DL algorithm, S˜II = 1 due to the orthonormalization steps 10 and 12. The formation of matrix-vector ˆ I i is the computational bottleneck. products |σI i ≡ H|b Balanced Davidson-Liu: To obtain our “balanced” DL (B-DL) procedure, Algorithm 1 is modified as follows: (1) Remove the normalization steps 9 and 12. (2) Explictly solve the generalized eigenvalue problem in step 3 by transforming to a unit-metric eigenvalue problem: ∗ ˜ ˜ K ˜JK = U ˜IK h ˜KI H˙ IJ U : U UKJ = δIJ −1/2 −1/2 ˙ IJ ≡S ˜ H II

ACS Paragon Plus Environment 1

˜ IJ S ˜ H JJ

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Since the subspace metric is diagonal, it is trivially inverted. The full eigenvector matrix is given as: ˜IJ ≡ C

ˆ {|bI i} : hbI |bJ i = procedure Davidson-Liu(NI , H, δIJ ) Form the subspace Hamiltonian: ˜ IJ ≡ hbI |H|b ˆ J i = hbI |σJ i H ˆ |σJ i≡H|bJ i

3:

Solve the subspace eigenproblem: ˜ K ˜ IJ C ˜JK = S˜II C ˜IK h H ˜

SII ≡hbI |bI i

4:

Obtain the NI Ritz approximations to the lowest ˜ I , and eigenvectors: eigenvalues hI ≈ h ∗ ˜ ˜ Ii ≡ C ˜JI |bJ i : C ˜KI ˜LJ = δIJ |ΨI i ≈ |Ψ SKL C

5:

|ri ≡

Form the NI residual vectors:

S ˆ ia,jb H cIjb = cIia ∆EI : cIia cJia = δIJ

?

7: 8:

Check norm(|rI i) ≤ ǫ ∼ 10−3 − 10−6 for the convergence of each root. for I in the non-converged roots do Form the preconditioned(a) correction vector: ˜ I )−1 |rI i ¯ −h |δI i ≡ −(H

9: 10:

Normalize |δI i. Orthogonalize to the current trial subspace: X |δI′ i ≡ |δI i − |bJ ihbJ |δI i/hbJ |bJ i J

11: 12: 13: 14: 15: 16: 17: (a)

if norm(|δI′ i)/norm(|δI i) > τ ∼ 10−3 then Normalize |δI′ i. Add |δI′ i to the trial subspace {|bI i}. end if end for Repeat from 2 until convergence is achieved in all NI roots. end procedure

¯ is defined to be either the exact or The preconditioner H ˆ II in DL. 1,12,14 approximate diagonal H

A rough justification for B-DL is as follows. Consider DL ˜ ≡ P ci |˜bi i which is expanded for a single eigenstate |Ψi i in terms of the Ni orthonormalized subspace trial vectors {|˜bi i}. Two key target quantites are the eigenvalue, XX ∗ ˜≡ ˆ ˜bj i h ci cj h˜bi |H| (1) i

j

(2)

(3)

where the singlet block of the CIS Hamiltonian is, S ˆ ia,jb H ≡ δij δab (ǫa − ǫi ) + 2(ia|jb) − (ij|ab)

(4)

Here i, j label occupied spatial orbitals, a, b label virtual spatial orbitals, ǫp is the orbital eigenvalue for the p-th molecular orbital, and (pq|rs) are the molecular-orbital ERIs in chemist’s notation. In practice, the rate-limiting ERI contributions are written first in terms of AO-basis integrals (µν|λσ), and thence in terms of generalized J and K Fock builds Jµν (Dλσ ) ≡ (µν|λσ)Dλσ and Kµν (Dλσ ) ≡ (µλ|νσ)Dλσ , viz., J σia ≡ 2(ia|jb)bjb = 2Cµi [(µν|λσ)Cλj bjb Cσb ] Cνa

≡ 2Cµi [Jµν (Dλσ ≡ Cλj bjb Cσb )] Cνa

(5)

and, K σia ≡ −(ij|ab)bjb = −Cµi [(µλ|νσ)Cλj bjb Cσb ] Cνa

≡ −Cµi [Kµν (Dλσ ≡ Cλj bjb Cσb )] Cνa

(6)

A heavily-optimized GPGPU-based integral-direct J and K build implementation lies at the heart of the TeraChem program. 6–9 One particularly important consideration in the present work is the density-based screening of Fock contributions based on the Cauchy-Schwarz bound. E.g., for J, we discard contributions below a tolerance θ according to: 10 p (7) |(µν|λσ)Dλσ | ≤ (µν|µν)(λσ|λσ)|Dλσ | ≤ θ If Dµν is provided with a magnitude of the same order of that seen in the physical system, then the error in the physical observables derived from the Fock contribution will be on the order of θ. It is at this point that B-DL enters: with B-DL, the elements of Dλσ will decrease as conver-

ACS Paragon Plus Environment 2

˜ Ψi ˆ ˜bi i − h| ˜ ci H|

Note that the target quantities depend quadratically and linearly, respectively, on ci . The ci tend to decrease geometrically with i as the DL procedure converges. In canonical ˆ ˜bi i is evaluated with the same DL, the quantity |˜ σi i ≡ H| absolute accuracy O(ǫ) for each i. Thus, as |ci | decreases with increasing i, the individual DL contributions become more accurate than O(ǫ), which represents wasted effort. In ˆ i |˜bi i B-DL, we endeavor to compute the quantity |σi i ≡ Hc with accuracy O(ǫ), which can be shown to yield errors of O(ǫ) for each i in the target quantities above. In principle, the exact ci cannot be known until after the formation of |σi i (via diagonalization of the new subspace Hamiltonian). In ′ B-DL, the main assumption is that |ci | ∼ k|δi−1 ik = k|bi ik, i.e., that the non-normalized DL search direction will enter the final eigenvector with weight |c′i | ∼ 1. This assumption is partially justified by virtue of the fact that DL can be viewed 1,12,14 as a quasi-Newton-Raphson method which converges quickly for the problem at hand, implying that the non-normalized search directions are “good” steps. A more detailed analysis is presented in the Supporting Information. Configuration Interaction Singles: A key opportunity for the deployment of B-DL is in single-excitation-type methods for excited states, e.g., configuration interaction singles (CIS) and time-dependent density functional theory (TD-DFT). For singlet excited states relative to a canonical restricted Hartree-Fock determinant in real orbitals, we have the symmetric CIS eigenproblem, 3

˜Ψ ˆ Ψ ˜ I i − h| ˜ Ii |rI i ≡ H| 6:

X i

Algorithm 1 Canonical Davidson-Liu algorithm for ˆ I i = hI |ΨI i : hΨI |ΨJ i = δIJ . In the input arguments, H|Ψ the desired number of roots is NI , the matrix to diagonalize ˆ and the ≥ NI orthonormal guess vectors are {|bI i}. is H,

2:

and the residual,

−1/2 ˜ SII U IJ

(3) For numerical stability and to enforce Hermitivity, we take the off-diagonal subspace Hamiltonian elements as ∗ ˜ IJ = H ˜ JI H from the matrix element hbI |σJ i or hbJ |σI i∗ where the σ-vector is constructed from the b-vector with the larger norm. This reduces sensitivity to presumably higher relative errors in σ-vectors built from b-vectors with small norms (e.g., due to more aggressive relative approximations in the σ-vectors built from small-norm b-vectors). 13

1:

Page 2 of 9

Page 3 of 9

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 4 of 9

Page 5 of 9

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

ACS Paragon Plus Environment

Page 6 of 9

Page 7 of 9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Journal of Chemical Theory and Computation

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

ACS Paragon Plus Environment

Page 8 of 9

-2 -3 -4 -5

Idea:

H˜ IJC˜ JK = S˜ IJC˜ JK˜hK -6 ˜ SIJ = δIJ ⇒ Canonical-DL -7 ˜ SIJ = δIJS˜ II ⇒ Balanced-DL -8 0

5

10 15 CIS Iteration

20

25

25

Runtime Savings

Log10 Residual Norm [a.u.]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

30

Journal of Chemical Theory and Computation

Identical Convergence/Properties

Wall Time per Sigma-Vector [s]

Page 9 of 9

20 15 10 5 0 0

ACS Paragon Plus Environment

Canonical Balanced 5

10 15 CIS Iteration

20

25