Recursive Factorization of the Inverse Overlap Matrix in Linear-Scaling

Jun 6, 2016 - We present a reduced complexity algorithm to compute the inverse overlap factors required to solve the generalized eigenvalue problem in...
2 downloads 6 Views 2MB Size
Article pubs.acs.org/JCTC

Recursive Factorization of the Inverse Overlap Matrix in LinearScaling Quantum Molecular Dynamics Simulations Christian F. A. Negre,*,† Susan M. Mniszewski,‡ Marc J. Cawkwell,† Nicolas Bock,† Michael E. Wall,‡ and Anders M. N. Niklasson*,† †

Theoretical Division and ‡Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States S Supporting Information *

ABSTRACT: We present a reduced complexity algorithm to compute the inverse overlap factors required to solve the generalized eigenvalue problem in a quantum-based molecular dynamics (MD) simulation. Our method is based on the recursive, iterative refinement of an initial guess of Z (inverse square root of the overlap matrix S). The initial guess of Z is obtained beforehand by using either an approximate divide-and-conquer technique or dynamical methods, propagated within an extended Lagrangian dynamics from previous MD time steps. With this formulation, we achieve long-term stability and energy conservation even under the incomplete, approximate, iterative refinement of Z. Linear-scaling performance is obtained using numerically thresholded sparse matrix algebra based on the ELLPACK-R sparse matrix data format, which also enables efficient shared-memory parallelization. As we show in this article using self-consistent density-functional-based tight-binding MD, our approach is faster than conventional methods based on the diagonalization of overlap matrix S for systems as small as a few hundred atoms, substantially accelerating quantum-based simulations even for molecular structures of intermediate size. For a 4158-atom water-solvated polyalanine system, we find an average speedup factor of 122 for the computation of Z in each MD step.



INTRODUCTION Molecular dynamics (MD) simulations are widely used in computational materials science, chemistry, and biophysics.1 The predictive capabilities of MD simulations are strongly related to the accuracy of the method used to compute the interatomic forces. One of the most reliable, yet still computationally feasible, methods uses forces acting on classical nuclear degrees of freedom that are calculated from an underlying quantum mechanical description of the electronic degrees of freedom within the Born−Oppenheimer approximation. Such quantum-based MD (QMD) simulations provide a general and transferable approach that avoids problems of classical force-field methods2 and are applicable to a broad range of materials capturing charge transfer and bond formations in molecular reactions. However, simulating large systems with QMD is challenging because of a high computational cost and poor parallel scaling. Often, we are limited either to very small system sizes or to very short simulation times. QMD is still very far from simulating systems of the same complexity, as is possible with classical methods. However, thanks to recent advances in state-of-the-art linear scaling electronic structure theory, it is now possible to perform semiempirical self-consistent Born−Oppenheimer QMD simulations of systems with ∼104 atoms on a single shared-memory node over longer simulation times and with ab initio QMD simulations not too far behind.3−5 Systems of this size are sufficiently large for the study of completely new problems that previously were inaccessible to QMD. © XXXX American Chemical Society

Quantum-based Born−Oppenheimer MD requires the selfconsistent solution of a quantum mechanical eigenvalue problem at every time step. Solving the eigenvalue problem with traditional methods has a computational cost that scales as N3, where N is the total number of orbitals of the system (which is, in general, proportional to the number of atoms). If a self-consistent field (SCF) calculation is performed, then the eigenvalue problem has to be solved several times at each time step until the convergence of the electron density operator is reached. As the size of the system is increased, the calculation of the MD trajectory therefore rapidly becomes computationally intractable. Several reduced complexity methods to overcome the 6(N3) bottleneck have therefore been proposed in recent years.6,7 However, before some of the most efficient of these linear-scaling electronic structure methods can be applied, the quantum mechanical eigenvalue problem has to be transformed from a general nonorthogonal basis-set representation to an orthogonal “normal” form.8 This orthogonalization transform is the main focus of this article.6−8 The general representation of the quantum mechanical eigenvalue problem is given by [H ]β B = SBϵ

(1)

where [H]β is the matrix representation of the Hamiltonian in a nonorthogonal basis-set representation, B is the matrix Received: February 11, 2016

A

DOI: 10.1021/acs.jctc.6b00154 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation containing the coefficients for every solution Bi with eigenvalue ϵi, and S is the overlap matrix defined as Sij = ⟨ψβi |ψβj ⟩ in terms of the nonorthogonal set of basis functions, {ψβj }. To solve eq 1, the generalized eigenvalue problem is transformed to the orthogonalized normal form

[H ]α C = C ϵ

overlap factor from an ad hoc divide-and-conquer-like estimate (DC_EST) given from a graph-based approach for the evaluations of sparse matrix polynomials.25 Although the iterative refinement of the inverse overlap is a well-known method, the dynamical propagation of the congruence transformation matrix is a technique that, as far as we know, has never been applied before. However, it requires the solution of an additional and quite unexpected problem. To maintain matrix sparsity and linear scaling performance, we have found that an additional symmetrization condition needs to be enforced (at least partially). Apart from a detailed analysis, testing, and evaluation, the article is thus governed by three novel solutions and implementations: (1) a dynamical evolution of the congruence transformation that allows for a geometric integration, (2) a modified iterative refinement scheme that maintains matrix sparsity and linear scaling complexity over long simulation times, and (3) a novel approximate but fast calculation of the initial guess to the overlap matrix prior to the iterative refinement in the first MD time step. The article is organized as follows: We first explain each individual method and algorithms used to compose the full scheme developed in this work. Thereafter, we present the full combined framework and evaluate its performance and accuracy using liquid methane and water as model systems [see Supporting Information (SI) for a representation of the initial configuration of this system]. Comparison with traditional methodology where Z is calculated from a diagonalization (ZDIAG) is used as a reference. As a demonstration of this technique’s capability, we also perform MD simulations of a water-solvated polyalanine system containing 4158 atoms. In the Appendix section, we offer a brief background explanation of the congruence transform, that is, how the generalized eigenvalue problem is orthogonalized to a normal form.

(2)

by a congruence transformation, [H]α = ZT[H]βZ. Matrix Z is given by a factorization of inverse overlap matrix S−1 such that S−1 = ZZT. A commonly used approach is to apply the Löwdin decomposition9 where Z = S−1/2, which is computed from the diagonalization of overlap matrix S and the inverse square root of eigenvalue si of S, that is

Z = Us−1/2UT

(3)

where U contains the eigenvectors that diagonalize S such that S = UsUT. The diagonalization of S has an 6(N3) computational cost, which represents a major obstacle for large systems. Inverse factor Z is not unique. In fact, in general, there is an infinite number of Z that fulfill the requirement that ZTSZ = I, where I is the identity matrix. Another common method besides the Löwdin factorization is the inverse Cholesky factorization, although with the same 6(N3) scaling.8 In this article, we present a general scheme that computes the factors, Z, of the inverse overlap matrix at every step of an MD simulation using an iterative, thresholded sparse matrix algebra with a computational cost that scales only linearly, 6(N ), with the system size for sufficiently large problems. The converged factors, Z, are, in general, neither Löwdin nor inverse Cholesky factors, but some intermediate solution between the symmetric Löwdin factor and the highly asymmetric inverse Cholesky factor. The algorithm is based on Niklasson’s iterative refinement of an approximate inverse factorization of the overlap matrix.10 Although alternative 6(N ) algorithms exist,11−16 our approach here is tailored to QMD simulations and attempts to minimize overhead and systematic errors. In modern formulations of Born−Oppenheimer QMD,4,17−23 the number of SCF cycles is reduced to a minimum, and the inverse overlap factorization becomes a significant part of each force evaluation. The computational overhead of the congruence transformation therefore needs to be kept as small as possible. We also need to avoid breaking the time-reversal symmetry and preserve the long-term stability of the molecular trajectories. Here, we achieve this by introducing an approximate inverse factorization matrix, Z(t), as an auxiliary dynamical variable within a timereversible extended Lagrangian (XL) framework that allows a time-reversible geometrical integration (GEOM_INT) of Z(t). As we demonstrate later in this article, this approach provides an accurate initial estimate to the iterative refinement algorithm and avoids systematic error accumulation even for approximate convergence using only a single iterative refinement recursion in each time step. The computational cost scales linearly with the system size, and the wall-clock time can be kept low by using an efficient parallel sparse matrix algebra based on the ELLPACK-R format for matrix−matrix multiplications,3,24 which is a modified and more easily parallelizable form of the classical (serial) Gustavson algorithm using the compressed sparse row format. In the very first MD time step, we do not have access to an approximate inverse factorization from a previous time step. In this case, we construct the initial guess to the iterative refinement scheme from an approximate inverse



METHODOLOGY Our reduced complexity scheme for the inverse factorization of the overlap matrix in QMD simulations is composed of three essential subalgorithms: (1) the iterative refinement method (IT_REF), (2) the divide-and-conquer estimate (DC_EST) for the initial estimation of Z, and (3) the XL framework enabling a time-reversible geometric integration (GEOM_INT) that provides the long-term stability of the total energy. A global numerical threshold, τ, was used for all of the sparse matrix operations to achieve efficient 6(N ) scaling of the computational cost. The thresholding operations are performed after each sparse matrix addition, multiplication, and dense-to-sparse transformation. A numerical threshold allows tunable error control that can easily be adjusted depending on the characteristics of the electronic structure of the system and/ or the precision that the calculation requires.26−28 We have chosen the ELLPACK-R format for storing sparse matrices. A brief explanation of this format can be found in the Appendix section. To address the performance and scaling of our new algorithm in large MD simulations, we use a nonorthogonal self-consistent-charge tight-binding (SCC-TB) model of the underlying electronic structure (see the Electronic Structure section in the Appendix). However, our algorithms should be directly applicable to other methods using nonorthogonal local basis-set representations. To solve the electronic structure of the system, we have used both a direct 6(N3) diagonalization of the Hamiltonian (H) and the 6(N ) second-order spectral B

DOI: 10.1021/acs.jctc.6b00154 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

scaling with small overhead. Also, thanks to the XL framework, we avoid any systematic energy drift. DC_EST for the Initial Guess of Z0. In an MD simulation, we may use Z0 from the previous time step as an initial guess to the iterative refinement. This is possible if the overlap matrix changes only slightly (as demonstrated below with the δ error values) from one time step to the next. Figure 2 shows the

projection (SP2) algorithm, which is a fast, canonical, recursive Fermi-operator expansion method29,30 that has a computational cost that scales linearly with system size provided the density matrix is sparse. The SP2 method, together with the extended Lagrangian Born−Oppenheimer MD (XL-BOMD) method,23,31 is implemented in the LATTE electronic structure code,32 which is an implementation of density-functional-based tight-binding (DFTB) theory33−36 (see the Electronic Structure section in Appendix). This code was used as a framework to implement our proposed scheme using the general purpose (mio) parameter sets for DFTB calculations given at www.dftb.org. In all of the MD simulations performed in this work, we have used the XL-BOMD in the zero self-consistency field (0-SCF) limit, with only one density matrix evaluation per MD step.23,31 Iterative Refinement. An iterative refinement algorithm for the inverse overlap matrix factorization was introduced in refs 10 and 37. This algorithm applies consecutive transformations Xi = ZiTSZi

(4)

and m

Zi + 1 = Zi( ∑ ak Xik) k=0

Figure 2. Variation in δ by using Z from the previous time step [i.e., Z0(t) = Z(t − Δt)] for varying values of Δt. In this case, the simulation is stopped after 10 time steps, and Z0 is estimated from the previous step. A 50 CH4 molecule system was used for this calculation.

(5)

starting from initial estimate Z0, which has to satisfy ∥ZT0 SZ0 − I∥2 < 1. Coefficients {ak}mk=1 are determined to minimize the error, δ = ∥ZTi SZi − I∥2, such that 6(δ) → 6(δ n) after each iteration for the highest possible integer value of n. These sets of coefficients are given in ref 10 up to m = 5. For the present method, we have chosen m = 2, for which the error decays as δ3 after each iteration. The corresponding coefficients are a0 = 1.875, a1 = −1.25, and a3 = 0.375. In Figure 1, we show the

initial error, δ = ∥ZT0 SZ0 − I∥2 for Z0 = Z(t − Δt), as a function of integration time step Δt. Because the convergence requirement for the iterative refinement is δ < 1, we can see that the initial estimate from the previous MD step can be used for integration time steps Δt ≲ 2.0 fs. For these time steps, error δ is stable and increases linearly with the length of the step. The upper bound for Δt is sufficiently high because ensuring a good energy conservation and stability requires much smaller values of Δt. A reasonable integration time step, Δt, with the Verlet scheme used here is about 0.5 fs for a system containing H atoms at room temperature. For the very first MD step, however, we need to initialize the iterative refinement with a different guess. For the MD simulations considered in this article, we could, in principle, calculate Z0(t = 0) through a diagonalization of S because it is needed only in the very first time step. However, for large systems, this would represent a major computational cost and has to be avoided. An alternative was presented by Jansı ́k et al.40 who proposed a scaled identity matrix can be used as an initial estimate of Z0 for the iterative refinement algorithm. They showed that a value of λ can always be found such that ∥λS − I∥2 < 1; hence, it is appropriate to choose Z0 = λ I as an initial guess. However, in this work, we have chosen a different method where the initial Z0(t = 0) is estimated based on an ad hoc divide-and-conquer construction. This scheme does not provide a very accurate calculation of the inverse overlap factor. However, here it is a useful technique because the only requirement is that the approximate inverse overlap fulfills the convergence criterion that ∥ZT0 SZ0 − I∥2 < 1. Our divide-and-conquer technique can be seen as a coarse approximation of a graph-based operator expansion scheme25 or as an application of the hierarchical technique proposed by Rubensson et al.41 Even if we do not have a formal proof of guaranteed convergence, we have found that it works

Figure 1. Convergence of Zn in the iterative refinement scheme for m = 2 for sparse overlap matrix S (67% sparse) of a molecular system with 50 CH4 molecules.

convergence of δ for a sparse overlap matrix corresponding to a system of liquid methane with 50 CH4 molecules. Pseudocode detailing the implementation of this algorithm is presented as the subroutine IT_REF in Algorithm 1. Although there are many techniques to compute or to approximate the Z factor,13,38,39 our method allows us to employ a close initial guess from a previous time step of an MD simulation providing a fast convergence and an overall linear C

DOI: 10.1021/acs.jctc.6b00154 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

small, dense matrices Ž i such that the ith row of Z is given by ZiK(h) = Ž gi ih. By applying this method, we can compute reasonable approximation Z0 (such that ∥ZT0 SZ0 − I∥2 < 1) with 6(N ) performance. Moreover, this procedure is straightforward to parallelize. The scheme can be viewed as a special case of a graph-based operator expansion method,25 which is valid for general thresholded sparse matrix polynomials, but where we use a very coarse approximation of an unknown data dependency graph for a sparse operator expansion of S−1/2. Figure 3 gives a schematic representation of this procedure for the first row. This is then repeated for all of the seven rows. (See SI for a more detailed description of the algorithm.) In Table 1, the quality of our 6(N )Z0(t = 0) divide-andconquer estimate is quantified for several test systems. For each Z0(t = 0), the corresponding value of δ depends strongly on the type of system but not on its sparsity or system size. Instead, we can expect a strong dependence on the condition number of S. For large (highly accurate) Gaussian basis sets, this method may be less efficient compared to that for the localized minimal basis sets used here. The error of the first computed Z(t = 0) will be determined by the number of iterations performed with the iterative refinement method applied immediately after the DC-EST approach. The accuracy of the graph-based approach (DC-EST) does not need to be high, but the error needs to be lower than 1.0 only for the stability of the refinement algorithm. Extended Lagrangian Time-Reversible Geometrical Integration Scheme. A straightforward application to QMD simulation using previous optimized inverse factor Z(t − Δt) as an initial guess to the iterative refinement procedure may lead to systematic errors in Z unless a very high convergence is reached. In particular, we may create a significant systematic drift in the total energy if we minimize the computational cost using only a single step of our 6(N ) iterative refinement scheme with m = 2. The problem is equivalent to the SCF optimization procedure using the electron density or the wave functions from the previous time step as an initial guess.17,18,42

surprisingly well for all of the examples tested here (see Table 1). Table 1. δ = ∥ZT0 SZ0 − I∥2 Values for Different Systems Validating the Approximation for the Calculation of Z0(t = 0) Using a DC_ESTa system

N

sparsity

δ

50 CH4 100 CH4 (−CH2−)64 (−CH2−)192 100 H2O 200 H2O 7 ALA 119 ALA

400 800 768 2304 600 1200 175 2975

0.675 0.836 0.942 0.981 0.656 0.828 0.106 0.944

0.0767 0.0763 0.134 0.136 0.214 0.214 0.253 0.264

In these calculations, a threshold value of τ = 10−6 was used for overlap matrix S. a

Our divide-and-conquer approach to construct an estimate of inverse overlap factorization matrix Z0 is based on the extraction of principal submatrices corresponding to the nonzero elements of each row (or column) of thresholded overlap matrix S (or if necessary S2). Starting with a local node of orbital i that is connected to m other orbital nodes if Sij > 0 for m different values of j, we can construct a small, dense m × m principal submatrix Ši composed of the subset of Skl, for which all k and l are connected to i, that is, where all rows j and columns j of Sij are removed except for the ones where Sij > 0. Hence, we construct Šigh = Sk=K(g),l=K(h), where K is a function mapping indices to the m “neighbors” with nonvanishing overlap to i. Note that in this case g and h run from 1 to m because i has m neighbors and naturally there is also a gi that has i itself as an image, that is, where K(gi) = i. We then apply the Löwdin inverse square root decomposition for submatrix Sǐ and obtain a small m × m matrix Ž i = (Ši)−1/2. We then collect approximate full inverse square root matrix Z row by row from D

DOI: 10.1021/acs.jctc.6b00154 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 3. Schematic representation of the divide-and-conquer approach for the estimation of the initial Z0(t = 0) matrix. Here, ϕ1 ··· ϕ7 are the “nodes” (orbitals) of the corresponding associated graph. In this scheme, matrices s and s−1/2 correspond to Š1 and Ž 1, respectively.

The drift is caused by a broken time-reversal symmetry in an unphysical extrapolation of Z(t). The errors affect the forces and give rise to a systematic drift in the total energy. This phenomenon is illustrated by the red curve of Figure 4, where

L = L0 +

2 1 1 μTr[Z̃̇ ] − μω 2 Tr[(Z(t − Δt ) − Z̃)2 ] 2 2

(6)

where L0 is the Lagrangian of regular or extended Lagrangian Born−Oppenheimer MD. This gives us the Euler−Lagrange equations of motion for Z̃ . In an adiabatic limit and when μ → 0,23 we find that Z̃̈ = ω 2(Z(t − Δt ) − Z(̃ t ))

(7)

where the dots denote time derivatives. This equation can be integrated following a modified Verlet algorithm20,43,44 Z(̃ t + Δt ) = 2Z(̃ t ) − Z(̃ t − Δt ) + Δt 2Z̃̈ K

+ α ∑ CkZ(̃ t − k Δt )

(8)

k=0

Z(̃ t + Δt ) = 2Z(̃ t ) − Z(̃ t − Δt ) + Δt 2ω 2 K

(Z(t − Δt ) − Z(̃ t )) + α ∑ CkZ(̃ t − k Δt )

Figure 4. Time dependence of the total energy for QMD using our algorithm with (green) and without (red) the geometrical integration scheme. The result with typical diagonalization was added for comparison (black). The combination with the SP2 method was also added (blue). The system analyzed here consists of liquid methane with 50 CH4 molecular units. Here, the plots were shifted in the y axis for comparison.

k=0

(9)

which includes an additional damping term at the end that removes the accumulation of numerical noise.20,43,44 This dissipative term is tuned by small coupling constant α that removes numerical error fluctuations but without any significant modification of the microcanonical trajectories and the free energy.20,43,44 For this application, we have used K = 5 with the optimized coefficients: α = 0.0180, κ = Δt2ω2 = 1.82, C0 = −6.0, C1 = 14.0, C2 = −8.0, C3 = −3.0, C4 = 4.0, and C5 = −1.0. The propagated inverse overlap, Z̃ (t + Δt), is then used as the initial guess in the iterative refinement algorithm used to calculate an updated Z(t) (see full algorithm in Algorithm 1). In this way, the systematic bias from the broken time-reversal symmetry of an ad hoc extrapolation procedure is avoided, and the convergence requirement in the iterative refinement can be kept much lower, keeping the cost down. This is particularly important in modern formulations of Born−Oppenheimer

Z(t) is calculated in an MD simulation using only a single iterative refinement step (for m = 2) starting with Z(t − Δt) as an initial guess. Here, we address this systematic drift through a time-reversible geometric integration approach20,23,31,43,44 where we include approximate inverse overlap matrix Z̃ (t) as an auxiliary dynamical variable through a harmonic oscillator centered around the “exact” (iteratively optimized) inverse overlap, Z(t), within an XL scheme. The oscillator has frequency ω, and Z̃ (t) has fictitious mass μ. Thus, Z̃ (t) is included in the same way as the electronic degrees of freedom in XL-BOMD.23,31 The additional XL terms for Z̃ are given by E

DOI: 10.1021/acs.jctc.6b00154 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation MD4,17−23 where the number of required diagonalizations or density matrix constructions per MD step is kept to a minimum. The green curve of Figure 4 (ZSP) demonstrates how energy conservation is restored with our time-reversible geometrical integration scheme. Is Inverse Overlap Factor Z Really Sparse? The iterative refinement algorithm does not necessarily preserve the symmetry of the initial guess of Z0. In general, the iterative refinement scheme converges neither to the inverse square root nor to the inverse Cholesky factor of overlap matrix S. In principle, this should not be a problem as long as ∥ZTSZ − I∥2 → 0. However, even when a numerically thresholded iterative refinement scheme is used in combination with Z(t) propagated through the geometric XL formulation during an MD simulation, a systematic fill-in occurs of matrix elements above numerical threshold τ in Z(t). This leads to a drastic loss in the computational efficiency of the iterative refinement scheme because the computational cost of a sparse matrix− matrix multiplication scales quadratically with the average number of nonzero elements per row. This somewhat surprising phenomenon is shown for an MD simulation of liquid methane in Figure 5 (red curve), where any significant

1 ̃ T (Z(t + Δt ) + Z̃ (t + Δt )) (10) 2 By applying the symmetrization in eq 9, the approximate symmetry of refined optimized matrix Z(t) is maintained such that Z(t) ≈ ZT(t), and the loss of matrix sparsity is avoided during the MD trajectory as illustrated by the black curve in Figure 5. Furthermore, maintaining an approximate symmetry of Z(t) also gives rise to smaller values of δ (see SI). Full Algorithm Description. A pseudocode for our 6(N ) algorithm for the generation of the inverse overlap factor is given in Algorithm 1. An initial Z0 is estimated at the first MD step (t0) using the divide-and-conquer estimate (DC_EST). Subsequent Z0 matrices are generated from the previous time steps when i > 1 by applying the geometrical integration scheme (GEOM_INT) followed by the symmetrization step. The iterative refinement algorithm (IT_REF) is applied to Z0 to generate the final Z(ti) matrix. We have found that lmax = 1 is enough to ensure accuracy and energy stability; however, this value can be increased in the first MD steps if improvements are desired. Except for the initial MD time step, where Z0 is calculated from the divide-and-conquer estimate, the most time-consuming part of the algorithm during each MD time step is the iterative refinement because four matrix−matrix multiplications per iteration are required to update Z starting with Z̃ (ti+1). Including the orthogonalization of the Hamiltonian, a total of six sparse matrix−matrix multiplications are needed in each MD time step. This cost should be compared to that of the recursive (SP2) Fermi-operator expansion scheme that typically requires between 25 and 35 sparse matrix−matrix multiplications per MD time step. Z0 =



DISCUSSION MD Simulation on a Benchmark System. We have observed that by applying Algorithm 1 (ZSP) for the calculation of Z using linear-scaling thresholded sparse matrix algebra (ZSP) the total energy in the microcanonical MD simulation remains well preserved. In the particular case of Figure 4, we performed an MD simulation for a 50 CH4 molecule system and compared the energy stability and fluctuations using Z(t) calculated from a regular 6(N3) diagonalization approach (ZDIAG) with those of our linearscaling ZSP method. We can see that there is almost no difference in the energy drift and both methods have excellent long-term conservation of the total energy. Even using the linear-scaling SP2 method to solve the electronic structure, we still have a very well conserved energy throughout the MD simulation (blue curve in Figure 4). We can observe that the application of the XL geometrical integration algorithm (GEOM_INT) reduced the energy drift from 465 to 2.20 μeV/atom/ps. The latter value is of the same order of magnitude as that of the estimated drift obtained with direct diagonalization (ZDIAG) of 1.23 μeV/atom/ps (see SI). These drift estimates were computed using a straightforward linear regression over the energy versus time data. Figure 6 shows that the overall performance of the algorithm scales linearly with the number of atoms in the system, whereas the traditional matrix diagonalization gives rise to 6(N3) scaling. Unless specified, all timings were performed on the “Moonlight” cluster at LANL that comprises two 8-core Intel Xeon E5-2670 CPUs running at 2.6 GHz. We used the Intel Fortran compiler version 13.1.3 and the Math Kernel Library

Figure 5. In this plot, we show the loss of sparsity and the effect of symmetrization on the average number of nonzero elements per row above a numerical threshold, τ = 10−6, as a function of time during an MD simulation of liquid methane. A loss of sparsity is evidenced in the case where no symmetrization is used (red dotted line) compared to that in the symmetrized case (black full line). The dimension of the full dense matrix is 400 × 400 for this particular case.

matrix sparsity of thresholded inverse factor Z(t) is lost after a few hundred MD steps. This increase in the number of matrix elements above numerical threshold τ also means that the reduced linear scaling complexity of the thresholded iterative refinement scheme is lost. How can we understand this? Is inverse overlap factor Z simply not sparse or can the sparsity loss be avoided? What the systematic sparsity loss in the MD simulations indicates is that not all inverse factors, Z(t), that fulfill ZTSZ = I are sparse (see the Appendix for an explanation with a 2 × 2 general matrix). To avoid sparsity loss, we can therefore impose some additional constraints on the initial guess of Z0 to enforce or improve the sparsity in the iterative refinement. To overcome the loss of sparsity, we propose a simple symmetrization constraint on the initial guess of Z0 such that F

DOI: 10.1021/acs.jctc.6b00154 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 7. Trajectory deviations for different values of τ (10−8 to 10−4) with respect to the one obtained by diagonalizing S (ZDIAG).

Figure 6. Average time vs size of the system for the typical diagonalization (ZDIAG) and for our 6(N ) algorithm (ZSP) with one refinement iteration for the liquid methane with 50 CH4 molecular units. This was performed on a single core.

same, reaching an asymptotic limit. At this limit, the fluctuations produced by precision errors start to dominate. In this system when τ = 10−6 is used, the energy fluctuations stay almost identical to the ones obtained by diagonalizing S for more than 2 ps (see Figure 8).

version 11.2 with its OpenMP implementation. Unless specified, values of τ = 10−6 was used for all the matrix operations together with a time-step of Δt = 0.5 fs for all QMD simulations. The crossover point between the two approaches depends sensitively on the level of matrix sparsity and thus also on the numerical threshold. In the case of liquid CH4 systems, we find that the 6(N ) algorithm is faster when systems are larger than 250 atoms (50 CH4 units). In the case of water molecules, this crossover happens later at 300 atoms (100 H2O units, see SI), and this is mainly because S matrices are denser (less sparse) in the case of water systems (see Table 1). Apart from achieving linear scaling, the ZSP algorithm has a good parallelization performance owing to the efficient parallelism afforded by the use of ELLPACK-R format threaded matrix operations using the Open Multi-processing (OpenMP) application programming interface45 (see SI). The total wall-clock time for an MD time step is substantially reduced when using a linear scaling method such as SP2 to compute the electronic structure. When the SP2 method is used, the time spent to compute Z for large systems stays constant with respect to the time spent to compute the density matrix when using the ZSP algorithm. With regular 6(N3) diagonalization (ZDIAG), the computation of Z rapidly becomes the bottleneck as the system size gets larger (see SI). To quantify the effect in the trajectories arising from the use of approximated Z matrices, MD simulations with different values of τ are compared with respect to typical diagonalization. We compare two trajectories by computing the root-meansquare deviation as a function of the simulation time step (RMSD(t)). Na

RMSD(t ) =

1 ∑ R kZsp(t , τ) − R kZdg(t ) Na k

Figure 8. Energy fluctuations for ZSP (τ = 10−6) compared to those for ZDIAG for a 50 CH4 system.

By using ZSP with τ = 10−6, we can see (Figure 8) that the energy fluctuations remain unaltered for almost 2 ps when compared to the ones obtained when using ZDIAG. The radial distribution function shown in Figure 9a for g(r)C−C shows that the two methods (ZSP and ZDIAG) give the same steady-state properties after the system is equilibrated. Figure 9b shows the vibrational density of states of liquid methane computed with the Fourier transform of the velocity autocorrelation function.46 An alternative to our iterative refinement method could be a direct minimization of the error as performed by, for example, Millam and Scuseria for the density matrix.16 The objective function to be minimized could be written as follows: Ω(Z) = ∥ZTSZ − I∥ + λ∥ZT − Z∥, where the last term would ensure symmetrization, hence keeping a well-controlled matrix sparsity. Application to a Biophysical Model System. As a demonstration of both the scaling and the energy stability of the present algorithm, we performed an MD simulation of a 4158-atom system consisting of a polyalanine peptide (63

2

(11)

where RZsp k (t,τ) is the trajectory computed with the 6(N ) algorithm and RZdg k (t) is the reference trajectory computed with the diagonalization of S. Figure 7 shows that for values of τ < 10−6 the trajectories start to diverge after 0.4 ps, reaching a limit where, no matter which value of τ is used, the trajectory divergences are the G

DOI: 10.1021/acs.jctc.6b00154 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 9. C−C radial distribution function (a) and vibrational density of states (DOS) (b) calculated from the atomic coordinates and velocity autocorrelation function, respectively. Comparison between the exact MD trajectory (using ZDIAG) (black curve) and the approximated (using ZSP with τ = 10−6) trajectory (red dots) with one iterative refinement and one density matrix construction per MD step. For these calculations, a system with 50 CH4 molecules was used.

Figure 10. (a) Molecular representation of a polyalanine molecule solvated in water (a 4158-atom system). H and O atoms are white and red colorcoded, respectively. A cartoon representation of the polyalanine chain is shown in cyan. (b) QMD total energy plots (during 10.0 ps of the trajectory) for polyalanine solvated in water.



CONCLUSIONS We have demonstrated the computational efficiency and accuracy of an 6(N ) implementation of Niklasson’s iterative refinement scheme to obtain the inverse overlap factors for QMD simulations (ZSP). With this procedure, we performed QMD with a nonorthogonal basis set with 6(N ) scaling and small well-controlled errors. By propagating the initial guess for the iterative refinement using a time-reversible integration scheme, we were able to preserve a long-term stability also under approximate convergence of Z and avoid a systematic energy drift in microcanonical MD simulations. By symmetrizing Z at every step of the MD trajectory, we maintain high levels of matrix sparsity and hence good low prefactor 6(N ) performance. In terms of efficiency, the ZSP method shows the 6(N ) scaling with low prefactor that outperforms the traditional 6(N3) diagonalization already for fairly small systems (>102 atoms). With linear scaling XL-BOMD in the 0-SCF limit, with only one density matrix calculation per MD time step, the inverse overlap factorization becomes a significant part of each force evaluation. The computational overhead of the orthogonalization transform therefore needs to be kept as small as possible. This is the purpose of our method, and the additional cost is only about 20−30% compared to that

alanine residues) solvated in 1175 water molecules, giving a total of 4158 atoms (8631 orbitals). This system is a good prototype for globular proteins encountered in many biophysical problems and gives an ideal substrate to benchmark our 6(N ) algorithm. Algorithm 1 (ZSP) was also used to perform a preliminary geometry optimization of the system. After the geometry optimization, a canonical simulation was performed to stabilize the system at 300 K, also employing Algorithm 1. Finally, to test for energy conservation, a microcanonical MD simulation was performed. In this case, we have used a time step of Δt = 0.25 fs. Figure 10 shows the total energy as a function of the simulation time. The trajectories were computed using an 6(N ) implementation of the SP2 method for the calculation of the density matrix and within the framework of XL-BOMD. Figure 10 shows a good energy conservation during 10.0 ps of the trajectory. We achieve a simulation rate of about 2.64 ps/day (8 s/step) on a single node of a 20-core 2.60 GHz Intel Xeon E5-2660 v3 architecture. Given the size of the system, using traditional 6(N3) method to solve Z renders both the geometry optimization and MD simulations prohibitively expensive. H

DOI: 10.1021/acs.jctc.6b00154 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

istics of DFT with the possibility of having full atomistic detail in the simulations.33,34 These methods have been applied with success to study both equilibrium and time-dependent properties of nanoscopic systems.48,49 In general, DFTB Hamiltonians are written as follows:

of a corresponding orthogonal formulation, though avoiding any long-term energy drift. Finally, we have applied the ZSP method to a large system consisting of a polyalanine peptide solvated in water, which shows remarkable conservation of the total energy. Geometry optimization also can benefit from Algorithm 1 as the orthogonal transformation needs to be performed at every relaxation step.



Na

0 Hμν = Hμν +

APPENDIX

Assume that R is a transformation such that (12)

where β = {ψi} is the nonorthogonal basis set as determined by slater orbitals, for example, and α = {ϕi} is an orthogonal basis set. We can then have matrix Z that orthogonalizes H. Hence, eq 12 becomes [H ]α = ZT[H ]β Z

(13)

where Z is the inverse transformation of R (ZR = RZ = I, see eq 3). Provided that we can construct matrix Z, the Hamiltonian can be rewritten in an orthogonal form (i.e., [H]β → [H]α). Starting from eq 1 and using transformation R, we can derive eq 2 as follows: [H ]β B = SBϵ T

ELLPACK-R Format

(14)

T

(15)

T

T

(16)

T

T

(17)

R [H ]α RB = R RBϵ R [H ]α C = R C ϵ R [H ]α C = R C ϵ ⇒ [H ]α C = C ϵ

The ELLPACK-R storage format for sparse matrices24 has recently been shown to have great advantages with respect to the traditional compressed sparse row (CSR) format in which the matrix is fully vectorized.51 Algebraic operations can be parallelized over shared-memory cores despite the fact that it is less compact than CSR. The ELLPACK-R format has been shown to have a good parallel performance for matrix−vector multiplications in GPU devices. 24 To understand the ELLPACK-R data structure, we show the translation from a dense N × N matrix to the ELLPACK-R sparse data structure in Figure 11. There are three arrays for the ELLPACK-R

Here, we have used RTR = S. Hence, to obtain [H]α, we need to apply transformation Z from left and right as in eq 13. Once the orthogonal eigenvectors are obtained, they can be rewritten in the nonorthogonal basis set as follows: (18)

B = ZC

(22)

where i and j are atom indexes, μ and ν are orbital indexes, and γii is the Hubbard U for species i. At large distances, parameter γij decays as 1/Rij, which is simply the Coulombic interaction between different sites, whereas at short range, it is a screened 0 Coulomb potential. Off-diagonal elements, H μ,ν , are a combination of special radial-dependent bond integrals, which are parametrized based on electronic properties of the system as computed with ab initio methods or from experimental properties.50 Partial charges, Δqk = Ak − qk , are obtained from the Mulliken analysis, which in the case of a general nonorthogonal 1 basis set is qk = 2 ∑μ ∈ k (Sρ + ρS)μμ, Ak being the core effective charge of each atom k. In this work, we have used the mio-0-1 set of DFTB parameters33 to compute both Hamiltonian H and overlap matrix S.

Inverse Overlap Factorization

[H ]β = RT[H ]α R

1 Sμν∑ (γ + γkj)Δqk 2 k = 1 ik

Transformation Z is not unique. A popular approach is the Löwdin factorization where we use the fact that S is Hermitian and it admits a unitary diagonalization S = UsUT T

(19) 1/2

T

T

with UU = I. Let R = Us U so that R R = S. Using ZR = I, we have

Z = Us−1/2UT

(20)

Figure 11. Translation of an N × N dense matrix to a sparse data structure. The ELLPACK-R data structure at the right is described by three arrays: two N × M matrices containing values and indices, respectively, and a vector containing the number of nonzero entries per row. N is the number of rows, and M is the maximum value of nonzeros per row.

and

I = ZTSZ

(21)

Equation 20 gives a way to compute Z by diagonalizing S. Equation 21, in turn, provides a check of any approximation of Z; that is, if Z0 is an approximation of Z, then ∥I − ZT0 SZ0∥ = δ, where δ is the error in the approximation, and ∥X∥ is the two norm of matrix X.

representation: an N × M array containing the values of the entries of the nonzero elements of the N × N dense matrix, an N × M array containing the column indices of the nonzero elements for every row of the dense matrix, and an array containing the maximum number of nonzero elements for every row of the dense matrix. M is the maximum number of nonzero in any row. The row-wise data storage enables a straightforward partition of the work over shared- and

Electronic Structure

SCC-TB is a fast semiempirical method in which the Hamiltonian can be constructed significantly faster than that in ab initio methods. It allows for the rapid calculation of the electronic structure of several thousands of atoms.47 In particular, the SCC-TB method based on density functional theory (SCC-DFTB) naturally includes many of the characterI

DOI: 10.1021/acs.jctc.6b00154 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Nuclear Security Administration. This research has been supported at Los Alamos National Laboratory under the Department of Energy contract DE-AC52-06NA25396. Special thanks to Dr. Vanessa V. Galassi for helpful discussions.

distributed-memory cores. Details of the ELLPACK-R matrix algebra and its parallel implementation can be found in ref 3. Relationship between Sparsity and Symmetry

Let L be the Cholesky decomposition such that S−1 = LLT, where L is a lower triangular matrix, and let X be the symmetric Löwdin factor, that is, X = S−1/2 and S−1 = XXT. For any general factor Z between L and X that fulfills ZTSZ = I, we also have that S−1 = ZZT. The question why the general case Z is less sparse can possibly be understood by considering the following 2 × 2 model example: ⎛ s −1 s −1 ⎞ 11 12 ⎟ S −1 = ⎜⎜ −1 −1 ⎟ s s ⎝ 12 22 ⎠



(1) Adcock, S. A.; McCammon, J. A. Molecular dynamics: survey of methods for simulating the activity of proteins. Chem. Rev. 2006, 106, 1589−1615. (2) Freddolino, P. L.; Park, S.; Roux, B.; Schulten, K. Force field bias in protein folding simulations. Biophys. J. 2009, 96, 3772−3780. (3) Mniszewski, S. M.; Cawkwell, M. J.; Wall, M. E.; Mohd-Yusof, J.; Bock, N.; Germann, T. C.; Niklasson, A. M. N. Efficient parallel linear scaling construction of the density matrix for Born−Oppenheimer molecular dynamics. J. Chem. Theory Comput. 2015, 11, 4644−4654. (4) Arita, M.; Bowler, D. R.; Miyazaki, T. Stable and efficient linear scaling first-principles molecular dynamics for 10 000+ atoms. J. Chem. Theory Comput. 2014, 10, 5419−5425. (5) VandeVondele, J.; Borstnik, U.; Hutter, J. Linear scaling selfconsistent field calculations with millions of atoms in the condensed phase. J. Chem. Theory Comput. 2012, 8, 3565−3573. (6) Goedecker, S. Decay properties of the finite-temperature density matrix in metals. Phys. Rev. B 1998, 58, 3501. (7) Bowler, D. R.; Miyazaki, T. O(N) methods in electronic structure calculations. Rep. Prog. Phys. 2012, 75, 036503−036546. (8) Golub, G.; van Loan, C. F. Matrix Computations; Johns Hopkins University Press: Baltimore, 1996. (9) Löwdin, P. O. Quantum theory of cohesive properties of solids. Adv. Phys. 1956, 5, 1−171. (10) Niklasson, A. M. N. Iterative refinement method for the approximate factorization of a matrix inverse. Phys. Rev. B 2004, 70, 193102. (11) Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; Vetterling, W. T. Numerical Recipes in FORTRAN: The Art of Scientific Computing, 2nd ed.; Cambridge University Press: Cambridge, U.K., 1992. (12) Challacombe, M. A simplified density matrix minimization for linear scaling self-consistent field theory. J. Chem. Phys. 1999, 110, 2332−2342. (13) Benzi, M.; Meyer, C. D.; Tåuma, M. A sparse approximate inverse preconditioner for the conjugate gradient method. SIAM J. Sci. Comput. 1996, 17, 1135−1149. (14) Ozaki, T. Efficient recursion method for inverting an overlap matrix. Phys. Rev. B 2001, 64, 195110. (15) Mauri, F.; Galli, G.; Car, R. Orbital formulation for electronicstructure calculations with linear system-size scaling. Phys. Rev. B 1993, 47, 9973−9976. (16) Millam, J. M.; Scuseria, G. E. Linear scaling conjugate gradient density matrix search as an alternative to diagonalization for first principles electronic structure calculations. J. Chem. Phys. 1997, 106, 5569−5577. (17) Pulay, P.; Fogarasi, G. Fock matrix dynamics. Chem. Phys. Lett. 2004, 386, 272. (18) Niklasson, A. M. N.; Tymczak, C. J.; Challacombe, M. Timereversible Born-Oppenheimer molecular dynamics. Phys. Rev. Lett. 2006, 97, 123001. (19) Kühne, T. D.; Krack, M.; Mohamed, F. R.; Parrinello, M. Efficient and accurate Car-Parrinello-like approach to Born-Oppenheimer molecular dynamics. Phys. Rev. Lett. 2007, 98, 066401. (20) Zheng, G.; Niklasson, A. M. N.; Karplus, M. Lagrangian formulation with dissipation of Born-Oppenheimer molecular dynamics using the density-functional tight-binding method. J. Chem. Phys. 2011, 135, 044122. (21) Hutter, J. Car Parrinello molecular dynamics. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2012, 2, 604. (22) Lin, L.; Lu, J.; Shao, S. Analysis of time reversible BornOppenheimer molecular dynamics. Entropy 2014, 16, 110.

(23) −1

where in the most general case (S

T

= ZZ ) reads

⎛ z11 z12 ⎞⎛ z11 z 21 ⎞ S −1 = ⎜ ⎟⎜ ⎟ ⎝ z 21 z 22 ⎠⎝ z12 z 22 ⎠ ⎛z 2 + z 2 z11z 21 + z12z 22 ⎞ 11 12 ⎟ = ⎜⎜ ⎟ 2 2 ⎝ z11z 21 + z12z 22 z 22 + z 21 ⎠ −1

s−1 12

(24)

s−1 21

Now let S be sparse such that = = τ with τ → 0. In the general factorization case, z21 = (τ−z12z22)/z11 ≃ −z12z22/ −1 z11. By expressing z11, z22 as a function of z12, s−1 11 and s22 , we get ⎛ −1 2 ⎜ s11 − z12 ⎜ ⎛ s −1 Z=⎜ 22 z − ⎜⎜ 12⎜⎜ −1 s ⎝ 11 ⎝

⎞ ⎟ ⎟ −1 −1 2 s22 (s11 − z12 )⎟ ⎟⎟ −1 s11 ⎠

z12 ⎞ ⎟ ⎟ ⎠

(25)

Thus, even with τ = 0, that is, for a simple diagonal inverse overlap matrix, for which both the Löwding and Cholesky factors have vanishing off-diagonal elements, the general case, Z, may have off-diagonal elements of any size (as z12 becomes a parameter that can take any value). Without symmetrization of the solution, we may thus have a drift of general inverse overlap factor Z such that sparsity is lost.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jctc.6b00154. Details of the implementation of the DC_EST algorithm; additional figures and tables and coordinates of the systems (PDF).



REFERENCES

AUTHOR INFORMATION

Corresponding Authors

*E-mail: [email protected] (C.F.A.N.). *E-mail: [email protected] (A.M.N.N.). Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We acknowledge support from Office of Basic Energy Sciences (LANL2014E8AN) and the Laboratory Directed Research and Development Program of Los Alamos National Laboratory. This research used resources provided by the Los Alamos National Laboratory Institutional Computing Program, which is supported by the U.S. Department of Energy National J

DOI: 10.1021/acs.jctc.6b00154 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation (23) Niklasson, A. M. N.; Cawkwell, M. J. Generalized extended Lagrangian Born-Oppenheimer molecular dynamics. J. Chem. Phys. 2014, 141, 164123. (24) Vazquez, F.; Ortega, G.; Fernandez, J. J.; Garzon, E. Improving the performance of the sparse matrix vector product with GPUs. 10th IEEE International Conference on Computer and Information Technology, 2010; pp 1146−1151. (25) Niklasson, A. M. N.; Mniszewski, S. M.; Negre, C. F. A.; Cawkwell, M. J.; Swart, P. J.; Mohd-Yusof, J.; Germann, T. C.; Wall, M. E.; Bock, N.; Rubensson, E. H.; Djidjev, H. Graph-based linear scaling electronic structure theory. 2016. arXiv:1603.00937 [physics.comp-ph]; LA-UR-16-21317. (26) Rubensson, E. H.; Salek, P. Systematic sparse matrix error control for linear scaling electronic structure calculations. J. Comput. Chem. 2005, 26, 1628−1637. (27) Rubensson, E. H.; Rudberg, E.; Salek, P. Density matrix purification with rigorous error control. J. Chem. Phys. 2008, 128, 074106. (28) Rubensson, E. H.; Rudberg, E.; Salek, P. Truncation of small matrix elements based on the Euclidean norm for blocked data structures. J. Comput. Chem. 2009, 30, 974. (29) Niklasson, A. M. N. Expansion algorithm for the density matrix. Phys. Rev. B 2002, 66, 155115. (30) Rubensson, E. H.; Niklasson, A. M. N. Interior eigenvalues from density matrix expansions in quantum mechanical molecular dynamics. SIAM J. Sci. Comput. 2014, 36, B147−B170. (31) Niklasson, A. M. N. Extended Born-Oppenheimer molecular dynamics. Phys. Rev. Lett. 2008, 100, 123004. (32) Sanville, E. J.; Bock, N.; Coe, J. D.; Martinez, E.; Mniszewski, S. M.; Negre, C. F. A.; Niklasson, A. M. N.; Cawkwell, M. J. LATTE. 2010. Los Alamos National Laboratory, LA-CC 10-004. (33) Elstner, M.; Porezag, D.; Jungnickel, G.; Elsner, J.; Haugk, M.; Frauenheim, Th.; Suhai, S.; Seifert, G. Self-consistent-charge densityfunctional tight-binding method for simulations of complex materials properties. Phys. Rev. B 1998, 58, 7260−7268. (34) Finnis, M. W.; Paxton, A. T.; Methfessel, M.; van Schilfgarde, M. Crystal structures of zirconia from first principles and self-consistent tight binding. Phys. Rev. Lett. 1998, 81, 5149. (35) Frauenheim, T.; Seifert, G.; Elstner, M.; Hajnal, Z.; Jungnickel, G.; Poresag, D.; Suhai, S.; Scholz, R. A self-consistent charge densityfunctional based tight-binding method for predictive materials simulations in physics, chemistry and biology. Phys. Status Solidi 2000, 217, 41. (36) Aradi, B.; Niklasson, A. M. N.; Frauenheim, T. Extended Lagrangian density functional tight-binding molecular dynamics for molecules and solids. J. Chem. Theory Comput. 2015, 11, 3357. (37) Rubensson, E. H.; Bock, N.; Holmtrömm, E.; Niklasson, A. M. N. Recursive inverse factorization. J. Chem. Phys. 2008, 128, 104105. (38) Brandhorst, K.; Head-Gordon, M. Fast sparse Cholesky decomposition and inversion using nested dissection matrix reordering. J. Chem. Theor. Comput. 2011, 7, 351−368. (39) Kim, K.; Rajamanickam, S.; Stelle, G.; Edwards, H. C.; Olivier, S. L. Task parallel incomplete Cholesky factorization using 2D partitioned-block layout. 2016. arXiv:1601.05871. (40) Jansı ́k, B.; Høst, S.; Jørgensen, P.; Olsen, J.; Helgaker, T. Linearscaling symmetric square-root decomposition of the overlap matrix. J. Chem. Phys. 2007, 126, 124104. (41) Rubensson, E. H.; Rudberg, E.; Salek, P. A hierarchic sparse matrix data structure for large-scale Hartree-Fock/Kohn-Sham calculations. J. Comput. Chem. 2007, 28, 2531. (42) Remler, D. K.; Madden, P. A. Molecular dynamics without effective potentials via the Car-Parrinello approach. Mol. Phys. 1990, 70, 921. (43) Niklasson, A. M. N.; Steneteg, P.; Odell, A.; Bock, N.; Challacombe, M.; Tymczak, C. J.; Holmstrom, E.; Zheng, G.; Weber, V. Extended Lagrangian Born-Oppenheimer molecular dynamics with dissipation. J. Chem. Phys. 2009, 130, 214109.

(44) Steneteg, P.; Abrikosov, I. A.; Weber, V.; Niklasson, A. M. N. Wave function extended Lagrangian Born-Oppenheimer molecular dynamics. Phys. Rev. B 2010, 82, 075110. (45) OpenMP. Architecture Review Board, 2014. http://openmp.org (accessed March 28, 2016). (46) Dickey, J. M.; Paskin, A. Computer simulation of the lattice dynamics of solids. Phys. Rev. 1969, 188, 1407−1418. (47) Negre, C. F. A.; Sánchez, C. G. Atomistic structure dependence of the collective excitation in metal nanoparticles. J. Chem. Phys. 2008, 129, 034710. (48) Fuertes, V. C.; Negre, C. F. A.; Oviedo, M. B.; Bonafé, F. P.; Oliva, F. Y.; Sánchez, C. G. A theoretical study of the optical properties of nanostructured TiO2. J. Phys.: Condens. Matter 2013, 25, 115304. (49) Oviedo, M. B.; Negre, C. F.; Sánchez, C. G. Dynamical simulation of the optical response of photosynthetic pigments. Phys. Chem. Chem. Phys. 2010, 12, 6706−6711. (50) Slater, J. C.; Koster, G. F. Simplified LCAO method for the periodic potential problem. Phys. Rev. 1954, 94, 1498−1524. (51) Challacombe, M. A general parallel sparse-blocked matrix multiply for linear scaling SCF theory. Comput. Phys. Commun. 2000, 128, 93.

K

DOI: 10.1021/acs.jctc.6b00154 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX