GPU Integral Engine for Strong ... - ACS Publications

Nov 29, 2016 - Considering the vast computational resources on a single GPU and the dataflow model of kernel execution,4 GPUs need far larger batch si...
0 downloads 0 Views 1MB Size
Article pubs.acs.org/JCTC

Hybrid CPU/GPU Integral Engine for Strong-Scaling Ab Initio Methods Jörg Kussmann†,‡ and Christian Ochsenfeld*,†,‡ †

Department of Chemistry, University of Munich (LMU), Butenandtstrasse 7, D-81377 München, Germany Center for Integrated Protein Science (CIPSM) at the Department of Chemistry, University of Munich (LMU), Butenandtstrasse 5-13, D-81377 München, Germany



S Supporting Information *

ABSTRACT: We present a parallel integral algorithm for twoelectron contributions occurring in Hartree−Fock and hybrid density functional theory that allows for a strong scaling parallelization on inhomogeneous compute clusters. With a particular focus on graphic processing units, we show that our approach allows an efficient use of CPUs and graphics processing units (GPUs) simultaneously, although the different architectures demand conflictive strategies in order to ensure efficient program execution. Furthermore, we present a general strategy to use large basis sets like quadruple-ζ split valence on GPUs and investigate the balance between CPUs and GPUs depending on l-quantum numbers of the corresponding basis functions. Finally, we present first illustrative calculations using a hybrid CPU/GPU environment and demonstrate the strong-scaling performance of our parallelization strategy also for pure CPU-based calculations.

I. INTRODUCTION In recent years the use of graphic processing units (GPUs) has gained increasing interest in high performance computing due to their inherent computing power and relatively low power consumption. Their use in the field of ab initio quantum chemistry has been explored by several research groups reporting significant performance enhancements as compared to CPU-based calculations.1−16 Regarding fully GPU-based integral evaluation, the proposed algorithms were so far restricted to smaller basis sets with up to d-11 or f-shells,12 while the CPU cores idle. The restriction to smaller basis sets results from the scarce local memory resources on the streaming multiprocessors (64 kB), which prevents the storage of larger integral batches in shared memory.17 Regarding the so far neglected combined use of GPUs and CPUs, one particular reason is the different requirements on the workload distribution. Considering the vast computational resources on a single GPU and the dataflow model of kernel execution,4 GPUs need far larger batch sizes to ensure an efficient execution as compared to CPUs, where significantly smaller batch sizes are needed in order to ensure a balanced workload distribution and avoid memory access latency. To solve these contradictions, we developed the hybrid integral engine, which allows the combined use of CPUs and CUDA-12,14 as well as OpenCL-driven17 GPUs in a serial or parallel compute environment. The hybrid integral engine is described in section II. In section III we present first illustrative calculations using both CPUs and GPUs on a single compute © XXXX American Chemical Society

node as well a cluster of fast interconnected GPU nodes. Furthermore, we show that our integral engine also provides strong-scaling parallelization on CPU cores only.

II. HYBRID INTEGRAL ENGINE The central concept of our integral evaluation algorithm is to leave the workload distribution to the single worker threads instead of statically assigning the work batches to the single threads. The latter model usually results in worker starvation, i.e., an imbalance in work distribution, which is here ameliorated by worker threads actively requesting work batches from a global work pool. The worker threads are allocated to a physical CPU core as depicted in Figure 1. Each GPU device is also associated with a specific worker (blue/green in Figure 1), while in an MPI-based parallel environment a single thread on the master node is dedicated to bookkeeping and managing work requests by the workers (red in Figure 1). Note that during the integral evaluation only single batch IDs are communicated between pool and worker nodes, i.e., there is virtually no communication overhead. Since there are usually more CPU cores than GPU devices per node, the worker population is strongly inhomogeneous regarding computing power and, as mentioned before, requirements with respect to work batch sizes. The latter point is addressed by employing a doubly redundant work queue, i.e., we divide the complete workload twice with respect Received: November 29, 2016

A

DOI: 10.1021/acs.jctc.6b01166 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 1. Hybrid integral engine: CPU core utilization within parallel setup. See the text for further explanations.

Figure 2. Hybrid integral engine: Doubly redundant queue for GPU and CPU worker threads. See the text for further explanations.

combination are necessary in order to balance the distribution of the workload over all available GPUs (see section III). Note that for larger systems the maximization of the GPU batches may be limited by the devices memory resources. In this case the hybrid integral engine generates adequately sized batches automatically. The hybrid integral engine is implemented in C++, using different classes to handle the work pool and workers. The internode communication is handled via MPI, while the different CPU cores on a single node are parallelized using the class std::thread of the C++ standard template library (STL). For calculations on a single node, the work pool operates passively, i.e., work nodes directly fetch the next work batch from the pool. Within an MPI-based setup, the work pool objects request the next batch ID from the master node, where a single thread is dedicated to manage all remote requests in order to avoid any communication latency. By default, independent potential matrices are available on each node, while the final result is obtained via a final reduction step. II.A. Hybrid Scheme for Linear-Scaling Exchange Evaluation. A combined use of both CPUs and GPUs is most attractive considering the evaluation of exchange-like potentials, which is the bottleneck in Hartree−Fock (HF) or hybrid density functional theory (DFT) calculations. In order to obtain an asymptotical linear-scaling behavior with respect to system size, different strategies for CPUs and GPUs have to be employed. For the CPU-based algorithm we employ the LinK scheme by Ochsenfeld et al.,18,19 while the GPU-based algorithm uses the PreLinK scheme by Kussmann and Ochsenfeld.12,14 However, a straightforward use of both schemes independently can lead to the accumulation of numerical noise within the exchange potential that can hamper or even prevent convergence of the self-consistent field calculation (SCF). This numerical noise results from the different screening objectives of the LinK and PreLinK schemes. The latter one determines significant elements Kμν in the resulting exchange matrix, while the LinK scheme employs Schwarz screening20 to determine significant 4-center 2-electron integrals (μλ|νσ). Consider for example an element Kμν, where μ represents a p-function (lquantum number = 1 (l-qn)) and ν represents an s-function (lqn = 0)

to GPUs and CPUs, respectively, as depicted in Figure 2. The labels Lx (x = 1,2,...) represent l-quantum number (l-qn) combinations of the corresponding integrals, e.g., in the case of exchange integrals: L1 = (ss|ss), L2 = (ps|ss), L3 = (pp|ss), etc. For GPUs, the batches are significantly larger in order to maximize utilization and minimize host/device communication latency (blue/upper part of Figure 2). The CPU batches instead are small in order to minimize starvation of compute cores. Since GPUs are most efficient for shells with small l-qn, the queue pointer for GPU workers starts at L1, while the queue pointer for CPU workers starts at the opposite end starting with the largest l-qn combinations. In order to prevent redundant double calculations a specific type of l-qn combinations is locked for CPU workers if the first batch is handled by GPU workers and vice versa. Here, it should be stressed that the labels Lx represent nonredundant l-qn combinations with respect to the 8-fold integral symmetry as it is usually considered in conventional CPU-based implementations. For GPUs, however, some symmetries are not exploited in order to ensure an efficient execution.5 In exchange-type integrals, for example, the only symmetry exploited is Kμν = Kνμ (in case of (skew-)symmetric density matrices only). Therefore, if Lx represents, e.g., (sp|df), the same label represents the set of [(sp|df), (sp|fd), (ps|df), (ps| fd)] l-qn combinations. A common strategy for parallelizing two-electron integral evaluation algorithms is to only parallelize integrals of the same l-qn combination, in particular when using OpenMP. This, however, may also cause the single threads to wait for contention of global memory access when scattering the integrals into the global result matrix. Furthermore, it creates an overhead due to the repeated creation/destruction of the parallel environment. Note that these drawbacks are avoided or at least ameliorated in our approach since the thread pool is set up only once and latency due to global write processes is reduced since all l-qn combinations are processed at once. Since we employ specific blocking/synchronization primitives (mutex) for each resulting l-qn combination of, e.g., Kμν, which is written to a different part of the result matrix, latency due to global write processes is reduced. It is also necessary to briefly discuss the batching strategy for GPU and CPU threads. For the latter we simply define a target batch size for the number of contracted shell quartets. Since we try to maximize the batch sizes for GPU threads, we set a target number of batches instead. Using a single GPU node and smaller molecular systems, for example, we use a target of one batch per l-qn combination. Within a parallel environment and therefore a larger number of GPUs, more batches per l-qn

K μpνs =

∑ Pλσ(μp λ|νσ s ) λσ

(1)

where functions λ and σ can be of any l-qn type. Considering the case where the l-qn combination (pp|ss) is evaluated with PreLinK on GPUs and (pp|ps) with LinK on CPUs, one obtains B

DOI: 10.1021/acs.jctc.6b01166 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 3. Wall times in seconds for exchange-potential calculations for a series of DNA fragments using PBE0/def2-SVP for different CPU/GPU ratios. A single GPU node with a total of 12 CPU cores was used for calculations with up to 4 GPUs, while the calculation with 24 GPUs was executed on 6 GPU nodes with a total of 71 CPU cores using MPI. ’Mixed’ denotes the use of both GPUs and CPUs, while ’Pure’ denotes calculations using GPUs only. The corresponding CPU/GPU ratios are 11/1, 10/2, 9/3, 8/4, and 47/24.

Figure 4. Wall times in seconds for exchange-potential calculations for a single adenosine−thymine basis pair employing a def2-SV, def2-SVP, and def2-TZVP basis set for different CPU/GPU ratios. See the caption of Figure 3 for computational details.

prohibits the use of larger shared memory arrays, which, however, are essential to store the intermediate resulting integrals. This not only negatively affects the performance of integral evaluation using basis functions with higher l-qn but also even prevents the use of large basis sets containing f- or gfunctions when employing the standard thread blocks of dimension 8 × 8. Our solution to this problem is to divide the resulting integrals into batches of basis functions. Considering Coulomb-type integrals, we follow the algorithm outlined in ref 5. Thus, we precontract the density with the ket Hermite factors and only form the intermediate integrals Jp, while the final transformation to Jμν is done on CPU. Considering a batch (μν|λσ), for example, the number of Jp values is (lbra+1)(lbra+2)(lbra+3)/6 with lbra = lμ+lν. Thus, at most lbra = 5 (28 kB) will fit into the 32 kB of shared memory provided by an AMD FirePro 3D W8100 GPU when 8 × 8 thread blocks are employed, while two f-functions will already demand 42 kB. Thus, a threshold for the maximum batch size for Jp is passed to the code generator to automatically generate

Kμpνs from contributions of both schemes. If the PreLinK schemes considers the elements Kμpνs as insignificant, no contributions to this matrix element are computed. In contrast, the LinK screening can determine the contributing integral (μλ|νσ) as significant due to the Schwarz criterion. Thus, the final element Kμpνs is constructed from only a part of the sum given in eq 1 and can therefore be larger than predicted by the PreLinK estimate. In order to avoid this potential source of numerical noise in the final exchange potential matrix, the PreLinK scheme has to be additionally considered within the CPU-based evaluation using LinK. Therefore, only contributions to significant elements Kμν are added to the final exchange potential matrix. Note that the additional preselection to the LinK scheme does not improve on the performance of the pure LinK scheme, but it is essential for an overall stable SCF convergence. II.B. High l-Quantum Number Basis Functions on GPUs. As mentioned before, the limited local memory C

DOI: 10.1021/acs.jctc.6b01166 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Table I. Wall Times in Seconds for the Evaluation of the Coulomb and Exchange Potentials of β-Carotene Using Different Basis Sets (def2-SVP, def2-TZVP, def2-QZVP) on a Single Node with up to 4 GPUs and 12 CPU Coresa def2-SVP OpenCL

def2-TZVP CUDA

OpenCL

def2-QZVP

CUDA

OpenCL

def2-QZVP CUDA

OpenCL/MPI

no. of GPUs

J

K

J

K

J

K

J

K

J

K

J

K

no. of GPUs

J

K

1 2 3 4

0.9 0.4 0.4 0.2

4.7 3.2 2.2 1.7

1.0 0.5 0.3 0.3

6.2 4.3 3.3 2.5

8.3 4.1 2.8 2.1

45.4 33.7 27.1 23.3

9.2 4.7 3.2 2.4

69.8 43.2 33.6 26.8

175.9 85.7 57.2 43.4

2173.6 1327.7 1009.2 861.7

156.9 79.3 53.6 40.2

1719.4 1035.2 781.7 639.7

8 12 16 20 24

22.0 14.8 11.2 9.2 7.9

452.6 303.8 230.3 185.4 155.5

a

OpenCL-based calculations were executed on AMD FirePro 3D W8100 and Intel Xeon E5-2620 v3 (2.40 GHz); CUDA-based calculations were executed on GTX Titan Black and Intel Xeon E5-2620 v2 (2.10 GHz).

590 angular points (Lebedev−Laikov22). Note that the evaluation of the Coulomb integrals is exclusively done on GPUs, while the exchange-correlation potential is entirely executed on CPUs; and only the generation of the molecular grid using the scheme proposed by Stratmann et al.23 is done on GPUs. III.A. Hybrid GPU/CPU Calculations. As a first example, we analyzed the performance of our hybrid scheme for a series of DNA fragments of up to 16 adenine−thymine base pairs with 1052 atoms for the largest system. In Figure 3 the wall times for the exchange-potential evaluation using different CPU/GPU ratios as well as the times for the pure GPU-based evaluation are given. Note that each GPU node offers 12 CPU cores, i.e., the CPU/GPU ratios are 11/1, 10/2, 9/3, 8/4, and 47/24 (one master thread is used for bookkeeping within the MPI setup). As can be seen, the use of both CPUs and GPUs is always advantageous, and − as can be expected − the speedup as compared to the pure GPU-based calculation is largest for the highest CPU/GPU ratio. In contrast, the speedup is virtually negligible for an abundance of GPU devices (24 GPUs). For the single adenosine−thymine base pair we also analyze the impact of the basis set size. In Figure 4 the wall times for the evaluation of the exchange potential using a def2SV, def2-SVP, and def2-TZVP basis are shown. Additionally to the trend regarding the CPU/GPU ratios, one can also observe a significant benefit with our hybrid CPU/GPU scheme as compared to the pure GPU-based evaluation for increasing basis set sizes. Note that the different systems as well as the different CPU/ GPU setups also require different batching strategies for optimal performance. As mentioned in the previous section, it is most efficient for GPUs to maximize batch sizes for smaller systems or if only a small number of GPUs are available. Thus, all calculations with up to only 4 GPUs perform best with a target number of batches of 1, except for the system containing 16 base pairs using 4 GPUs, where a target number of 4 batches per l-qn combination is more efficient. In the case of the MPIbased calculations with 24 GPUs the smaller systems with up to 8 base pairs performs best with a target number of 4 batches, while the largest system excels with a target number of 8 batches per l-qn combination. As a second example, we timed the Fock build for PBE0 calculations24,25 of the β-carotene molecule (96 atoms) using def2-SVP, def2-TZVP, and def2-QZVP basis sets.26 Here, the results using two types of GPU servers, one with 4 AMD FirePro 3D W8100 (OpenCL) and one with 4 NVIDIA GTX Titan Black (CUDA), are shown in Table I. As it was already stressed in previous publications,12,14 the performance of the

compute kernels for the single batches, where the batches are also organized in order to minimze the recomputation of integrals. The evaluation of exchange-type integrals requires far more memory as compared to Coulomb integrals for the l-qn combination. Here, the number of integrals Kμλ for the batch (μν|λσ) is (lμ+1)(lμ+2)(lλ+1)(lλ+2)/4. Thus, if μ and λ represent f-functions, 50 kB of shared memory are needed for 8 × 8 thread blocks. The subdivision into batches during the code generation is again determined via a threshold. In this case, we first subdivide shell λ and only batch shell μ if necessary. In both Coulomb- and exchange-type integral algorithms, we pass the number of integral batches as a third dimension to the execution grid and evaluate all batches during a single GPU execution call, similar to the algorithm outlined in ref 13. Thus, the shared memory variables are defined in the entering GPU kernel function and passed down to the corresponding GPU subroutines.

III. ILLUSTRATIVE CALCULATIONS We implemented the hybrid integral engine in our GPUaccelerated ab initio program FERMIONS++.12,14 For all calculations tight thresholds for preselection12 and integral screening (ϑpre = 10−4, ϑint = 10−10) are employed, and the DFT grid is constructed from 99 radial points (LOG321) and

Figure 5. Speedup of Coulomb- and exchange-type calculations for βcarotene using PBE0/def2-QZVP with up to 6 GPU nodes/24 GPUs. The exchange speedup signed with an asterisk indicates the use of the single-node calculation with 4 GPUs as reference (renormed to 4). See the text for further explanations. D

DOI: 10.1021/acs.jctc.6b01166 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Table II. Wall Times in Seconds for the Evaluation of the Coulomb, Exchange, and Exchange-Correlation Potentials for Guanine, C20H42, and C80H162 Using PBE0/def2-SVP or PBE0/def2-TZVP on a Single CPU Node with 4 Intel Xeon E7-4820 (2.00 GHz) Processors and in Total 32 CPU Cores guanine [SVP]

C20H42 [SVP]

C20H42 [TZVP]

C80H162 [SVP]

no. of cores

JK

XC

J

K

XC

J

K

XC

J

K

XC

1 2 4 8 16 32

13.57 6.82 3.44 1.85 1.04 0.78

18.86 9.49 5.08 2.60 1.42 0.78

41.31 21.00 10.83 5.93 3.33 2.16

83.74 41.70 21.15 10.88 5.67 3.67

69.89 35.12 17.67 9.42 5.30 2.94

327.75 165.79 85.86 43.46 23.09 12.49

833.65 419.02 208.81 105.14 56.82 28.93

177.52 88.69 44.41 23.60 12.88 6.92

191.53 97.96 50.35 26.54 15.48 9.93

747.59 373.35 189.45 99.48 55.20 32.45

434.89 226.52 111.52 59.13 32.11 17.20

Figure 6. Speedup of Coulomb, exchange, and exchange-correlation type calculations for guanine, C20H42, and C80H162 using PBE0/def2-SVP or PBE0/def2-TZVP on a single server with 32 CPU cores (Intel Xeon E7-4820 [2.00 GHz]).

Table III. Wall Times in Seconds for the Evaluation of the Coulomb, Exchange, and Exchange-Correlation Potentials for Guanine, C20H42, and C80H162 Using PBE0/def2-SVP or PBE0/def2-TZVP on up to 6 CPU Nodes with 2 Intel Xeon E5-2620 v3 (2.40G Hz) and 12 CPU Cores Per Node guanine [SVP]

C20H42 [SVP]

C20H42 [TZVP]

C80H162 [SVP]

no. of cores

JK

XC

JK

XC

JK

XC

JK

XC

12 24 36 48 60 72

1.14 0.75 0.56 0.51 0.59 0.96

0.74 0.41 0.27 0.23 0.19 0.16

7.99 4.37 3.05 2.37 2.09 2.14

3.66 1.88 1.30 0.97 0.82 0.68

62.81 35.84 23.29 18.64 15.74 13.72

8.40 4.43 3.04 2.31 1.91 1.67

162.98 88.40 58.59 44.08 34.75 29.54

21.47 11.43 8.09 6.74 5.77 4.94

integral evaluation routines on GPU strongly decreases with an increase of the l-qn in the underlying basis set. Considering further the 6(N 4) scaling with respect to the size of the basis set, a strong increase in the wall times for the quadruple-ζ basis is obtained. Note that even for the def2-QZVP basis the Coulomb evaluation is most efficient if executed only on GPUs, while for the exchange evaluation the use of CPUs is crucial. For the latter, the GPU kernel for the highest l-qn combination (gg|gg) is actually unstable when executed on the available GPU devices. For the largest basis set we also executed OpenCL/ MPI-based calculations with up to 6 GPU nodes totalling 24 AMD GPUs and 71 CPU cores. Here, we did not adapt the target number of GPU batches, i.e., we always used the maximized batches ideal for high CPU/GPU ratios. The speedups for the Coulomb- and exchange-type calculations are

shown in Figure 5. Note that the speedup for the Coulomb part is almost ideal, while the speedup for exchange integrals is less favorable if the single GPU calculation is used as reference. Considering the speedup for OpenCL/MPI-parallelized calculations only, using the single-node calculation employing 4 GPUs as reference and renorming the speedup to 4, we again obtain an almost ideal speedup. Finally, it should be stressed that the performance on AMD GPUs is superior to NVIDIA GPUs for the split-valence and the triple-ζ basis set, while the calculation with the quadruple-ζ basis is in contrast significantly faster on the chosen NVIDIA GPUs. A first analysis of the single l-qn combination batches has shown that some kernels are more or less efficient on AMD or NVIDIA GPUs, although no clear pattern could be E

DOI: 10.1021/acs.jctc.6b01166 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation Funding

identified. Here, a more detailed analysis as well as hardwarespecific optimizations are clearly necessary in the future. III.B. Strong-Scaling CPU Calculations. As a final example we present some pure CPU-based timings of smalland medium-sized systems in order to analyze the strongscaling parallization due to the use of the hybrid integral engine. The wall times for single-node calculations with up to 32 CPU cores are shown in Table II for guanine, C20H42, and C80H162 using PBE0/def2-SVP or PBE0/def2-TZVP. For the linear alkenes we also employed the linear-scaling CFMM27 and LinK18,19 methods for Coulomb and exchange integrals, respectively, while for the smaller guanine molecule a quadratic-scaling combined evaluation of Coulomb and exchange terms was employed. As can be seen from Figure 6, we obtain almost ideal speedups for the evaluation of the exchange-correlation term (left) and still very good speedups for the J/K evaluations. Note that the least effective speedup for guanine refers to wall times less than 1 s. Additionally, we also performed MPI-parallelized calculations with up to 6 infiniband-connected CPU nodes and a total of 71 CPU cores (Table III). Note that in this case all calculations were executed with the quadratic-scaling algorithm which evaluates both Coulomb and exchange terms together, since so far no MPI-based parallelization for the far-field contribution to the Coulomb term has been implemented. For the larger systems/basis sets we still obtain a good strong-scaling behavior, while for the smaller guanine system the communication overhead starts to hamper the overall performance.

C.O. acknowledges financial support by the SFB 749 ”Dynamik und Intermediate molekularer Transformationen” (DFG) and the DFG cluster of excellence (EXC 114) ”Center for Integrative Protein Science Munich” (CIPSM). Notes

The authors declare no competing financial interest.



IV. CONCLUSION We presented an integral evaluation engine that efficiently combines CPUs and GPUs to accelerate the evaluation of Coulomb- and exchange-type potential matrices in ab initio calculations. We showed that the performance is significantly enhanced, in particular for increasing CPU/GPU ratios in the computational setup. The additional use of CPU cores also allows for the first time the use of the large basis sets in the context of exchange-potential calculations as shown for the example of β-carotene using a quadruple-ζ basis set. Furthermore, it was shown that our hybrid integral engine ensures a strong-scaling parallelization on single or multiple compute nodes using C++11 threads and fast interconnect MPI, respectively. This favorable scaling behavior is also obtained in pure CPU-based calculations. Note that this strongscaling parallelization can also provide an efficient pathway to molecular dynamics simulations for small- to medium-sized molecular systems.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jctc.6b01166. Configuration setup and Cartesian coordinates (PDF)



REFERENCES

(1) Yasuda, K. Two-electron integral evaluation on the graphics processor unit. J. Comput. Chem. 2008, 29, 334−342. (2) Yasuda, K. Accelerating Density Functional Calculations with Graphics Processing Units. J. Chem. Theory Comput. 2008, 4, 1230− 1236. (3) Ufimtsev, I. S.; Martínez, T. J. Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation. J. Chem. Theory Comput. 2008, 4, 222−231. (4) Ufimtsev, I. S.; Martínez, T. J. Graphical Processing Units for Quantum Chemistry. Comput. Sci. Eng. 2008, 10, 26−34. (5) Ufimtsev, I. S.; Martínez, T. J. Quantum Chemistry on Graphical Processing Units. 2. Direct Self-Consistent-Field Implementation. J. Chem. Theory Comput. 2009, 5, 1004−1015. (6) Luehr, N.; Ufimtsev, I. S.; Martínez, T. J. Dynamic Precision for Electron Repulsion Integral Evaluation on Graphical Processing Units (GPUs). J. Chem. Theory Comput. 2011, 7, 949−954. (7) Isborn, C. M.; Luehr, N.; Ufimtsev, I. S.; Martínez, T. J. ExcitedState Electronic Structure with Configuration Interaction Singles and Tamm-Dancoff Time-Dependent Density Functional Theory on Graphical Processing Units. J. Chem. Theory Comput. 2011, 7, 1814−1823. (8) Wu, X.; Koslowski, A.; Thiel, W. Semiempirical Quantum Chemical Calculations Accelerated on a Hybrid Multicore CPU-GPU Computing Platform. J. Chem. Theory Comput. 2012, 8, 2272−2281. (9) Asadchev, A.; Allada, V.; Felder, J.; Bode, B.; Windus, T. L.; Gordon, M. S. Uncontracted Rys Quadrature Implementation of up to g Functions on Graphical Processing Units. J. Chem. Theory Comput. 2010, 6, 696−704. (10) Asadchev, A.; Gordon, M. S. New Multithreaded Hybrid CPU/ GPU Approach to Hartree−Fock. J. Chem. Theory Comput. 2012, 8, 4166−4176. (11) Titov, A. V.; Ufimtsev, I. S.; Luehr, N.; Martinez, T. J. Generating Efficient Quantum Chemistry Codes for Novel Architectures. J. Chem. Theory Comput. 2013, 9, 213−221. (12) Kussmann, J.; Ochsenfeld, C. Pre-selective screening for matrix elements in linear-scaling exact exchange calculations. J. Chem. Phys. 2013, 138, 134114. (13) Maurer, S.; Kussmann, J.; Ochsenfeld, C. A Reduced Scaling JEngine Based Reformulation of SOS-MP2 using Graphics Processing Units. J. Chem. Phys. 2014, 141, 051106. (14) Kussmann, J.; Ochsenfeld, C. Pre-Selective Screening for LinearScaling Exact Exchange Gradient Calculations for Graphics Processing Units and General Strong-Scaling Massively-Parallel Calculations. J. Chem. Theory Comput. 2015, 11, 918−922. (15) Miao, Y.; Merz, K. M. Acceleration of Electron Repulsion Integral Evaluation on Graphics Processing Units via Use of Recurrence Relations. J. Chem. Theory Comput. 2013, 9, 965−976. (16) Miao, Y.; Merz, K. M. Acceleration of High Angular Momentum Electron Repulsion Integrals and Integral Derivatives on Graphics Processing Units. J. Chem. Theory Comput. 2015, 11, 1449−1462. (17) Kussmann, J.; Ochsenfeld, C. Employing OpenCL to Accelerate Ab Initio Calculations on Graphics Processing Units. J. Chem. Theory Comput. 2017, 13, 2712−2716. (18) Ochsenfeld, C.; White, C. A.; Head-Gordon, M. Linear and sublinear scaling formation of Hartree-Fock-type exchange matrices. J. Chem. Phys. 1998, 109, 1663−1669. (19) Ochsenfeld, C. Linear scaling exchange gradients for HartreeFock and hybrid density functional theory. Chem. Phys. Lett. 2000, 327, 216−223.

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Christian Ochsenfeld: 0000-0002-4189-6558 F

DOI: 10.1021/acs.jctc.6b01166 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation (20) Häser, M.; Ahlrichs, R. Improvements on the direct SCF method. J. Comput. Chem. 1989, 10, 104−111. (21) Mura, M. E.; Knowles, P. J. Improved radial grids for quadrature in molecular density-functional calculations. J. Chem. Phys. 1996, 104, 9848−9858. (22) Lebedev, V. I.; Laikov, D. N. A quadrature formula for the sphere of the 131st algebraic order of accuracy. Doklady Mathematics 1999, 59, 477−481. (23) Stratmann, R. E.; Scuseria, G.; Frisch, M. J. Achieving linear scaling in exchange-correlation density functional quadratures. Chem. Phys. Lett. 1996, 257, 213−223. (24) Perdew, J. P.; Burke, K.; Ernzerhof, M. Generalized Gradient Approximation Made Simple. Phys. Rev. Lett. 1996, 77, 3865−3868. (25) Perdew, J. P.; Burke, K.; Ernzerhof, M. Generalized Gradient Approximation Made Simple [Phys. Rev. Lett. 77, 3865 (1996)]. Phys. Rev. Lett. 1997, 78, 1396−1396. (26) Weigend, F.; Ahlrichs, R. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy. Phys. Chem. Chem. Phys. 2005, 7, 3297−3305. (27) White, C. A.; Johnson, B. G.; Gill, P. M. W.; Head-Gordon, M. The continuous fast multipole method. Chem. Phys. Lett. 1994, 230, 8−16.

G

DOI: 10.1021/acs.jctc.6b01166 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX