Arbitrary Angular Momentum Electron Repulsion Integrals with

Jun 12, 2017 - Jaroslaw Kalinowski , Frank Wennmohs, and Frank Neese. Molecular Theory and Spectroscopy, Max-Planck Institute for Chemical Energy ...
1 downloads 0 Views 2MB Size
Article pubs.acs.org/JCTC

Arbitrary Angular Momentum Electron Repulsion Integrals with Graphical Processing Units: Application to the Resolution of Identity Hartree−Fock Method Jaroslaw Kalinowski, Frank Wennmohs, and Frank Neese* Molecular Theory and Spectroscopy, Max-Planck Institute for Chemical Energy Conversion, Muelheim an der Ruhr D-45470, Germany ABSTRACT: A resolution of identity based implementation of the Hartree−Fock method on graphical processing units (GPUs) is presented that is capable of handling basis functions with arbitrary angular momentum. For practical reasons, only functions up to (ff|f) angular momentum are presently calculated on the GPU, thus leaving the calculation of higher angular momenta integrals on the CPU of the hybrid CPU-GPU environment. Speedups of up to a factor of 30 are demonstrated relative to state-of-the-art serial and parallel CPU implementations. Benchmark calculations with over 3500 contracted basis functions (def2-SVP or def2-TZVP basis sets) are reported. The presented implementation supports all devices with OpenCL support and is capable of utilizing multiple GPU cards over either MPI or OpenCL itself.



INTRODUCTION Over the past 20 years, computational chemistry has become one of the most rapidly growing areas of chemistry. Calculations nowadays are used routinely to interpret experiments or hypothesize the outcome of future experiments. While the theoretical methodology has been known for a long time, the lack of computing power proved prohibitive until more recent years. Nowadays, the capacity of modern computing devices allows for a very accurate treatment of even large systems. However, there is always a need for higher accuracy and shorter turnaround times. In striving for a more favorable cost/performance factor, there are two main approaches. The first is through new methodological developments, where new approximations are developed which offer better scaling with the studied system size or smaller prefactors. The second method is more computer science centered, where new algorithms and computer architectures are being used to speed up already known theoretical chemistry methods. While methodological developments are critical and offer unparalleled advantages when reaching for higher and more realistic systems, it is important to stay in touch with current state-of-the-art computer science. One of the largest advancements in high performance computing over the last years is the usage of Graphical Processing Units (GPUs) for general purpose computing. Traditional Central Processing Units (CPUs) and GPUs are based on very different philosophies. The CPU is optimized to work on a single task at a time; GPUs, on the other hand, are made for problems that offer a very high data parallelism in which simple operations are performed on multiple data sets. Fortunately, quantum chemical calculations frequently offer a high degree of data parallelism. For example, operations like © XXXX American Chemical Society

tensor contractions and linear algebra operations are at the heart of most quantum chemical calculations and can be parallelized with close to ideal scaling. Until 2006, the usage of GPUs for general purpose computing was very difficult as program developers had to use the equivalent of a graphic application programming interface (API) to access the computing potential of GPUs. Essentially, the data to be processed needed to be expressed as an image such that calculations, like matrix−matrix multiplication, would be expressed as an operation on that image. Even with higher level programming environments being developed, applications for such calculations were very limited. Everything changed in 2007 when NVIDIA released the Compute Unified Device Architecture (CUDA). NVIDIA devoted a chip area to facilitate the ease of parallel programming. This was a change in hardware that enabled quantum chemical applications (among many others) to utilize GPUs. To this day, CUDA is probably the most widespread way of exploiting GPU architectures for general purpose computing. Another pathway to GPU computing is the OpenCL framework, which was released in 2009. OpenCL is designed for hybrid computing across different platforms, including CPU-GPU hybrid computing. The advantage of OpenCL over CUDA is the fact that it is supported by a wide range of hardware, while CUDA is a closed project only supported by NVIDIAs hardware. Even though new frameworks like CUDA or OpenCL made it much easier to adopt existing codes to GPU architectures, the efficient use of GPUs still requires a great deal of careful attention and low level programming. Received: January 11, 2017 Published: June 12, 2017 A

DOI: 10.1021/acs.jctc.7b00030 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

where An represents the nth coordinate of the Ath nucleus the basis function is centered on. For better accuracy, contracted basis functions are used, which means any actual basis function is a linear combination of a set of Gaussian basis functions:

Although initially general purpose GPU calculations were limited in precision (single precision), the promised computational potential of the new architecture already stimulated a host of successful development work.1 Codes for computational chemistry applications from electronic structure theory,1d−g,2 through ab initio molecular dynamics,3 to empirical force fields1b,f,4 quickly emerged and were put into use. While methods based on force fields are quite straightforward to implement, electronic structure methods are in desperate need of speedups. By now, there are GPU codes available for ab initio methods ranging from elementary Hartree−Fock (HF) 2e,5 calculations to advanced electron correlation methods.6 When developing software, including scientific software like ab inito electronic structure packages, it is not only performance that needs to be considered. One of the most important factors to justify the cost of code development is how well it may be executed by a large user population. The world of GPU computing is somewhat unique in this matter as general purpose GPU computing, the way we know it today, was introduced and supported by only one hardware vendor, NVIDIA, for a significant length of time. Therefore, to this date most applications of GPUs for scientific computing, and highperformance computing in general, are based on the CUDA framework. The problem is not only that users are forced to use hardware from only one vendor but also that the codes developed are bound to a single closed framework which is supported by only one commercial company. For this reason, it is in the best interest of the scientific community to look for other ways to utilize the computing potential of GPUs with any hardware a user prefers. One of the most elementary computational problems in quantum chemistry is constituted by the first step of any ab initio calculation, namely the Hartree−Fock (HF) method. The HF approach requires the calculation of a large number of twoelectron repulsion integrals (ERIs). While most high level ab inito methods can be accelerated by development and judicious application of Basic Linear Algebra Subprograms (BLAS), code for ERIs needs to be mostly developed from scratch. Therefore, the development of ERI codes for GPU architectures was one of the first problems approached by the community.1g,2d However, available solutions are still limited to rather small angular momentum or low efficiency compared to state-of-theart CPU codes. The main goal of this work is to present an efficient code for Hartree−Fock calculations on heterogeneous CPU-GPU architectures that will offer a high speedup in comparison to state-of-the art CPU codes without binding the user to one specific hardware vendor. The code is designed, from the very beginning, to be easily utilized by as many users as possible.

Ka

(ab|cd) =

1

2

3

4

1

2

3

4

Square brackets denote the so-called primitive ERIs, while round brackets denote the contracted ERIs that are used for actual calculations. The main properties of a basis function and an ERI are total angular momenta La = la + ma + na and L = La + Lb + Lc + Ld, respectively. The code for the evaluation of ERIs is very important since the cost of these calculations scales as O(N4). Although this is less than most post Hartree−Fock methods, the base (N) of ERIs can be extremely large, thus easily dominating the computation time when poor algorithms are involved. The primitive integrals can be evaluated analytically, and numerous algorithms for this have been developed since the pioneering work of Boys.7 Most algorithms rely on recurrence relations between integrals with different angular momenta. Algorithms such as Rys-quadrature,8 McMurchie-Davidson,9 Obara-Saika,10 Head-Gordon-Pople,11 and PRISM12 are designed to maximize the use of these recurrence relations and factorize expressions for contracted ERIs of the same total angular momenta, thus trading floating point operations (FLOPs) for higher memory usage. While the choice of algorithm for calculations on a CPU is quite straightforward, this is not so clear for GPU architectures. When using CPUs, every factorization is desirable if one assumes that there is enough memory, and, in the case of serial calculation of ERIs, on a CPU even ERIs with the highest angular momentum require comparatively little memory. In a GPU framework, however, the strategy is less clear. On one side, the evaluation of ERIs on a GPU is very intuitive as they offer incredible data parallelism connected with very high FLOP cost. However, when calculating millions of ERIs at the same time, memory requirements for every single integral batch quickly become non-negligible. Huge memory requirements for the parallel evaluation of all ERIs are problematic on GPUs as the memory is very limited, both in quantity and latency. In other words, when working with GPU architectures, trading lower FLOP count for higher memory requirement is not always desirable. Therefore, the choice of algorithm and its implementation is a nontrivial task for ERIs calculations. For this work, we chose the McMurchie− Davidson (MD) algorithm9 as it offers a good compromise, with a reasonable FLOP count for high angular momentum integrals and relatively modest memory requirements. At the same time, its mathematical elegance offers a high degree of flexibility in the implementation. The MD scheme introduces new 6-index integrals in the calculation of a primitive 4 index integral [ab|cd] [ab|cd] = [ab0|cd 0]

(1)

where φ are one-electron basis functions, and r refers to electronic coordinates. In most cases atom-centered Gaussian functions are used as basis functions

(4)

which are used to evaluate the bra and ket parts of the integral separately using the following recurrence relations [(a + 1i )bp| = pi [ab(p − 1i )| + (Pi − Ai )[abp| + 1/2αp[ab(p + 1i )|

2

φa(r ) = (x − Ax )la (y − A y )ma (y − Az )na e−αA(r − A)

Kd

(3)

TWO-ELECTRON REPULSION INTEGRALS The essential part of any ab initio calculation is the calculation of a very large number of two-electron repulsion integrals (ERIs)

∫ ∫ φa(r1)φb(r1)r12−1φc(r2)φd(r2)dr1dr2

Kc

k1= 1 k 2 = 1 k 3= 1 k4 = 1



[ab|cd] =

Kb

∑ ∑ ∑ ∑ cak cbk cck cdk [φak φbk |φck φdk ]

(2)

(5) B

DOI: 10.1021/acs.jctc.7b00030 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

These expressions for exchange allow for some high-level optimizations of the implementation but leave the O(N4) scaling unchanged. In addition, the practical scaling of the Coulomb part is reduced by integral prescreening to O(N2). Although the RI approximation does not reduce the actual scaling of the HF calculation, it significantly reduces the prefactor and simplifies the whole calculation without introducing significant additional problems. Due to the fact that three-index integrals are easier to calculate than four-index integrals, they can be implemented in GPU architectures more efficiently. In this work, we exclusively focus on RI integrals and RI-JK HF calculations because of the simplicity of individual integrals and the high demand for RI integrals in highly correlated ab initio methods.

|(c + 1i )dq] = qi ·|cd(q − 1i )] + (Q i − Ci)·|cdq] + 1/2αq·|cd(q + 1i )]

(6)

where subscript i denotes a Cartesian direction, and 1i is the unit vector of the i-direction. All the required [00p|00q] integrals can be simplified into a set of one index integrals [00p|00q] = [p|q] = −1q [p + q]

(7)

with the assumption of

[r ] = [r ]0

(8)

[0]m = (ss|ss)m

(9)

and a recurrence relation of [r ]m = (Q i − Pi)[r − 1i ]m − (ri − 1)[r − 1i ](m + 1)

(10)



these one-index integrals can be easily calculated.



OPENCL OpenCL is the main framework for cross-platform calculations after CUDA, except OpenCL is relatively platform independent as it is supported by a wide range of devices and vendors. The wide range of devices that support OpenCL implementations is the primary reason for its use in this work. While OpenCL offers high platform independence, this generality does not come without additional problems. Thus, in order for the same code to be able to run on a wide variety of devices, it must be written in a relatively high-level language as the instruction sets of these devices are not standardized. While for most applications the programming language C used in OpenCL offers enough flexibility for optimizations, the efficiency of execution of many computationally intensive routines could improve greatly with access to lower levels. In addition to being bound to the use of the C language the compilation of the code executed on the GPU device needs to happen at runtime, since the instruction set is highly device dependent. The need for additional code compilation at runtime in the case of large kernels can easily negate any speedup achieved from executing the code on a GPU device. Therefore, the time of the GPU code compilation needs to be taken into account and optimized to find the best compromise between calculation and compilation times.

RESOLUTION OF IDENTITY Considering how computationally demanding the calculation of all four center ERIs is, it was natural to develop approximate methods for this step. This was achieved by introducing an auxiliary basis set in addition to the regular atom centered basis functions. The so-called resolution of identity (RI) approximation boils down to the following equation (ab|cd) ≈ (ab|cd)RI =

−1 (Q |cd) ∑ (ab|P)V PQ

(11)

PQ

where P and Q label auxiliary basis set functions, and V−1 PQ denotes the inverse matrix specified as VPQ = (P|Q ) =



P(r1)Q (r2) dr1dr2 r12

(12)

In practice the explicit calculation of the inverse of the Vmatrix is avoided in favor of Cholesky decomposition, but this is a relatively minor technical detail for the purpose of this paper. The advantage of the RI approximation is that it decomposes four-index ERIs into two- and three-index integrals thus allowing for the additional factorization of many equations. In the case of HF calculations, the main advantage is that in the Coulomb terms one can apply the following factorization (for closed shell systems)13 RI Jab ≈ Jab =



EVALUATION OF RI INTEGRALS − IMPLEMENTATION There are two main bottlenecks for the Fock matrix generation in the case of the RI-JK method: integral generation and tensor contraction (eqs 15, 16, and 17). While tensor contraction is rather straightforward and can easily be done using standard mathematical library routines, integral generation is probably the most complicated part of any quantum chemistry software. Parallel evaluation of three-index RI integrals poses certain challenges. One of the biggest problems is how work is distributed between threads to achieve load balance. Some ways of mapping the work to threads have been tested before,1g and general results are in agreement with an intuitive trend that is derived from the design of GPUs. As a rule of thumb, the less work that is assigned to each thread, the better. However, upon dividing work between threads and keeping the amount of work per thread as small as possible, it is easy to fall into one of two traps: a) the FLOP count increases too much, thus negating benefits, or b) the memory demand becomes too high. Possible FLOP count increases derive from the fact that part of the work is effectively repeated on more than one thread. The memory

−1 ∑ (ab|P)V PQ ∑ (Q |cd)Dcd PQ

cd

Dcd = 2 ∑ caicid

(13)

(14)

i

where i denotes occupied orbitals, and c denotes molecular orbital expansion coefficients. This factorization reduces the formal scaling of O(N4) to O(N3). Unfortunately, the benefits for the exchange part of the Fock-matrix are not as significant as this equation does not factorize but is only reduced to the following equations:14 i XPa =

∑ cbi(ab|P) b

(15)

Y i = V −1/2 × X i

(16)

K RI =

∑ (Y i)T i

× Yi (17) C

DOI: 10.1021/acs.jctc.7b00030 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 1. Overview of the algorithm for calculating two electron repulsion integrals. Data transfer is from the GPU device point of view, e.g., “Recv Integral List” means that the integral list is being received by the device from the host memory.

issues are usually a result of fighting the FLOP count increase intermediate values are being stored in memory thus increasing the amount of memory needed. Very often, complicated memory access patterns are difficult to optimize and lead to significant drops in efficiency. In this work, the computational load is distributed such that each individual thread calculates a single primitive integral batch at a time. An “integral batch” refers to all integrals that share the same centers, exponents, and total angular momentum per function. One of the main problems GPU developers always face is the balancing of load between threads. While it is commonplace for threads to do different work in parallel on CPUs, this is not possible for GPUs. GPUs are Single Instruction Multiple Data (SIMD) processors; this means that all threads that work in parallel can execute only one instruction at a time. In the case that GPU threads working in parallel are assigned different instructions, they effectively work in a serial way, waiting for each others operations to finish. Programming models for GPUs thus behave as if all threads are executed at the same time; but in practice they are executed in small batches, and only tasks within a single batch are required to share the same instructions. This creates many opportunities for advanced optimizations where very different sets of calculations can be set to execute in parallel, but by proper sorting of work-thread assignments, the work per thread is the same for threads inside thread batches. Accomplishing this poses a major challenge

when implementing tasks like integral prescreening, as prescreening may terminate the calculation of just a few integrals within a thread batch, essentially giving no benefits in performance. This problem was extensively covered by previous work in the area.1g,2d It was solved by proper basis function pair sorting to ensure that integrals skipped due to prescreening are next to each other, thus preventing whole thread batches from being executed. However, in this work the prescreening is done in a completely different way. Integral prescreening by its very nature removes much of the true data parallelism from the integral calculation task and thus should be done outside of the GPU calculation. Therefore, in this work we have decided to perform the prescreening entirely on the CPU side. Consequently, the list of integrals to be calculated is created on the CPU. This list includes only integrals that pass all prescreening tests. Any prescreening criteria may be used with this implementation as it is technically done outside of the main algorithm which handles the integral calculation. In the implementation described in this work, the same prescreening conditions as implemented in the parent ORCA15 package are applied. These are essentially standard tests based on the Schwarz inequality as first described by Häser and Ahlrichs.16 After being created, the list is sent to the GPU where every single thread calculates an integral batch. This additional step poses a challenge of its own. Two major tasks are added: integral list generation and CPU to GPU memory transfer of D

DOI: 10.1021/acs.jctc.7b00030 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation said list. The memory transfer effort can easily be “hidden” by performing it in parallel with GPU calculations. The integral list generation is a computationally heavy task that is serial in nature, and therefore it is perfect for a CPU architecture. With access to lower programming levels on the CPU and low memory latency, the task of integral list generation can be implemented with sufficient efficiency that it may be done in parallel to the GPU calculation and keeping the GPU calculation as a bottleneck and making integral evaluation with a GPU as efficient as possible (see Figure 1). In order to ensure the GPU is optimally used in the sense that it constantly operates under maximum load and that the turnaround time is not limited by preparatory or data transfer steps, proper work organization is paramount. There are many additional tasks in addition to just integral evaluation that need to be accomplished. Integrals are calculated in batches but are needed in matrix form for later operations. Thus, integral sorting after evaluation is required. Additionally, the Cartesian integrals are calculated despite most operations using spherical harmonics. Hence, integrals must be transformed to the spherical harmonics basis. In principle, sorting of integrals could be done at the calculation level by simply storing them into matrices as they are calculated. However, such on-the-fly sorting results in many nonlinear accesses to the memory on the GPU card (which has a high latency), limiting the overall performance. It is much easier and more efficient to store every single integral batch in a vector, keeping the memory access linear, and then sort them into matrices using optimized memory tiles that can be used in matrix multiplications. At this point it is clear that the efficient integral calculation cannot be done in a single kernel. In this work, integrals are divided by their angular momenta into sets, where in a single integral set all integrals have the same total angular momentum per center. Hence, a set is fully defined by three angular momenta, e.g., (ss|s), (ss|p), or (sp|s). Integral sets are then calculated on the GPU one set at a time. While one set is being calculated, another one is being prepared, and when another one is being prepared, the results of the previous one are being received. This way leads to no downtime in GPU computation and guarantees that all threads will be doing exactly the same work. However, there are downsides of this approach as very often integral sets are not large enough for small systems, thus rendering GPU calculations inefficient. However, such small molecules are hardly the domain of application for the proposed algorithm. A calculation of a single integral set consists of four separate calculations: Cartesian primitive integral calculation, integral contraction, transformation from Cartesians to spherical harmonics, and integral sorting. The first step, primitive integral calculation, requires the most additional data processing of the whole process. In addition, it involves a seemingly trivial step that is not normally considered as problematic in CPU implementations: the access to the actual basis set. When the primitive integral calculation starts, every thread requires the data for all three functions contributing to the integral. The integral list, however, cannot store all the atomic coordinates and expansion coefficients as it would result in an incredible memory bottleneck. Thus, the actual basis set needs to be stored on the card, and the list needs to be built out of simple pointers to functions within the basis set. If the basis set would be stored on regular memory, it would again result in a major memory bottleneck, as a single function will be accessed by multiple threads simultaneously.

Fortunately, there is a memory bank on every card designed for massive parallel access; it is called a constant memory bank as it is read-only from the GPU code point of view. It is, however, very small with 64 kB of memory on most modern devices. In order to fit a basis set, even for the largest system, into constant memory, only unique exponents and coefficients are stored. Therefore, for example, when calculating a H2 system when the same basis set is used for both hydrogens the basis set storage will consist of two sets of xyz coordinates but just one set of exponents of Gaussian functions. Assuming a simple 3-21G basis set, there are two basis functions per hydrogen, and one of them is contracted from two primitive Gaussians. In that case, 12 numbers are stored (6 for two sets of xyz coordinates and 6 for exponents and contraction coefficients), while in a traditional way 18 numbers would be stored (6 exponents and contraction coefficients would be stored twice). Since only unique functions are stored, there is no practical limit for the system or basis set size. However, in principle there is a limit. When adding new atoms to the system, the memory need for the basis set grows mostly due to atomic coordinates as every atom will have unique coordinates while there will be only a few types of atoms, so not every atom introduces new unique basis function exponents. Even when storing atomic coordinates with double precision for most system types the theoretical limit will be about 2500 atoms (58.6 kb of memory just for coordinates). Another step that requires additional processing is integral contraction. For integral contraction, another list is prepared and sent to the GPU. In the contraction step, every thread calculates a single contracted integral out of primitive integrals. Primitive integrals are stored during their calculation, so every thread during the contraction calculation needs just two values: a pointer for where to start and the number of integrals to sum. This leads to major load imbalance, as the contraction depth varies quite heavily between integral batches. However, this step is so fast that this lack of load balance does not prove to be problematic. In the test calculations reported below, the contraction step always took far less than 0.1% of the overall integral calculation time. The transformation to spherical harmonics does not require any preprocessed data other than the number of integral batches. Every single thread transforms a single integral batch. For the transformation, integrals are read from the global memory on the device to the local memory and then to the register space with proper tiling to increase the efficiency of the transfer. The transformation is then performed in the register space and subsequently transferred back to the global memory in a similar fashion. The same memory transfer mechanisms are applied to integral sorting, making this memory heavy step of the calculation one of the fastest steps of the whole process. Cartesian primitive integral calculation is by far the most complicated and challenging step. Every single integral set has its own separate kernel which is generated and compiled at runtime. Due to the fact that the code needs to work on essentially every possible platform with OpenCL support, it relies heavily on code optimization done during compilation. However, code optimization takes a significant amount of time; since this is done at runtime, this time must be taken into consideration during algorithm development. To ensure the compilation takes as little time as possible, every loop in the kernel is unrolled, if possible. This, however, leads to very large code size. As the kernel code is generated and compiled on-theE

DOI: 10.1021/acs.jctc.7b00030 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation fly as needed, integrals of arbitrary angular momenta can be treated, in principle. However, compilation time and complexity of high angular momentum integrals limits the practical usage of GPU code to (ff|f) integrals. Thus, in the current implementation, integrals with angular momenta higher than (ff|f) are calculated on the CPU. This calculation is partially done in parallel to the GPU calculation of lower angular momentum integrals. The fact that this CPU work is partially done in parallel to the GPU work, together with the usually relative low number of high angular momentum integrals should lead to no major efficiency drops. Even for integrals with high angular momentum, integral sorting is done on the GPU, and integrals are then kept on the card for further processing. The current implementation supports the use of multiple GPU devices in two ways: use of the Message Passing Interface (MPI) with one device per MPI task (usually a single CPU thread) or using multiple GPU devices through OpenCL interface on just one CPU thread. Parallelization via MPI has the same problem as a regular parallelization of the CPU code that uses MPI. MPI threads do not share memory, which increases the total memory demand since copies of the data need to be stored. In addition to higher memory demands, the overhead associated with MPI communication can be problematic. This overhead is negligible in the case of CPU implementations in most cases. However, in the case of a GPU implementation where everything, aside from thread−thread communication, is significantly faster, this communication can account for a relatively high percentage of the calculation time. Parallelization over OpenCL, on the other hand, affects only the GPU side of the calculation, and all calculations on the CPU are essentially serial, which may cause the CPU side to be a bottleneck. However, due to high level of optimization of the CPU code, the CPU side of the calculation accounts for only about 10% of the time that the GPU requires to finish. This relation allows for the efficient use of at least a few GPU devices per thread.

should be repeated for every memory layer. First, the initial matrix, which is stored in global memory, is divided into tiles which are loaded into the local memory of workgroups of threads. Then, a local tile is divided into microtiles, which are loaded to single threads where they are calculated in register space. Finally, every thread calculates a single microtile of a global result matrix. One of the biggest challenges here is to achieve a close-tooptimal performance on a wide range of devices, without any user interaction. This means that, for the most part, a constant tile size is necessary. A constant tile size means that performance on most devices will be less than optimal. However, the simplicity of the black-box approach is expected to make this still much more useful for the average user than many other fine-tuned BLAS libraries available that use OpenCL. In addition to tiling, the innermost loop may be unrolled to take advantage of vector processing units, essentially enabling calculations on a few microtiles at the time. The level of unrolling is another parameter which can be fine-tuned to a specific device, but in our approach it must be kept constant for the same reasons outlined above. In this work, the tile sizes and unrolling parameter are adjusted not only to fit reasonably well with memory parameters of most devices but also to reduce the number of logical operations performed. Therefore, tiles are of the size 64 × 64 and microtiles are of the size of 4 × 4, while the innermost loop is unrolled in sets of 4 instances. These parameters not only were close to optimal on a wide range of tested devices but also reduce the number of logical operations to essentially zero. The work group used is 16 × 16, which is maximal for most devices; with the innermost loop unrolled to 4 instances at a time, a single thread will be calculating 4 microtiles with sizes 4 × 4. This means that for a single set of 4 tiles every thread from a workgroup needs to load just a single value from global memory (see Figure 2).



MATRIX MULTIPLICATION IMPLEMENTATION Matrix−matrix multiplication kernels developed for the purpose of this project are very straightforward and are mostly written according to the general rules for efficient operations of such type. The main problem for matrix multiplications on GPUs is its memory access to the FLOP ratio, and so the most effort is put into decreasing this ratio. Let us assume we are multiplying two square matrices of size N. In a naive approach 2N3 FLOPs are performed while loading 2N3 elements, which means 1:1 memory loads to the FLOP ratio. To decrease this ratio, the matrix is split into tiles. For simplicity, let us assume tiles are also squared. Therefore, a 3 A×A tile will reduce the number of memory loads to 4N . On A GPUs, memory access is hidden during the calculation by having a large number of tasks performing calculations. While some of the tasks need to access memory, they are set aside to wait while other tasks are being calculated. High occupancy on each compute unit is necessary to achieve peak performance, and while large tiles reduce the pressure on memory bandwidth, they require more register space and put more pressure on memory latency as it becomes harder and harder for the device to hide memory access efforts. Optimal tile size is very hardware dependent and may vary greatly for different devices. In addition, the tiling process

Figure 2. Pseudocode for the matrix−matrix multiplications kernel for GPUs. Matrix indices are derived directly from thread indices in a thread space on the device. The microtile calculation consists of an unrolled multiplication of two 4 × 4 matrices.

This approach leads to stable performance on large matrices (N > 3000) of 800−850 GFLOPS on a NVIDIA K40m device and 1050−1100 GFLOPS on a AMD S9150 device, not including the time of the memory transfer to and from the device. For comparison, proprietary cuBLAS library in our benchmarks achieved about 1050 GFLOPS in the case of the NVIDIA device, for the same matrix sizes. In the case of the AMD device it is much harder to compare as there is no F

DOI: 10.1021/acs.jctc.7b00030 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

OpenBLAS will be used in both cases, with the exception of matrix−matrix multiplication in the GPU accelerated code since separate dgemm kernels were developed for this purpose. In principle, there are external BLAS libraries available to perform matrix−matrix multiplication that offer a similar level of performance. However, the use of internal DGEMM kernels offers an advantage of a black-box-type product for the end user. Test Systems. For benchmarking, the following chemical systems are used: peptides containing from 2 to 16 alanine molecules, humulone (C 21 H 30 O 5 ), benzylopenicillin (C16H18N2O4S), vancomycin (C66H75Cl2N9O24), C180 fullerene, a vancomycin dimer (C132H150Cl4N18O48), taxol (C45H49O15), and olestra (C156H278O19). Every molecule is calculated using both def2-SVP and def2-TZVP basis sets (see Table 1) and for

available solution that comes even close to being a black-boxlike. Supported by AMD clBLAS together with the standard driver achieved in our benchmarks 1000−1050 GFLOPS in the case of the AMD device, for the same matrix sizes. This is essentially the exact performance achieved for the exchange contribution calculation documented below.



RESULTS Methodology of Benchmarking. Comparison of the performance between CPU and GPU code is always a notoriously difficult subject. It is rather obvious that, for the purpose of benchmarking, a CPU setup should be running algorithms and code optimized for the CPU, and a GPU setup should be using algorithms and code optimized for the GPU. Hence, a 1:1 comparison of two such optimized algorithms can be fairly misleading, and it is not always obvious what exactly should be compared to what. Probably the most popular method of benchmarking GPU implementations is to compare its efficiency to a serial CPU implementation. This approach is quite straightforward, as the GPU implementation usually just uses a single CPU thread per device or even multiple devices. While very intuitive and standardized, results from such benchmarks are not really informative for the user. An average user would never use just use a single thread for a large calculation but would rather resort to a parallel CPU implementation. Comparing a GPU implementation to just a single thread is also not really fair to the CPU. With modern CPUs easily having 6 physical cores, this comparison often compares one very large and powerful GPU device to just 1/ 6th of the CPU device. On the other hand, most implementations that will utilize all 6 CPU cores will do it via MPI. Hence, threads do not share memory, and the whole calculation often needs 6 times more memory than a single thread used in serial implementation or even the GPU implementation for that matter. Given this ambiguity, both comparisons are shown in this work. The GPU accelerated implementation is judged by comparing to both a single thread implementation and a parallel implementation utilizing the whole CPU. The issue of different memory needs is solved by limiting the total memory usage for the whole calculation, such that a parallel implementation is forced to use the same total amount of memory as the single threaded versions. As a measure of efficiency, the total time of a HF single point energy calculation using RI-JK approximation in the integral direct HF scheme is used for the comparison of the CPU and GPU codes. This means that integrals are recalculated at each SCF cycle. In contrast to molecular integrals, the V−1/2 matrix is calculated once at the beginning of the calculation and then stored in the main memory of the host throughout the calculation, being transferred to the device when needed. Full double precision is used for both CPU and GPU codes throughout the whole calculation. Molecular symmetry is not used in any way during the calculation. No additional optimization flags were used for the GPU code compilation on the device as they affected the accuracy. The CPU code uses the HGP11 algorithm for integral evaluation. Hardware Setup. For benchmarking, two different GPU cards are used: NVIDIA K40m and AMD S9150. Both cards are used with the same CPU type: Intel Xeon E5-2620 with 6 physical cores with 2.4 GHz clock. The system is equipped with 64 GB of RAM memory. For CPU code, the 3.0.2 release of the ORCA15 program suite will be used. For mathematical libraries,

Table 1. Number of Contracted Basis Functions and Auxiliary Contracted Basis Functions for Systems Used for Benchmarking def2-SVP 2× alanine 3× alanine 4× alanine benzylopenicillin 5× alanine humulone 6× alanine 7× alanine 8× alanine 9× alanine 10× alanine 11× alanine taxol 12× alanine 13× alanine 14× alanine 15× alanine 16× alanine vancomycin C180 vancomycin dimer olestra

def2-TZVP

N

Naux

N

Naux

214 309 404 416 499 514 594 689 784 879 974 1069 1099 1164 1259 1354 1449 1544 1797 2520 3594 3840

696 1006 1316 1370 1626 1668 1936 2246 2556 2866 3176 3486 3614 3796 4106 4416 4726 5036 5926 8640 11852 12292

413 598 783 827 968 986 1153 1338 1523 1708 1893 2078 2185 2263 2448 2633 2818 3003 3593

1016 1471 1926 2055 2381 2426 2836 3291 3746 4201 4656 5111 5371 5566 6021 6476 6931 7386 8875

every possible configuration: single threaded GPU accelerated code, single threaded CPU code, parallel CPU code with 6 threads. The largest test cases: C180, the vancomycin dimer, and olestra are not calculated with the def2-TZVP basis set due to the excessive time needed for the CPU code to finish these calculations. Benchmark Results. Upon comparing the GPU accelerated RI-HF code with a single CPU thread using ORCA, it is clear that the GPU code is never slower than its CPU counterpart. As the number of basis functions of the system grows, the speedup increases rapidly. The rather small relative speedup for small systems is caused by the issue mentioned previously: a single integral kernel is calculating a very small set of integrals at a time, and in the case of small systems this set is far too small to utilize a large GPU device efficiently. However, for larger systems these sets seem to be large enough, and a speedup in relation to a single CPU thread quickly approaches a factor of 10 and more. For a larger basis set, the speedup G

DOI: 10.1021/acs.jctc.7b00030 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 3. Speedup of the GPU accelerated code in relation to the CPU code ORCA, as a function of the number of contracted basis functions: (a) the AMD S9150 graphic card and (b) the NVIDIA K40m graphic card.

a step that is affected by integral prescreening, further reducing the scaling. Unfortunately, equations treated as matrix multiplications scale as O(N3), must be computed for every occupied orbital and are not affected by any prescreening, leading to a hard scaling of O(N4). However, scaling is not everything as the base effort counts as well. The base effort in the case of integral calculation is extremely large in comparison to the two FLOPs needed in the case of matrix−matrix multiplication. When looking more carefully at the time needed for different steps of the calculation, a single SCF iteration consists of the following: integral evaluation, atomic to molecular integral transformation, exchange contribution calculation, and Coulomb contribution calculation (which also includes another integral evaluation step). It is shown in Figure 5 that integral generation is a bottleneck only for very small systems. In the case of the largest system tested with the def2-SVP basis set, the vancomycin dimer, the integral transformation, and exchange contribution calculation steps account for 90% of the calculation time with integral evaluation being only responsible for 7.4% of the time required for a single SCF iteration. The large speedup of the O(N4) step is mainly a result of the highly efficient matrix multiplication kernels developed for this project. Data used for these matrix multiplications is neither sent nor received from the device; it is evaluated on the card based on integral lists sent to the device in the integral evaluation step and compressed basis set stored on the device.

grows somewhat more slowly but also easily approaches the factor of 10. The difference between speedups for different basis sets originates from the fact that the speedup for higher angular momentum integrals is much smaller. It is shown in Figure 3 that the speedup for the def2-SVP basis set for very large systems grows much beyond 10×, reaching values as high as 29× in relation to a single CPU thread. One may notice the drop in performance for the last system; it is caused by the fact that even though it has more basis set functions the olestra molecule consists of long hydrocarbon chains, which result in a much lower integral count after prescreening compared to a bulky vancomycin dimer. The overall speedup, however, does not primarily arise from the integral code. As it is shown in Figure 4a and 4c, the acceleration of the integral code is much lower than the total acceleration. Rather, the acceleration above the integral generation speedup may be traced back to the accelerated code for the exchange contribution calculation (see Figure 4b and 4d). As mentioned earlier, the scaling of the computational cost with system size for RI-JK remains O(N4); however, the bottleneck for large systems is not the integral calculation but simply the equations for the exchange part of the Fock matrix (eqs 15, 16, and 17) that involve matrix multiplications. Assuming access to an infinite amount of memory, integrals are calculated only once being then not only a O(N3) step but also H

DOI: 10.1021/acs.jctc.7b00030 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 4. Speedup of the GPU accelerated code in relation to the CPU code ORCA, for the integral generation step of the Fock matrix generation routine (a and c) and the exchange contribution calculation in the Fock matrix generation routine (b and d). The speedups are for the AMD S9150 graphic card (a and b) and the NVIDIA K40m graphic card (c and d).

Only the final Fock matrix is received from the device in the form of a symmetric matrix. Even though the exchange contribution calculation is the bottleneck by a large margin for the calculation of most systems (see Figure 5), the speedup of the integral calculation is still very crucial for obtaining optimal efficiency. Considering how big of a bottleneck the exchange contribution calculation is, one could argue that the speedup of the integral calculation is insignificant in the view of the whole calculation. However, the exchange contribution calculation consist of mostly matrix− matrix multiplications for which speedups can be very large. Even the relation of the largest system tested, where 90% of the time is spent on the exchange contribution calculation and only 7.4% on integral evaluation, the speedup of about 13× for the exchange is enough to leave the integral calculation as the new bottleneck. In Figure 4 it is shown that speedups for the exchange part grow very quickly beyond 13×, demonstrating that a speedup of the integral code is very important even though it accounts for