Graphics Processing Unit Acceleration and Parallelization of

Sep 15, 2016 - Multiple program/multiple data molecular dynamics method with multiple time step integrator for large biological systems. Jaewoon Jung ...
0 downloads 6 Views 3MB Size
Subscriber access provided by Northern Illinois University

Article

GPU acceleration and parallelization of GENESIS for large-scale molecular dynamics simulations Jaewoon Jung, Akira Naruse, Chigusa Kobayashi, and Yuji Sugita J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.6b00241 • Publication Date (Web): 15 Sep 2016 Downloaded from http://pubs.acs.org on September 16, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

GPU acceleration and parallelization of GENESIS for large-scale molecular dynamics simulations Jaewoon Jung1,2, Akira Naurse3, Chigusa Kobayashi2, and Yuji Sugita1,2,4,5* 1

RIKEN Theoretical Molecular Science Laboratory, 2-1 Hirosawa, Wako, Saitama 351-0198,

Japan 2RIKEN Advanced Institute for Computational Science, 7-1-26 Minatojima-minamimachi, Chuo-ku, Kobe, Hyogo 640-0047, Japan, 3NVIDIA, 2-11-7, Akasaka, Minato-ku, Tokyo 1070052, Japan, 4RIKEN iTHES, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan, 5Laboratory for Biomolecular Function Simulation, RIKEN Quantitative Biology Center (QBiC), 6-7-1 Minatojima-minamimachi, Chuo-ku, Kobe, Hyogo 650-0047

KEYWORDS: Molecular Dynamics, GPU, RESPA, Parallelization, Midpoint cell method, FFT

ACS Paragon Plus Environment

1

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 39

ABSTRACT

Graphics processing unit (GPU) has become a popular computational platform for molecular dynamics (MD) simulations of biomolecules. A significant speedup in the simulations of smallor medium-size systems using only a few computer nodes with a single or multiple GPUs has been reported. Due to GPU memory limitation and slow communication between GPUs on different computer nodes, it is not straightforward to accelerate MD simulations of large biological systems that contain a few million or more atoms on massively parallel supercomputers with GPUs. In this study, we develop a new scheme in our MD software, GENESIS, to reduce the total computational time on such computers. Computationally intensive real-space non-bonded interactions are computed mainly on GPUs in the scheme, while less intensive bonded interactions and communication-intensive reciprocal-space interactions are performed on CPUs. Based on the midpoint cell method as a domain decomposition scheme, we invent the single particle interaction list for reducing the GPU memory usage. Since total computational time is limited by the reciprocal-space computation, we utilize the RESPA multiple time-step integration and reduce the CPU resting time by assigning a subset of nonbonded interactions on CPUs as well as on GPUs when the reciprocal-space computation is skipped. We validated our GPU implementations in GENESIS on BPTI and a membrane protein, porin, by MD simulations and an alanine-tripeptide by REMD simulations. Benchmark calculations on TSUBAME supercomputer showed that an MD simulation of a million atoms system was scalable up to 256 computer nodes with GPUs.

ACS Paragon Plus Environment

2

Page 3 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

I. Introduction Molecular dynamics (MD) simulation was first applied to a small protein, BPTI, by Martin Karplus and his colleagues in 19771. Since then, it has become a popular computational method to investigate relationships between structure and function in various biological systems2. Nowadays, the target systems in the MD simulations are not only small soluble proteins3, 4 but also membrane proteins in explicit lipid bilayers5-7, protein/nucleic acid complexes8-10, viruses11, 12

, and so on. One of the recent trends is to include the effect of realistic cellular environments in

biological MD simulations, such as macromolecular crowding in the cytoplasm13,

14

,

heterogeneous biological membranes15, and dynamic chromatin structures in a cellular nucleus16. In spite of continuous developments in computer hardware and MD software, there is still a large time-scale gap between MD simulations and experiments, in particular, for large biological systems. Many algorithms in various MD programs have been developed to reduce the gap17-21. A specially designed computer like Anton has greatly extended the simulation time to a millisecond if the target system size is less than several hundred thousand atoms22. Graphics processing unit (GPU) was originally designed to handle computations only for graphics, but, nowadays, it performs various computations, traditionally handled on CPU. GPU has often been utilized for all-atom MD simulations of biomolecules with explicit solvent and membrane23-28. In the simulations, the most time-consuming tasks are the calculations of nonbonded interactions, namely, electrostatic and Lennard-Jones interactions. The standard scheme for long-range electrostatic interactions is the particle mesh Ewald (PME) method29, 30, which consists of real-space non-bonded calculations and reciprocal-space calculations including 3D Fast Fourier Transform (FFT). There are two different ways in GPU computation of all-atom MD simulations. In one way, which is employed in AMBER31, ACEMD32, OpenMM33, or MOIL34,

ACS Paragon Plus Environment

3

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 39

all the data required for MD simulations are sent to the GPU memory and the computations are carried out only on GPUs. In the other way, NAMD or GROMACS utilizes both CPUs and GPUs by assigning computation-intensive tasks (i.e., the real-space non-bonded interactions) to GPUs and communication-intensive works (i.e., the reciprocal-space non-bonded interactions) to CPUs35, 36. The former scheme is suitable for the small- or medium-size biological systems. Using AMBER 14 on a single GPU card (NVIDIA Tesla K20X, for instance) with cutoff = 8.0 Å, typical benchmark systems, DHFR (23,558 atoms) and Factor IX (90,906 atoms) are simulated for 89.41 ns/days and 25.45 ns/days, respectively31. This scheme may not be suitable for large biological systems due to GPU memory limitations31. The latter scheme does not suffer from the limitations, although the performance is poor compared with that of the former when a single computer node with GPUs is used. Recently, NAMD has extended the available target system size up to around two hundred million atoms by the latter scheme37. We have recently developed a highly parallel MD program, GENESIS (GENeralizedEnsemble SImulation System)38. This program shows a significant parallel scalability for large biological systems containing more than a million atoms on K computer or other massively parallel CPU-based supercomputers. There are three key developments in GENESIS, namely, (i) the midpoint cell method39, (ii) the hybrid (MPI+OpenMP) parallelization39, and (iii) the volumetric decomposition of 3D FFT calculations in PME40. The midpoint cell method is a domain decomposition method that extends the original midpoint method41 developed by D. E. Shaw Research. In GENESIS, the real-space non-bonded interactions are computed by the midpoint cell method, which is parallelized with a hybrid (MPI+OpenMP) parallelization scheme. Furthermore, the volumetric decomposition of 3D FFT calculations40 in the reciprocalspace non-bonded interactions allows us to share the same decomposition between real and

ACS Paragon Plus Environment

4

Page 5 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

reciprocal spaces. In this study, we make use of the advantage of the original code in GENESIS to extend available simulation time in large biological systems on massively parallel supercomputers with GPUs. Once scalable all-atom MD simulation is possible on hybrid (CPU+GPU) systems, we could use it for the simulation of large biomolecular complexes like ribosomes or viruses, and biomolecules in more realistic cellular environments13, 14. We basically follow the same idea of the GPU computations implemented in NAMD and GROMACS. We don't use the conventional non-bonded pairlist, but use a single particle interaction list to reduce memory usage on GPUs. Since the communication-intensive reciprocal-space calculation is the bottleneck in MD simulations on computers with GPUs, we utilize the RESPA multi time-step integration42 and optimize the computational schedule for hybrid (CPU+GPU) systems. This paper is organized as follows: We first preview the energy/force expression in PME, the RESPA multi-time step integration, and the midpoint cell method in Methods. Then, the single particle interaction list and the RESPA multi-time step integration optimized to hybrid (CPU+GPU) systems are described in detail. In Results and Discussion, current computational schemes are validated from simulations of several biological systems. The performance of allatom MD simulations of a million atoms system using the TSUBAME supercomputer is also shown43 and compared to that on K computer44.

II. Methods 1.

The energy/force expressions in PME The potential energy function that is used for all-atom MD simulations of biomolecules

consists of bonded as well as non-bonded interaction terms17, 45-47. The bonded interactions usually include the bond, angle, and dihedral terms, whereas the non-bonded terms consist of

ACS Paragon Plus Environment

5

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 39

van der Waals and electrostatic interactions. For instance, the potential energy function in the CHARMM force field is expressed as

E (r ) =



Kb (b − b0 )2 +

bonds

+



∑ angles

KUB ( S − S0 )2

Urey-Bradley



Kφ (1 + cos(nφ − δ )) +

dihedrals

 R ij + ∑ ε ij    rij non-bonds 



Kθ (θ − θ0 )2 +

Kω (ω − ω0 )2 +

impropers



U CMAP (ϕ ,ψ ) .

(1)

residues

12 6   Rij   qi q j  − 2    + ∑   rij   non-bonds rij

In Eq. (1), the bonded interactions are expressed in the first two lines and the non-bonded interactions are written in the third line. Under the periodic boundary condition with box size (B1, B2, B3), the non-bonded interactions without any approximation are expressed as: N   Rij 1 Enonb (r) = ∑ ∑ ε ij  2 n ' i , j =1   ri − rj + n  

12

  Rij  − 2ε ij    ri − rj + n  

6  qi q j   +    ri − rj + n  ,

(2)

where N is the number of particles in a system and n = (n1B1, n2 B2 , n3B3 ) is used for the nonbonded interactions under periodic boundary condition. In Eq. (2), prime indicates that the terms with i = j and with n = 0 are not considered. In PME30, the non-bonded interactions are given by   Rij ε  ij   ri − r j + n  ri −r j + n < rc 

N 1 Enonb (r ) = ∑ ∑ 2 n ' i , j =1

+

1 2π V

12

  Rij  − 2ε ij    ri − r j + n  

α exp( −π 2 k 2 α 2 ) 2 S (k ) − ∑ 2 k π k ≠0

6  qi q j erfc(α ri − r j + n )    +   ri − r j + n  

(3)

N

∑q

2

i

i =1

,

where the real-space non-bonded interactions, the reciprocal-space electrostatic energy, and the self-energy term are represented in the first, second and third terms, respectively. Here, rc is the

ACS Paragon Plus Environment

6

Page 7 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

cutoff value within which the real-space non-bonded interactions are evaluated. Computational costs of the first (the real-space interaction) and second (the reciprocal-space interaction) terms in Eq. (3) are proportional to N and NlogN, respectively.

2. The RESPA multiple time-step integration In the reversible Reference System Propagator Algorithm (RESPA) proposed by Tuckerman et al.42, we can decompose the Liouville operator into fast and slow motions:

iL = iLfast + iLslow , iLfast =

p ∂ ∂ ∂ + Ffast , iL2 = Fslow . ∂p ∂p m ∂r

(4)

Using the Trotter theorem48, the Liouville operator becomes

exp(iL∆t ) = exp(iLslow ∆t / 2) exp(iLfast ∆t ) exp(iLslow ∆t / 2) .

(5)

The Liouville propagator related to the fast motion is again decomposed into n

   ∂  p ∂  ∂  ∆t  . exp(iLfast ∆t ) =  exp  δ t 2 ⋅ Ffast  exp  δ t ⋅  , δ t =  exp  δ t 2 ⋅ Ffast ∂p  ∂p   m ∂r  n    

(6)

Therefore, the final Liouville operator for RESPA is n

 ∆t  δt  δt  ∆t ∂  ∂  p ∂  ∂  ∂   exp  Fslow   exp  ⋅ Ffast  exp  δ t ⋅   exp  Fslow  . (7)  exp  ⋅ Ffast ∂p   ∂p  m ∂r  ∂p   ∂p    2  2  2  2 Fslow , slowly varying components of the forces, are obtained from long-range contributions of

the non-bonded interactions. In PME, we calculate the long-range electrostatic interactions in reciprocal space using 3D FFT, so the slowly varying components are obtained from reciprocalspace electrostatic interactions. According to Barth et al., there is a limitation of the slow motion

ACS Paragon Plus Environment

7

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 39

time-step around 5 fs, which is closely related to the resonance and could be overcome in combination with stochastic dynamics for realizing the NVT ensemble49. We therefore borrow the idea for our multiple time-step (MTS) integration by assigning a Langevin thermostat based on RESPA. Whenever there are updates of velocities from the slowly varying force components, we assign the Langevin thermostat to maintain the target temperature50. In this case, the Liouville operator is modified to: n ∆t  ∆t  ∆t  ∆t      exp  iLLT  exp  iLslow  exp ( iLfastδ t )  exp  iLslow  exp  iLLT  , 2 2 2 2    

(8)

where the Langevin thermostat operator is defined as

iLLT = −γ v ⋅

∂ γ ∂2 . − ∂v β m ∂v 2

(9)

3. The midpoint cell method in combination with the volumetric decomposition of 3D FFT In parallelized MD programs, simulation space is geometrically divided into smaller subdomains, and particle information in each subdomain is assigned to a MPI processor. The midpoint cell method39 is a variant of such spatial decomposition schemes, where a subdomain is further divided into small cells (Figure 1). The size of the cells in each dimension should be greater than half of the pairlist cutoff, when the pairlist is used for the non-bonded calculations. We define interaction space of two given particles as the midpoint cell for the two cell pairs where each particle resides. The midpoint cell is usually defined uniquely (the midpoint cell between cell a and b in Figure 1). If it is not defined uniquely, we select one of the midpoint cells to obtain better load balance. In the scheme, each subdomain contains particle information for the subdomain itself and its adjacent cells (colored orange in Figure 1(b)) by MPI

ACS Paragon Plus Environment

8

Page 9 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

communication between only its adjacent cells. Cell pairs are distributed over the OpenMP threads for allowing more efficient shared memory parallelization. In GENESIS, we combined the midpoint cell method39 with the volumetric decomposition of 3D FFT40 to optimize parallel efficiency. In the volumetric decomposition of 3D FFT, data are partitioned along three dimensions for MPI processors. It requires more communications than slab or pencil decompositions of 3D FFT, but the number of processors involved in each communication is small. In this scheme, at least three all-to-all communications (two all-to-all communication in one dimension and one all-to-all communications in two dimension) are required. In PME, charge information inside of the subdomain and adjacent cells is necessary to obtain the charge grid data in the reciprocal space in the same subdomain. In GENESIS, we do not need any communication to collect charge grid data before the 3D FFT, because we assign the same spatial decomposition in real and reciprocal spaces. On the other hand, communication is required in many other methods where the domain decomposition in reciprocal space differs from that in real space. For example, in Figure 1(c), MPI rank 10 needs communications with MPI ranks 3, 7, 11, and 15 for the 3D FFT calculation.

4.

Hybrid (CPU+GPU) implementation of MD simulations

1) Non-bonded interaction scheme on GPU – the single particle interaction list In the midpoint cell scheme, we generate a pairlist for each cell pair, the pairwise Verlet list51. In spite of efficient OpenMP parallelization, the scheme itself is not suitable for GPU due to the small cache memory in GPU. For example, the most recent Intel Haswell CPU with CPUID code 0306C3h has 64KB L1 cache (per core), 256KB L2 cache (per core), 2-20MB L3 cache (shared among cores), and 128MB eDRAM L4 cache (Iris Pro models only)52. The

ACS Paragon Plus Environment

9

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 39

NVIDIA Tesla K40 GPU has 64KB L1 cache plus shared memory, and 1.5MB L2 cache. L1 cache of NVIDIA K40 is just one fifth of L1+L2 cache of Intel Haswell, and L2 cache of K40 is around one tenth of L3 cache of Haswell CPU. To use such small-sized GPU memory efficiently, we invented a new scheme, which we call the single particle interaction list (Figure 2). The interaction between the i-th and j-th cells is expressed by two particle interaction lists for the i-th and j-th cells. In the list for the i-th cell, the atoms in i-th cell that do not interact to any other atoms in the j-th cell are excluded. We treat the interaction particle list for the j-th cell in the same way. These lists are generated on GPU, while the exclusion list for bonded interactions is made on CPUs and the information is transferred to GPU. In the original midpoint cell scheme, the amount of non-bonded lists is O(n2) in each cell pair where n is the number of particles in each cell. With the single particle interaction list scheme, the number of lists is O(2n). In Figure 2(a), the pairlist amount is 22 (the number of squares colored yellow in the left upper figure). Using the single particle interaction scheme, the list is 12 (6 in the i-th cell and 6 in the j-th cell, colored yellow in the right upper figure). After generating single particle interaction lists, we assign the non-bonded interactions. In Figure 2(a), the number of interactions becomes 36 (right, lower figure). The exclusion masks (colored green in left, lower figure in Figure 2(a)) are generated as one byte integer, and we skip the non-bonded interaction if mask value is assigned. Migration of atoms in each cell and in each subdomain is considered whenever the interaction list is updated. In K20X or K40 GPUs, all the threads in a warp execute concurrently, and the warp size is 32. Thus, we make blocks of 32 interaction pairs in each cell pair. This is done by making a block containing 8 particles in the i-th cell and a block with 4 particles in the j-th cell for the interaction between the i-th and j-th cells (Figure 2(b)). GPU is optimized for computations using

ACS Paragon Plus Environment

10

Page 11 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

a single precision floating point for a real number. However, accuracy in the MD simulations may be compromised with this procedure. Therefore, we have employed a hybrid precision floating point, i.e. we combined single precision calculations with double precision accumulation/integration.

2) Scheduling for the hybrid (CPU+GPU) calculation When we use both CPUs and GPUs in an all-atom MD simulation, it is important to assign tasks to CPU or GPU properly. Because GPU is not suitable for communication-intensive calculations, our strategy is to assign computation-intensive real-space non-bonded interactions to GPU and assign CPU for communication-intensive reciprocal-space interactions (Figure 3). Bonded interactions could be calculated on GPUs but data transfer between CPU and GPU may be more time consuming than the calculations for bonded interactions itself. We allow multiple MPI processors access to one GPU card if multiple MPI processors are in a computer node. If there are multiple GPU cards in a node, we assign the number of MPI processors to be the same as or multiple of the number of GPU cards.

3) Scheduling for hybrid (CPU+GPU) calculation with the RESPA multiple time-step integration If we use the RESPA multiple time-step integration in all-atom MD simulations, the reciprocal-space calculation is skipped regularly. Consequently, CPU computes only the bonded interactions in some MD steps and waits for the end of the non-bonded computations on GPU. To use all computer resources efficiently, we assign the real-space non-bonded interactions to both GPU and CPU when the reciprocal-space calculation is skipped in RESPA. We optimize the

ACS Paragon Plus Environment

11

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 39

load ratio between CPU and GPU by counting the computational time of the real-space nonbonded interactions and the bonded interactions, and the accumulation of the energies/forces. We assign the non-bonded interactions between atoms in the same cells (intra-cell interactions) on CPU. The interactions between atoms in different cells (inter-cell interactions) are divided to GPU and CPU to reduce total simulation time. For example, MD simulation of 9,250 water molecules on a desktop PC (two Intel Xeon E5-2650 CPUs (16 cores), 64GB memory, two K40 GPU cards) shows the minimum computational time if we assign 33 times more inter-cell interactions on GPU than on CPU (Figure 4). The overall procedure for MD simulations with RESPA integration is depicted in Figure 5. It should be noted that the ratio between GPU and CPU computations for inter-cell interactions is determined automatically in GENESIS, by checking the different ratios during the setup process.

III. Results and Discussion 1. Validation of the current MD scheme on hybrid (CPU+GPU) systems 1) Energy conservation in NVE MD We prepared a soluble protein, bovine pancreatic trypsin inhibitor (BPTI, 27,712 atoms) with AMBER force field54, and a main porin from Mycobacterium smegmatis (MSPA, 216,726 atoms) with CHARMM force field55,

56

, a membrane protein in an explicit lipid bilayer, to

validate the implementation. For the BPTI, we assigned the following parameters: cutoff = 8.0 Å, cutoff for interaction list = 10.0 Å, PME grid size = (64,64,64). In the case of MSPA, cutoff = 12.0 Å, cutoff for interaction list = 13.5 Å, PME grid size = (144,144,144). In both systems, we used the inverse lookup table for the real-space non-bonded energy/forces PME interpolation

ACS Paragon Plus Environment

12

Page 13 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

order = 4, and SHAKE/SETTLE constraints (in the case of 2 fs time step). All the calculations are performed using both CPU (Intel Xeon E5-2680 v3) and GPU (NVIDIA Tesla K20X) except energy drift using only CPUs. We investigated energy drift for BPTI and MSPA by performing NVE MD using two time steps: 0.5 fs and 2 fs. We did not assign any constraint in the case of 0.5 fs, whereas the SHAKE/RATTLE constraints59,

60

were employed when 2 fs time step was used. Water

molecules were constrained by SETTLE61. In TABLE I, we list the energy drifts per degree of freedom (kT/ns/degree of freedom) for BPTI (10 ns) and MSPA (2 ns). We also plot the total energy during MD simulation in Figure 6. Absolute values of the energy drifts are less than 1.0×10-4 in all cases. These values are comparable to the energy drifts in other MD programs, for example, AMBER package with double and hybrid precisions31. Thus, the single particle interaction list scheme on GPU with hybrid precision (single precision energy and force calculation and double precision accumulation and integration) does conserve energy values sufficiently under the NVE condition. We also tested the reliability of the RESPA integration. The reciprocal-space interactions were assigned as slow motion forces in the RESPA integrator with NVE condition42 or in the RESPA with Langevin thermostat for the NVT condition. The former is referred to as RESPA, while the latter is named as Langevin RESPA in the paper. The energy drift value reproduced by RESPA is almost negligible like velocity Verlet by assigning the reciprocal-space interactions every other step (4 fs slow force time step). If the slow force time step is greater than 4 fs, energy conservation was not well maintained. The main reason of large energy drift in RESPA is mainly due to the increase of kinetic energy in the NVE condition. Thermostat in the NVT condition, however, could control kinetic energy and may compensate for the kinetic energetic errors62.

ACS Paragon Plus Environment

13

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 39

Therefore, there is a possibility to use Langevin RESPA with slow force time step greater than 4 fs. In the following sections, we examine the quality of canonical ensembles, structural and energetic properties of proteins in solution, and free-energy landscapes obtained by REMD58 for the simulations based on Langevin RESPA with computations of slow motion forces every 4 step (8 fs slow force time step). 2) Ensemble tests for Langevin RESPA Proper ensemble generation under the NVT condition can be checked through the quality of canonical ensembles. Recently Shirts et al. presented a simple statistical analysis procedure to judge whether the desired thermodynamic ensemble is properly sampled or not57. In the procedure, the relative probability distribution at two different temperatures T1 and T2 is expressed as

 P ( E β1 )  ln   = β 2 A2 − β1 A1 − ( β 2 − β1 ) E  P ( E β2 )   

(10)

( )

where P a b is the probability of a state determined by a given parameter b, and A is the Helmholtz free energy at a given temperature. By making a fitting of the left side in Eq. (10), we can compare the slope of the fitted line to ( β 2 − β1) . In our case, the fitting is performed using the maximum likelihood estimates because it is not affected by the choice of the histogram binning. The two target temperatures for two 10 ns MD simulations of 9,250 TIP3P water molecules are 298 K and 300 K. We used cutoff = 12.0 Å and interaction cutoff distance = 13.0 Å, and hybrid floating points for real numbers. The ensemble tests for total, kinetic, and potential energies were carried out using Eq. (10). TABLE II shows that the estimated slope (slope of fitted line in Eq.

(

)

(10)) is within 2σ deviation of the true slope β2 − β1 . Langevin RESPA is reliable up to 8 fs of

ACS Paragon Plus Environment

14

Page 15 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

slow force time-step as for the proper generation of the canonical ensemble. Even with a slow force time-step of 10 fs, a reliable ensemble is obtained as seen in TABLE II, although structural and dynamical properties of biomolecules is also required to check carefully (see the next section). 3) Structural fluctuations and local energy potentials of BPTI in NVT MD simulations We performed three MD simulations of BPTI in solution (36,931 atoms) by Velocity Verlet integrator with Langevin thermostat (Langevin VVER) with ∆t = 2 fs, Langevin RESPA with ∆t = 2 fs and ∆tslow = 8 fs, and another Langevin RESPA with ∆t = 2.5 fs and ∆tslow = 12.5 fs. In all cases, cutoff = 12.0 Å and interaction cutoff distance = 13.5 Å, hybrid-floating points for real numbers and CHARMM force field55, 56 were used. Root mean square fluctuations (RMSFs) of backbone Cα atoms in BPTI in Figure 7 show no distinguishable differences between the three simulations. In Figure 8, potential energy distributions of angle and dihedral angle terms obtained by Langevin RESPA with ∆t = 2 fs and ∆tslow = 8 fs are almost identical to those by Langevin VVER. In contrast, the distributions by Langevin RESPA (∆t = 2.5 fs and ∆tslow = 12.5 fs) show distinct differences from the other two simulations. These results suggests the validity of Langevin RESPA with ∆t = 2 fs and ∆tslow = 8 fs in terms of dynamical properties of protein systems. 4) Free energy landscapes of an alanine tripeptide in solution We carried out replica-exchange MD (REMD)58 of an alanine tripeptide, (Ala)3, solvated in 3939 water molecules using Langevin VVER with ∆t = 2 fs as well as Langevin RESPA with ∆t = 2 fs and ∆tslow = 8 fs. In both simulations, 18 replicas were used to cover the temperature ranges from 300 K to 345.59 K. Replica exchanges are attempted every 2000 step (every 4 ps).

ACS Paragon Plus Environment

15

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 39

Here, we made use of CHARMM force field55, 56 for the peptide and TIP3P63 model for water. Figure 9 shows the free energy profiles of the ϕ and ψ angles of the 2nd residue in the peptide at 300 K obtained by the two REMD simulations. The minimum points in the profiles correspond to favored secondary structures; α-helix, 3-10 helix, left-handed helix, β-sheet, and polyproline II. The result shows that there are no distinct differences in the free energy profiles, suggesting that Langevin VVER and Langevin RESPA reproduce almost identical conformational stability of the peptide in solution. By judging from the results of ensemble tests for water, comparisons of structural/dynamical properties of a protein, the free-energy landscapes of an peptide in solution, the NVT MD simulations based on Langevin RESPA with ∆t = 2 fs and ∆tslow = 8 fs are reliable and the benchmark performance using the parameters on hybrid (CPU+GPU) systems should be meaningful. 2. Speed up of non-bonded interactions on GPU After the validation tests of the GPU implementations as well as the RESPA scheme, we examined the computational time of the real-space non-bonded interactions on GPU, compared with that on CPU. A system consisting of 9,250 TIP3P water molecules was tested on a single node computer equipped to Intel Xeon E5-2650 (2 CPU, 16 cores)) and 2 NVIDIA Tesla K40 GPUs. Here, we compared performances by including computational times of energy/force evaluations of the real-space interactions as well as data transfer between CPUs and GPUs. The comparison is depicted in Figure 10. According to our test, use of 2 GPUs increases the total real-space interaction 5.39-fold compared to CPU-only calculations with 16 cores. If we use both 2 GPUs and 2 CPUs (16 cores) for the real-space interactions, 7.7-fold speed-up computation is

ACS Paragon Plus Environment

16

Page 17 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

possible compared with the CPU-only calculations. Therefore, assigning a subset of real-space non-bonded interactions to both CPU and GPU can reduce the computational time for the realspace interactions by 30 %. If we consider the calculations in the multiple time-step integration where reciprocal-space interactions are calculated every other step, it increases the total performance 15 % compared to assigning all the real-space non-bonded interactions to GPU only. 3. Benchmark results of the GPU implementation on TSUBAME supercomputer 1) The real-space versus reciprocal-space interactions on hybrid (CPU+GPU) systems The benchmark performance tests were carried out on TSUBAME supercomputer, which consists of 1408 computer nodes. Each node has Intel Xeon X5670 CPU, 54GB memory, and 3 NVIDIA Tesla K20X GPUs, connected by QDR InfiniBand network. GENESIS was compiled using PGI compilers (version 15.1 with CUDA 6.5) with Open MPI (version 1.8.2). We used STMV (virus, 299,855 water molecules, 773 ions, a total of 1,066,628 atoms) for the following benchmark tests and assigned the following parameters: cutoff = 12.0 Å, pairlist cutoff = 13.5 Å, inverse lookup table for the real-space non-bonded energy/forces53, PME interpolation order = 4, and the SHAKE/SETTLE constraints. The PME grid size is (192,192,192) and corresponding grid spacing is 1.125 Å. The reciprocal-space interactions were assigned as slow motion forces and calculated every other step in RESPA with the NVE condition42 and every fourth step for Langevin RESPA with the NVT condition. Single particle interaction lists were updated every 10 steps for velocity Verlet and the RESPA integration. In the case of Langevin RESPA, interaction lists were updated every 8 steps because the update period should be a multiple of the period of slow force computation. Computational times in benchmark tests were estimated from the last 1000 integration steps out of totally 2000 integration steps.

ACS Paragon Plus Environment

17

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 39

In the above section, it has been shown that use of GPU greatly assists in reduction of computational time except for reciprocal-space interactions. In all-atom MD simulations, however, 3D FFT computation in the reciprocal-space interactions is not negligible, in particular, when using a large number of CPU nodes. In Figure 11(a), we compared the computational time of the real-space calculation on CPU+GPU (including CPU/GPU data transfer and force accumulation) with that of reciprocal-space calculation on CPU for STMV. In all cases, the computational time of real-space interactions is much less than that of reciprocal-space interactions. In other words, the main computational bottleneck is the reciprocal-space interaction if we calculate real-space interactions on GPUs. The midpoint cell scheme with the volumetric decomposition of 3D FFT could prove to be a good solution for multiple (CPU+GPU) processors because it does not require communications for preparing charges grid data before 3D FFT. 2) MD performance on TSUBAME To understand the effect of GPUs to accelerate MD simulations, we also compared performance on TSUBAME with that on K for STMV using the velocity Verlet integrator. Each node in the K computer contains a single CPU (8 core, 2.0GHz), and there is no GPU. Each node is connected through the 6D mesh/torus interconnect named Tofu, which allows better scalability than conventional PC clusters. The total and reciprocal-space computational times for one integration step are shown in Figure 11(b). K shows better performance for the reciprocal-space calculation than TSUBAME if the number of cores is larger than 1536. The slightly better performance of TSUBAME over K for a small number of cores is due to the higher clock speed of the CPU on TSUBAME. Benchmark tests on K show larger differences between the total simulation time and reciprocal-space computation time for both systems. TSUBAME shows a

ACS Paragon Plus Environment

18

Page 19 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

slightly worse performance of the reciprocal-space interaction, but produces a better overall performance because the gap between the total and reciprocal time is not as large as in K. This means that the reciprocal-space interactions are not the only bottleneck in computation in the case of CPUs only, while the reciprocal-space interactions are the main bottleneck when using CPU+GPUs. The parallel performances of STMV in each integrator are depicted in Figure 12. It is impressive that the MD program, GENESIS, was scalable up to 256 nodes for all integrator types for STMV. Even with the velocity Verlet integration where time consuming reciprocal calculations are performed at every integration step, good scalability is conserved. This good parallel efficiency originates from the highly parallelized nature of both the real-space and reciprocal-space calculations. Maximum performances for STMV are 3.85 ms/step for one integration step with Langevin RESPA with ∆t = 2 fs and ∆tslow = 8 fs, showing a significant speedup by the usage of 256 computer nodes with GPUs compared to a few nodes.

IV. Conclusion We have developed a parallelization scheme of all-atom MD simulations suitable for hybrid (CPU+GPU) processors in multiple nodes. The most time-consuming real-space nonbonded interaction is calculated on GPUs while other parts are done on CPUs. To avoid large usage of memory for the pair-lists of non-bonded interactions, the single particle interaction list is introduced for each cell pair. The performance with the RESPA multiple time-step integrator is optimized by making both GPUs and CPUs work together for the real-space interactions when reciprocal-space interactions can be skipped. Acceleration of GPU calculations means that the total simulation time is dominated by reciprocal-space calculations on CPUs. In spite of

ACS Paragon Plus Environment

19

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 39

performance limitations due to the reciprocal-space interactions, GPU accelerates the overall speed of MD simulations, while keeping good parallel efficiency. The efficient parallelization results from the midpoint cell method with the volumetric decomposition of 3D FFT, which avoids communicating charge data between real and reciprocal-space. One MD time-step is completed within 4 ms even for a million atoms system using 256 nodes on TSUBAME while we need more than 2,000 nodes on the K computer to obtain the same performance. TSUBAME produces best performances for a million atoms systems of 3.85 ms/step. Our development could be helpful for long MD simulations of large systems on massively parallel computers equipped with GPUs.

ACS Paragon Plus Environment

20

Page 21 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 1. Overview of the midpoint cell scheme. In (a), the simulation space is decomposed to subdomains according to MPI processors, and the subdomain is again decomposed into cells. For each cell pair, we assign the interaction cell as the midpoint cell. With this scheme, we only communicate data of adjacent cells. In (b), we depict the case of a midpoint cell with volumetric decomposition FFT. In this case, we do not need any communication between real and reciprocal-space due to the same domain decomposition. If we do not follow any midpoint cell or volumetric decomposition FFT, domain decomposition between real and reciprocal-space will be different, which is depicted in (c). In this case, global communications are required, and could be the main bottleneck of parallelization on massively parallel supercomputers.

ACS Paragon Plus Environment

21

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 39

Figure 2. (a) Non-bonded interaction scheme on hybrid (CPU+GPU) systems. In each cell pair of i-th cell and j-th cell, we first calculate pairwise distances between atoms in i-th cell and those in j-th cell. Second, we select particle numbers in each cell that do not interact with any other atoms in another cell. Such particles are excluded and we generate an atom list in each cell (single particle interaction list). Finally, we perform calculations for non-bonded interactions between interaction list particles. Because such interactions might include unnecessary interactions that may be involved in the bonded interaction list, we perform the non-bonded interactions after masking the bonded interaction list. (b) Blocks of 32 interaction pairs in each cell pair of i-th cell and j-th cell.

ACS Paragon Plus Environment

22

Page 23 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 3. The overall flowchart of hybrid (CPU+GPU) calculation. Basically, GPU calculates real-space non-bonded interactions and CPU deals with the others. When GPU generates single particle interaction lists, CPU generates masks from the bonded interaction lists.

ACS Paragon Plus Environment

23

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 39

Figure 4. The real-space interaction time as a function of GPU/CPU computational ratio. In this example, the minimal time is obtained at the ratio value 33, which means it is optimal to make the GPU load 33 times larger than that of the CPU.

ACS Paragon Plus Environment

24

Page 25 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 5. The overall flowchart of hybrid (CPU+GPU) calculation with multiple time-step integrator. When slow force with long-range interaction is necessary, GPU calculates real-space non-bonded interactions and CPU deals with the others. If slow force calculation can be skipped, GPU calculates a subset (set 2) of the non-bonded real-space interaction space, and CPU calculates another subset (set 1) of the non-bonded real-space interactions and bonded interactions.

ACS Paragon Plus Environment

25

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 39

Figure 6. Total energy along NVE MD simulations for (a) BPTI and (b) MSPA for different integration, time step, and computational platforms.

ACS Paragon Plus Environment

26

Page 27 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 7. RMSFs of the Cα atoms of BPTI with respect to the crystal structure for NVT trajectories with Langevin thermostats.

ACS Paragon Plus Environment

27

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 39

Figure 8. Distributions of (a) angle energy and (b) dihedral angle energy from 100 ns NVT MD simulations of BPTI.

ACS Paragon Plus Environment

28

Page 29 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 9. Free energy profile of (Ala)3 peptide with solvent using (a) Langevin VVER and (b) Langevin RESPA integrators in REMD (energy unit : kcal/mol).

ACS Paragon Plus Environment

29

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 39

Figure 10. Comparison of performances using CPU only, GPU only, and CPU+GPU for a model system on a single node. Here, we assigned 4 MPIs with 4 OpenMP threads to fully make use of 16 cores in a node.

ACS Paragon Plus Environment

30

Page 31 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 11. (a) Computational time of real-space and reciprocal-space interaction of STMV on TSUBAME and (b) Computational time of the total and reciprocal-space interaction of STMV on TSUBAME and K.

ACS Paragon Plus Environment

31

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 39

Figure 12. The computational time of one integration step for STMV with (a) velocity Verlet, (b) RESPA (NVE with reciprocal-space interaction at every other step), and (c) Langevin RESPA (NVT with reciprocal-space interaction at every fourth step).

ACS Paragon Plus Environment

32

Page 33 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Table I. Energy drifts per degree of freedom from 10 ns (BPTI) and 2 ns (MSPA) NVE MD simulations (unit : kT/ns/degree of freedom) System

Platform

Time step

Integrator

Energy drift

BPTI

CPU+GPU

0.5 fs

VVER

-1.11×10-5

CPU+GPU

2 fs

VVER

7.50×10-5

CPU+GPU

2 fs

RESPA

5.82×10-5

CPU+GPU

2 fs

RESPA*

5.11×10-2

CPU

2 fs

VVER

6.13×10-5

CPU

2 fs

RESPA

6.84×10-5

CPU+GPU

0.5 fs

VVER

-2.37×10-5

CPU+GPU

2 fs

VVER

-2.60×10-5

CPU+GPU

2 fs

RESPA

-2.31×10-5

CPU+GPU

2 fs

RESPA*

1.45×10-2

CPU

2 fs

VVER

-1.55×10-5

CPU

2fs

RESPA

-1.92×10-5

MSPA

* Reciprocal space interaction is calculated every fourth integration step (8 fs)

ACS Paragon Plus Environment

33

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 39

Table II. Ensemble Test from 10ns TIP3P water molecules (true slope = 0.002691) Slow force time step

Energy type

Estimated slope

Deviation (×σ*)

4 fs

total

0.002679±0.000067

0.18

4 fs

kinetic

0.002763±0.000047

1.52

4 fs

potential

0.002693±0.000059

0.04

8 fs

total

0.002562±0.000092

1.40

8 fs

kinetic

0.002642±0.000066

0.74

8 fs

potential

0.002551±0.000080

1.75

* σ is the standard deviation of distributions of estimated slope.

AUTHOR INFORMATION Corresponding Author * Yuji Sugita, RIKEN iTHES and Theoretical Molecular Science Laboratory, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan, Phone: +81-48-462-1407; Fax: +81-48-467-4532; E-mail: [email protected] ACKNOWLEDGMENT This research used the computational resources of TSUBAME provided by Tokyo Institute of Technology through the grand challenge project and of the K computer provided by the RIKEN Advanced Institute for Computational Science through the HPCI System Research project (project ID: ra000009, hp140169, hp150108). Part of this research has been funded by strategic programs for innovation research “Computational life science and application in drug discovery and medical development” and by MEXT, “Novel measurement techniques for visualizing live protein molecules at work” (Grant No. 26119006). ABBREVIATIONS MD molecular dynamics; FFT fast Fourier Transform; GPU graphics processing unit; MTS multiple time step; PME particle mesh Ewald; RESPA reversible reference system propagator algorithm

REFERENCES 1. Mccammon, J. A.; Gelin, B. R.; Karplus, M., Dynamics of Folded Proteins. Nature 1977, 267 (5612), 585-590. 2. Karplus, M.; McCammon, J. A., Molecular dynamics simulations of biomolecules. Nat. Struct. Biol. 2002, 9 (9), 646-52.

ACS Paragon Plus Environment

34

Page 35 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

3.

4. 5. 6.

7.

8. 9. 10. 11.

12.

13.

14.

15.

16.

17.

Komuro, Y.; Miyashita, N.; Mori, T.; Muneyuki, E.; Saitoh, T.; Kohda, D.; Sugita, Y., Energetics of the Presequence-Binding Poses in Mitochondrial Protein Import Through Tom20. J. Phys. Chem. B 2013, 117 (10), 2864-2871. Imai, T.; Sugita, Y., Dynamic Correlation between Pressure-Induced Protein Structural Transition and Water Penetration. J. Phys. Chem. B 2010, 114 (6), 2281-2286. Berneche, S.; Roux, B., Energetics of ion conduction through the K+ channel. Nature 2001, 414 (6859), 73-77. de Groot, B. L.; Grubmuller, H., Water permeation across biological membranes: Mechanism and dynamics of aquaporin-1 and GlpF. Science 2001, 294 (5550), 23532357. Tajkhorshid, E.; Nollert, P.; Jensen, M. O.; Miercke, L. J. W.; O'Connell, J.; Stroud, R. M.; Schulten, K., Control of the selectivity of the aquaporin water channel family by global orientational tuning. Science 2002, 296 (5567), 525-530. Gumbart, J.; Trabuco, L. G.; Schreiner, E.; Villa, E.; Schulten, K., Regulation of the Protein-Conducting Channel by a Bound Ribosome. Structure 2009, 17 (11), 1453-1464. Feig, M.; Burton, Z. F., RNA Polymerase II with Open and Closed Trigger Loops: Active Site Dynamics and Nucleic Acid Translocation. Biophys. J. 2010, 99 (8), 2577-2586. Sanbonmatsu, K. Y., Computational studies of molecular machines: the ribosome. Curr. Opin. Struct. Biol. 2012, 22 (2), 168-174. Zhao, G. P.; Perilla, J. R.; Yufenyuy, E. L.; Meng, X.; Chen, B.; Ning, J. Y.; Ahn, J.; Gronenborn, A. M.; Schulten, K.; Aiken, C.; Zhang, P. J., Mature HIV-1 capsid structure by cryo-electron microscopy and all-atom molecular dynamics. Nature 2013, 497 (7451), 643-646. Andoh, Y.; Yoshii, N.; Yamada, A.; Fujimoto, K.; Kojima, H.; Mizutani, K.; Nakagawa, A.; Nomoto, A.; Okazaki, S., All-atom molecular dynamics calculation study of entire poliovirus empty capsids in solution. J. Chem. Phys. 2014, 141 (16), 165101. Harada, R.; Tochio, N.; Kigawa, T.; Sugita, Y.; Feig, M., Reduced Native State Stability in Crowded Cellular Environment Due to Protein-Protein Interactions. J. Am. Chem. Soc. 2013, 135 (9), 3696-3701. Feig, M.; Harada, R.; Mori, T.; Yu, I.; Takahashi, K.; Sugita, Y., Complete atomistic model of a bacterial cytoplasm for integrating physics, biochemistry, and systems biology. J. Mol. Graph. Model. 2015, 58, 1-9. Wu, E. L.; Fleming, P. J.; Yeom, M. S.; Widmalm, G.; Klauda, J. B.; Fleming, K. G.; Im, W., E. coli Outer Membrane and Interactions with OmpLA. Biophys. J. 2014, 106 (11), 2493-2502. Kono, H.; Shirayama, K.; Arimura, Y.; Tachiwana, H.; Kurumizaka, H., Two Arginine Residues Suppress the Flexibility of Nucleosomal DNA in the Canonical Nucleosome Core. PLoS One 2015, 10 (3), e0120635. Brooks, B. R.; Brooks, C. L., 3rd; Mackerell, A. D., Jr.; Nilsson, L.; Petrella, R. J.; Roux, B.; Won, Y.; Archontis, G.; Bartels, C.; Boresch, S.; Caflisch, A.; Caves, L.; Cui, Q.; Dinner, A. R.; Feig, M.; Fischer, S.; Gao, J.; Hodoscek, M.; Im, W.; Kuczera, K.; Lazaridis, T.; Ma, J.; Ovchinnikov, V.; Paci, E.; Pastor, R. W.; Post, C. B.; Pu, J. Z.; Schaefer, M.; Tidor, B.; Venable, R. M.; Woodcock, H. L.; Wu, X.; Yang, W.; York, D. M.; Karplus, M., CHARMM: the biomolecular simulation program. J. Comput. Chem. 2009, 30 (10), 1545-614.

ACS Paragon Plus Environment

35

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

18. 19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29. 30. 31.

Page 36 of 39

Salomon-Ferrer, R.; Case, D. A.; Walker, R. C., An overview of the Amber biomolecular simulation package. Wires Comput Mol Sci 2013, 3 (2), 198-210. Phillips, J. C.; Braun, R.; Wang, W.; Gumbart, J.; Tajkhorshid, E.; Villa, E.; Chipot, C.; Skeel, R. D.; Kale, L.; Schulten, K., Scalable molecular dynamics with NAMD. J. Comput. Chem. 2005, 26 (16), 1781-802. Hess, B.; Kutzner, C.; van der Spoel, D.; Lindahl, E., GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput. 2008, 4 (3), 435-447. Andoh, Y.; Yoshii, N.; Fujimoto, K.; Mizutani, K.; Kojima, H.; Yamada, A.; Okazaki, S.; Kawaguchi, K.; Nagao, H.; Iwahashi, K.; Mizutani, F.; Minami, K.; Ichikawa, S.; Komatsu, H.; Ishizuki, S.; Takeda, Y.; Fukushima, M., MODYLAS: A Highly Parallelized General-Purpose Molecular Dynamics Simulation Program for Large-Scale Systems with Long-Range Forces Calculated by Fast Multipole Method (FMM) and Highly Scalable Fine-Grained New Parallel Processing Algorithms. J. Chem. Theory Comput. 2013, 9 (7), 3201-3209. Shaw, D. E.; Maragakis, P.; Lindorff-Larsen, K.; Piana, S.; Dror, R. O.; Eastwood, M. P.; Bank, J. A.; Jumper, J. M.; Salmon, J. K.; Shan, Y. B.; Wriggers, W., Atomic-Level Characterization of the Structural Dynamics of Proteins. Science 2010, 330 (6002), 341346. Stone, J. E.; Phillips, J. C.; Freddolino, P. L.; Hardy, D. J.; Trabuco, L. G.; Schulten, K., Accelerating molecular modeling applications with graphics processors. J. Comput. Chem. 2007, 28 (16), 2618-2640. Anderson, J. A.; Lorenz, C. D.; Travesset, A., General purpose molecular dynamics simulations fully implemented on graphics processing units. J. Comput. Phys. 2008, 227 (10), 5342-5359. Jha, P. K.; Sknepnek, R.; Guerrero-Garcia, G. I.; de la Cruz, M. O., A Graphics Processing Unit Implementation of Coulomb Interaction in Molecular Dynamics. J. Chem. Theory Comput. 2010, 6 (10), 3058-3065. Nguyen, T. D.; Carrillo, J. M. Y.; Dobrynin, A. V.; Brown, W. M., A Case Study of Truncated Electrostatics for Simulation of Polyelectrolyte Brushes on GPU Accelerators. J. Chem. Theory Comput. 2013, 9 (1), 73-83. Mashimo, T.; Fukunishi, Y.; Kamiya, N.; Takano, Y.; Fukuda, I.; Nakamura, H., Molecular Dynamics Simulations Accelerated by GPU for Biological Macromolecules with a Non-Ewald Scheme for Electrostatic Interactions. J. Chem. Theory Comput. 2013, 9 (12), 5599-5609. Brown, W. M.; Kohlmeyer, A.; Plimpton, S. J.; Tharrington, A. N., Implementing molecular dynamics on hybrid high performance computers - Particle-particle particlemesh. Comput. Phys. Commun. 2012, 183 (3), 449-459. Darden, T.; York, D.; Pedersen, L., Particle Mesh Ewald - an N.Log(N) Method for Ewald Sums in Large Systems. J. Chem. Phys. 1993, 98 (12), 10089-10092. Essmann, U.; Perera, L.; Berkowitz, M. L.; Darden, T.; Lee, H.; Pedersen, L. G., A Smooth Particle Mesh Ewald Method. J. Chem. Phys. 1995, 103 (19), 8577-8593. Salomon-Ferrer, R.; Gotz, A. W.; Poole, D.; Le Grand, S.; Walker, R. C., Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald. J. Chem. Theory Comput. 2013, 9 (9), 3878-3888.

ACS Paragon Plus Environment

36

Page 37 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

32.

33.

34. 35.

36.

37.

38.

39.

40.

41. 42. 43. 44. 45.

46.

47.

Harvey, M. J.; Giupponi, G.; De Fabritiis, G., ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale. J. Chem. Theory Comput. 2009, 5 (6), 16321639. Eastman, P.; Friedrichs, M. S.; Chodera, J. D.; Radmer, R. J.; Bruns, C. M.; Ku, J. P.; Beauchamp, K. A.; Lane, T. J.; Wang, L. P.; Shukla, D.; Tye, T.; Houston, M.; Stich, T.; Klein, C.; Shirts, M. R.; Pande, V. S., OpenMM 4: A Reusable, Extensible, Hardware Independent Library for High Performance Molecular Simulation. J. Chem. Theory Comput. 2013, 9 (1), 461-469. Ruymgaart, A. P.; Cardenas, A. E.; Elber, R., MOIL-opt: Energy-Conserving Molecular Dynamics on a GPU/CPU System. J. Chem. Theory Comput. 2011, 7 (10), 3072-3082. Phillips, J. C.; Stone, J. E.; Schulten, K. In Adapting a message-driven parallel application to GPU-accelerated clusters, High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, IEEE: 2008; pp 1-9. Abraham, M. J.; Murtola, T.; Schulz, R.; Pali, S.; Smith, J. C.; Hess, B.; Lindahl, E., GROMACS: HIgh performance molecular simulations through multi-level parallelism from laptops to supercomputers. Software X 2015, 1-2, 19-25. Phillips, J. C.; Sun, Y.; Jain, N.; Bohm, E. J.; Kalé, L. V. In Mapping to irregular torus topologies and other techniques for petascale biomolecular simulation, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Press: 2014; pp 81-91. Jung, J.; Mori, T.; Kobayashi, C.; Matsunaga, Y.; Yoda, T.; Feig, M.; Sugita, Y., GENESIS: a hybrid-parallel and multi-scale molecular dynamics simulator with enhanced sampling algorithms for biomolecular and cellular simulations. Wires Comput Mol Sci 2015, 5 (4), 310-323. Jung, J.; Mori, T.; Sugita, Y., Midpoint Cell Method for Hybrid (MPI+OPENMP) Parallelization of Molecular Dynamics Simulations. J. Comput. Chem. 2014, 35 (14), 1064-1072. Jung, J.; Kobayashi, C.; Imamura, T.; Sugita, Y., Parallel implementation of 3D FFT with volumetric decomposition schemes for efficient molecular dynamics simulations. Comput. Phys. Commun. 2016, 200, 57-65. Bowers, K. J.; Dror, R. O.; Shaw, D. E., The midpoint method for parallelization of particle simulations. J. Chem. Phys. 2006, 124 (18), 184109. Tuckerman, M.; Berne, B. J.; Martyna, G. J., Reversible Multiple Time Scale MolecularDynamics. J. Chem. Phys. 1992, 97 (3), 1990-2001. http://www.gsic.titech.ac.jp/en/tsubame. http://www.aics.riken.jp/en/k-computer/about/. Case, D. A.; Cheatham, T. E., 3rd; Darden, T.; Gohlke, H.; Luo, R.; Merz, K. M., Jr.; Onufriev, A.; Simmerling, C.; Wang, B.; Woods, R. J., The Amber biomolecular simulation programs. J. Comput. Chem. 2005, 26 (16), 1668-88. Jorgensen, W. L.; Maxwell, D. S.; TiradoRives, J., Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. J. Am. Chem. Soc. 1996, 118 (45), 11225-11236. Oostenbrink, C.; Villa, A.; Mark, A. E.; Van Gunsteren, W. F., A biomolecular force field based on the free enthalpy of hydration and solvation: The GROMOS force-field parameter sets 53A5 and 53A6. J. Comput. Chem. 2004, 25 (13), 1656-1676.

ACS Paragon Plus Environment

37

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

48. 49.

50. 51. 52. 53. 54.

55.

56.

57. 58. 59.

60. 61. 62. 63.

64.

Page 38 of 39

Trotter, H. F., On the product of semi-groups of operators. Proc. Amer. Math. Soc. 1959, 10 (4), 545-551. Barth, E.; Schlick, T., Extrapolation versus impulse in multiple-timestepping schemes. II. Linear analysis and applications to Newtonian and Langevin dynamics. J. Chem. Phys. 1998, 109 (5), 1633-1642. Grest, G. S.; Kremer, K., Molecular-Dynamics Simulation for Polymers in the Presence of a Heat Bath. Phys. Rev. A 1986, 33 (5), 3628-3631. Gonnet, P., Pairwise verlet lists: combining cell lists and verlet lists to improve memory locality and parallelism. J. Comput. Chem. 2012, 33 (1), 76-81. https://en.wikipedia.org/wiki/Haswell_(microarchitecture) https://en.wikipedia.org/wiki/Haswell_(microarchitecture) (accessed November 16). Jung, J.; Mori, T.; Sugita, Y., Efficient Lookup Table Using a Linear Function of Inverse Distance Squared. J. Comput. Chem. 2013, 34 (28), 2412-2420. Duan, Y.; Wu, C.; Chowdhury, S.; Lee, M. C.; Xiong, G. M.; Zhang, W.; Yang, R.; Cieplak, P.; Luo, R.; Lee, T.; Caldwell, J.; Wang, J. M.; Kollman, P., A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations. J. Comput. Chem. 2003, 24 (16), 1999-2012. MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph-McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; Wiorkiewicz-Kuczera, J.; Yin, D.; Karplus, M., All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 1998, 102 (18), 3586-3616. Huang, J.; MacKerell, A. D., CHARMM36 all-atom additive protein force field: Validation based on comparison to NMR data. J. Comput. Chem. 2013, 34 (25), 21352145. Shirts, M. R., Simple Quantitative Tests to Validate Sampling from Thermodynamic Ensembles. J. Chem. Theory Comput. 2013, 9 (2), 909-926. Sugita, Y.; Okamoto, Y., Replica-exchange molecular dynamics method for protein folding. Chem. Phys. Lett. 1999, 314 (1-2), 141-151. Ryckaert, J. P.; Ciccotti, G.; Berendsen, H. J. C., Numerical-Integration of Cartesian Equations of Motion of a System with Constraints - Molecular-Dynamics of N-Alkanes. J. Comput. Phys. 1977, 23 (3), 327-341. Andersen, H. C., Rattle - a Velocity Version of the Shake Algorithm for MolecularDynamics Calculations. J. Comput. Phys. 1983, 52 (1), 24-34. Miyamoto, S.; Kollman, P. A., Settle - an Analytical Version of the Shake and Rattle Algorithm for Rigid Water Models. J. Comput. Chem. 1992, 13 (8), 952-962. Hess, B., P-LINCS: A parallel linear constraint solver for molecular simulation. J. Chem. Theory Comput. 2008, 4 (1), 116-122. Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L., Comparison of Simple Potential Functions for Simulating Liquid Water. J. Chem. Phys. 1983, 79 (2), 926-935. http://folding.bmc.uu.se/remd/ (accessed September ).

ACS Paragon Plus Environment

38

Page 39 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

TOC graphic

ACS Paragon Plus Environment

39