OpenMP Implementation of the ORAC ... - ACS Publications

May 27, 2016 - Implementing molecular dynamics simulation on the Sunway TaihuLight system with heterogeneous many-core processors. Concurrency and ...
0 downloads 0 Views 1MB Size
Subscriber access provided by UCL Library Services

Application Note

Hybrid MPI/OpenMP Implementation of the ORAC Molecular Dynamics Program for Generalized Ensemble and Fast Switching Alchemical Simulations Piero Procacci J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.6b00151 • Publication Date (Web): 27 May 2016 Downloaded from http://pubs.acs.org on May 31, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Hybrid MPI/OpenMP Implementation of the ORAC Molecular Dynamics Program for Generalized Ensemble and Fast Switching Alchemical Simulations Piero Procacci∗ Department of Chemistry, University of Florence, Sesto Fiorentino, 50019, Italy E-mail: [email protected]

Abstract We present a new release (6.0β ) of the ORAC program [ Marsili, S.; Signorini, G. F.; Chelli, R.; Marchi, M.; Procacci, P. ORAC: A Molecular Dynamics Simulation Program to Explore Free Energy Surfaces in Biomolecular Systems at the Atomistic Level. J. Comput. Chem. 2010, 31, 1106-1116.] with a hybrid OpenMP/MPI multilevel parallelism tailored for generalized ensemble (GE) and fast switching double annihilation (FS-DAM) nonequilibrium technology aimed at evaluating the binding free energy in drug-receptors system on High Performance Computing platforms. The production of the GE or FS-DAM trajectories is handled using a weak scaling parallel approach on the MPI level only, while a strong scaling force decomposition scheme is implemented for intranode computations with shared memory access at the OpenMP level. The efficiency, simplicity and inherent parallel nature of the ORAC implementation of the FS-DAM algorithm, project the code as a possible effective tool for a second generation High Throughput Virtual Screening in drug discovery and design. The ∗ To

whom correspondence should be addressed

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

code, along with documentation, testing and ancillary tools, is distributed under the provisions of the General Public License (GPL) and can be freely downloaded at the Internet address www.chim.unifi.it/orac.

Introduction Parallel implementations are today widespread in molecular dynamics (MD) programs for complex systems. Most of these applications concern both the parallelization of the non bonded force loop, a typical strong scaling problem, and the exploitation of inherently parallel sampling methods based on weakly scaling multicanonical approaches (Hamiltonian Replica Exchange(H-REM), 1 Serial Generalized ensemble(SGE) 2 ). The parallelization of the strong scaling non bonded force computations in modern MD codes 3–6 is usually handled in a distributed memory environment at the MPI level via the so-called neutral territory domain decomposition (NT-DD) approach. 7,8 Combining MPI with OpenMP in NT-DD implementations has in fact a considerable overhead. 4,9 As in any strong scaling computation, the scaling behavior of the DD scheme with MPI is limited by the mean number of particles in a domain/processor. When the latter decreases, the mean workload per processor also decreases and the communication overhead fatally increases. In typically sized drug-receptor systems (from 10000 to 30000 atoms), the saturation level of the parallel performances of DD based MPI-only MD codes are reached within a few hundreds or even few tens of cores, depending on network speed. It follows that investing a large share of the available computational resources in the strong scaling parallelization of the force loops, at the same time reducing parallel instances to the weak scaling algorithms based on generalized ensemble or non equilibrium (NE) technologies, can be highly counterproductive by eventually limiting the overall sampling power of the simulation in systems characterized by rugged free energy surfaces. The present release of the hybrid OpenMP/MPI ORAC code 10,11 for molecular dynamics of biomolecular systems is designed to optimally exploit the recent trend of non uniform memory access (NUMA) multi-core High Performance Computing. The previous parallel release of the

2 ACS Paragon Plus Environment

Page 2 of 16

Page 3 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

code featured a weakly scaling MPI-only implementation of advanced techniques for enhanced sampling in atomistic systems including replica exchange with solute tempering, metadynamics and NE steered molecular dynamics. Forces on particles were computed serially using a very efficient Multiple Time Steps (MTS) integration scheme. 10,12 In this new release, a freshly implemented OpenMP layer is applied to the parallelization of the bonded and non bonded forces within a single shared memory access compute node. This is achieved by setting up a particle decomposition method for intramolecular and intermolecular force computations with tunable cache line size handling and dynamic threading. In the ORAC6.0 version, we added full support for the so-called fast switching double annihilation method (FS-DAM) 13,14 for efficiently computing binding free energies in drug-receptor systems using the OpenMP/MPI multi-level parallel approach. FS-DAM is based on the production of many independent NE processes where an externally driven alchemical parameter, regulating the interaction of the ligand with the environment, is varied continuously during the multi-step integration of the equations of motion according to a prescribed common time schedule. 15,16 The drug-receptor absolute binding free energy can be recovered from the annihilation work histogram of the bound and unbound states by applying an unbiased unidirectional free energy estimate. We refer to Refs. 13,14 for a detailed description of the FS-DAM algorithms in drug design.

Code use and performances from desktop computers to HPC systems The ORAC code, mostly written in Fortran, is highly portable and it compiles successfully with the most common compilers (Intel ifort, IBM xlf90, PGI pgfortran and GNU gfortran) on many unix platforms, from laptop computers to HPC facilities including BlueGeneQ IBM architectures. ORAC can use the efficient FFTW suite 17 for Fourier Transform in the evaluation of the PME contribution to reciprocal lattice forces, but has its own built-in parallel FFT libraries. The present mixed-thread parallel version of ORAC is designed to maximize the sampling efficiency of 3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

a single MD simulation job on a NUMA multi-core platform using advanced highly parallel techniques, such as H-REM or SGE, or even zero-communication embarrassingly parallel approach like FS-DAM. We deem as “thermodynamic” the parallelism related to such weakly scaling algorithms. Thermodynamic parallelism in ORAC is handled only at the distributed memory MPI level, with a minimal incidence of the inter-replica communication overhead. Hence, the number of MPI instances in ORAC simply equals the number of replica/walkers in H-REM/SGE simulations or the number of independent NE trajectories in a FS-DAM job. In order to reduce the loss of parallel efficiency due to node overloading, ORAC can handle the unequal granularity of the various asynchronous force computations in the nested multiple time step levels using an end-user controlled dynamic adjustment of the OpenMP threads numbers per MPI instance. Intramolecular forces, implementation of constraints, step-wise advancement of velocities/coordinates and computation of non bonded forces in the direct and reciprocal lattice can each be computed with a user controlled number of threads by switching on dynamic threading. From the end user point of view, dynamic threading can be specified by supporting an optional heading in the main input file (see Figure S1 of the Supporting Information). Tests on a single multi-core CPU. In the Figure 1(a) we show the OpenMP speedups obtained on a single HPC 16-cores compute node, for a system including a penta-alanine mini-protein, solvated in 9300 TIP3P water molecules for a total of 28000 atoms in a cubic MD simulation box with a side-length of ' 64 Å. The size of such benchmark system, as we shall see further on, is that typical for alchemically driven drug-receptor FS-DAM simulations. Further details on the simulations are given in the Methods section in the Supporting Information. Tests were run on a 16-cores genuine Intel CPU. With eight processors, the overall parallel efficiency of the ORAC code is at about 70% with a nearly sixfold speedup factor. Performances are only marginally better when using all 16 cores on the node, with speed up factor at about 7 and with parallel efficiency dropping below 50%. Speedup factors are, however, different for the various type of calculations involved in the MD production. So, for example, whereas the scaling of the computation of the direct lattice non bonded forces is indeed satisfactory, with a parallel efficiency hitting 70% with

4 ACS Paragon Plus Environment

Page 4 of 16

Page 5 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

16 cores, the scaling of the overall fftw-PME computation only produces a maximum threefold speedup on eight cores, in line with the recently reported benchmarks. 17 The computation labeled “MTS integrator” in Figure 1 includes constraints enforcing, fast bonded force calculations and asynchronous MTS velocity/coordinates advancement. All these combined tasks yield an overall contribution representing just 3% of the total MD workload for a serial job. However, due to the fine granularity, the MTS integration computation impacts on the overall parallel efficiency when increasing the number of processors.

Figure 1: a): Speedup measurements on the ENEA-CRESCO5 18 Compute node ( 16-cores Intel(R) Xeon(R) CPU E5-2630 v3 2.40GHz) for a 28000 atoms systems (see text for details). Programs were compiled with Intel ifort and FFTW libraries. b): OpenMP parallel efficiency obtained with ORAC6.0β for systems of various sizes on a 16-cores genuine Intel Xeon(R) CPU E5-2630 (black curves, filled symbols) and on a non Intel 8-cores AMD FX(tm)-8150 CPU (red curves, empty symbols). c) Gromacs 5.1.2 and ORAC6.0β benchmarks on a low end Desktop computer equipped with 8-cores-AMD(tm) FX-8150 CPU as a function of system size.

Single node OpenMP performances depend on the compiler and the CPU architecture. In Figure 1(b) we report a series of benchmarks as a function of the system size using an ifortcompiled ORAC6.0β target on a genuine Intel CPU and on a non Intel CPU. As before, the system is a penta-alanine peptide, solvated in about 29800, 9200, 1700, and 500 TIP3P water molecules. As expected, the parallel efficiency decreases with decreasing system size and corresponding processor workload. However, while for the Intel node (Xeon E5-2630), even the small system (1500 5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

atoms circa) still exhibits a nearly 40% parallel efficiency, the efficiency obtained with the non Intel CPU (AMD FX(tm)-8150 ) is consistently below that obtained with the Intel CPU, dropping, for the smallest system, to about 30% using eight processors. Finally in Figure 1(c) we compare the performances (measured in ns/day of simulation) of the OpenMP ORAC6.0β code to those obtained with the latest release of the popular GROMACS program. The latter MD code is admirably efficient due to algorithmic and processor-specific professional optimization, typically running 3-10 times faster than many simulation programs. 4 Reported tests were done on a (AMD FX(tm)-8150 ) 8-cores CPU using solvated penta-alanine samples of various sizes. ORAC and GROMACS jobs were run using identical set up (constant pressure and temperature, bond constraints on X-H bonds, PME grid spacing of 1 Å, 10 Å cutoff). The log-log plot of Figure 1(c) shows that the performances of the two codes on a single 8-cores CPU are comparable. For the largest system (90000 atoms), GROMACS runs at a speed of 2.2 ns/day compared to the 1.3 ns/day of ORAC6.0β . The efficiency gap between the two codes gets narrower with decreasing system size. For the smallest system (1500 atoms), GROMACS and ORAC produce about 95 and 60 ns of simulation in a day, respectively. Tests on a HPC platforms. An OpenMP/MPI implementation of the ORAC6.0β code has been tested on the Fermi BlueGeneQ system at CINECA. Fermi is made up of about 10,000 compute nodes, each featuring an IBM PowerA2 chip with 16 cores working at a frequency of 1.6 GHz. The basic Power A2 Fermi CPU 1.6GHz core unit is very slow. However, the slow speed of the single core can be compensated by exploiting the hyper-threading technology of the BGQ chipset allowing a maximum of 64 hardware threads (HWT) per node. As a test case, we use the solvated Tumor Necrosis Factor α Converting Enzyme (TACE) in complex with the IK682 tight binding ligand, 19 a system made up of about 23000 atoms. See the Supporting Information for technical details on the simulation setup. An OpenMP-MPI application of the solvated IK682TACE complex on a HPC facility could be used, for example, for a H-REM simulation, with scaling of the binding site region only, 20,21 in order to efficiently sample the fully coupled bound state; or for the production of FS-DAM independent trajectories (vide infra) aimed at determining

6 ACS Paragon Plus Environment

Page 6 of 16

Page 7 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the binding affinity of the ligand via the associated NE work distribution. 16 In the first instance, pilot tests for this kind of simulations have been done for a fixed slot of 1024 cores (64 compute nodes), using a total of OpenMP and MPI threads equal to the number of cores (i.e. using a single HWT per core). We varied the number of MPI instances from 64 (1 trajectory/walker with 16 OpenMP threads per node) to 1024 (16 trajectories/walkers per node and no OpenMP layer), hence assigning just one parallel MPI instance to each physical core. For the sake of clarity, a sample Batch Scheduler LoadLeveler submission script is reported and explained in the Supporting Information. The OpenMP performances on Fermi are different for ORAC MD-related algorithms and go from 90-80% for the direct lattice force routines to 6030% for FFT reciprocal lattice forces down to 40-14% for low granularity fast intramolecular forces and bond constraints. On 1024 cores the program runs at a maximum speed of 0.6 ns/day producing 64 concurrent trajectories (either of the H-REM or the FS-DAM type). Full details of these OpenMP/MPI performance tests are reported in Table 1 of the Supporting Information. Table 1: Tests for ORAC6.0β applications done on 64 compute nodes of the Fermi BGQ HPC facility using up to 4 HWT per core. The efficiency of the parallel simulation is measured in the number of processed ligand-receptor pairs per day. NMPI /NOMP 8 16 32 64

cores 1024 1024 1024 1024

threads (HWT) Ntraj. /node 2048 2 4096 4 2048 4 4096 8

Elaps. Time/ps Ligands/day 131 7.2 164 10.9 186 10.3 250 15.4

Finally, tests have also been done using up to 4 HWT per core, thus filling each BGQ compute node with a total of 64 threads given by the product of the MPI rank_per_node times the OMP_NUM_THREADS parallel instances. Examples of LoadLeveler batch submission scripts using more than 1 HWT per core are reported and commented in the Supporting Information. Results are reported in the Table 1. The overall efficiency of the computation is measured in processed ligandreceptor pairs per day on the so-called bigpar queue (16384 cores), on the basis of a total simulation time of 0.3 ns in 512 FS-DAM annihilation (vide infra) trajectories per pair. Scaling is indeed 7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

astonishing: de facto, the 64 logical cores on the single BGQ chip module appear to behave as physical cores. The maximum efficiency is reached at the highest ratio NMPI /NOMP = 64 between MPI instances (i.e. FS-DAM trajectories) and OpenMP threads, with eight OpenMP threads and four HWT per core, peaking at 15.4 ligand-receptor pair/day (or 2.3 µs/day of total simulation time)

A test case: dissociation free energy of TACE ligands via FSDAM simulations on a BlueGeneQ platform In this section we provide a practical demonstration on how ORAC can be used to efficiently evaluate the binding free energy in drug-receptor systems on a HPC platform. To this end, we present some preliminary results obtained on the Fermi BGQ facility using the recently developed FS-DAM technology for drug design. 16 We stress that these results are presented here for demonstrative purposes. A thorough study on TACE inhibitors using FS-DAM will be reported in a forthcoming paper. The chosen drug-receptor pair is the previously mentioned TACE protein bound to the tight binding inhibitor IK682. 19 A detailed description of the system and of the simulation setup are provided in the Supporting Information. The TACE-IK682 presents various methodological challenges for the binding free energy determination via alchemical transformations. The TACE protein contains a doubly charged metal ion, interacting directly with the hydroxamate moiety of the ligand, hence requiring an annihilation work of hundreds of kcal mol−1 . An error of a few percent on this value may easily be transferred to the annihilation free energy of the binding ZBG, possibly producing discrepancies of several kcal mol−1 in the computed dissociation free energy. In order to collect the starting configurations of the unbound ligand, a 5 ns NPT simulation of IK682 in bulk water using a cubic MD box of 30 Å side-length was performed. Once collected, the final 512 starting configurations of the bound and unbound states were transferred to the Fermi/CINECA system. Two parallel OpenMP/MPI jobs on 2048 cores (128 compute nodes) 8 ACS Paragon Plus Environment

Page 8 of 16

Page 9 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

were finally submitted, producing the 512 FS-DAM annihilation 0.3 ns trajectories of the bound (23094 atoms) and unbound (2969 atoms) states, for a total combined 0.3 µs of simulation. Both jobs ended in a single wall clock day. In Figure 2, we show the time record of the alchemical work in a subset of the FS-DAM annihilation trajectories of the IK682 ligand in the bound state and in bulk solvent along with the corresponding distribution of the final alchemical work. The nature of the work distribution was checked by computing higher moments of the distributions (skewness and kurtosis) and by using the standard Kolmogorov test as described in Refs. 16 Both work distributions turn out be the superposition of more than one components. In particular, the work distribution referring to the bound state exhibits a significant left skewness, while that relative to the annihilation of the ligand in bulk showed a large kurtosis. By representing the overall FS-DAM work distributions with an

Figure 2: Time record of the annihilation work for a subset of the 512 FS-DAM trajectories of the bound state (left) and for the unbound state (right) for the IK682-TACE ligand-receptor pair. In the inset, the relative work distribution is reported.

appropriate combination of normal components, the bound and unbound annihilation free energies (∆Gb , ∆Gs ,respectively ) can be straightforwardly computed as described in Ref., 22 finding ∆Gb = 43.3 ± 0.9 kcal mol−1 and ∆Gs = 21.6 ± 0.3 kcal mol−1 . The dissociation free energy is given 9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

by the difference between these two values plus a correction of −3.5 ± 0.7 Å3 kcal mol−1 , that is related to the binding site volume. Further details on the calculations are provided in the Supporting Information. Based on these results the standard dissociation free energy of the TACE-IK682 complex can be estimated as ∆G0 = 18.2 ± 1.4 kcal mol−1 . Experimental measurements of the IK682 binding affinity vs TACE are very difficult due to the extreme strength of the binding. In any case, IK682 was found to exhibit picomolar activity 19 with an apparent Ki of the order 0.02 nm, yielding a standard dissociation free energy of ' 15 kcal mol−1 at pH=7.3. As previously stated, the TACE-IK682 is indeed challenging from a computational standpoint with the binding free energy crucially depending on many details (pH modeling, force fields, work distribution resolution). Nonetheless, in spite of the preliminary nature of this demonstrative Open-MP/MPI ORAC application, FS-DAM was able to correctly predict IK682 as a tight binding TACE inhibitor, with a semi-quantitative agreement between computed (FS−DAM)

(∆G0

(Exp.)

= 18.2 ± 1.4 kcal mol−1 ) and experimental (∆G0

' 15.0 kcal mol−1 ) standard

dissociation free energies.

Conclusion and perspective In this paper we have presented an OpenMP/MPI implementation of the ORAC code 10,11 for multicore HPC platforms. The hybrid parallelization is designed in such a way that the distributed memory MPI level is reserved for thermodynamic parallelism only (H-REM, GE or FS-DAM low communication computations) while the strong scaling force decomposition algorithm is implemented in the OpenMP layer within the intra-node multi-cores shared memory environment. Such design is specifically tuned so as to maximize the sampling efficiency in complex heterogeneous systems that are characterized by a manifold of competing free energy attractors. Hence the need for an effective exploitation, in MD codes of biological system, of the so-called non Boltzmann methodologies that artificially enhance the probability of rare fluctuations on auxiliary thermodynamic states by rigorously re-weighting the statistics onto the target thermodynamic state. In some

10 ACS Paragon Plus Environment

Page 10 of 16

Page 11 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

sense, H-REM and GE methodologies, as well as FS-DAM techniques are all aimed at expanding the statistics (i.e. the number of statistically independent simulated proteins) rather than the time scale of the system. These latter techniques are especially advantageous in a massively parallel environment where one can efficiently exploit the limited communication overhead that is needed for their implementation, easily producing, in a single wall clock day, an overall simulation time in the order of the microseconds. The OpenMP/MPI ORAC MD code is based on this simulation paradigm and is perfectly tailored for the current architectural trends in super-computing, limiting the strong scaling force computation to the intranode shared memory environment with a moderate subtraction of resources to be assigned to the thermodynamic parallelism for enhanced sampling. As stated in the introduction, the present release of the code is of a prototypical type. The 6.0β tarball, including source, documentation and OpenMP/MPI testing examples, is made freely available via the usual Internet address at hhtp:www.chim.unifi.it/orac with all the caveats that are typical for a betaware program. Future developments on ORAC are foreseen to involve GPU porting, a more efficient implementation of the alchemical protocol with the possibility of enforcing the asynchronous annihilation of different parts of the molecule, the possibility of handling simultaneously FS-DAM annihilation of multiple topologies for high throughput screening on massively parallel HPC systems. We conclude by recalling that the program, including the source, is distributed under the General Public License and contributions and improvements from researchers and software developers are welcome.

Acknowledgement Part of this work was done in the framework of CINECA ISCRA Project “Hybrid MPI and OpenMP parallel nonequilibrium method for fast calculations of the binding free energy in Molecular Dynamics Simulations of ligand-receptor systems” (ISCRA Grant identifier MPNEDRUG). Part of the computing resources and the related technical support used for this work have been kindly provided by CRESCO/ENEAGRID High Performance Computing infrastructure and its staff. 18 11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Supporting Information Available Handling of dynamic threading in OpenMP implementations. OpenMP/MPI dispatching with ORAC6.0 on a BGQ HPC platform using load-leveler and ORAC performances tests. Calculation of the annihilation free energy estimates in FS-DAM of the TACE-IK682 complex. Computational details and simulation setup for the TACE-IK682 system. This material is available free of charge via the Internet at http://pubs.acs.org/.

12 ACS Paragon Plus Environment

Page 12 of 16

Page 13 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

References (1) Sugita, Y.; Okamoto, Y. Replica-Exchange Molecular Dynamics Method for Protein Folding. Chem. Phys. Lett. 1999, 314, 141–151. (2) Marinari, E.; Parisi, G. Simulated Tempering: A New Monte Carlo Scheme. Europhys. Lett. 1992, 19, 451–458. (3) Hess, B.; Kutzner, C.; van der Spoel, D.; Lindahl, E. GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. J. Chem. Theory Comput. 2008, 4, 435–447. (4) Abraham, M. J.; Murtola, T.; Schulz, R.; Pall, S.; Smith, J. C.; Hess, B.; Lindahl, E. GROMACS: High Performance Molecular Simulations through Multi-Level Parallelism from Laptops to Supercomputers. SoftwareX 2015, 1-2, 19–25. (5) Phillips, J. C.; Braun, R.; Wang, W.; Gumbart, J.; Tajkhorshid, E.; Villa, E.; Chipot, C.; Skeel, L.; Schulten, K. Scalable Molecular Dynamics with NAMD. J. Comput. Chem. 2005, 26, 1781–1802. (6) Plimpton, S. Fast Parallel Algorithms for Short-Range Molecular Dynamics. J. Comput. Phys. 1995, 117, 1 – 19. (7) Bowers, K. J.; Dror, R. O.; Shaw, D. E. Overview of Neutral Territory Methods for the Parallel Evaluation of Pairwise Particle Interactions. Journal of Physics: Conference Series 2005, 16, 300–304. (8) Shaw, D. E. A Fast, Scalable Method for the Parallel Evaluation of Distance-Limited Pairwise Particle Interactions. J. Computat. Chem. 2005, 26, 1318–1328. (9) See for example "Acceleration and parallelization" Section in the online gromacs 4 documentation at http://www.gromacs.org/ (accessed 22 January 2016).

13 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(10) Procacci, P.; Paci, E.; Darden, T.; Marchi, M. ORAC: A Molecular Dynamics Program to Simulate Complex Molecular Systems with Realistic Electrostatic Interactions. J. Comput. Chem. 1997, 18, 1848–1862. (11) Marsili, S.; Signorini, G. F.; Chelli, R.; Marchi, M.; Procacci, P. ORAC: A Molecular Dynamics Simulation Program to Explore Free Energy Surfaces in Biomolecular Systems at the Atomistic Level. J. Comput. Chem. 2010, 31, 1106–1116. (12) Marchi, M.; Procacci, P. Coordinates Scaling and Multiple Time Step Algorithms for Simulation of Solvated Oroteins in the NPT Ensemble. J. Chem. Phys. 1998, 109, 5194–5202. (13) Procacci, P. I. Dissociation Free Energies in Drug-Receptor Systems via Non Equilibrium Alchemical Simulations: Theoretical Framework. Phys. Chem. Chem. Phys. 2016, DOI: 10.1039/C5CP05519A. (14) Nerattini, F.; Chelli, R.; Procacci, P. II. Dissociation Free Energies in Drug-Receptor Systems via Non Equilibrium Alchemical Simulations: Application to the FK506-related Immunophilin Ligands. Phys. Chem. Chem. Phys. 2016, DOI: 10.1039/C5CP05521K. (15) Procacci, P.; Cardelli, C. Fast Switching Alchemical Transformations in Molecular Dynamics Simulations. J. Chem. Theory Comput. 2014, 10, 2813–2823. (16) Sandberg, R. B.; Banchelli, M.; Guardiani, C.; Menichetti, S.; Caminati, G.; Procacci, P. Efficient Nonequilibrium Method for Binding Free Energy Calculations in Molecular Dynamics Simulations. J. Chem. Theory Comput. 2015, 11, 423–435. (17) Frigo, M.; Johnson, S. G. The Design and Implementation of FFTW3. Proceedings of the IEEE 2005, 93, 216–231, Special issue on “Program Generation, Optimization, and Platform Adaptation”. (18) Ponti, G.; Palombi, F.; Abate, D.; Ambrosino, F.; Aprea, G.; Bastianelli, T.; Beone, F.; Bertini, R.; Bracco, G.; Caporicci, M.; Calosso, B.; Chinnici, M.; Colavincenzo, A.; Cu14 ACS Paragon Plus Environment

Page 14 of 16

Page 15 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

curullo, A.; Dangelo, P.; De Rosa, M.; De Michele, P.; Funel, A.; Furini, G.; Giammattei, D.; Giusepponi, S.; Guadagni, R.; Guarnieri, G.; Italiano, A.; Magagnino, S.; Mariano, A.; Mencuccini, G.; Mercuri, C.; Migliori, S.; Ornelli, P.; Pecoraro, S.; Perozziello, A.; Pierattini, S.; Podda, S.; Poggi, F.; Quintiliani, A.; Rocchi, A.; Scio, C.; Simoni, F.; Vita, A. Proceeding of the International Conference on High Performance Computing & Simulation; Institute of Electrical and Electronics Engineers ( IEEE ), 2014; pp 1030–1033. (19) Niu, X.; Umland, S.; Ingram, R.; Beyer, B.; Liu, Y. H.; Sun, J.; Lundell, D.; Orth, P. IK682, A Tight Binding Inhibitor of TACE. Arch. Biochem. Biophys. 2006, 451, 43–50. (20) Procacci, P.; Bizzarri, M.; Marsili, S. Energy-Driven Undocking (EDU-HREM) in Solute Tempering Replica Exchange Simulations. J Chem. Theory Comput. 2014, 10, 439–450. (21) Wang, L.; Wu, Y.; Deng, Y.; Kim, B.; Pierce, L.; Krilov, G.; Lupyan, D.; Robinson, S.; Dahlgren, M. K.; Greenwood, J.; Romero, D. L.; Masse, C.; Knight, J. L.; Steinbrecher, T.; Beuming, T.; Damm, W.; Harder, E.; Sherman, W.; Brewer, M.; Wester, R.; Murcko, M.; Frye, L.; Farid, R.; Lin, T.; Mobley, D. L.; Jorgensen, W. L.; Berne, B. J.; Friesner, R. A.; Abel, R. Accurate and Reliable Prediction of Relative Ligand Binding Potency in Prospective Drug Discovery by Way of a Modern Free-Energy Calculation Protocol and Force Field. J. Am. Chem. Soc. 2015, 137, 2695–2703. (22) Procacci, P. Unbiased Free Energy Estimates in Fast Nonequilibrium Transformations Using Gaussian Mixtures. J. Chem. Phys. 2015, 142, 154117(1–11).

15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Conventional simulations of microseconds are a suboptimal way to deal with rare events that occur in the same time scale. ORAC6.0 is designed to achieve maximum sampling efficiency in the simulation of biosystems. Enhanced sampling strategies relying on weak scaling thermodynamic parallelism, such as H-REM, SGE, FS-DAM or metadynamics, boosted by a tunable OpenMP layer, can provide a far more reliable statistical picture of many biologically relevant events.

16 ACS Paragon Plus Environment

Page 16 of 16