Error-Safe, Portable, and Efficient Evolutionary Algorithms

Sep 22, 2016 - We present an efficient massively parallel implementation of genetic algorithms for chemical and materials science problems, solely bas...
0 downloads 11 Views 1MB Size
Article pubs.acs.org/JCTC

Error-Safe, Portable, and Efficient Evolutionary Algorithms Implementation with High Scalability Johannes M. Dieterich*,† and Bernd Hartke*,‡ †

Insitut für Physikalische Chemie, Georg-August-Unversität Göttingen, Tammannstrasse 6, 37077 Göttingen, Germany Theoretische Chemie, Institut für Physikalische Chemie, Christian-Albrechts-Universität Kiel, Olshaustenstrasse 40, 24098 Kiel, Germany



ABSTRACT: We present an efficient massively parallel implementation of genetic algorithms for chemical and materials science problems, solely based on Java virtual machine (JVM) technologies and standard networking protocols. The lack of complicated dependencies allows for a highly portable solution exploiting strongly heterogeneous components within a single computational context. At runtime, our implementation is almost completely immune to hardware failure, and additional computational resources can be added or subtracted dynamically, if needed. With extensive testing, we show that despite all these benefits, parallel scalability is excellent.

1. INTRODUCTION Global optimization has been used with great success in chemistry and materials sciences, for various purposes. One of them is finding lowest-energy structures of atomic or molecular clusters, i.e., finding the structural arrangement in threedimensional space that globally minimizes its energy, which is given as a predefined function of the coordinates of all particles. In this task, already for relatively small cluster sizes, chemical intuition is insufficient to predict the most likely structures, and likewise series of local optimizations from guessed starting structures are sure to fail.1 For increasing cluster sizes, the search space (and the number of local minima in it) grows exponentially. Hence, for deterministic global optimization, nontrivial cluster sizes simply are too large. However, various nondeterministic global optimization algorithms have been shown to work sufficiently effectively in practice.2−5 Nevertheless, even for nondeterministic global optimization, for the clusters of interest in experiments, excessive amounts of computer time are needed. Fortunately, many of these algorithms are trivially parallel, which can be turned into good parallel scaling with proper implementations. Accordingly, this has been done before, for example for evolutionary algorithms (EA), departing from the old generational paradigm to eliminate serial bottlenecks6,7 (cf. ref 8 for a whole book on EA parallelization). Similarly, basin-hopping was combined with parallel tempering, executed in parallel, for better search power in difficult landscapes.9 For particle swarm optimizations, a speedup by a factor of 215 could be achieved via parallelization on GPGPUs.10 This trivial parallelism makes nondeterministic global optimization a prime candidate for efficient and useful exploitation of very many processor cores. Additionally, there is no need for even distribution of tasks, for employing identical © XXXX American Chemical Society

hardware nodes, or even for keeping the number of nodes in a run constant. As shown in this contribution, these characteristics can be used to turn global optimizations into computer tasks that are immune against hardware failure and able to dynamically utilize every bit of computing resources left, in typical queue-based computing centers or in wildly heterogeneous ad hoc networks. Hardware failure at runtime is the bane of high-performance computing (HPC). Standard HPC installations are now in the petascale realm and require thousands of nodes to provide this computing capacity. As can be expected from a trivial statistical analysis, hardware failure becomes prohibitively likely for long computing jobs using such large numbers of nodes.11 Hence, claims of petascale algorithms without explicit, build-in errorrecovery capabilities must be rejected as purely academic. The evolving MPI-3 standard12 provides features addressing this problem. With previous MPI standards, some degree of fault tolerance could also be achieved, but this was not straightforward and efficient.13,14 As a simpler solution with less dependencies, we present in this contribution a different parallel implementation that in its conception specifically aims at fault tolerance, highest portability, and excellent scalability. This is achieved by relying on limited communication bandwidth requirements and independent tasks in an embarrassingly parallel setting. These are typical features of many nondeterministic global optimization schemes, including Genetic and Evolutionary Algorithms (GAs and EAs), for which we present our examples. However, also other computational tasks in the natural sciences show these characteristics, for example most Monte Carlo algorithms (in Received: July 18, 2016

A

DOI: 10.1021/acs.jctc.6b00716 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Our established, efficient parallel setup for genetic algorithms is pool-based.6 A constant-sized pool of trial solutions is maintained and updated throughout the optimization. New solutions are accepted based on fitness and genetic diversity compared to existing solutions in the pool. In contrast to classic generation-based schemes, it has three major advantages: a) fewer serialization points and hence a better intrinsic scalability, b) built-in elitism, and c) fewer tunables and therefore a more stable performance. We have shown16 that it easily solves standard analytical benchmark functions up to thousands of dimensions on a simple PC. This scheme naturally decomposes into a server-client architecture with the server responsible for maintaining the genetic pool and handing out optimization tasks to the client(s). The clients in turn are responsible for carrying out these tasks, either initializing a possible candidate or obtaining one through genetic operations on one or more parent solutions, and returning the result to the server. Hence, they are very thin clients by nature. This communication model is exactly represented in the most simple mode of our parallel algorithm. In this server +clients mode depicted in Figure 1, a server process maintains

chemical kinetics, particle physics, etc.). Therefore, the implementation ideas presented here may also be useful in those areas. As an universal standard with a focus on high bandwidth requirements, we argue MPI is less suitable for such situations. After a detailed discussion of the communication protocols and implementation we devised, we present extensive tests on scalability, error-safety, and portability of our parallel GA framework.

2. IMPLEMENTATION Our implementation of genetic algorithms, OGOLEM, is fully object-oriented in the Java (version 7+) and Scala languages. OGOLEM therefore can exploit the portability advantages provided by the Java virtual machine (JVM). Two different parallelization technologies are supported: shared-memory parallelization based on the Java thread model and distributed-memory parallelization based on the Java wrapper of the MPI API (mpiJava and MPJ Express are supported). The threading implementation was shown to scale linearly and with little overhead to 48 cores on a single compute node. The MPI implementation using a native MPI implementation through the mpiJava wrapper was shown to scale linearly to 128 cores.7 We expect neither of these processor numbers to be the actual scalability limit for the respective implementation. There are nevertheless limitations to both parallelization techniques. Obviously, the thread-based implementation cannot make use of larger installations unless they are of the (cc)NUMA type. The MPI implementation, despite its good support by computing centers, scalability, and efficiency, suffers for our purposes from the drawbacks outlined above: a) no error safety in case of hardware failure at runtime when using the MPI 1 or 2 standard, b) additional dependencies through the MPI wrapper and (if applicable) implementation, c) API is focused on operations with primitive data types, e.g., arrays of doubles, making object-oriented programming complicated, and d) implicit assumption of identical hardware resources. In light of massively parallel installations with higher probabilities of hardware failures as well as the rise of cloudstyle computing on potentially diverse hardware resources, we introduce an implementation without the disadvantages of these existing implementations. We accomplish this goal by exploiting the Remote Method Invocation (RMI) API available in the Java Standard Edition (Java SE). RMI enables an objectoriented programming model by allowing method invocation on remote objects − hence its name − and packages data via (de)serialization of JVM objects. Typically, the data transfer happens via simple TCP/IP communication with exception handling due to, e.g., transmission errors built into the API. This, in combination with being a standard part of Java SE, allows us to connect any two (or more) Java VMs as long as they can communicate via TCP/IP. RMI is the API used underneath the MPJ Express library15 and has hence proven its worth in high-performance computing environments. We want to note again that in contrast to MPJ Express, we do not aim to mimick the MPI interface through RMI, we instead want to make direct use of RMI for a customized parallelization implementation for genetic algorithms. It is perfectly conceivable to achieve a similar implementation using MPI through MPJ Express. However, such undertaking would be at the expense of an object-oriented architecture and would add dependencies and additional setup requirements.

Figure 1. Parallelization and communication schematics for the server +clients RMI mode of OGOLEM.

the pool and distributes tasks to a number of client processes. We can readily identify a drawback of this mode: too much server-client communication. Genetic algorithms have superior scalability since a pool-based algorithm is trivially parallel and because the data exchanged between server and client is small in size: the solution candidate’s genome and its fitness. The latter causes bandwidth requirements to be negligible, but latency in communication may still matter. If the time required for a client to compute the task is not substantially longer than the time required to receive the task and to send the result back, communication will hamper scalability. This situation can occur in computational chemistry/materials sciences when computationally cheap model potentials are used to evaluate the fitness function. We introduce therefore two other communication modes, server+proxies+clients and server+threading clients, to address this problem (cf. Figures 2 and 3). In both cases, the server process does not hand out individual tasks but blocks of tasks. In return, it receives a pool of solution candidates among which are the results from the task block. The pool received is merged with the pool on the server and the merged pool handed back to proxy or threading client. Quite naturally, these modes give rise to a multipool algorithm where multiple independent genetic pools exchange genetic information with an authoritative server pool. The algorithm allows these pools to be truly independent, with their own configurations for genetic pool, genetic operations, and diversity criteria. The exchange with the B

DOI: 10.1021/acs.jctc.6b00716 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 2. In the server+proxies+clients RMI mode of OGOLEM, another branching level is added to the tree structure of the server+clients mode depicted in Figure 1, to mitigate a possible communication bottleneck at the server.

continue, for the client to continue with obtaining a task, waiting if a serialization point has been reached, f inish if the optimization is done, and unknown to signal an unintended server state. Only if the server’s state is continue, subsequent requests for new tasks will return a valid task. As noted before, these tasks will fall into two categories: initialization tasks to fill the pool to its configured size and global optimization tasks to create offspring from previous solutions. Clients finished with their assigned task hand the result to the server and receive the current state in return. These calls may occur at any point in execution but must be in the order “register”, followed by a sequence of “task get” and “result return”. Hence, more clients can be added at runtime to an existing computing context at any time, simply by also letting them register with the server first. For both the server+proxy and server+threading client mode, the initial registration is followed by requesting a block of tasks. The client provides a configurable maximum number of tasks it wants. The server will typically satisfy this request unless there are less tasks remaining. As discussed above, once a block of tasks has been finished, a list of N solution candidates is handed to the server and merged with the existing pool, and the best N individuals of the merged pool are returned alongside the state of the server. Typically, N will be the number of individuals in the pool, i.e., full exchange. Subsequently, new requests are satisfied until the optimization job is finished. Again, more proxies/threading clients may be added at runtime as long as the above communication protocol is obeyed. In addition to the functions discussed above, the server has features and exposes an API targeted at ensuring graceful recovery from hardware failure at runtime. We define as graceful recovery the loss of any client or proxy to not jeopardize the optimization job or get it into an undefined state. Only failure of the server process may cause the job to stop and must cause all clients/proxies to gracefully shut down. The latter also makes interacting with queuing systems substantially easier. For this purpose, the server spawns off a background thread keeping track of all known proxies/clients and time stamps when they last contacted the server. If any client fails to contact the server for longer than a configurable timeout, it gets purged from the list, and the server’s state is changed as if it had returned its assigned task, if applicable. If the outage was

Figure 3. In the server+threading clients RMI mode, the proxies and clients of the server+proxies+clients mode depicted in Figure 2 are merged into threading clients; on compute clusters with many-thread nodes this setup is simpler to handle and a better match to the hardware.

server pool can be limited to a subset of the pool, if a stronger decoupling of populations is benefical. (Note that parallelization of GAs via partly independent subpopulations with only rare exchange of data was fashionable more than 20 years ago8 but has received less attention recently.) In the case of a threading client, the tasks are taken care of in the same sharedmemory context in which the secondary pool is located (vide inf ra). In the case of a proxy setup, the proxy relays these jobs to a set of simple clients treating the proxy as their server. After this schematic view on the parallelization modes, we will discuss the communication (and therefore exposed API) from the server perspective. The server process must be started first, configures itself, and exports the standard-mandated RMI registry and communication on two TCP/IP ports. All subsequent client/proxy to server communication uses these ports. Clients/proxies are required to register with the server first and obtain a unique identification number (ID). All subsequent communications with the server contain this ID. We currently do not protect against intentionally malicious clients falsifying their ID. In case of the server+client model, the server will subsequently get a request to return its status. This can be answered with C

DOI: 10.1021/acs.jctc.6b00716 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

3. PERFORMANCE ASSESSMENT For the following scaling demonstrations and other test runs we have chosen pure neutral water clusters modeled by the TTM3F potential17 as a test case. Two simple reasons for this choice are (1) that this is a real-life application, offering real-life computational load, in contrast to artificial benchmarks that tend to be too simple and hence not realistic,16 and (2) that we have used our efficient Scala implementation of this model potential in several prior application projects.18−20 In fact, water clusters are not just a global optimization benchmark,21,22 but in chemistry and molecular physics they are highly important and challenging objects of study by themselves,18,23−25 with relevance for diverse areas like astrophysics and climate change on earth. Hence, increased possibilites to study such clusters computationally are very important. Nevertheless, in this contribution, we do not aim at new or improved results for water clusters but merely use them to ensure full realism. For this purpose, the TTM3-F potential is a good compromise, since it is more accurate and computationally a factor of 20 more expensive than simpler water models normally used in biochemistry calculations26 but several orders of magnitude less expensive than ab initio quantum-chemistry methods.27,28 3.1. Scaling with Simple Clients. For the server+simple clients scheme shown in Figure 1, we have tested the scalability within one computing node, compared both to the ideal case and to the traditional thread-based, shared-memory parallelization in OGOLEM.7 We have selected (H2O)25 as a test case, with a pool size of 100 and requesting 25,000 global optimization steps. Our test machine had 4 AMD Opteron 6274 CPUs (each with 16 CPU cores) per node, i.e., 64 CPU cores in total. All runs were performed on one dedicated node (but with normal production calculations running on the seven other nodes). Timings were taken with the standard linux time command and averaged over 10 repeat runs. Figures 4 and 5 show the resulting scaling curves.

temporary, e.g., caused by network instabilites, and this threading client/proxy tries to return this result, the result is silently ignored, and the client re-enters the list of eligible clients/proxies. Moving on to the client/proxy side of the communication, the simple client features two internal threads. One thread, spawned off at the start of execution, is the heartbeat sent to what it considers its server. This can either be the main server or a proxy, as detailed above. The heartbeat thread regularly contacts the server, thereby keeping this client in the list of eligible ones. The other thread is the main worker thread which asks the server it is associated with for new tasks, executes them, and returns the result. The type, content, or details of those tasks are not transparent to the client and determined by what the server sends. Once the server signals job completion, or if the server is unreachable, the simple client shuts itself down in a controlled fashion. The proxy reuses the server code paths to mimick a server for its associated simple clients and adds logic to contact a server and acquire blocks of tasks to distribute to the clients. Blocks of tasks are the equivalent of the server process requesting N initializations or N global optimization steps, leaving the implementation details to the proxy to configure. While we see potential use for this communication mode when coupling, e.g., accelerated and regular clients, typically users will want to employ threading clients for best performance and ease of setup. Threading clients connect to the main server just as proxies and acquire blocks of tasks. Again, they spawn off a heartbeat thread and configure their optimization setup potentially independently from the main setup. However, in contrast to proxies, they execute the tasks they acquire in the same sharedmemory context/process as their secondary genetic pool. Hence, they are restricted to a single node but then communicate via Java threads instead of TCP/IP. Only when the secondary pool must be merged with the authoritative server one and new task blocks have to be acquired is TCP/IP communication necessary. This mode is also characterized by all the positive features mentioned before: It naturally gives rise to a multipool setup, where pools may, if wished, operate with a minimum of information exchange and vastly different optimization strategies. It includes the explicit safety features against hardware failures at runtime common to all modes described here. Through a hybrid parallelization strategy involving both shared-memory thread-level parallelization and TCP/IP distributed parallelization, it was designed for superior scalability and efficiency. In summary, we envision these modes to serve different purposes. The simplest mode with a central server and multiple simple clients will be beneficial in grid-style computing to couple diverse hardware into a single computing context. As noted above, the server+proxies+simple clients mode will be beneficial to overcome communication bottlenecks present in the first mode. For regular use in HPC centers, server+threading client will provide the best efficiency, scalability, and stability. These RMI-based, parallel genetic algorithms implementations, along with the previous shared-memory and MPI-based ones as well as all our optimization kernels, are available under 4-clause BSD licensing conditions from https://www.ogolem. org. An API documentation based on Javadoc is available there as well.

Figure 4. Scaling of the standard SHMEM-mode of OGOLEM, for global optimization of TTM3-F (H2O)25, employing different numbers of CPU cores within one computing node.

As Figure 4 documents, the standard threading/sharedmemory scaling of OGOLEM is excellent. For 32 and 64 threads, a small degradation of the scaling is visible, compared to the theoretical ideal line. This may be due to communication overhead introduced due to fast single global optimization steps and intrinsic hardware scalability limitations. However, this D

DOI: 10.1021/acs.jctc.6b00716 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

opposite end of the node count spectrum, runs had to be shortened (all the way down to 500,000 steps for 2 nodes), to be able to complete one run within the maximum allowed walltime of 12 h. For the final performance comparison, run times were scaled to 4 million steps if necessary. Note that all scaling runs were performed as part of the normal production load on “Konrad” coming from dozens of other users who were running a wild mix of other applications with varying amounts of parallelism, communication and I/O. Therefore, all of our runs were done three times, and arithmetic means and standard deviations were taken. To avoid possible failures in the initial setup of RMI communications via the ethernet fabric shared with all other jobs, three seconds of waiting time were inserted before starting each of the threading clients. The resulting startup time (until all threading clients are launched) is entirely negligible for the jobs with small node counts and long walltimes. However, for jobs with large node counts and short walltimes, this startup time is between 7 and 13 min, which already is a sizable fraction of the total walltime of approximately 1 h and 30 min. Therefore, both the raw timings and the timings after subtraction of half of the startup time are shown (since the number of threading clients grows linearly during startup, under the assumption of perfect scaling this is equivalent to half of the time with no clients and half of the time with all clients present). Figure 6 shows the resulting speedup/scaling graph.

Figure 5. Scaling of the server+simple clients RMI mode, for global optimization of TTM3-F (H2O)25, employing different numbers of CPU cores within one computing node (identical to the number of simple RMI clients).

degradation is very small; clearly, 64-thread runs would be well justified in an actual application. Comparing Figure 4 to Figure 5 shows that there is no significant difference between thread-based parallelism and RMI-based parallelism in OGOLEM, despite the fact that for N clients N separate processes are communicating each and every result with a server process via standard TCP/IP (i.e., for these simple clients there is no chunking, in contrast to the threading clients case presented in section 3.2). On the one hand, the lack of significant differences to the traditional thread-based case is not very surprising since essentially the same amount of communication has to be done in thread-based parallelism, too. On the other hand, the very small deviations from the ideal line in both cases show that the RMI case has been implemented just as efficiently as the established Java thread pool. 3.2. Scaling with Threading Clients. Typical HPC centers offer thousands of computing nodes, each of which contains several dozen CPU cores. This naturally maps onto the server+threading clients scheme depicted in Figure 3, with thread-based communication inside each node and RMI communication across nodes. We have tested this scheme on the Cray XC30 supercomputer “Konrad”, installed in Berlin as part of the HLRN HPC center.30 Each XC30 node contains 24 CPU cores (in two 12-core Intel Xeon IvyBridge CPUs). On the first node, one thread was set apart for the server process; the remaining 23 were used for the first threading client. On each of the remaining nodes, one threading client with 24 threads was run. Since only the overall scaling was of interest here, all OGOLEM inputs on all nodes were identical. Each input asked for global optimization of (H2O)55 with the TTM3-F model potential in our Scala implementation. Local optimizations were done with the L-BFGS algorithm,29 with a relative convergence threshold of 10−6. To avoid possible delays in I/O due to the absence of node-local I/O devices, almost all intermediate output was turned off (after suitable initial testing to ensure that the global optimization worked as desired, in this setup). Scaling tests were run for node numbers between 2 and 256. For the midrange node counts, 4 million global optimization steps were performed. For 144, 192, and 256 nodes, 6 million, 8 million, and 12 million steps were used, respectively, to avoid possible artifacts from total walltimes dropping below 1 h. On the

Figure 6. Scaling of the server+threading clients RMI mode, for global optimization of TTM3-F (H2O)55, on different numbers of Cray XC30 nodes. Each node has 24 CPU cores, hence the count of CPU cores employed ranges from 48 to 6144.

The line of ideal scaling has a slope of 0.5, since it is based on the runs with two nodes (not just one). In addition to this ideal line, the spread due to one and two standard deviations is shown, which increases linearly with increasing node count, since the ideal scaling is linear. Obviously, the real scaling is perfect (within graphical resolution) up to 96 nodes and then starts to drop slightly. However, this is almost entirely due to our imposed startup waiting times: After subtracting half of this startup time, the scaling is again essentially perfect, also for 144, 192, and 256 nodes, despite the fact that just one server thread was used in all of these runs, which was the target of all communications and which had to integrate all results coming from all threading clients into one coherent, progressively improving EA pool. This is possible since not each and every new individual generated by the threading clients is communicated back to the E

DOI: 10.1021/acs.jctc.6b00716 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

In a first stage, a normal series of scaling runs was performed, for N = 1, 2, 3, 4, 5, 6 threaded RMI clients, repeating each run three times. The average walltimes from these runs established expected walltimes for each N. In a second stage, all these runs were started once more but with a different procedure, simulating sudden loss and addition of clients: • For one-quarter of the expected walltime, each run proceeded normally, with N clients. • Then one of the clients was stopped by sending it a SIGKILL signal, i.e., a forced kill without the option of completing any ongoing operations. • For the ensuing one-half of the expected walltime, the run continued but now with N−1 clients. • Then a new client was added, so that for the remaining time again N clients were participating. Thus, effectively, half of the expected walltime was allowed for N clients and the other half for N−1 clients, resulting in a ficticious number of N−0.5 clients for the whole time. Note that this is not exactly true since actual walltimes for several repetitions of the “same” run differ by several percents, due to the fact that many EA operations employ (pseudo)random numbers, and the effects of this get amplified by the iterative local optimizations. The resulting scaling for integer and half-integer numbers of clients is displayed in Figure 7, as average values for three runs each.

server. Instead, communication is bundled into larger chunks. Obviously, suitable chunk sizes depend on the walltime needed for all calculations leading to a new individual (of which local optimization takes more than 95%) and to a lesser extent on the number of threading clients per server. After two or three initial tests, the chunksize was set to 2000, for all production runs, demonstrating that finding a good chunk size would not be difficult in a real application: The present choice obviously is very close to optimal across the full interval of threaded client numbers (from 2 to 256) shown in Figure 6, and almost no optimization effort was made to arrive at this choice. 3.3. Heterogenous Environments. Since OGOLEM is JVMbased, OGOLEM inherits the inherent platform-independence of the JVM. Hence, by construction and in practice, any computer system or any electronic device that can host a JVM and can communicate via TCP/IP is also capable of running OGOLEM and of taking part in a distributed, RMI-parallelized OGOLEM run. We have verified this claim by several RMI test runs in our lab, in which diverse sets of ethernet-linked workstations and computing nodes participated. This does not require any additional work beyond starting an RMI server task on one machine, followed by starts of simple and/or threaded clients on any number and type of other machines. The most heavily mixed run of this kind employed the following setup: • an RMI server job on a workstation with a 6-core Intel Xeon E5 1650 v2 processor, running Linux 3.12.57-44 and a JDK 1.8.0_65, • a threaded RMI client job with 2 threads on a laptop with an Intel Core2 Duo p8700 CPU, running Windows 7 and a JRE 1.8.0_92-h14, • a threaded RMI client job with 64 threads on a Sun SPARC Enterprise T5240 server with two 8-core UltraSPARC T2 plus CPUs, running SunOS Solaris 10 and a JRE 1.8.0_05. As in the other cases reported here, this was not for some kind of artificial test case but for a real-life global optimization task that performed 25000 global optimization steps for (H2O)25 with the TTM3-F model potential. The very same jar file from the previous subsection 3.2 was used again for this run, the input file and input syntax also remained the same, and no problems were encountered in setting up and performing this run. 3.4. Failure Tolerance and Flexible Number of Clients. To also verify the claim that our RMI-parallel setup tolerates arbitrary loss and addition of (simple or threaded) clients and proxies during runtime and still maintains adequate scalability, even in distributed environments, we have performed scaling of another kind: Again using (H2O)25 with the TTM3-F model potential as a real-life test case, now for 10000 global steps, and again using the very same jar file also used for the scaling runs described in the previous sections, we have employed • an RMI server job on the same workstation used as a server in the previous section 3.3 • varying numbers of threaded RMI clients with 2 threads each, on a computing node with two 6-core Intel Xeon X5675 processors and 192 GB RAM, running Linux 3.12.53-40 and a JDK 1.8.0_65. Both machines were in the same building and in the same subnetwork but in different rooms, and every packet of communication had to go across three switches and many meters of ethernet cable.

Figure 7. Scaling of the server+threading clients RMI mode, for global optimization of TTM3-F (H2O)25, for different numbers of 2-thread clients, with the server running on a different machine. Half-integer numbers of clients (blue curve) result from killing and adding a client during runtime.

Note that in contrast to section 3.2, both the overall number of steps and the chunksize were changed: In section 3.2, the chunksize was 2000, and between 0.5 and 12 million steps were performed. Here, the chunksize was set to 200 (1 order of magnitude less), and only 10000 steps were performed (between 1.5 and 3 orders of magnitude less). This disparity is part of the explanation why the scaling for integer numbers of clients is worse than in section 3.2. In addition, despite the use of up to 256 nodes, all runs in section 3.2 were done within one (big) contiguous block of hardware, with a specialized internal network optimized for low-latency, high-volume data exchange. Here, two totally independent machines were used for server and clients, respectively, and communication was done via commodity network components. The effects of this longerF

DOI: 10.1021/acs.jctc.6b00716 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 8 shows the resulting parallel speedup. There is a visible deviation between the ideal and the real scaling, setting

latency communication can be seen in Figure 7: The real scaling starts to deviate slightly but visibly from the ideal scaling between 5 and 6 clients. As explained above, for half-integer numbers of clients the protocol involves killing and adding one client during runtime. This necessarily leads to waste of computing time: The results already achieved by the killed client in its last chunk are not passed back to the server; therefore, another client or the newly added client has to restart work on this chunk from scratch, so that part of the chunk is computed twice. To keep this doubling small, we have chosen a small chunk size for this demonstration. Note that for such an artificial test we could also implement a “soft kill” option which would allow a client that is scheduled for being stopped to still pass its accumulated results back to the server. However, the intention of this test scenario was to simulate client loss due to hardware failure, in which case continuation of client-server communication beyond the failure incident typically is not an option. However, despite this inevitable loss and recomputation of results, the resulting scaling for half-integer client numbers (involving killing and adding of clients) is excellent. The small deviations between the three curves between 1 and 5 clients are within the normal “noise”, due to all operations being ultimatively based on random numbers, resulting in total run times fluctuating by a few percent. This is more visible here than in the previous subsections because the number of global optimization steps is much shorter and hence the chance for compensation of these fluctuations within one run much smaller. Nevertheless, the scaling for half-integer client numbers (involving killing and adding clients) follows faithfully all trends of the scaling for integer client numbers (where no clients are killed or added). Therefore, these tests document two remarkable features of our parallelization setup: (1) Clients can be added and subtracted at will, during runtime, and (2) this does not damage the good parallel scaling. 3.5. Long-Distance Operation. Our client/server parallelization model with built-in error safety also naturally lends itself to grid-style computing. We define grid-computing as a set of geographically diverse and heterogeneous computing resources united to complete a common task. Since our JVM-based environment is platform-agnostic, heterogeneous computing environments do not pose a significant issue, as detailed above. Also, geographical diversity can simply be seen as larger latencies for server-client communication combined with a, most likely, increased failure rate for the clients due to network issues. Hence, our implementation is naturally ideally suited for such a setup. In order to assess this, we conducted tests between a server localized in Amsterdam (Netherlands) and a set of threading clients on a computing node in the University of Kiel, across a geographical distance of 411 km, employing standard public Internet connections and IP addresses and again using the same jar file as for the tests reported in the previous subsections. More specifically, (H2O)25 clusters were globally optimized with the TTM3-F potential, for 42000 steps. Each client employed 4 threads, on a computing node with 4 AMD Opteron 6172 CPUs (i.e., with a total of 48 CPU cores) and 256 GB RAM, running Linux 3.12.53-40 and a JRE 1.8.0_65b17. Since these runs were considerably smaller than those reported in section 3.2, the chunksize was set to 200.

Figure 8. Scaling of the server+threading clients RMI mode, for global optimization of TTM3-F (H2O)25, for different numbers of 4-thread clients, with all server-client communication running between Amsterdam (Netherlands) and Kiel (Germany), across >411 km of standard Internet connections.

in at 6 clients and growing slightly up to 10 clients. This is to be expected, given the fairly small chunksize (which increases the number of communication events) and the considerable distance (and unknown number of network switches) that has to be traversed by each communication event, resulting in a latency that certainly is much larger than for all the examples above. Nevertheless, the real scaling does not degrade further upon going from 10 to 12 clients, indicating that useful speedups can presumably be obtained with significantly larger numbers of clients. We want to note here that it may be worthwhile to enable SSL-based encryption for such communications across publicly accessible networks, which is straighforward using RMI. Together with the mixed-environment tests reported in the previous subsection 3.3, this demonstrates that distributed or grid-computing may be an interesting, easily exploitable paradigm for global optimization. In theory, it would allow “leftover” computing time scattered across a broad network to be used for scientific research. Optimization problems could be enqueued on a remote server and solved one by one. During each run, arbitrary and strongly changing numbers of computers in the network could contribute, without the need for control and planning, i.e., without any additional pieces of software.

4. SUMMARY AND OUTLOOK In essentially all of its forms, nondeterministic global search is embarrassingly parallel by construction. In this Article, we have demonstrated that our RMI-based parallelization concepts in OGOLEM enable us to take the still nontrivial step of transforming this theoretical advantage into excellent parallel scalability that can actually be achieved in practice, up to at least hundreds of clients running on thousands of processors. In contrast to the prevailing MPI paradigm, our RMI-based parallelization includes additional advantages that are of prime importance in practice, including (a) strong heterogeneity of hardware and operating systems, (b) communication exclusively using standard TCP/IP, enabling us to use long-distance G

DOI: 10.1021/acs.jctc.6b00716 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

University for her ongoing, magnanimous support. J.M.D. and B.X.H. wish to acknowledge Sascha Frick for maintaining local computing, network infrastructure, and clean network separation in the Theoretical Chemistry group, Christian-AlbrechtsUniversity Kiel.

communication across any available Internet connections, and (c) the possibility to subtract and add clients at will while the calculation (i.e., the server job) is running. As shown, all this does not affect parallel scalability. Feature (c) entails further advantages that are of substantial practical importance: An RMI-job of this kind is not affected by a sudden loss of any number of clients. Thus, it can be run for arbitrarily long times, even on HPC installations employing many hundreds or thousands of nodes, where normally walltimes have to be limited to a few hours to “protect” massively parallel jobs from breakdown due to inevitable hardware failures. Of course, feature (c) also makes it possible to shrink and expand the hardware ressources actually used in a controlled fashion. This opens up another use case: In every queue-based job scheduling, there is no way to attain 100% machine load in practice, with traditional parallel jobs that require a fixed subset of the machine resources. The scheduler knows that N nodes will be available for the next M minutes, but all jobs currently in the queue ask for node numbers and times that differ from these values N and M − so that these jobs cannot be started now or they will not use all of the remainder. Even if these leftover nodes amount to just a few percent of the whole machine, on a typical HPC installation this is substantial computer power. With our RMI setup, we can adjust the amount of resources needed at will and very quickly. Hence, a long-time RMI server job running separately could host a frequently varying number of clients that is constantly readapted to exactly fill up all the computing nodes that would otherwise remain idle. Of course, this requires not-yetstandard interactions between such an RMI job and the queue scheduler. In a current project with the Computing Center at the University of Kiel we are currently designing ways of realizing this new concept.





REFERENCES

(1) Avaltroni, F.; Corminboeuf, C. J. Comput. Chem. 2011, 32, 1869. (2) Hartke, B. Angew. Chem., Int. Ed. 2002, 41, 1468. (3) Rossi, G.; Ferrando, R. J. Phys.: Condens. Matter 2009, 21, 084208. (4) Hartke, B. WIREs Comput. Mol. Sci. 2011, 1, 879. (5) Heiles, S.; Johnston, R. L. Int. J. Quantum Chem. 2013, 113, 2091. (6) Bandow, B.; Hartke, B. J. Phys. Chem. A 2006, 110, 5809. (7) Dieterich, J. M.; Hartke, B. Mol. Phys. 2010, 108, 279. (8) Cantú-Paz, E. Efficient and accurate parallel genetic algorithms; Kluwer Academic Publishers: Boston, 2001. (9) Strodel, B.; Lee, J. W. L.; Whittleston, C. S.; Wales, D. J. J. Am. Chem. Soc. 2010, 132, 13300. (10) Roberge, V.; Tarbouchi, M. WSEAS Trans. Comput. 2012, 6, 170. (11) El-Sayed, N.; Schroeder, B. Reading between the lines of failure logs: Understanding how HPC systems fail. Proceedings of Dependable Systems and Networks, 2013, 1. (12) MPI: A Message-Passing Interface Standard, Version 3.0, September 21, 2012. http://www.mpi-forum.org/docs/mpi-3.0/ mpi30-report.pdf (accessed September 20, 2016). (13) Gropp, W.; Lusk, E. Int. J. HPC Appl. 2004, 18, 363. (14) Dinan, J.; Krishnamoorthy, S.; Balaji, P.; Hammond, J. R.; Krishnan, M.; Tipparaju, V.; Vishnu, A. Noncollective Communicator Creation in MPI. In Proceedings of the 18th European MPI Users Group Conference on Recent Advances in the Message Passing Interface, Springer: 2011. (15) Javed, A.; Qamar, B.; Jameel, M.; Shafir, A.; Carpenter, B. Towards Scalable Java HPC with Hybrid and Native Communication Devices in MPJ Express. Int. J. Parall. Prog. (IJPP) 2016, 44, 1142. http://www.mpj-express.org (accessed September 20, 2016). (16) Dieterich, J. M.; Hartke, B. Appl. Math. 2012, 03, 1552. (17) Fanourgakis, G. S.; Xantheas, S. S. J. Chem. Phys. 2008, 128, 074506. (18) Buck, U.; Pradzynski, C. C.; Zeuch, T.; Dieterich, J. M.; Hartke, B. Phys. Chem. Chem. Phys. 2014, 16, 6859. (19) Dieterich, J. M.; Hartke, B. J. Comput. Chem. 2014, 35, 1618. (20) Dieterich, J. M.; Hartke, B. Phys. Chem. Chem. Phys. 2015, 17, 11958. (21) Wales, D. J.; Hodges, M. P. Chem. Phys. Lett. 1998, 286, 65. (22) Takeuchi, H. J. Chem. Inf. Model. 2008, 48, 2226. (23) Tokmachev, A. M.; Tchougréeff, A. L.; Dronskowski, R. Theor. Chem. Acc. 2015, 134, 115. (24) Wang, Y.; Bowman, J. M. Phys. Chem. Chem. Phys. 2016, 18, 24057. (25) Schwan, R.; Kaufmann, M.; Leicht, D.; Schwaab, G.; Havenith, M. Phys. Chem. Chem. Phys. 2016, 18, 24063. (26) Hartke, B. Phys. Chem. Chem. Phys. 2003, 5, 275. (27) Lagutschenkov, A.; Fanourgakis, G. S.; Niedner-Schatteburg, G.; Xantheas, S. S. J. Chem. Phys. 2005, 122, 194310. (28) Singh, G.; Nandi, A.; Gadre, S. R. J. Chem. Phys. 2016, 144, 104102. (29) Liu, D. C.; Nocedal, J. Math. Program. 1989, 45, 503. (30) http://www.hlrn.de/ (accessed September 20, 2016).

AUTHOR INFORMATION

Corresponding Authors

*E-mail: [email protected]. *E-mail: [email protected]. Present Address

Mechanical & Aerospace Engineering, Princeton University, Engineering Quadrangle, Olden Street, Princeton, NJ 085445263, USA. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The authors acknowledge the North-German Supercomputing Alliance (HLRN) for providing HPC resources that made the threading client scaling tests possible. In particular, they thank Stephen Sachs for his detailed and patient help that enabled us to use non-MPI parallelism in a queueing environment heavily geared towards MPI-parallelized applications. B.X.H. thanks the German Research Foundation DFG for financial support of this work, through grant Ha2498/16-1. J.M.D. is grateful to Scientific Computing & Modelling (SCM), Amsterdam, The Netherlands, for allowing him to work on this project in his free time in 2015. He is also grateful to the universal support received from Prof. Ricardo Mata at the Georg-AugustUniversity Göttingen where he started to work on a first prototype implementation in 2012. Additionally, he would like to express his gratitude to Dean Emily A. Carter at Princeton H

DOI: 10.1021/acs.jctc.6b00716 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX