A special purpose parallel computer for molecular dynamics

We describe the special purpose parallel computer being built at the IBM Almaden Research Center for simulating classical many-body systems by molecul...
0 downloads 0 Views 3MB Size
J. Phys. Chem. 1987, 91, 4881-4890 and a decrease from 76 to 65 A2 for valine 34. The tendency for the tyrosine hydroxy group to be more exposed in the transition-state region raises the possibility that the transition-state structures could be transiently stabilized by hydrogen bonding between this group and external solvent molecules. An effect of this kind has been noted before in simulation studies of ion pair dissociation in water.35 If this is the case, freezing of the water might slow the ring rotational transitions by limiting the possibilities for hydrogen bond rearrangements. Configurational blocking due to steric interactions between the tyrosine side chain and the frozen solvent could also hinder the completion of transitions. Secondly, the displacements of the atoms in the “gate” region (backbone atoms of residues 36 and 37) have been determined in four full ring-rotation trajectories so far. In two trajectories with a fluid solvent, the displacements of the gate atoms from their transition-state locations during the 2 ps before and after crossing appear to be somewhat larger than in the trajectories with a glassy solvent. The points described above suggest that the glassy solvent may retard the completion of individual rotational transitions both directly, by slowing the (35) Karim, 0. A,; McCammon, J. A. J . Am. Chem. SOC.1986,108, 1762.

4881

relaxation of ringsolvent interactions, and indirectly, by slowing the relaxation of the protein matrix. The slowing of these relaxation processes would increase the lifetimes of the collective fluctuations that correspond to permissive environments for ring rotation, which would in turn increase the probability of a ring returning to the transition-state surface immediately after passing through it. As discussed previously,25the actual rotation of the ring within a permissive environment proceeds by a packing defect mechanism first described by Rahman in his analysis of atomic diffusion in Lennard-Jones

Acknowledgment. We thank Professor W. van Gunsteren for providing the GROMOS programs, Professor Jan Hermans for a helpful discussion on the potential function, and Dr. Omar Karim for a helpful discussion on transmission coefficient calculations. I.G. is a Fulbright Scholar and Robert A. Welch Foundation Postdoctoral Fellow. J.A.M. is a Camille and Henry Dreyfus Teacher-Scholar. This work was supported in part by grants from the N S F and the Texas Advanced Technology Research Program. Registry No. Trypsin inhibitor, 9087-70- 1. (36) Rahman, A. J . Chem. Phys. 1966, 45, 2585.

A Special Purpose Parallel Computer for Molecular Dynamics: Motivation, Design, Implementation, and Application D. J. Auerbach: W. Paul,t A. F. Bakker,t C. Lutz,+ W. E. Rudge,+ and Farid F. Abraham* S P A R K Special Purpose Computer Project, and Statistical Physics Project, IBM Research, Almaden Research Center, S a n Jose, California 951 20-6099 (Received: November 21, 1986)

We describe the special purpose parallel computer being built at the IBM Almaden Research Center for simulating classical many-body systems by molecular dynamics. The motivation, design, and implementation is emphasized. A new application to the dynamics of growth and form has been studied on a single node of our multinode computer, and the results are presented.

W e dedicate this study to Anees Rahman, in recognition of his pioneering contributions to molecular dynamics simulation f o r over two decades.

1. Motivation for Considering a Special Purpose Computer In Table I, we summarize computational requirements for certain supercomputing problems as estimated by the outside world. Note that the requirements are in terms of sustained performance, and this should not be confused with maximum performance quoted by computer vendors. For example, 160 MFLOPS maximum performance may be given for a CRAY 1 computer, but a 16 MFLOPS sustained performance is realistic for an optimized molecular dynamics simulation. That is a factor of ten difference in measured performance! Therefore, a 10 GFLOPS requirement for sustained performance demands a computer approximately 500 times faster than a CRAY 1. Certainly, such statements set the tenor for the justification of future generations of supercomputers. Let us first do a simple calculation. Assuming that the computational burden scales linearly with linear dimension in space or time, we readily see that a computer simulation can easily go beyond any foreseeable computer capability without giving us a tremendous jump in system size or total time. For example, if total clock time is a bottleneck for a particular simulation problem (whether it is hours, days, or weeks), a factor of ten in computer power gives us a factor +SPARK Special Purpose Computer Project. * Statistical Physics Project.

0022-3654/87/2091-4881$01.50/0

TABLE I: Computational Requirements for Certain Supercomputing Problems As Estimated by the Outside World“

QCD lattice gauge studies 3-Dtime-dependent fluid dynamics biological membrane transport ocean basic dynamics aerodynamics 3-DMHD for fusion

GFLOPS 10 10 IO

1-10 1-10 1-10

‘Note that the requirements are in terms of sustained performance. of two in length scale for a three-dimensional system expanded in all dimensions, or a factor of ten in simulated time for a system fixed in size! Similarly, expanding the length scale for a 3-D system by a factor of ten demands a thousandfold increase in computer burden. This immediately points out the importance of having a dedicated computer, in contrast to having access to the same computer which is shared equally with many users; i.e., factors of tens, hundreds, or even thousands in computer power may be achieved by going from the shared mode to the dedicated mode. Currently, it is being questioned whether the von Newmann or serial architecture employed in most current computers represents a fundamental design bottleneck for achieving significantly greater speeds. New architectures are being implemented, and, in the next few years, specialized computers will yield execution speeds that are two to three orders of magnitude greater than the

0 1987 American Chemical Society

4882

Auerbach et al.

The Journal of Physical Chemistry, Vol. 91, No. 19, 1987

I I I Coarse

Serial, Pipeline, Array Architectures A d d i t i o n of two floating point vectors 21 = X I

Grain

+

YI

Cells

1

= xty

Overlap suboperations

Replicate units

x1.u

Figure 1. Coarse-grained mesh for coarse-grain table method superim-

posed on the computational cell. 2,

faster presently available supercomputers. We will focus on the array computer designed for a particular computational need, i.e., the Special Purpose Computer. Such an approach is being undertaken at our laboratory for molecular dynamics simulations. For a more balanced and complete discussion, the reader is referred to a special issue of Physics Today entitled “Advances in Computers for Physics”. The topics treated are parallel architectures for computer systems,’ engineering limits on computer performance,2 algorithms for parallel processor^,^ and software considerations. The book by Hockney and Jesshope4 is an excellent text on computer architecture. It traces the development of parallelism in large-scale computers, explains the main principles of parallel or concurrent computers, and analyzes several current commercial designs. The reader is also referred to Burke and delve^,^ where the entire issue of Computer Physics Communications is devoted to vector and parallel processors in computational science. Since our computational interest is molecular dynamics and the design of our computer is partly driven by this interest, we first review those molecular dynamics techniques relevant to our application of fractal dynamics described in the last section of this paper. For a very complete review of a large variety of simulation techniques, the reader is referred to Abraham6 2. Two Dynamical Methods

2.1. Conventional Molecular Dynamics. The molecular dynamics simulation technique yields the motion of a given number of atoms governed by their mutual interatomic interactions. Our interest relates to the dynamics of atomic systems with continuous interatomic interactions, requiring the numerical integration of Hamilton’s equations of motion. Rahman’ made the first dynamics simulations for fluids with continuous potentials. For the sake of simplicity, we assume that the particles are identical, have three degrees of freedom each, and obey classical mechanics defined by a Hamiltonian %!. The particles have generalized coordinates 7 = (FI, ..., FA), and conjugate momenta p’ = ..., C Y ) . Writing the Hamiltonian 7f of the N-body system as % = 9@)+ U(F) (2.1.1)

cp,,

where 9 is the kinetic energy, U is the potential energy, and 9@’) = Cp’i.p’r/2m (2.1.2) I

yields the dynamical equations (1) Browne, J . C. Phys. Today 1984, 37, No. 5, 28. (2) Seitz, C . ; Matisoo, J. Phys. Today 1984, 37, No. 5, 3 8 . ( 3 ) Fox, G. C.; Otto, S. W. Phys. Today 1984, 37, No. 5 , 5 3 . (4) Hockney, R. W.; Jesshope, C. R. Parallel Computers; Adam Hilger: Bristol, 1983. ( 5 ) Burke, P. G.; Delves, L. M. Comp. Phys. Commun. 1982, 26, No. 3 and 4. The entire issue is devoted to vector and parallel processors in computational science. ( 6 ) Abraham, F. F. Ado. Phys. 1986, 35, 1. ( 7 ) Rahman, A. Phys. Reu. A 1964, 136A, 405.

N processors 4 clocks/N iesulls

1

/I ,

4 cIocks/resuIt

Figure 2. An example of the procedure for multiplication of two floating

point vectors using the serial, pipeline, and array procedure^.^ (2.1.3) (2.1.4) In the traditional molecular dynamics experiment, the total energy

E for a fixed number of atoms N in a fixed volume Vis conserved as the dynamics of the system evolves in time, and the time average of any property is an approximate measure of the microcanonical ensemble average of that property for a thermodynamic state of N , V, E. For certain applications, it may be desirable to perform dynamical simulations at constant temperature and/or pressure. However, we are forced to invent artificial processes in order to make this practical. Such schemes may be classified as nonNewtonian molecular dynamics (NNMD), since they are not based on Hamiltonians of motion for real systems; they are invented by using scaling tricks, resulting in fictitious forces, in order to satisfy some adopted definition for constant T and P. However, it is possible to show rigorously that certain N N M D methods give proper equilibrium ensemble averages. Of great practical importance is the following observation: when numerical experiments on equilibrium phases are compared, the various N N M D procedures, whether rigorous or not, give time averages for thermodynamically interesting variables that agree to within the statistical accuracy of most numerical experiments. However, unlike the coarse-grain averaging fundamental to thermodynamic properties, certain properties of interest may focus down to the level of individual atomic processes and local temporal atomic structures which may be sensitive to perturbations associated with a hybrid molecular dynamics scheme. In this case, the consequences to the outcome of a chosen computer observation must be individually evaluated. In our application, we are interested in maintaining the system at constant temperature. The simplest and earliest procedure for simulating constant temperature takes the instantaneous temperature TI defined by y2NkTI= 1 ~ , ~ / 2 m (2.1.5) L

to equal the desired temperature T; the atomic velocities are renormalized at every time interval T~ so that the instantaneous

Parallel Computer for Molecular Dynamics

The Journal of Physical Chemistry, Vol. 91, No. 19, 1987 4883 Benes Network

“Coarse-G rain Cells” of a Solid

Recursive Construction: Elementary Binary Switch (2x2)

& Nearest-Neig hbor Interactions

0

0

1

1

Switch States

0 N-2 N-1

Each Processor Models a Cell with Nearest-Neighbor Connections

N-2 N-1

Binary Network for N=8 (N=4 is subset) 0

0

Figure 3. The important features associated with going from the physical problem to the array computer.’

1

1

2 3

2 3 4 5 6 7

4 2-0 Array of Computers

5 6 7

N=4

Figure 5. The recursive construction for an elementary binary switch is presented, and its application is illustrated by setting up the permutation network for eight processor^.^ Example: Communication Flow Between Eight Processors

Geometry of Connections Matches Problem

Parallel Computation Algorithm 1

3.

For each particle in computer’s own space, Bo

Coordinate Communication Phase For each communication path P, (1=1.8) each computer transmits coordinates of it’s atoms along P,

Update the positions and monenta 4

receives coordinates along opposite path 2

Integration Phase

Force Accumulation Phase For each particle in computer’s own space, Bo Compute force from neighbors in Bo and B, Neighbor tables can be used to limit the number of force calculations required

Global Computation Phase Collect and sum partial kinetic energy terms Pass Total Kinetic Energy t o each computer Renormalize Velocities

0 1

2 3 4

5

Final Phase

5

Once every 10-100 time steps Update the neighbor tables Move the particles to a new computer i f they have crossed the coarse grain cell boundaries

Figure 4. The parallel computation algorithm for molecular dynamics with velocity renormalization.

6 7

Switch Setting

Figure 6. The switch setting for a particular communication flow of eight processors (see Figures 3.4).

d@/dt = -I’F mean kinetic energy corresponds to the chosen temperature T.8 While naive, this ad hoc scaling procedure is favored by many because of its simplicity, low computational overhead, numerical stability, and overall proven success. More sophisticated schemes are reviewed by Abraham.6 2.2. Brownian Dynamics. A physically intuitive and appealing approach is to generalize the Brownian dynamics of an individual particle which is coupled to a heat bath or reservoir (namely a liquid). Considering a single particle, the heat bath causes two effects: it decelerated the particle’s motion by viscous damping, I?@, and it accelerates the particle’s motion by statistical impacts with the reservoir particles, Tj(t). The particle obeys the equations of motion:9-’’ d?/dt = @/m

(2.2.1)

+ Tj(t)

(2.2.2)

where the following physically reasonable assumptions are made (W)) = 0

(2.2.3)

the averaged force on the Brownian particle is zero, and (7“(t)7a(t?) = Y W - t ?

(2.2.4)

the a components of the forced ;ri are rapidly varying, being a constant (Le., successive collisions are uncorrelated). The average is taken over an ensemble of many systems, each consisting of a particle in a surrounding (fluid) heat reservoir. The solution to eq 2.2.2 is pa(t) = pa(0)e-rt

+ e-rt s,’eI-*’7 (t’) dt’

(2.2.5)

a

After squaring (2.2.5) and ensemble averaging (8) Woodcock, L. V. Chem. Phys. Lett. 1971, 10, 257. (9) Machlup, S.; Onsager, L. Phys. Rev. 1953, 91, 1512. (10) Haken, H. Synergetics-An Introduction; Springer-Verlag: Berlin, 1977; Chapter 6. (11) van Kampen, N. G. (1981), Stochastic Processes in Physics and Chemistry; North-Holland: New York, 1981; Chapter 8, Section 8.

(tp”(t)12) = (pyo))ze-2rt + e-2rt Jtdt’ L ‘ d t ” er(”+t’’)(f(t’)~a(t’’))= Y + 217

(pa(0))2e-2Ft -( 1 - e-2rr) (2.2.6)

Auerbach et al.

The Journal of Physical Chemistry, Vol. 91, No. 19, 1987

4884

KEY ISSUE IN ARRAY PROCESSING

1 1 1 1

O F A SINGLE APPLICATION

S(speedup)

execution time for one orocessor execution time for "p" processors

Ware's Two-SfateModel _?all "p" processors operating

Binary Machine

a

or:'

-+only . one processor operating

fraction of work processed in parallel

Figure 8. The design of a custom SPARK processor used as the node

for the IBM array machine.

A

a

= 0,

"not"

S = 1

parallel

M

, 06

07 Fracl8on 01

speedup

I

08 WOW

~

a

09

t 0

# n~ i r a l i a l

o~ p~ a r a i ~~ e i , ~ana m wnumaer ~

whole space is filled by periodic images of this basic computational unit. In our study, we have arbitrarily adopted the Lennard-Jones 12:6 potential function u(r) to represent how individual atoms interact in our model universe:

0, plocessorl

u(r) = 4a[

* Very similar consideration for vectorization in single processor. Figure 7. The speedup for array processing based on the very simple model of Ware.'*J9

Hence, for equilibrium, or t ((p"(m))')

>> l/r Y

= - = mkT 2r

(2.2.6.a)

where we make use of the equipartition theorem, ( (jf)'/2m) = kT/2. For the Brownian dynamics of our N-atom system of interest, we make the following ansatz for the dynamical equations:'* (2.2.7) dT1/dt = $,/m d$',/dt = -rel

+ + $,(t)

(2.2.8)

where the stochastic force $,(t) is a Gaussian distributed random vector with the property for its a-component

+ 7) = 2mkm6,6(r)

$?(t)$(t

(2.2.9)

Equations 2.2.7-2.2.9 will generate dynamics where C p I 2 / 2 m = y2NkT

(2.2.10)

I

However, this constant temperature dynamics will also reflect the fact that our system of atoms are coupled to a heat bath so as to generate many-particle Brownian dynamics. Molecular dynamics describes ballistic motion, and Brownian dynamics describes stochastic motion. 2.3. Other Important Ingredients for Doing a Simulation. Equations of motion are needed, but a simulation study is defined by a model created to incorporate the important relevant features of the physical system of interest. For example, to simulate a bulk liquid of argon at constant energy and density, we adopt Newton's equation of motion, an accurate interatomic potential function, and simulate the dynamics of a sufficiently large number of atoms. Present-day practical standards are 103-104 atoms, which means that the walls of a container holding IO3 atoms would significantly distort the bulk nature of the liquid. In order to simulate an infinite bulk with a finite (and small!) number of particles, the technique of employing periodic boundary conditions is used. The basic computational cell of colume V and containing N particles is periodically replicated in the x, y , and z dimensions so that the (12) Hiwatari, Y.; Stoll, E.; Schneider, T. J. Chem. Phys. 1978, 68, 3401

(

4)12

-

(;)"I

(2.3.1)

where r is the interatomic separation. The range of interaction is normally cutoff at some distance to lessen the computational burden of summing over all atoms, the cutoff typically being at 2 . 5 ~ . When using the Lennard-Jones potential, we normally express quantities in terms of reduced units. Lengths are scaled by the parameter U , the value of the interatomic separation for which the LJ potential is zero, and energies are scaled by the parameter a, the depth of the minimum of the L J potential. Reduced temperature is therefore kT/a. Our choice of a simple interatomic force law is dictated generally by our desire to investigate the qualitative features of a particular many-body problem common to a large class of real physical systems and not governed by the particular complexities of a unique molecular interaction. The key to an efficient algorithm is an efficient technique for handling the sums over particles which appear in the energy and force calculations. If we had to carry out these summations over all N interacting particles, we would have a problem with the execution time scaling as fl. By using the fact that we have short-range interactions, this may be reduced to a problem with execution time proportional only to N . Considering the truncated Lennard-Jones potential, the summation over interactions on a given atom need be taken only over those atomic pairs where their interatomic separation is less than the cutoff range. Testing whether a pair of atoms satisfy this criterion requires the computation of the distance between them. If we had to do this for fl pairs, even this calculation would dominate the computation for large N . Two methods are used to limit the number of pairs to be tested: the use of coarse-grain tables13 and the use of the neighbor tables with infrequent update.I4 We discuss these two methods in turn. The idea of the coarse-grain table method (also called the chain-link method) is illustrated in Figure 1 for a two-dimensional system. The computational box is divided into coarse-grain cells which are chosen so that in searching for the neighbors of a given atom, only the coarse-grain cell in which it is located and the eight nearest and next-nearest neighbor cells need to be considered. This is illustrated in the figure, where the circle represents the range of interaction of the interatomic potential while the boxes represent the coarse-grain cells. For a three-dimensional problem, the circle is replaced by a sphere, and (13) Hockney, R. W.; Eastwood, J. W. Computer Simulation Using Particles; McGraw-Hill: New York, 1981. (14) Verlet, L. Phys. Reo. 1967, 159, 98.

Parallel Computer for Molecular Dynamics

The Journal of Physical Chemistry, Vol. 91, No. 19, 1987 4885

1.o

SPC Architecture

I

I

1

1

Control I

Switching

2

2 0.7

SPARK

Network

HI1

I

2

Q,

0.6

F 0.5 0.4 0.6 Density pa2

0.2

0 Network allows for a n y permutation Simple switch elements

for reliability SI370 host for software development, I/O, data storage Figure 9. The overall architecture of the 1BM SPARK array computer. R e d u n d a n t elements

0.8

Figure IO. Phase diagram of the two-dimensional Lennard-Jonessystem in the density-temperature plane. The dashed curve indicates the spi-

nodal separating unstable and metastable regions.

27 coarse-grain cubes would have to be searched. The coarse-grain cells greatly reduce the number of pairs which need to be considered, but we can do better. The smallest value of a coarse-grain cell length that can be taken is the cutoff range for the interatomic interaction, and still be assured that all interacting pairs are accounted for. However, in two dimensions the ratio of the area of the circle of interaction to the area of the nine cells is ~ / =9 0.35. Thus, we can gain another factor of three in execution speed by creating tables of neighbors for a given atom." For threedimensions, the factor is 6.5. Of course, the atoms may be mobile (e.g., a liquid), so these neighbor tables need to be frequently updated. The trick is to save computational time by infrequent updating. This is achieved by adding a skin of about 10% to the cutoff radius and then updating the neighbor tables only every ten to twenty time steps. By implementing these two schemes, the execution goes linearly with the total number of particles of the system, provided the system is sufficiently large.

3. A Special Purpose Computer for Molecular Dynamics The architecture of the first generation computers is described as von Neumann organization which featured the stored program computer. I t is defined by ( I ) a input/output device; (2) a single memory for storage of data and instructions; (3) a single control unit for the interpretation of instructions; and (4) a single arithmetic/logic unit for the processing of data. Parts 3 and 4 are called the central processing unit or CPU. The important feature of this von Neumann computer is that each operation is performed sequentially, be it a memory fetch/store, an arithmetic/logic operation, or an input/output operation. In order to improve computer performance, three types of parallelism are being i n troduced into computer arthitecture. The first is pipelining where assembly line techniques are applied to the control unit and the arithmetic/logic unit. The second is to have several functional units which are independent units performing different functions and operate simultaneously on different data, e.g., logic and arithmetic operations. The third, and in many ways the most powerful parallelism concept, is to have an array of processing units performing operations simultaneously. We may have an array of identical processors under a common control u n i t performing identical operations simultaneously on different data stored in their private memories. (This is known as an S I M D or single instruction multiple data computer.) Even more generally we may have processors executing separate programs on different data (in which case we have an M I M D or multiple instruction multiple data computer). Figure 2 is taken from Hockney and Jesshope4 and tells much of the story! The example is based on the addition of two floating point numbers but is relevant to any general operation. One can see immediately the power of the array

. ._. . . . ./... . . Figure 11. Cluster growth by molecular dynamics (top) and by Brownian dynamics (bottom). The individual unattached dots represent fluid atoms a t a reduced temperature of u n i t y . .. . ..._ .-.

v ''

.'

'

'

'

procedure, its speedup going linearly with the number of processors. Of course, this clearly depends on the optimum balance between the vector length and number of processors, and this occurs when they are equal. Otherwise, the number of processors is too small, and the speedup is diminished. Or the vector length is too small, and computer power is wasted; this is the issue of load balancing. For the pipeline procedure, its speedup is linear with the number of suboperations. In the example it is four. The interested reader should study this figure carefully, since it provides a succinct summary of parallelism in computer architecture. The general issue of computer performance on vectors becomes very complex and ill-defined when one goes beyond a fixed vector length, a single operation, and a very simple computer architecture (see ref 4, pp 51-67). Typically, one is forced to use empiricism and try a particular computer program on a variety of computers or try a variety of programs on a particular computer, depending on the immediate interest. Many other issues can be discussed,

4886

The Journal of Physical Chemistry, Vol. 91, No. 19, 1987

Auerbach et al.

L Figure 12. Cluster growth by molecular dynamics for various densities, temperatures, and sticking coefficients (p,T,a). Reading left to right and down the page, the p,T,a are (a) 0.4,l .O,I .O; (b) 0.4,1.0,0.5; (c) 0.2,1.0,1 .O; (d) 0.2,1.0,0.05; (e) 0.1,1.0,1 .O; (f) 0.325,0.45,1 .O, respectively. All pictures are scaled so the box length is the same. The various colored bands denote growth rings in subintervals equal to one-fifth of the total growth times of 6000, I00 000, 50 000, 200 000, 150 000, and 120 000 time steps for the respective clusters.

such as memory bandwidth, scatter-gather operations, etc., but we will not do so. Instead, we will just state that, without massive parallelism, present single processor supercomputers appear to be close to a performance maximum of 4 gigaflops, not to be confused with a much lower sustained performance. We offer a threefold solution for overcoming this computational bottleneck: special purpose parallel computers, dedicated computer resources, and development of better algorithms. In many instances, the mathematical model of a physical system has a visible model of parallel computation, leading to a computer architecture which embodies the precise model of computation. An example is the Ising-model processor at the University of California at Santa Barbara.15 Such a special purpose computer is dependent on the model of computation being constant and massive repetitive computations being needed. Physical parallelism and VLSI (very large-scale integration) in the microelectronics industry are leading to specialized supercomputers where inexpensive microprocessors combine to form a single parallel processor. Possible architectures (15) Hirsch, J . E.; Scalapino, D. J . Phys. Today 1983, 36, No. 5 , 44.

for a molecular dynamics computer are a pipeline vector uniprocessor,16an array computer with fixed connections, and a computer array with switched connection^.^ We now discuss the special purpose parallel computer being built at the IBM Almaden Research Center at San Jose. It appeals to the fact that molecular dynamics problems satisfy the criteria for parallel efficiency; equal divisibility (load balancing) and locality (minimal communication overhead). Figure 3 shows the important features of going from the physical problem to the array computer. Equal divisibility implies that on the average each coarse-grain cell has the same number of atoms. Locality implies that the range of interaction of the atoms span no more than a coarse-grain dimension. This is of course the same requirements for obtaining computational efficiency in using the coarse-grain table procedure and neighbor tables to obtain linear scaling of the computational burden for a serial computer. We briefly review this procedure. The key to the efficient algorithm (16) Bakker, A. F.;'Bruin, C.; Hilhorst, H. J. Phys. Reo. Lett. 1984, 52, 449.

Parallel Computer for Molecular Dynamics is an efficient technique for handling the sums over particles which appear in the dynamical equations. To do so, we restrict the summation to those particles with an interatomic separation that is less than the range of interaction. Two techniques are used to efficiently find the interacting pairs of atoms: the use of coarse-grain tables and the use of neighbor tables, both implemented with infrequent update. The computational space is divided up into coarse-grained cells which are chosen so that, in searching for the neighbors of a given atom, only the coarsegrained cell in which it is located and the nearest-neighbor and next nearest-neighbor cells need to be considered. Since placing the particles in the proper cells can be done in linear time, we have reduced the problem from one which scales as M to one which scales with N . By building explicit tables of the neighbors of a given atom (and updating infrequently) we can gain another factor of about six in speed. So, with an array computer with the number of processors increasing with the number of coarse-grain cells (the number of atoms per cell remaining constant), the computational burden remains constant! Figure 4 describes the parallel computation algorithm for molecular dynamics with velocity renormalization. The task for computing the forces and updating the positions of the particles in a given cell is given to a processor of the array computer. To perform this task, the processor will first have to learn the positions of neighboring particles. Thus we are led to divide the parallel algorithm into a communication phase followed by a computation phase. Efficiency demands that the communication bandwidth grows linearly with the number of processors. This can be achieved by connecting the processors with a network that allows the communication to be done in parallel also. Each processor sends coordinates of its particles to its neighbors say to the north and simultaneously receives coordinates from its neighbor to the south. For a two-dimensional problem, eight such permutations are required, while for three dimensions, twenty-six are required. Having switch connections for the array computer allows for general purpose architecture. For the SPARK Array Computer, the Benes17 full permutation network has been chosen. In Figure 5, the recursive construction for an elementary binary switch is presented, and its application is illustrated by setting up the permutation network for eight processors. In Figure 6, the switch setting for a particular communication flow of eight processors is given. Replacing the binary switch with a higher-order n X n switch “shortens” the number of switches to go through for the same number N of microprocessors. This is called the depth and is given by the relation depth = 2 log, N - 1. For a binary switch the depth equals five, and in Figure 6 we see that there are five columns of switches. For the SPARK Array, a switch with n = 16 has been chosen. Communication in an array structure demands that a major design feature must be considered; a fixed connection network has no computational overhead but does not allow for generality in architectural configuration and redundancy for replacing a failed processor by switching to a standby processor; a dynamic connection network requires an additional computational overhead, as well as making the design and building of the computer more complex and expensive. The perfectly efficient parallel computer would exhibit linear speedup: the computation rate of N processors would be N times that of a single processor. However, communication and other global demands may dominate if a sufficiently fine-grained array structure is defined. It is obvious that a natural point exists where more parallelism becomes ineffective, but, f o r a nontrivial application on any nontrivial architecture, we do not know what determines this point. The speedup for array processing is presented in Figure 7, the analysis being based on the very simple model of Ware.’*J9 If we identity the single processor operation as a overhead step, we see that speedup depends in a sensitive way on the fraction of work done in parallel; i.e., if a large proportion (17) Benes, V. Mathematical Theory of Connecting Networks and Telephone Traffic; Academic: New York, 1965. (18) Ware, W. (1972), IEEE Spectrum 84, 1972 March. (19) Buzbee, B. L.; Sharp, D. H. Science 1985, 227, 591.

The Journal of Physical Chemistry, Vol. 91, No. 19, 1987 4887

i

1

.

:*

Figure 13. Cluster growth by Brownian dynamics for a reduced temperature of 1 .O. Going from top to bottom, the sticking coefficient is (a) I .O, (b) 0.05, and (c) 0.02, respectively. The various colored bands denote growth rings in subintervals equal to one-fifth of the total growth times of 120 000, 1 000 000, 1 500 000 time-steps for the respective clusters.

of the effort goes into overhead, the efficiently for the parallel array computer drops dramatically. The overhead degradation can occur for several reasons: (I) Heterogeneous Load Balancing. The work must be shared equally among all the processors. For a uniform distribution of particles, this is achieved. (2) Large Communication Overhead. The time spent communicating must be a small fraction of the computation time. The large amount of computation which must be done on each data item and the large bandwidth of the permutation network together meet this requirement. ( 3 ) Large Fraction of Serial Code. If the problem under consideration has any portion which must be done serially, this will ultimately limit the number or processors which can be efficiently used. To gain a more quantitative understanding of these factors we have analyzed one of the Fortran molecular dynamics codes used

4888

The Journal of Physical Chemistry, Vol. 91, No. 19, 1987

in our laboratory. The analysis was done by counting (approximately) the number of ”simple” assembler language instructions, floating point additions, and floating point multiplications required for table set up, force accumulation, integration, etc. Formulae for the run time were developed and checked by comparison with measured run time on an IBM 370/3081.20 Agreement to better than 10%was achieved. A corresponding analysis for the parallel algorithm was also performed, and formulae were developed for run time as a function of communication rate and number of processors. The code studied performed a “temperature renormalization” which required computation of the total energy of the system at each time step. Algorithmically this amounts to computing a global sum of say n items; this can be done (partially) in parallel in O(log n) steps and then the sum can be broadcast to all processors. Analysis indicates that an array of order 1000 processors can be used with 70-90% efficiency (depending on the communication speed/computation speed assumed). The same analysis teaches us a very important lesson: the maximum number of processors which can be effectively used for a given size problem is eventually limited. For applications with about 10000 particles we cannot use more than 1000 processors efficiently. Thus, to achieve 10-1000 times the performance of current mainframes requires a very fast processor at each node of the computer array. We have therefore designed a high-performance floating point processor to serve as the node in our special purpose parallel computer. We now briefly describe the main features of this custom node processor designed and built at our laboratory. The processor, called SPARK (Scientific Processor ARray Kernel), is a compact, high-performance floating point processor. The design is based on VLSI building blocks and on the extensive use of pipelining to achieve both high performance and a favorable cost/performance ratio. A sketch of SPARK is shown in Figure 8. The machine consists of four tightly coupled execution units, the floating point unit FAU, the fixed point unit XAU, the branch control unit BAU, and the main memory MM. Concurrent operation of these units allows for overlapping of address calculation and floating point operations, pipelined vector processing, and hardware support for index table driven algorithms (scattergather). The scientific computing power comes from the floating point unit FAU on the right of Figure 8. In the initial design, 32-bit IEEE floating point is supported, but this will be upgraded to 64-bit precision. The operands and results of calculations are serviced by a three-port register file implemented with VLSI static RAM (random access memory) to allow a large (4K word) register space and a flexible indexing scheme. This scheme, described further below, allows the registers to be viewed in various ways: as 16 vector registers of length 256, or 8 vector registers of length 128, as I vector register of length 4096. or as sets of 16 scalar registers. The fixed point unit XAU to the left of the FAU is mainly devoted to the bookkeeping chores required in floating point computations, including the preparation of subscript addresses, generation of memory addresses, and handling of tables of indices (neighbor tables in molecular dynamics). It is also equipped with a 4K word register file which can be accessed as a vector register. Leftmost in Figure 8 is the branch unit BAU, responsible for branch and loop handling based on condition codes generated by the FAU and XAU. The BAU is equipped with an instruction cache (I-Cache) of 4K 64-bit words. Above these units is the main memory unit MM, responsible for holding all code and data. The MM is 64-bits wide with a 16M word segmented address space. A block move operation allows data and code to be moved to and from the M M and various other memories and registers a t a rate of 48 Mbytes/s. To keeping the floating point processor running at full speed requires supplying three addresses to the FAU in one machine -e,

(20) Auerbach, D. J.; Abraham, F. F.; Paul, W . IBM Research Report RJ-4289, San Jose, CA. 1983.

Aucrbach ct ‘11 cycle. If a completely general calculation were required for each of these three addresses, the hardware burden for the address calculation would be very heavq. An analysis of the inner loops of molecular dynamics and other scientific codes indicated that in most cases at most one nontrivial address calculation was required. In trivial address calculations we include incrementing and selecting from a small number of vector registers. Based on this analysis, a flexible address indexing scheme was implemented which sustains the full pipeline rate in most cases. The address of a floating point operand or result is derived from three sources: (I)an index register which optionally can be automatically incremented under control of an instruction bit; ( 2 ) 4 bits of the instruction word which are logically combined with the 4 high order register bits (denoted A, B, C in Figure 8); and ( 3 ) an optional overwrite of one or more of the registers by the fixed point unit. Vector operations are performed by initializing the index registers to the base of the vectors, enabling the optional autoincrement, and maintaining a count in the BAU. Indexing of a table (scatter-gather) can be done at the full pipeline rate by placing the table of indices in XM, using the XAU to add a base address, and ship the result to an FAU index register. By setting appropriate bits of A, B, or C to zero, the floating registers can be “reshaped” as 16 vector registers of length 256, 8 vector registers of length 5 12, and so on down to 1 vector register of length 4096. The first version of SPARK has an instruction cycle time of 168 ns, giving a pipeline rate of six million operations per second. The peak floating point performance is thus 6 MFLOPS. For chained floating point operations, the peak floating point performance is 12 MFLOPS, but this feature is of little use in niolecular dynamics simulations and has not been implemented on the prototype SPARK computer. Since one fixed point, one branch, one floating point, and three indexing operations can be programmed in each machine cycle, the peak instruction rate if thirty-six million instructions per second (MIPS). Sustained performance is of course of more interest and also much harder to quantify. Benchmarks based on kernels from Lawrence Livermore Laboratory indicate performance ranging from 1 . 1 to 6.0 MFLOPS for a SPARK without using chained floating point operations. These results are based on an anlysis of hard-coded programs where, however, the hand coding was done according to the algorithms of a Fortran compiler under development. On the same benchmarks, the IBM 370/3081 ranged from 0.6 to 3.5 MFLOPS and the CRAYI from 2.8 to 82 MFLOPS. We have coded the molecular dynamics program to simulate the growth and form of fractals. Performance benchmarks have shown that the molecular dynamics program running on the SPARK has a sustained rate of approximately 3 MFLOPS, a factor of 1.73 greater than the IBM 370/3081. By fall of 1987. we hope to have an array computer composed of ten SPARK nodes and anticipate a sustained performance of approximately 30 MFLOPS. Figure 9 illustrates the overall architecture of this special purpose parallel computer. The SPARK processors are interconnected by a network which allows for communication following any permutation. The use of a full permutation network adds reliability and generality to the design and permits an efficient simulation of other parallel machine architectures as shown by Galil and Paul.*’ Control of the array is done by an enhanced SPARK which controls the network, synchronizes communication and computation phases, and communicates with a host computer for I/O. 4. An Application: Fractal Growth and Form by Atomistic Dynamics I n the past few years, many physicists have taken an active interest in understanding growth and form of natural and Two of the most obvious mathematical object^.^'-*^ questions-how can we best characterize the form of an object ( 2 1 ) Galil, Z . ; Paul, W . J . J . A C M 1983, 30, 360. ( 2 2 ) Stanley, H. E . ; Ostrowsky, N. On Growth and Form: Fraclal and Xon-Frartal Parterns in Physics; Martinus Nijhoff Boston. I 9 8 6

Parallel Computer for Molecular Dynamics

and what are the governing growth processes for realizing a given form-remain largely unanswered. Growth in nature, in particular biological systems, are too complexed, varied, and ill-understood to lead to simple answers. So, the physicists have focused on very simple models where the binding energy is much greater than the thermal energy, time is discrete, space is a lattice, and the growth rules are local. The models are representative of Cellular Automata which are very easy to simulate on the computer and which lead to beautifully intricate forms. The Eden model, diffusionlimited aggregation (DLA), the epidemics model, Mole’s labyrinth, clustering of clusters, and the kinetic gelation model are some of the more popular models.24 There is concern that these models may be too simplisitc, suggesting a need to investigate the consequences from the use of more sophisticated (and more realistic) models. At the same time, the generalizations should be well-defined, so comparison with Cellular Automata may be made. We have recently begun to simulate growth and form in two dimensions by relaxing the approximations of discrete time and space and by adopting continuous particle dynamics defined by the many-body equations of motion. The physical parameters which we have to specify for our experiments are the temperature, the area, and the total number of atoms; Le., we are always dealing with constant “mean” density. Information regarding these parameters is obtained from the phase diagram of a two-dimensional Lennard-Jones system, which was recently calculated by using Monte Carlo techniques by Barker, Hendeson, and Abrahamz5 A cut through this phase diagram in the temperature-density plane is shown in Figure 10. Our computer experiment consists in solving the equations of motion, a system of atoms being enclosed in a parallelogram of various mean densities and temperatures. The equations of motion are taken to be either Newtonian ballistic dynamics given by eq 2.1.3 and 2.1.4 or Brownian stochastic dynamics given by eq 2.2.1-2.2.4. Simulations were performed on either the IBM 370/3081 or the SPARK processor. The standard periodic boundary conditions are imposed in order to simulate an infinite system. A straightforward generalization of the lattice simulations lead to the following additional rules for the simulation of the fractal clusters: (1) a seed atom is fixed at the center of the computational box; (2) the positions of the fluid atoms advance in time according to the dynamical equations of motion; (3) if the interatomic pair energy between a fluid atom and the seed atom becomes repulsive, the fluid atom freezes at that position and also becomes a seed atom (a sticking coefficient equal to one); the dynamics of the remaining atoms continue until all of the atoms become frozen, or seed atoms. We also performed experiments where a seed atom has to register more than one independent repulsive hit before the next atomic hit resulted in a stick (a sticking coefficient less than one). In Figure 11, the cluster forms resulting from two different simulations are presented. For both experiments, the reduced density and temperature are chosen to equal 0.20 and 1.0, respectively. However, the atomistic motions in experiment 1 were given by molecular dynamics and in experiment 2 by Brownian dynamics. The top cluster is from experiment 1 after 50000 time steps and the bottom cluster is from experiment 2 after 120 000 time steps. While both clusters have similar linear dimensions, the molecular dynamics cluster is much more densely compact, its fluid environment (individual dots) being almost completely depleted. In sharp contrast, the Brownian dynamics cluster is quite open, its fluid environment being very dense outside the “boundaries” of the cluster. The fractal dimension is a popular measure for characterizing various forms.26 Experiments 1 and 2 yield fractal clusters of dimension 1.87 and 1.71, respectively. (23) Family, F.; Landau, D.P. Kinetics of Aggregation and Gelation; North-Holland: Amsterdam, 1984. (24) Herrmann, H. Phys. Rep. 1986, 136, 153. (25) Barker, .I.A,; Henderson, D.; Abraham, F. F. Physica A 1981, 106, 226. (26) Mandelbrot, B. B. The Fractal Geometry of Nature; Freeman: New York, 1982.

The Journal of Physical Chemistry, Vol. 91, No. 19, 1987 4889 The Brownian cluster is consistent with the diffusion-limited aggregation model of Witten and Sander.27 In Figure 12, the cluster forms resulting from different molecular dynamics experiments are presented for various densities, temperatures, and sticking coefficients (p,T,a). Reading left to right and down the page, the p,T,a are (a) 0.4,1.0,1.0; (b) 0.4,1.0,0.05; (c) 0.2,1.0,1.0; (d) 0.2,1.0,0.05; (e) O.l,l.O,l.O; (f) 0.325,0.45,1 .O, respectively. The atomic positions are scaled so that the box length of each picture is the same. The fractal dimension for all clusters is approximately two, consistent with the dynamics being ballistic. It is known from lattice simulations that ballistic aggregation results in a deposit that is not a tenuous object but achieves a constant density.28 The various colored bands denote growth rings in subintervals equal to one-fifth of the total growth times of 6000, 100000, 50000, 20000, 150000, and 120000 time steps for the respective clusters. Hence, the left-hand column is for a constant reduced temperature of unity and mean reduced densities of 0.4, 0.2 and 0.1, respectively. While the rapidity of growth is strongly dependent on the density of the system, the cluster structures are very similar. By lowering the sticking coefficient to 0.05, more compact structures are obtained, as demonstrated by the right-hand neighbors in the first two rows, while the growth dynamics becomes significantly slower. In the last example of Figure 12, we performed a molecular dynamics simulation of cluster growth in a phase separating, two-dimensional, single-component fluid. The fluid is equilibrated in a single-phase state above the critical point and then is rapidly cooled within the two-phase region of the phase diagram.29 Without a seed atom to cause “freezing”, one can directly observe the initiation of the phase separation process by spinodal decomposition. The mechanism of spinodal decomposition is believed to be an instability of a homogeneous fluid to density waves which initially have small amplitude density variations but which are sufficiently large in extent that the surface free energy contribution is always smaller than the volume free energy contribution. Spontaneous growth then occurs. A careful examination of the time-developing interatomic morphologies for these different growth stages suggests that they may be characterized as wave creation and growth until local maxima in density approach the condense liquid density, followed by wave necking or break-up leading to the creation and subsequent growth of atomic clusters. In this present fractal simulation, the quenched fluid has a mean reduced density of 0.325 (critical density for liquid-vapor coexistence) and a Boltzmann velocity distribution corresponding to a temperature k T / t = 0.45, which is slightly greater than triple-point temperature. The overall morphology has the appearance of the clustering of small clusters, consistent with the freezing of the early-time structure of the phase separating fluid. In Figure 13, we present cluster growth by Brownian dynamics for reduced temperature and density 1 .O and 0.2, respectively, and for various sticking coefficients. Going from top to bottom, the sticking coefficient is (a) 1.0, (b) 0.05, and (c) 0.02, respectively. The various colored bands denote growth rings in subintervals equal to one-fifth of the total growth times of 120000, 1000000, and 1500000 time steps for the respective clusters. The fractal dimension for all clusters is approximated by 1.7, consistent with the dynamics being diffusion-limited aggregation; Le., in contrast to ballistic dynamics, diffusion-limited growth yields an open deposit whose average density decreases with size. It is difficult for a random-walking particle to penetrate a significant fraction of the radius of a growing cluster for particle aggregation, and long upper branches tend to screen lower short branches. The lower sticking coefficient gives rise to thicker branches but the overall cluster morphologies look very similar. However, a visual inspection of the structures grown by Brownian dynamics suggests that the clusters may be approximated as constructs of straight-line segments of a given line thickness. Since the fractal dimension (27) Witten, T. A.; Sander, L. M. Phys. Reo. Lett. 1981, 47, 1400. (28) Meakin, P. Phys. Reu. B 1983, 28, 5221. (29) Abraham, F. F.; Koch, S . W.; Rudge, W. E. Phys. Reu. L e f t . 1982; 49, 1830.

4890

J . Phys. Chem. 1987, 91, 4890-4899 dimension defined by the relationship of cluster mass M to radial extent R ; Le., log (M) = D log ( R ) (Table 11). We note that this intrinsic dimension would not distinguished between radially diverging straight lines and the equiangular spiral (intrinsic dimension equal to unity) or between an Archimedian spiral and a solid circle (intrinsic dimension equal to two). Also, it seems strange to presuppose that examples having the same fractal dimension would have much in common concerning their form (as well as their growth). And yet, such practice is adopted in current literature. A similar conclusion has been reached by Hoover.30 H e states: “It is clear that the fractal dimensionality is a far-from-complete description of geometry. Both a smooth two-dimensional surface and a Brownian trajectory in three-dimensional space have the same fractal dimensionality, but they are very different from the geometric point of view.”

TABLE 11: Intrinsic Dimension D of Some Simple Two-Dimensional Clusters Defined by the Relationship of Cluster Mass M to Radial Extent R ; Le., log M = D log Ip radial line y = ax D=1 equiangular spiral log T = a0 D=I Archimedes spiral r = a0 D=2 power-law signal logr=aIogB D= 1+a

“ I n each case, we have assumed that a simple curve having some given constant width defines the elementary topology of a two-dimen-

sional cluster. of a straight line is unity (Table II), this suggest that the branching multiplicity goes as the radial distance to the 0.7 power. We close by noting that the presently popular parameter fractal dimension tells us little about the true visual richness of form in even the most simplistic of cases. We further illustrate this conclusion by the following example. We assume that a simple curve having some given constant width defines the elementary topology of a two-dimensional cluster, and we calculate its intrinsic

(30) Hoover, W. G. Molecular Dynamics; Springer-Verlag: Berlin, 1986; Lecture Notes in Physics No. 258, p 108.

Electron Excitation Dynamics, Localization, and Solvation in Small Clusters Uzi Landman,* R. N. Barnett, C. L. Cleveland, School of Physics, Georgia Institute of Technology, Atlanta, Georgia 30332

Dafna Scharf, and Joshua Jortner School of Chemistry, Tel Aviv University, 69978 Tel AGiv, Israel (Received: December 2, 1986)

The energetics and dynamics of electronically excited rare-gas clusters and modes of electron attachment and solvation in alkali halide clusters and in water clusters are investigated by classical and quantum path-integral molecular dynamics techniques. We demonstrate the versatility and degree of spatial and temporal detail afforded by computer simulations and the new avenues which such studies open for studies of physical and chemical phenomena, not accessible by other modes of investigation.

1. Introduction Structural, electronic, dynamic, and chemical characteristics of materials depend primarily on the state (phase) and degree (size) of aggregation. Small clusters often exhibit unique physical and chemical phenomena, of both fundamental and technological significance, and provide the opportunity for exploration of the transition from molecular to condensed matter systems. Particularly, investigationsof the correlations between physical properties and degree of aggregation allow elucidation of the development of collective phenomena responsible for phase transformations’*2 (such as nucleation, melting, and structural transitions), studies of the excitation dynamics and the kinetics of reactive p r o c e ~ s e s ~ . ~ (such as fragmentation) and of the energetics and dynamics of electron a t t a ~ h m e n t ,solvation ~.~ p h e n ~ m e n a and , ~ physical pro(1) Jellinek, J.; Beck, T. L.; Berry, R. s.J . Chem. Phys. 1983, 84, 2783, and references therein. (2) Luo, J.; Landman, U.; Jortner, J. Proceedings of the International Conference on the Physics and Chemistry of Small Clusters, Richmond, VA, Nov, 1986; Plenum: New York, in press. (3) Scharf, D.; Jortner. J.; Landman, U. Chem. Phys. Lett. 1986, 126,495. ( 4 ) Jortner. J. Ber. Bunsen-Ges. Phys. Chem. 1984, 88, 188. ( 5 ) Landman, U.:Scharf, D.; Jortner, J. Phys. Rea. Lett. 1985, 54, 1860. (6) Mark, T. D.; Castleman, Jr., A. W. Adu. A t . Mol. Phys. 1984, 20. ( 7 ) See papers in J . Phj,.~.Chem. 1984, 88.

0022-3654/87/2091-4890$01.50/0

cesses induced by electron attachment. In this paper we focus on the energetics and dynamics of electronically excited clusters and on electron attachment and solvation in clusters. Theoretical studies of clusters were hampered by the relatively large number of particles which renders the adaptation of molecular science techniques rather cumbersome, while the lack of translational symmetry prevents the employment of solid-state methodology. These problems are alleviated by computer simulation^,^-^ where the evolution of the system is simulated, with refined temporal and spatial resolution, via direct numerical solution of the equations of motion and are in a sense computer experiments which open new avenues in investigations of the microscopic origins of physical phenomena. We demonstrate the versatility and wealth of information obtained from computer simulations via classical molecular dynamics (MD) as well as quantum path-integral molecular dynamics (QUPID) calculations. The dynamics of electronically excited clusters and vibrational predissociation phenomena induced by excitation of inert-gas clusters3 are discussed in section 2. The QUPID method and its applications to studies of electron attachment and localization in alkali halide clusters5 and of cluster isomerization induced by electron attachment are described in section 3. Electron at(8) Abraham, F. F. J . Vac. Sci. Technol. 1984, BZ, 534. (9) Landman, U.;et al. Mat. Res. SOC.Symp. Proc. 1985, 63, 273

0 1987 American Chemical Society