Computational chemistry and computer science - American Chemical

Department of Computer Science, Camegle-Mellon University, Pittsburgh, ... Two aspects of frontier computer science research and their implications fo...
0 downloads 0 Views 976KB Size
2190

J. Phys. Chem. 1982, 86, 2190-2197

for Computation in Chemistry, Lawrence Berkeley Laboratory, University of California, Berkeley. I am very grateful to Dr. W. A. Lester, Jr., and Dr. D. M. Ceperley for their attention and interesting discussions and to the staff for their attention. This work was supported by the

Director, Office of Energy Research, Offce of Basic Energy Sciences, Chemical Sciences Division of the US.Department of Energy under Contract No. W-7405-ENG-48and by the National Science Foundation under Grant No. CHE-7721305.

Computational Chemistry and Computer Sciencet Nell S. Ostlund,’ Robert A. Whltedde, and Peter 0. Hlbbard Depedmnt of Computer Science, Cemegle-Mlbn Universw, Plttsbwgh. Pennsylvsnle 15213 (Received: August 3, 1981)

Two aspects of frontier computer science research and their implications for computational chemistry are described. The f i t area is that of very large-scale integration (VLSI),a code word for the design and fabrication of integrated circuits with very large numbers of logic devices. The direction of this area of research is such that computer-automated design (CAD) tools will eventually allow chemists to design custom chips for their own special purpose applications. The second area is that of parallel computation. Having a large number of inexpensive processors cooperate in the solution of a single task will allow chemical computation to become more cost-effective by a few orders of magnitude in the next few years. Such parallel computations are illustrated by some sample Monte Carlo and molecular dynamics calculations on the Cm* multiprocessor which consist of 50 LSI-11 processors.

1. Introduction This paper will very briefly review some current and expected developments in computer science and their possible impact on computational chemistry. Computational chemistry is a semimature field with a t least some well-defined methodologies. In these more well-defined areas, progress seems likely to result from developments in computer science as much as from traditional areas of chemistry. New computer architectures, new methods of parallel computation, new very large-scale integration (VLSI) technologies, and other areas of computer science will all have great impact on the computational chemist. A major theme of this paper is that certain areas of computational chemistry might place more emphasis on “computational” and less on “chemistry” than they are currently doing. Because of recent developments in computer science, the possible modes of computation, the range of architectures, and the types of devices available to an application area, such as computational chemistry, are expanding rapidly. Two major areas of development leading to this increased flexibility are VLSI and parallel computation. In the area of VLSI, it is now even possible for an individual to design and have fabricated his own silicon chip that performs an application-specific function. Future developments in computer-automated design will make possible the compilation onto silicon of, for example, a Gaussian integral package. Present serial computers cannot be expected to become much faster. Speed and throughput for “numbercrunching” applications must come from vastly increasing the degree of parallelism involved in a computation. The space of possible parallel architectures is very much larger than that of serial architectures and the “proper” parallel machine is likely to be very dependent on the particular application. In section 2, we present for the chemist some of the elementary ideas involved in digital design and integrated

2. Very Large-Scale Integration In 1961 the first integrated circuit (IC) had only four transistors; it was an example of small-scale integration (SSI). Since then advances in the field have been truly spectacular and are often the topics of articles in the popular press. Integration of circuitry has continued through medium-scale integration (MSI, a few hundred transistors) and large-scale integration (LSI, a few thousand transistors) to very large-scale integration (VLSI). The exact density of circuitry represented by VLSI is somewhat arbitrary, and the acronym VLSI serves perhaps as a code word for the current state-of-the-art. The state-of-the-art is such that some (1981) commercial IC’s (chips) contain on the order of 100000 transistors, comparable in complexity to the CPU of the VAX-11/780. This advancing technology is illustrated by the microprocessors shown in Figure 1. The smallest of these, the Intel 8080, represents the sort of minimal integration required for general purpose computation. Although it was not the first microprocessor, the 8080 and its introduction perhaps symbolize the origin of the personal computer movement. The Motorola 68000, on the other hand, is an

Presented at t h e 4th American Conference on Theoretical Chemistry, June 22-26, 1981, Boulder, CO.

(1)C. A. Mead and L. A. Conway, “Introduction to VLSI Systems”, Addison-Wesley,New York, 1980.

0022-3654/82/2086-2190$01.25/0

circuits manufacturing. We then briefly describe the techniques of Mead and Conway.’ As a result of these new structured design methods, it is now possible for students in a one-semester course to design and eventually have fabricated their own custom chips. The implications for computational chemistry are evident. In section 3, we describe some basic ideas of parallel computation and describe the Cm* multiprocessor, which consists of 50 processors all having access to a common memory. Speedup results for a Monte Carlo and a molecular dynamics algorithm on Cm* are reported. There is hope that a multiprocessor will be able to speed up such calculations by a few orders of magnitude. Finally, in conclusion, we indicate a philosophy with the directions of the present research.

0 1982 American Chemical Society

Computational Chemistry and Computer Science

The Journal of Physical Chemistry, Vol. 86, No. 12, 1982 2191

Intel 8080 (8.bit)

3500 transistors.

Motorola 68000 (16.bit)

40,000 transistors

Intel 432 (32.bit)

100,000transistors.

HewletLPackard experimental chip (32 bit)

500,000 transistors.

Flgure 1. The increasing sophistication of microprocessors.



Drain

Gate

1(

I

1

‘ ‘

‘7 Source

I

(a)

*

Drain

I (diffusion)

I

Source (diffusion)

(b)

1

1

A

-

W

m

B-

1 1

1

1

1

o

Diffusion

’,

Channel

\

Diffusion

(C)

Figure 3. The field effect transistor. Flgure 2. Elementary logic gates.

example of a new microprocessor; it rivals minicomputers of a few years ago. The recently introduced Intel 432 is intended to compete with small mainframe computers like the VAX-11/780 and IBM 370/158. The Hewlett-Packard experimental processor is likely to translate into a commerical product a few years hence. Furthermore, the end of these spectacular advances is not yet in sight, and it might be expected that by the end of the current decade up to lo7 transistors could be incorporated onto a single IC. We attempt here to give only sufficient details so that the nature of this revolution can be appreciated. We particularily want to indicate that the design of customintegrated circuits for problems specific to theoretical chemistry is becoming possible. 2.1. Elements of Digital Design. The general purpose digital computer is constructed from elementary logic devices termed gates. The three gates considered here, from which any digital computer could be constructed, are the inverter, the two-input nand gate; and the two-input nor gate. The reason for considering “nand” (not and) and “nor” (not or) rather than “and”and “or” gates will be clear subsequently. It is convenient to think of two logic levels: true (= binary 1 = 5 V = VDD) and false (= binary 0 = 0 V = GND). The logic function performed by each of the three gates is shown in Figure 2. These gates can be combined to form combinational logic for which, after a propagation delay, a set of outputs is some definite function of a set of inputs. For example, one-bit addition of A and B is performed by the function “((notA ) and B ) or ( A and (not B))”which is written as (AAB)V(AhB).If feedback of outputs to inputs is added to combinational logic, storage of information (i.e., memory) becomes possible. By adding a clock (a regularly repeating pulse switching from one logic level to the other) one obtains sequential logic which has outputs that are a function of the current state (memory) as well as a function of the inputs. The usual digital computer consists of such clocked combinational and sequential logic. It remains to show how elementary gates are implemented. 2.2. Integrated Circuits. The integrated circuit consists of logic circuitry of the above type arranged in two dimensions on the surface of some (generally crystalline silicon) substrate. Integrated circuits are described by a “technology”which describes how a particular gate is built. The fastest silicon technology is emitter coupled logic

(ECL), used only in very high-speed, power-consuming applications. The common technology for random logic is transistor-transistor logic (TTL), which is next in speed. Slower yet, but approaching the speed of TTL in some areas, is metal-oxide-semiconductor (MOS) technology. The density of circuitry available for these three technologies varies inversely with their speed, and it is the MOS technology which gives the high circuit densities described above. The processors in Figure 1 and most memory chips use the N-channel silicon gate MOS (NMOS) process, the most prevalent version of MOS technology. 2.2.1. The NMOS Field Effect Transistor. A simple NMOS field effect transistor is shown in Figure 3a. The source and drain are symmetric and the device acts as a simple switch depending on the voltage that is applied to the gate. When the gate is at ground potential (0 V, relative to the source), then the switch is open and no current can flow between the source and the drain. If a positive voltage near 5 V is applied to the gate, however, the switch is closed and the source and drain are connected as by a conducting wire. 2.2.2. NMOS Logic Gates. Given the binary switch of Figure 3a, the basic logic device (gate) is an inverter, as shown in Figure 4. With the input at 5 V, the transistor is turned on (switch closed), current will flow through the resistor, and the output will be “pulled down” to 0 V. With the input at 0 V, the transistor is turned off (switch open), no current flows through the resistor, and the output will be “pulled up” to 5 V. Thus a true input gives a false output and a false input gives a true output and the device truly acts as an inverter. The nand gate and the nor gate are also shown in Figure 4. In the first case, the output is pulled down to 0 V if inputs A and B are at 5 V. In the second case, the output is pulled down to 0 V if inputs A or B are at 5 V. 2.2.3. The Photolithography Process. NMOS integrated circuits are constructed as a series of patterned layers on the surface of a silicon wafer. Each wafer consists of a number of (usually identical) chips. The patterning of a particular layer occurs as follows: The wafer is covered completely with a thin layer of (positive) resist, usually a polymeric material. A photographic mask is then placed over the wafer and exposed to ultraviolet light. Where the mask is transparent, the resist is broken down by the light and subsequently dissolved away. Where the mask is

2192

Ostlund et al.

The Journal of Physical Chemistry, Voi, 86, No. 12, 1982

c

VDD

c

output

(R

1‘

GND

(a) The basic inverter,

VDD

L

output (AhB)

GND

(b) A two-input NAND gate.

a) The diffusion layer mask.

“m

t

output

(m

(b) The polysilicon layer mask.

n

(c) Polysilicon and diffusion layers.

c

GND (c) A two-input NOR gate.

0

Figure 4. NMOS logic gates.

opaque, the resist remains and is subsequently baked to hardness. Thus, the wafer is exposed for chemical attack at some areas and protected by the resist at other areas, as defmed by the mask. After the desired chemical process occurs at the exposed areas, the resist can be removed and the whole process repeated for the next layer. Most present processes use optical lithography, but line widths are approaching wavelength limitations and future processes will use electron beam or X-ray lithography. 2.2.4. The Mead and Conway Approach. A relatively common misconception has been that it takes many man years of effort and a tremendous investment in capital equipment to be able to produce an integrated circuit. This is still true to some extent but, beginning with efforts centered at the California Institute of Technology and the Xerox Palo Alto Research Center, design methods have evolved such that it has become possible for an individual to design his own custom NMOS chip and see it fabricated. These developments in simplifying and structuring the design of integrated circuits are outlined in the textbook, “Introduction to VLSI Systems”, by Mead and Conway.’ The methods described in this book have had a revolutionary impact on chip design and this book is the fundamental reference in this exciting and expanding field. The Mead and Conway approach requires a designer to specify a description (the Caltech intermediate form) of five masks. Standard software and fabrication facilities can then turn this description into a functioning integrated circuit. For commercial ventures, approximately $10 000 is required for a minimal run of 1600 chips. For educational and research purposes, a facility (availableto ARPA contractors) has been established at the Information Sciences Institute (University of Southern California) for fast-turnaround, small-quantity designs.2 To illustrate the design process, we consider the patterning of an NMOS nand gate. The basic NMOS transistor is shown in Figure 3. Two layers (diffusion and polysilicon) are patterned according to their respective masks. The intersection of the two produces a transistor. If 5 V is applied to the polysilicon layer (gate) then electrons are attracted to the channel region under the gate (2) L. Conway, A. Bell, and M. E. Newell, Lambda, Vol. I, No. 2.

(d) The ion implant mask

(e) The contact cut mask.

(f) The metal layer mask.

Figure 5. The five masks for a nand gate.

and current can flow between the source and drain on the diffusion level (the switch is closed). On the other hand, if the gate is at ground potential (0 V), no current can flow between source and drain, and the switch is open. To a first approximation, both diffusion and polysilicon layers can be considered to be conducting. Figure 5 shows all five masks for the nand gate. Parts a and b of Figure 5 show the masks for the diffusion and polysilicon layers and Figure 5c shows their intersection. Three transistors are formed. Since resistors cannot be directly formed on silicon in a simple way, the resistor of Figure 4 is formed by “spoiling” the larger transistor of Figure 5c so that it is always partially “on”. This is accomplished by an implant layer and an implant mask (Figure 5d). This layer actually comes prior to the polysilicon layer; ion implantation in the channel region of the transistor results in permanent but weak conduction acr’oss the channel turning the transistor into a resistor. Such a “spoiled” transistor is termed a depletion mode (as opposed to the normal enhancement mode) transistor and is shown in Figure 6b. It remains to form a metal layer for conducting power ( V D D )and ground and to form any necessary connections between the metal layer and either of the polysilicon or diffusion layers. There is normally no electrical connection between the metal and polysilicon or diffusion layers. The contact cut mask (Figure 5e) results in there being a “hole burrowed” through all layers; thus, where the metal mask of Figure 5f, the contact cut mask of Figure 5e, the diffusion mask of Figure 5a, and the polysilicon mask of Figure 5b overlap, there is formed a contact between metal, diffusion, and polysilicon. Two of these contacts connect power and ground to the nand gate and the other contact joins the gate and source of the “spoiled” transistor to keep the gate from floating. The five different masks of Figure 5 thus result in a nand gate analogous to that shown in Figure 4 or, equivalently, Figure 6. To make a chip, it is sufficient to specify a list of rectangles defining each of the five masks and send this information to an appropriate fabrication facility.

-

Computational Chemistry and Computer Science VDD

-

The Journal of Physical Chemistry, Vol. 86, No. 12, 1982

Diffusion

- Polysilicon

-

Metal

Processor

Memory

Processor

Memory

Processor

Memory

2193

Contactcut

@

B

Ion Implant

GND (a) Sticks diagram for a NAND gate.

I (b) The resistor as a depletion

mode transistor.

GND

'7

Flgure 6. Alternative representations of a nand gate.

2.3. Prospects for Computational Chemistry Chips. Given present technology (circa 1981), it is feasible for members of the computational chemistry community to begin to consider designing and having fabricated special purpose chips related to their individual special purpose calculations. Obviously, many obstacles remain to implementing a special purpose chip in day-to-day computation, but it is difficult to be pessimistic in this exciting area. An indication of the steps that are presently necessary to produce a functioning chip follows: A simple problem should be chosen such that it constitutes the bottleneck in an often-repeated computationally intensive calculation. An example might be the Fofunction required for 1s Gaussian electron-electron repulsion integrals of floating spherical Gaussian orbital (FSGO) calculations. Later, complete systems will be incorporated onto silicon; the first prototype computational chemistry chip should be simple. Next, an algorithm for evaluating the Fo function is needed that is appropriate to silicon technology. There is commonly a trade-off here between speed and the surface area required. An appropriate algorithm would appear to be the pipelined evaluation of a polynomial. Into the pipeline would flow a stream of arguments and out of the pipeline would flow values of the Fofunction. Polynomial coefficients would be internal to the chip. Each stage of the pipeline would require an adder and a multiplier. The design for these might come from standard libraries. The next step would be to convert the basic circuits (registers, adders, etc.) to designs at the stick level. This level of abstraction uses lines (sticks) of different colors (or dotted lines, etc.) to represent the metal wires and the diffusion and polysilicon layers. The intersection of diffusion and polysilicon lines define the transistors. A stick-level diagram of our nand gate is shown in Figure 6a. The stick level gives the basic topology of the chip. The next step, termed layout, requires the use of design rules and artwork to convert the stick diagram of Figure 6a to the rectangular areas of Figure 5. After the layout is complete, it is converted to a list of rectangles and set to a mask maker. The masks are then sent to a fabrication lab. Finally, wires are bonded to the chip and it is installed in a package. The most difficult step in the process may

Flgure 7. The abstract multiprocessor.

very well be the determination of whether or not the chip is functioning properly. For a practical application, the chip must be embedded into an existing system by some interface procedure. The above scenario represents a process that is presently feasible. This field is expanding rapidly, however, and in the future a lot of the low level details that one presently has to worry about will be replaced by appropriate software. The time is clearly coming when it will be necessary to only specify the algorithm that is to be solved; software and standard fabrication facilities will be available to compile a problem specification into a functioning piece of hardware specific to the problem.

3. Parallel Computation Another area of computer science likely to have impact on computational chemistry is parallel computation. Of course, some degree of parallelism is present in almost all computers. In most of these, however, the parallelism is on a rather small scale: simultaneous execution of one instruction while fetching the subsequent one, for instance. We are interested in the possibility of high-performance machines in which the degree of parallelism is much greater (hundreds or even thousands of overlapping operations). There are many appro ache^^-^ to take in trying to accomplish this goal, but we can present here only an informal overview of a few of these. One way to construct a parallel machine is to permit several independent computers to communicate via message transfers. Such a machine is called a network machine. Message passing is usually rather slow, so that a network machine is appropriate to problems in which the amount of interprocessor communication is small. A multiprocessor is a more tightly coupled collection of independent computers which communicate through use of shared memory. Thus, communication in a multiprocessor is faster than in a network machine. A still more tightly coupled form of parallelism occurs in the pipeline machine. In this architecture, operations are divided into a series of steps, a separate processing hardware is allocated to each step or stage of the "pipeline". While this does not decrease the time necessary to perform any single operation, it does increase the rate at which new operations may be initiated. Commercially available array processors6 and several high-performance computers (Cray-1, CDC Star100, for instance) utilize parallelism of this type. However, it is with multiprocessors that we are most concerned in this paper. Multiprocessors offer the possibility of obtaining high performance in a very cost-effective way by connecting together a very larger number of inexpensive (3) N. S. Ostlund, Int. J. Quantum Chem.,13, 15 (1979). (4) N. S. Ostlund, P. G. Hibbard, and R. A Whiteside in "Parallel Computations", B. Alder, S. Fernbach, and M. Rotenberg, Ed., Academic Press, New York, 1982. (5) D. J. Kuck, D. H. Lawrie, and A. H. Sameh, "High Speed Computer and Algorithm Organization", Academic Press New York, 1977. (6) N. S. Ostlund, "Attached Scientific Processors for Chemical Computations", Technical Report LBL-10409 UC-32, Lawrence Berkeley Laboratory, Jan 1980.

2194

Ostlund et al.

The Journal of Physical Chemistry, Vol. 86,No. 12, 1982

...

I LLL Figure 8. A crosspoint switch.

microprocessors such as those of Figure 1. A multiprocessor consists of a collection of processors, a collection of memory modules, and a switch which controls the access of the processors to the memories as shown in Figure 7. The organization of the switch is central to the behavior and performance of a multiprocessor. One possibility is to connect the processors and memories with a crosspoint switch, illustrated in Figure 8. The C.mmp multiprocessor’ constructed at Carnegie-Mellon University employed this type of architecture. This arrangement has the advantage that each processor has direct access to each memory. However, it is apparent that for n processors and memories the complexity of the switch is proportional to n2. For suitably large n the cost of the switch will dominate the cost of the machine. An additional disadvantage is that the switch becomes a central point for failure. Nonetheless, for machines composed of a small number of high-performance processors, the cost of the switch may still only be a small fraction of the total cost of the machine, and this architecture is very attractive. For instance, the S-1 machine*,composed of 16 processors (each comparable in speed to a Cray-1), will employ this architecture. 3.1. Cm*. The experimental Cm* machineg (also constructed at Carnegie-Mellon University) represents an effort to build a high-performance machine as a collection of inexpensive microprocessors,while avoiding the cost and reliability problems associated with a crosspoint switch. The switching structure of this machine is such that any processor may reference any memory location, but the cost of the switch is nearly linear in the number of processors. Thus, though the current configuration of Cm* contains only 50 processors, one can envision a machine of this architecture with hundreds or even thousands of processors. Another aspect of the switching structure, however, is that memory references are hierarchical; accesses to “distant” memory take more time than those to “local” memory. The performance of an algorithm on such a machine will therefore depend upon the locality of its memory references (to both program and data). A schematic diagram of the Cm* machine is given in Figure 9. The lowest level of the hierarchy is a computer (7) W. A. Wulf and C. G . Bell, M I P S Conf. Proc., 41, 765 (1972).

(8)T.M.McWilliams,L. C. Widdoes, Jr., and L. L. Wood, ‘Advanced Digital Processor Technology Base Development for Navy Applications: The S-1 Project”, Technical Report “14-77-0023, office of Naval Research, 1977. (9) S. H.Fuller, J. K. Ousterhout, L. Raskin, P. I. Rubinfeld, P. J. Sindhu, and R. J. Swan, Proc. IEEE, 66, 216 (1978).

Cm2

Cml

Kmap 2

/Cm41 !-

-

--

*..

-A-

- - - -I

i_ _ _ _ u --_---

I

I

Detail of a computer module (Cm3).

Flgure 9, Schematic illustration of the Cm’ multiprocessor.

module. This consists of an LSI-11 processor, 128K bytes of memory, and a custom-designed switch called the Slocal. The processor, memory, and Slocal communicate via the local LSI-11 bus. These computer modules (Cm’s) are organized into clusters. Each cluster is presided over by fast, microcoded processor, called the Kmap, which can communicate with the Slocal of each Cm via the Map bus. Finally, the Cm* machine is composed of several clusters, the Kmap’s of which can communicate over an intercluster bus. The current configuration of Cm* has five clusters, each containing ten Cm’s. Each word of memory in Cm* has a unique address, specified by the cluster number, the Cm within the cluster, and the word within the local memory. To access a word of memory, the LSI-11 processor places a virtual address on the LSI-11 bus. The Slocal inspects each address to determine whether it corresponds to a word of the local memory. If so, the corresponding location in the local memory is accessed. If the virtual address corresponds to a memory location in another Cm, then the processor is suspended, and a request packet containing the addressing information is passed to the h a p for processing. If the Kmap finds that the virtual address corresponds to a physical address in a Cm within the same cluster, then the location is accessed by the Kmap through the Slocal of that Cm. A return packet is generated by the Kmap which is passed back to the originating Slocal, and the LSI-11 processor is continued. Such an intracluster memory reference takes three times as long as a local access. If the Kmap finds that the virtual address generated corresponds to a physical location in another cluster, then the packet is forwarded via the intercluster bus to the Kmap for that cluster. That Kmap then handles the memory access, and passes back a return packet. This type of memory reference takes an additional factor of three longer. 3.2. Measures of Performance. There are several sources of overhead in a multiprocessor program that are not present in a serial machine. For instance, most memory modules can be accessed by only one processor at a time; if two processors simultaneously try to access the same location, one of these will have to wait for the other.

The Journal of Physical Chemistry, Vol. 86, No. 12, 1982 2195

Computational Chemistry and Computer Science

Thus, memory contention is one source of overhead. Synchronization costs can also degrade the performance of a multiprocessor program. It often happens that some step of a calculation executing in one processor should not proceed until one or more other processors have completed a previous step. The program, therefore, must employ some synchronization mechanism to ensure this. Additional overhead is sometimes incurred in maintaining the integrity of shared data. When modifying shared data, it is often necessary that only one processor at a time have access to this. The mechanisms used to enforce this mutual exclusion can cause processors to sit idle while waiting for access to this protected data. If t, is the execution time of a parallel program using n processors, then we will gauge the efficiency of a parallel program by its speedup, which is defined as s, = t,/tl. Ideally, of course, a program using n processors runs n times faster than with one processor, and s, = n. The overheads mentioned above will cause the speedup to be less than this theoretical limit. 3.3. Simulation of Liquids. In investigating the implementation of parallel programs on Cm*, we have used as an application two different approaches to the simulation of the structures of liquids: molecular dynamics, and Monte Carlo. The basic results which both of these procedures attempt to obtain is the description of the macroscopic properties of liquids given the microscopic interactions of the constituent particles (atoms, molecules). This is accomplished by averaging the property of interest over a large number of configurations of the particles, and the two methods differ in the way these configurations are generated. These procedures are well-known, and we sketch here only brief descriptions of the two. In the molecular dynamics approach, a sequence of configurations is generated at successive time steps. Given the forces acting on the particles, along with the current position and momentum of each particle, the position and momentum can be computed for each particle at the next time step. The most time-consumming part of a molecular dynamics calculation is the calculation of the net force exerted on each particle due to the interactions with the other particles. In the Monte Carlo procedure, a new configuration is generated from the current one in the following way. One of the particles is randomly displaced. Then, the change in the potential energy of the system due to this displacement is computed. This new configuration is either accepted or rejected with a probability based on a Boltzmann distribution. This procedure is applied repeatedly to each particle in the system. To prevent large surface effects, it is customary in both of these procedures to employ periodic boundary conditions. In our benchmark calculations, however, we have studied instead the simpler case of a small droplet of ions Cs+ and C1-. We take the total energy of the system to be a sum of two-body interactions:

c(rij)= B1(r*/r)9

+ B2ZiZ,/r

(2)

In the above equations, Zi = fl is the electronic charge on atom i, and the parameters B1, r*, and B2 were chosen to approximate Cs+ and C1- in the gas phase.1° This potential has been used previously in a Monte Carlo study." (10)P.S.Ramanathan and H. L. Friedman, J. Chem. Phys., 54,1086 (1970).

d

a t

8 Theoretical Speedup w 100 Atoms p--a

50 Atoms

u 25 Atoms

5

15

10

20 25 Number of Processors.

Figure 10. Plots of speedups vs. number of processors for molecular dynamics calculations.

3.3.1. Molecular Dynamics Calculations. We consider only the most time-consuming part of the molecular dynamics procedure: the generation of new configurations. Using the Verlet algorithm,12we computed the coordinates of each atom at time step t 1 from the coordinates at steps t and t - 1, along with the forces on the atoms at time step t by

+

x , + ~= -xt-l

+ 2 x , + Fi(At)2/mi

(3)

Fi is the force exerted (in the x direction) on atom i and is given by a sum of pairwise forces Fi

= C

fij

(4)

j#i

where fij is the force exerted on atom i due to its interaction with atom j . In this first effort at programming molecular dynamics on Cm*, we have decomposed the problem by assigning to each processor the task of computing the total force exerted on one atom (eq 4),and computing its new coordinates (eq 3). Of course, if there are more atoms than processors, then a given processor may be assigned more than one of these tasks. Since the matrix of pairwise forces, fij, is antisymmetric, this decomposition actually results in more work being done than in a good serial algorithm-the parallel algorithm will compute both f i j and fji. However, this decomposition does seem to result in quite small overheads due to synchronization and memory contention. Results for the molecular dynamics calculations are illustrated in Figure 10 which contains plots of speedup vs. (11)P.J. Rossky, F. D.Doll, and H. L. Friedman, J. Chem. Phys., 69, 4628 (1978). (12)L.Verlet, Phys. Reu., 159,98(1967).

2196

The Journal of Physical Chemistry, Vol. 86, No. 12, 1982

Ostlund et al. d

0

3 a 8

1

w

Theoretical Speedup

Theoretical Speedup w 100 Atoms o-----Q 50 Atoms e-.25 Atoms

5

Montecarlo

w Molecular Dynamics

15

10

20 25 Number of Processors.

Flgure 11. Plots of speedup vs. number of processors for the Monte Carlo program.

number of processors for various numbers of atoms. The most prominent feature is the "stepped" nature of the curves; while some of the points lie very close to the theoretical limit, there are also flat regions in which additional processors provide little or no speedup. These flat regions result from an unequal distribution of the work among the processors. For instance, if some of the processors are assigned two atoms to move and others are assigned three, then the elapsed time will be that necessary to move three atoms. The processors assigned less work wait idly for the others to finish. The points lying very near the theoretical limit are points for which the number of tasks to perform divides evenly among the number of available processors. These steps become less pronounced as the number of tasks increases relative to the number of processors, as can be seen from a comparison of the curves for 25, 50, and 100 atoms. 3.3.2. Monte Carlo Calculations. In the Monte Carlo pr~cedure,'~ the bottleneck computation involves the calculation of the interaction of a moved atom with all the remaining atoms, i.e., the binding energy of the moved atom Ei= Ce(rij) (5) j#i

In the parallel Monte Carlo program that we have implemented, each processor is assigned the task of evaluating one term in the summation of eq 5 and adding it into the shared value of Ei being computed. Once again, of course, if there are more terms to evaluate than processors, some processors may compute more than one contribution. An additional complication arises here that was not present in the molecdar dynamics calculations. When a processor (13) N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, J. Chem. Phys., 21, 1087 (1953).

5

10

15

20 25 Number of Processors.

Flgure 12. Comparlson of speedups for 100-atom Monte Carlo and molecular dynamics calculations.

is ready to sum its contribution into the binding energy

E, (eq 5), it must be guaranteed exclusive access to this shared variable during the time it performs the addition. Otherwise, the unpredictable interleavings of simultaneous additions by several processors may yield invalid results. Thus, some processors may have to wait for access to this variable, decreasing the overall performance of the program. The results of our Monte Carlo benchmark calculations are summarized in Figure 11. This figure presents plots of speedup vs. number of processors for 25, 50, and 100 atoms. Comparison of these three curves indicates that the speedup improves as the number of atoms increases relative to the number of processors. This is due to the fact that processors access the protected global variable (the new binding energy) less often. When there are more atoms, each processor computes and sums more pair contributions locally before it needs to gain exclusive access to the global binding energy Ei. Comparisons of speedups for the two procedures (Figure 12) shows that the performance of the molecular dynamics program is superior to that of Monte Carlo. This is because of the decreased synchronization costs in the molecular dynamics procedure relative to that in the Monte Carlo calculation. Though certainly more careful analyses are needed, these preliminary results suggest that the molecular dynamics procedure will be prefered to Monte Carlo calculations on the multiprocessor machines of the future. 4. Conclusions

Two developments, VLSI and parallel computation, can lead to orders of magnitude improvements in cost effectiveness (VLSI) and performance (parallel computation) for chemical computations. It seems unlikely, however,

J. Phys. Chem. 1982, 86, 2197-2205

that near-maximum benefits will be obtained by leaving such developments in the hands of computer scientists, who are generally disinterested in numerical computation and are lacking, of course,in knowledge of specific chemical applications. For the optimum advance of the field, computational chemists should become more actively involved in areas that are traditionally thought of as computer science, computer engineering, etc.

2197

Acknowledgment. The authors gratefully acknowledge financial support for this research from the National Science Foundation under Grant MCS79-20698. This research was also sponsored in part by Control Data Corporation, Grant 79C13, and by the Defense Advanced Research Projects Agency (DARPA), ARPA Order No. 3597, monitored by the Air Force Avionics Laboratory under Contract F33615-78-C-1551.

Effect of Excitation on Non-Markovian Vibrational Energy Relaxation Blman Bagchl and Davld W. Oxtoby' The James Franck Institute and The Depertment of Chemistry, The University of Chicago, Chicago, Illinois 60637 (Received July 22, 1081)

In the non-Markovian limit the process of excitation can have a profound influence on subsequent relaxation processes. We study this effect for vibrational energy relaxation in liquids. We carry out exact quantum mechanical simulationsof a three-level system where the ground state is radiatively coupled to one of two closely spaced excited vibrational levels which are strongly coupled to a bath of known statistical properties (either Poisson or Gaussian). We follow the energy exchange between the two excited levels as the light is turned off. Two types of excitation pulses, square and Gaussian, with different time duration (7')of excitation are used. Pronounced oscillations are found in the non-Markovian limit for the Poisson simulation for values of T comparablewith the bath correlation time. These oscillations may be observable in picosecond laser experiments. Oscillations are less pronounced for the Gaussian bath. We use the stochastic Liouville equation to explain the results of the simulation. Exact agreement with simulation results is obtained for the Poisson case, while for a Gaussian bath, good agreement is obtained by including a small number of bath decay modes. We discuss the experimental situations where these non-Markovian effects may be detectable.

I. Introduction The problem of vibrational energy relaxation in liquids has received considerable attention in recent years. Experiments have demonstrated a tremendous range of behavior, with single-level relaxation times ranging from picoseconds to seconds,l depending on the molecule involved and the solvent used. The slower relaxations can certainly be described by a rate-equation model in which rate constants for state-to-state relaxation are introduced and the population of each level is a sum of exponentials. We refer to this rate equation limit as Markovian behavior, where the molecules stay in a particular vibrational state long enough that any memory effects have disappeared. Another way of putting this is to say that there is a time scale separation between slowly relaxing vibrational modes and fast bath degrees of freedom such as rotational and translational motion. When the vibrational time scales reach into the picosecond or subpicosecond regime, however, the separation of time scales no longer holds true and some interesting qualitatively new effects can arise. This non-Markoviah limit is the subject of the present paper. The role of excitation in the preparation of excited states in molecules and its influence on their subsequent decay has been the subject of many recent theoretical investigations.2-8 These investigations reveal some new and interesting features that can appear in the radiationless relaxation of excited molecules in the condensed state: in the non-Markovian limit the excitation can have a profound effect, even a t long times, on the relaxation of the prepared state. In this paper we study these effects in the *Alfred P. Sloan Foundation Fellow and Camille and Henry Dreyfus Foundation Teacher-Scholar. 0022-3654/82/2086-2 197$01.25/0

case of vibrational energy relaxation in liquids. The first systematic investigation of the effects of excitation on the subsequent relaxation was made by Rhodes2-6 who concluded that, for isolated molecules, increasing the time duration T of the exciting radiation may change the fluorescence from an oscillatory decay to an exponential one or it may even lead to a drastic change in the observed lifetime. The dependence of molecular decay on the nature of the exciting light has also been emphasized by Robinson and LanghoffG who were able to include partially the effect of finite bandwidth and finite durations of excitation pulse and arrived at conclusions similar to those of R h ~ d e s . ~But . ~ none of these authors considered the effect of a stochastic driving force on the relaxation of the excited state. It was Middleton and Schieveg who, working in a somewhat different context, showed that the generalized master equation of Zwanzig, owing to i b inherent non-Markovian nature, can lead to an oscillatory decay for the occupation probability of the discrete state in the Friedrich's'O model. Therefore, oscillatory decay can arise from two seemingly unrelated sources. However, Grigolini and Lami' and GrigolinP have stressed that, in the non-Markovian limit, the process of excitation cannot be separated in a clear-cut way from that of relaxation due to the fact that the excited bath states D. W. Oxtoby, Adv. Chem. Phys., 47,487 (1981). W.Rhodes, J. Chem. Phys., 50, 2885 (1969).

W.Rhodes, Chem. Phys. Lett., 2, 179 (1971). W.Rhodes, Chem. Phys., 4, 259 (1974). W.Rhodes, Chem. Phys., 22,95 (1977). G. W.Robinson and C. A. Langhoff, Chem. Phys., 5, 1 (1974). P. Grigolini and A. Lami,Chem. Phys., 30,61 (1978). (8)P. Grigolini, Mol. Phys., 31, 1717 (1976). (9)J. W.Middleton and W. C. Schieve, Physica, 37, 139 (1973). (10)J. L. Pietenpol, Phys. Rev., 162,1301 (1967).

0 1982 American Chemical Society