Computer Modeling of Matter - American Chemical Society

Computer Simulations of the Melting and Freezing of. Simple Systems Using an Array Processor. G. CHESTER, R. GANN, R. GALLAGHER, and A. GRIMISON. Labo...
1 downloads 4 Views 1MB Size
10

Computer Simulations of the Melting and Freezing of Simple Systems Using an Array Processor

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

G. CHESTER, R. GANN, R. GALLAGHER, and A. GRIMISON Laboratory of Solid State Physics and Office of Computer Services, Cornell University, Ithaca, NY 14853

The aim of t h i s work is to study, at a microscopic l e v e l , pre­ cursor phenomenon to f r e e z i n g and m e l t i n g . At the present time the studies are being c a r r i e d out on s i n g l e phases, c r y s t a l and fluid (1,2,3). In the near future we expect to extend some o f our work to two phases i n equilibria. We hope that these studies will u l i t m a t e l y lead to an understanding of how the long range s p a t i a l order o f a c r y s t a l i s destroyed by m e l t i n g and how the same order is built up from the c h a o t i c f l u i d phase. These questions can best be studied initially i n the simplest systems. For t h i s r e a ­ son we have been studying c e n t r a l force classical systems i n both two and three dimensions. The d i f f e r e n c e s i n dimensions may lead to important i n s i g h t s about the o r d e r i n g phenomenon. The p r o p e r t i e s we plan t o c a l c u l a t e by Metropolis (4) Monte Carlo methods are the s t r u c t u r e f a c t o r s S(k) a t d e n s i t i e s and temperatures near the f r e e z i n g l i n e , the root mean square d i s ­ placement o f the p a r t i c l e s from t h e i r l a t t i c e s i t e s near m e l t i n g , the angular c o r r e l a t i o n s between d i s t a n t p a i r s o f p a r t i c l e s i n the c r y s t a l phase, and near-neighbour c o r r e l a t i o n s i n the c r y s t a l phase. Several o f these p r o p e r t i e s r e q u i r e very long simulations to achieve good e q u i l i b r i u m v a l u e s . We have t h e r e f o r e f r e q u e n t l y made runs a t l e a s t one order of magnitude l a r g e r than i s normal i n s i n g l e phase studies o f c e n t r a l p o t e n t i a l systems. Determining the s i z e dependence involves a s i m i l a r amount o f computing. In a d d i t i o n , to extend these s t u d i e s , at l e a s t i n two dimensions, to two phase simulations r e q u i r e s between 1000 and 10,000 p a r t i c l e s and we estimate that runs equivalent to between 3 and 30 hours o f CDC 7600 computer time will have to be made. A f u r t h e r considera­ t i o n i s that we p l a n to complement the Monte Carlo work with molecular dynamics s i m u l a t i o n s . These again r e q u i r e very long

0-8412-0463-2/78/47-086-lll$05.00/0 © 1978 American Chemical Society Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

112

COMPUTER MODELING OF MATTER

runs i n order to simulate phenomena which occur comparatively slowly i n r e a l time. For example the break up o f d i s l o c a t i o n p a i r s i n two dimensions may only occur on time s c a l e s o f at l e a s t 10^ conventional time steps o f molecular dynamics. Previous s o l u t i o n s to the problem o f such l a r g e - s c a l e compu­ t a t i o n s have mainly involved large powerful shared machines (IBM 370, CDC 7600, CDC S t a r , CRAY 1) or dedicated minicomputers (Prime 400, H a r r i s / 4 , PDP 11/70, e t c . ) . Each o f these s o l u t i o n s has i t s dedicated advocates, and there are h i g h l y favorable features i n each case. The large machine o f f e r s large memory, d i s k storage and powerful p e r i p h e r a l s , very extensive software, and a f a s t computation time. The large machines l i s t e d above are capable of o p e r a t i n g i n the range o f 2 to 100 megaflops ( m i l l i o n s o f f l o a t i n g p o i n t operations per second). The g r e a t e s t disadvantage i s o f t e n the high cost o f u s i n g such a system, unless i t i s h i g h l y s u b s i d i z e d and then somewhat r e s t r i c t e d i n a c c e s s . On a very h e a v i l y loaded c e n t r a l system, job turnaround can a l s o be a problem, but the main concern o f users o f such a system tends to be "How much w i l l i t c o s t ? " . The greatest advantage o f the dedicated mini-computer i s probably the low cost (50, a l l i e d with the convenience o f c o n t r o l of the r e s o u r c e s . This must be o f f s e t by the more l i m i t e d memory and storage a v a i l a b l e , the l i m i t e d p r e c i s i o n o f the hardware i n some cases, and the l e s s extensive software l i b r a r i e s a v a i l a b l e . However, the p r i n c i p a l l i m i t a t i o n f o r very l a r g e - s c a l e computa­ t i o n s stems from the lower computational speed (around 0.5 mega­ f l o p s or l e s s ) so that the users concern i s o f t e n "How long w i l l i t take?". Recently r e l a t i v e l y cheap s p e c i a l purpose "attached proces­ s o r s " have become a v a i l a b l e , and t h i s paper d e s c r i b e s an evalua­ t i o n which i s being c a r r i e d out a t C o r n e l l on the a p p l i c a b i l i t y o f one such processor to l a r g e - s c a l e s c i e n t i f i c c a l c u l a t i o n s such as the simulations d e s c r i b e d e a r l i e r . The p a r t i c u l a r machine i s a F l o a t i n g Point Systems 190-L Array Processor (AP). The AP features i n c l u d e (6^7): (1) Independent storage f o r programs, data and constants. (2) Independent f l o a t i n g p o i n t m u l t i p l i e r and adder u n i t s . (3) Two blocks o f 32 f l o a t i n g p o i n t r e g i s t e r s . (4) Address indexing and counting by independent i n t e g e r a r i t h m e t i c u n i t with 16 i n t e g e r r e g i s t e r s . (5) P r e c i s i o n enhanced by 38 b i t i n t e r n a l f l o a t i n g p o i n t format, with hardware rounding. The very l a r g e i n s t r u c t i o n length and m u l t i p l e data paths permit p a r a l l e l o p e r a t i o n o f a l l o f these elements, so that f o r example a f l o a t i n g p o i n t add, f l o a t i n g point m u l t i p l y , branch on c o n d i t i o n , index increment, memory f e t c h data memory s t o r e , load to and from f l o a t i n g p o i n t and i n t e g e r r e g i s t e r s , can be simul­ taneously executed i n one 167 nanosecond c y c l e , although memory operations cannot be executed i n every c y c l e . The r e s u l t o f t h i s a r c h i t e c t u r e i s a processor with a maximum computation r a t e o f

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

10.

CHESTER ET AL.

Melting and Freezing of Simple Systems

113

around 8 megaflops (6.* 7), or approximately the raw speed o f a CDC 7600 but i n the p r i c e range o f a minicomputer ($60,000 $140,000). Provided that such a processor can be made to operate near i t s p o t e n t i a l , t h i s c l e a r l y provides a very a t t r a c t i v e a l t e r n a t i v e to the maxi- or mini-computer s o l u t i o n s o u t l i n e d earlier. Since the AP cannot operate independently o f a host computer, the f i r s t d e c i s i o n that has to be made i s the appro­ p r i a t e type of host computer. E a r l i e r models of the AP (Model 120-B) have been attached e x c l u s i v e l y to minicomputers. The main concern with such a system i s whether the minicomputer i s s u f f i c i e n t l y powerful to keep the AP o p e r a t i n g near i t s f u l l p o t e n t i a l . The average AP c o n f i g u r a t i o n has only 32-64K o f data memory and 1-2K of program source memory. For l a r g e programs and large amounts o f data these l i m i t s imply a need to load s e c t i o n s (subroutines) o f the program i n t o the AP as necessary and to s i m i l a r l y segment and t r a n s f e r data to and from the AP. Unless the same t o t a l c a l c u l a t i o n can be performed i n a l e s s e r c l o c k time than that r e q u i r e d by the m i n i alone, there i s no advantage to using the AP except i f t h i s frees the m i n i f o r other t a s k s . In other words, assuming the m i n i i s a purchased machine, the key question i s s t i l l "How long w i l l i t take?". A number of i n s t a l l a t i o n s have been very s u c c e s s f u l l y o p e r a t i n g such a c o n f i g u r a t i o n i n a special-purpose environment f o r a few y e a r s . However, C o r n e l l appears to be the f i r s t i n s t a l l a t i o n to a t t a c h an AP to a l a r g e , general purpose host computer and attempt to make i t a v a i l a b l e to a wide range o f u s e r s . The C o r n e l l c o n f i g u r a t i o n c o n s i s t s o f an AP 190-L with 4K of Program Source Memory, 2.5K of Table Memory, and 96K of Main Data Memory, attached to an IBM 370/168 running the V i r t u a l Memory C o n t r o l Program. Jobs to be run i n the AP are prepared u s i n g the CMS i n t e r a c t i v e system and are then spooled to a s p e c i a l v i r t u a l machine c a l l e d the Array Processor Execution Manager (APEMAN) to which the AP i s normally attached. APEMAN schedules jobs which r e q u i r e the AP on the b a s i s of a scheduling a l g o r i t h m i n c l u d i n g requested p r i o r i t y and length of AP time (jobs can a l s o be placed i n hold and r e l e a s e d by the u s e r ) . When a job i s s e l e c t e d f o r i n i t i a t i o n , the designated user v i r t u a l machine i s autologged on, the AP i s l o g i c a l l y attached to that machine as a normal I/O device, and a designated CMS EXEC (command f i l e ) begins to execute i n the user's machine. At the end o f the run, or when the s p e c i f i e d time has elapsed, the AP i s detached from the user machine and returned to APEMAN, and the user machine i s logged o f f . The i n t e n t of the APEMAN scheduling, when complete­ l y implemented, i s the normal goal of any o p e r a t i n g system or c o n t r o l program; to have the resources of the AP as f u l l y occupied at a l l times as i s p r a c t i c a l . In a d d i t i o n , since the C o r n e l l Array Processor i s shared among s i x d i s t i n c t user groups, APEMAN i s used to monitor e q u i t a b l e d i s t r i b u t i o n of the AP f o r each aggregate group, and w i l l incorporate a p r i o r i t y scheme to facilitate this.

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

COMPUTER MODELING OF MATTER

114

The problem with a large c e n t r a l computer as host i s to r e s t r i c t as much as p o s s i b l e the time spent computing i n the host, not because o f the elapsed time i n v o l v e d , but because o f the cost differential. T h i s paper presents conclusions on the extent and d i f f i c u l t y o f reprogramming f o r the AP, the r e s u l t s o f benchmark comparisons f o r our Monte C a r l o c a l c u l a t i o n s , and on the p r e l i m i ­ nary assessments o f the host overhead, as w e l l as s t r a t e g i e s to minimize that overhead.

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

Program Development The f i r s t c o n s i d e r a t i o n i s program development f o r the AP. For the AP to approach i t s p o t e n t i a l speed, the programmer has to attempt to keep as many o f the p a r a l l e l operations proceeding simultaneously as p o s s i b l e . A cross-assembler i s a v a i l a b l e to permit program development on the host computer o f AP machine code. The optimum s t r a t e g y appears to be to code i n t e r i o r loops, which i n v o l v e the bulk o f the computation time, f i r s t as sub­ r o u t i n e s executing i n the AP. While the subroutine i s executing i n the AP i t i s f e a s i b l e with c a u t i o n to have simultaneous execution o f code i n the h o s t . However, with a v i r t u a l memory environment l i k e VM, the r e s i d e n t host program w i l l be paged out of memory when i n a c t i v e and thus i n c u r very l i t t l e overhead, so we have made no attempt to provide synchronous o p e r a t i o n to date. A f t e r c r u c i a l s e c t i o n s o f the program have been optimized f o r the AP, outer loops can g r a d u a l l y be t r a n s f e r r e d from host code to AP code. The t r a d e o f f here i s that a large amount o f programmer time may be spent c o n v e r t i n g code which accounts f o r only 5-10% of the o v e r a l l execution time. In the case of our benchmarks, i n i t i a l l y f i v e loops ( l e v e l 0) were i d e n t i f i e d which are shown i n Table I . These are f a i r l y standard small loops which should be q u i t e s i m i l a r to those found i n most Monte-Carlo s i m u l a t i o n s . The inner loops were then incorporated i n an outer loop ( l e v e l 1) which generates one attempted move of each o f the p a r t i c l e s (a pass). For 200 p a r t i c l e s , the outer loop i s executed 200 times, and each inner loop i s a l s o executed 200 times f o r each c y c l e through the outer loop. E v e n t u a l l y , another outer loop ( l e v e l 2) w i l l be t r a n s f e r r e d to the AP which c a r r i e s out about 1000-5000 passes o f the p a r t i c l e s . Previous s t u d i e s have shown that the l e v e l 1 loop encompasses 99% o f the t o t a l execution time. Recoding the l e v e l 0 and l e v e l 1 loops, a t o t a l o f about 75 FORTRAN statements, i n v o l v e d about 100 hours by a programmer who began as a novice (but c l e a r l y f i n i s h e d as an e x p e r t ! ) . We estimate that t h i s time could be almost halved now. An important r e s t r i c t i o n o f the AP i s the l i m i t e d amount o f program source memory. The "expansion f a c t o r " observed f o r handcoding was o f the order o f 4-5 APAL statements f o r each FORTRAN statement. Since the maximum program source memory a v a i l a b l e f o r the AP i s 4K, t h i s gives a maximum FORTRAN subroutine length o f around 800 statements. I t i s a l s o c l e a r l y very d e s i r a b l e to not

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

10.

CHESTER ET AL.

Melting and Freezing of Simple Systems

115

TABLE I Inner Monte Carlo Loops

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

Description DO 71 MM=1,NP Loop 71 c a l c u l a t e s the IF(MM.EQ.M)GOTO 71 energy change due to DX=SIDX2-DABS (DABS (XTRIAL-X (MM) ) -SIDX2)the t r i a l move o f a DY= SIDY2-DABS(DABS(YTRIAL-Y(MM))- SIDY2)particle and stores R2=DX*DX+DY*DY the new i n t e r p a r t i c l e LSAVE (MM)=R2 distances• LL= LSAVE (MM) C=R2 -DFLOAT (LL) C1=1.D0-C Loop 74 stores the E SAVE (MM)=C 1*RM ( LL )+C*RM ( LL+1) new i n t e r p a r t i c l e IND= INDEX (MINO (MM, M) )+MAXO(MM,M) energies and saves the 71 DP^DPf ES (IND) -ESAVE (MM) i n t e r p a r t i c l e distances f o r t r i a l moves which DO 74 MM=1,NP lower the energy o f IF(MM.EQ.M) GOTO 74 the system. GN (LSAVE (MM)=GN(LSAVE (MM) )+l.DO IND= INDEX (MINO (MM, M) ) +MAXO (MM, M) 74 ES(IND)=ESAVE(MM) DO 374 MM=1,NP IF(MM.EQ.M) GOTO 374 GN (LSAVE (MM) )=GN (LSAVE (MM) )+ACC DX= SIDX2 -DABS (DABS (X (M) -X (MM) ) - SIDX2 ) DY=SIDY2-DABS(DABS(Y(M) -Y(MM))-SIDY2) LLFDX*DX+DY*DY

GN(LL)=GN(LL)-ACCl IND= INDEX (MINO (MM, M) )+MAX0 (MM, M) 374 ES (IND=ESAVE (MM)

Loop 374 s t o r e s the new i n t e r p a r t i c l e energies and saves the i n t e r p a r t i c l e distances f o r t r i a l moves which r a i s e the energy o f the system.

DO 274 Mtt=l,NP IF(M.EQ.MM) GOTO 274 GN (LSAVE (MM)=GN (LSAVE (MM)+ACC DX=SIDX2-DABS (DABS (X(M) -X(MM) ) -SIDX2) DY=SIDY2-DABS(DABS(Y(M)-Y(MM))-SIDY2) LL=DX*DX+DY*DY 274 GN(LL)=GN(LL)+ACCl

Loop 274 stores the i n t e r p a r t i c l e distances f o r t r i a l moves which are r e j e c t e d .

DO 574 MM=1,NP IF(M.EQ.MM) GOTO 574 DX= SIDX2 -DABS (DAB S (X (M) -X (MM) ) - SIDX2 ) DY= SIDY2-DAB S(DABS(Y(M)-Y(MM))-SIDY2) LL=DX*DX+DY*DY 574 GN(LL)=GN(LL)+1.D0

Loop 574 stores the i n t e r p a r t i c l e distances f o r t r i a l moves which are r e j e c t e d w i t h an overwhelming probabi­ lity.

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

116

COMPUTER MODELING OF MATTER

ask a programmer to convert 750 FORTRAN statements which make up the r e s t o f the Monte-Carlo program used here, which only a f f e c t s 1% o f the execution time. A c c o r d i n g l y , C o r n e l l has developed an AP FORTRAN cross-compiler (APTRAN), running on the IBM 370/168. This compiler takes statements w r i t t e n i n a n a t u r a l "computations subset" o f FORTRAN, and produces APAL code or o p t i o n a l l y AP object code i n the a p p r o p r i a t e host format. The AP FORTRAN compiler i s only intended to remove the burden o f programming large amounts o f FORTRAN code which do not i n v o l v e the major p o r t i o n o f the computation time. Even though the compiler does have s e v e r a l l e v e l s o f o p t i m i z a t i o n , i t i s s t i l l envisaged that hand coding o f inner loops w i l l be b e n e f i c i a l . An added advantage o f the compiler i s that i t can reduce the amount o f program t r a n s f e r to and from the AP by c h a i n i n g c a l l s to AP subroutines. (See l a t e r ) However, because o f the complex­ i t y o f the compiler i t i s i n e v i t a b l y a q u i t e l a r g e program (more than 10,000 FORTRAN statements). The current "expansion f a c t o r " f o r the compiler i s around 10 so that the l i m i t e d program source memory i s an even more s e r i o u s l i m i t a t i o n f o r compiler-generated code • Results o f the Benchmarks Benchmarks on the AP were made by the use o f a simulator running on the 370/168. F l o a t i n g Point Systems provides a simulator f o r program development and debugging, as w e l l as f o r timing runs. Since the AP i s a completely asynchronous processor, such timings should be exact r e p r e s e n t a t i o n s o f the a c t u a l execu­ t i o n times. Simulators are n o t o r i o u s l y i n e f f i c i e n t , so that i t was necessary f o r long simulations to use a l o c a l l y modified v e r s i o n o f the F l o a t i n g Point Systems s i m u l a t o r . Because o f a d a p t a t i o n to the IBM 370 a r c h i t e c t u r e , t h i s simulator executes 3-4 times f a s t e r than the o r i g i n a l , and i n a l l runs to date i t has produced i d e n t i c a l r e s u l t s and timings. The longest run made f o r these benchmarks simulated a 22 m i l l i s e c o n d AP run, and consumed 60 seconds o f IBM 370/168 time. The r e s u l t s reported f o r other computers correspond to a c t u a l runs, and i n a l l cases the best o p t i m i z i n g FORTRAN compiler a v a i l a b l e was used, a t f u l l optimization l e v e l . Table I I shows the r e s u l t s obtained f o r some standard benchmarks (.5). The values f o r the CDC 7600 and H a r r i s / 4 are from the o r i g i n a l l i t e r a t u r e . In a l l cases, the timings are i n m i l l i ­ second, obtained from the time r e q u i r e d f o r 1000 i t e r a t i o n s o f the r e l e v a n t sequence o f i n s t r u c t i o n s . In the f l o a t i n g p o i n t benchmark, the AP performed only s l i g h t l y f a s t e r than the IBM 370/168. The f o l l o w i n g benchmark, i n v o l v i n g repeated c a l l s to e x t e r n a l a r i t h m e t i c f u n c t i o n s ( s p e c i f i c a l l y SQRT, ABS, SIN, ALOGIO, IFIX, ATAN, EXP) i s i n t e r e s t i n g since i t i l l u s t r a t e s the extremely low overhead i n c u r r e d by subroutine c a l l s i n the AP. The dot product benchmark shows the AP performing very f a v o r a b l y

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

10.

CHESTER ET AL.

Melting and Freezing of Simple Systems

117

compared to the CDC 7600. This i s to be expected, since the p a r t i c u l a r s t r u c t u r e o f the a l g o r i t h m f o r a dot product i s i d e a l l y s u i t e d to the AP a r c h i t e c t u r e , which permits the simultaneous c a l c u l a t i o n o f the products, and t h e i r a d d i t i o n . In f a c t , i t i s p o s s i b l e t o w r i t e the e s s e n t i a l s t r u c t u r e o f a dot product loop i n one AP i n s t r u c t i o n . Table I I I shows the r e s u l t s achieved f o r the a c t u a l Monte Carlo inner loops ( l e v e l 0) d e t a i l e d i n Table I . The r e s u l t s f o r these inner loops show the AP 190L executing a t 0.6 times the speed o f the CDC 7600, or a t 2.5 times the speed o f the IBM 370/168. Table IV compares the AP 190L timings f o r the l e v e l 1 outer loop with those f o r the CDC 7600, IBM 370/168, and PRIME 400 computers. The execution time, T, o f the l e v e l 1 loop i s a n o n - l i n e a r f u n c t i o n o f the number o f p a r t i c l e s c o n s i d e r e d . I t i s given by the equation T = AN + BN

2

where A and B have been determined f o r each machine. Timings which were obtained by e x t r a p o l a t i o n are i n d i c a t e d i n parentheses i n Table IV. This shows that as the number o f p a r t i c l e s i n c r e a s e s , the speed o f the AP 190L f o r the l e v e l 1 loop approaches a l i m i t of 0.63 times the speed o f the CDC 7600 o r 1.5 times the speed o f the IBM 370/168 and 67 times the speed o f the PRIME 400. We b e l i e v e the lower performance o f the AP 190L f o r the l e v e l 1 loop to be a t t r i b u t a b l e to a number o f f a c t o r s . Whereas i n s u i t a b l e a p p l i c a t i o n s the AP 190L has provided speeds above that o f a CDC 7600 (7), i n general the l i m i t i n g f a c t o r i n loop speed i s data memory access time. In t h i s i n i t i a l v e r s i o n o f the Monte C a r l o program, a great d e a l o f s t o r i n g and f e t c h i n g v a r i ­ ables and base addresses was performed between the l e v e l 0 loops. No attempt has yet been made to optimize the p o r t i o n s o f the l e v e l 1 loop around the inner loops. An a d d i t i o n a l c o n s i d e r a t i o n was the f a c t that the e x i s t i n g Monte Carlo FORTRAN program which was converted f o r the AP was not a t a l l o p t i m a l l y s t r u c t u r e d f o r the AP. I t i s the case that the r e s u l t s f o r the AP were obtained from hand coded r o u t i n e s , while those f o r the CDC 7600 and IBM 370/168 computers were obtained from the FTN4 and H Extended o p t i m i z i n g compilers r e s p e c t i v e l y . However i n our experience these o p t i m i z i n g compilers have proved extremely e f f i c i e n t f o r the small loops i n v o l v e d here, while there are extensive opportu­ n i t i e s f o r f u r t h e r hand o p t i m i z a t i o n o f the AP code. For these reasons we b e l i e v e the f a c t o r o f 0.42 times the CDC 7600 speed to be a lower l i m i t o f the c a p a b i l i t y o f the AP f o r t h i s a p p l i c a ­ tion.

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

118

COMPUTER MODELING OF MATTER

TABLE I I General Benchmarks

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

CDC 7600 Integer A r i t h .

1.8

F l o a t i n g Point

0.21

E x t e r n a l Funct.

0.32

Dot

0.70

Product

Harris/4

IBM 370/168

41

AP 190L

4.3

NA

4.8

0.60

0.55

4.1

0.97

0.26

1.50

0.84

30

These benchmarks are taken from r e f e r e n c e 5.

TABLE I I I Monte C a r l o Benchmarks LOOP

CDC 7600 TIME (MS)

IBM/ 370 TIME (MS)

AP TIME (MS)

71

.597

2.2

74

.316

1.2

374

.583

1.9

574

.487

1.4

274

.448

1.6

.885 .634 .885 .300 .601 TOTAL

2.431

8.3

3.305

A l l times are i n m i l l i s e c o n d s and measure the time needed to execute the s t a t e d loop 100 times.

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

10.

CHESTER ET AL.

Melting and Freezing of Simple Systems

119

TABLE IV Comparative Times f o r Execution o f N - P a r t i c l e Monte Carlo Program (milliseconds) //Particles

PRIME 400

IBM 370/168

CDC 7600

AP 190L

16

(416.0)

6.8

(2.0)

4.9

36

1650.0

30.2

(9.2)

21.9

64

4710.0

98.2

28.0

(66.3)

67.2

(158.5)

100 1000 10000

3

(1.09 1 0 ) 6

(1.00 10 ) 7

(9.92 1 0 )

243.0 4

(6.55 1 0 )

6

(6.53 1 0 )

(2.32 1 0 ) (2.31 1 0 )

3

5

4

(1.53 1 0 ) 6

(1.52 1 0 )

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

120

COMPUTER MODELING OF MATTER

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

Host Overhead The f u l l cost-recovery r a t e f o r the AP-190L at C o r n e l l , based on a three-year a m o r t i z a t i o n schedule, i s $40/hour. Since the previous benchmarks have shown that the AP can run at from 1.5-2.5 times the speed of the IBM 370/168, t h i s y i e l d s a c o s t comparison of about 80:1 i n the C o r n e l l environment. It i s impor­ tant to note that t h i s r a t i o uses the prime-time 370 r a t e s , and can be reduced by a f a c t o r of about 0.4 f o r overnight turnaround. The comparison may a l s o be s u b s t a n t i a l l y l e s s favorable to the AP f o r a p a r t i a l l y - s u b s i d i z e d host computer. However, t h i s i s s t i l l a dramatic demonstration of the p o t e n t i a l cost advantage of c a r r y i n g out l a r g e - s c a l e s c i e n t i f i c c a l c u l a t i o n s on such an attached p r o c e s s o r . The magnitude o f the c o s t - d i f f e r e n t i a l makes the problem o f host overhead even more c r u c i a l , since the cost of a run with 907 of the computation i n the AP and 107 of the computation i n the host could s t i l l be t o t a l l y dominated by the host charges. Host overhead i s i n c u r r e d i n a number o f ways: (i) Job submission. (ii) System overhead i n making the AP a v a i l a b l e to the p a r t i c u l a r user. (iii) Maintenance of a main program r e s i d e n t i n the host (AP programs can only be executed as subroutines c a l l e d from the h o s t ) . (iv) CALL to the AP subroutine from the main program. (v) I n i t i a l i z a t i o n o f the AP. (vi) T r a n s f e r of program source microcode to the AP f o r every subroutine c a l l , i n c l u d i n g the updating o f l o c a t i o n tables, etc. (vii) T r a n s f e r of data to and from the AP. (viii) Release the AP f o r a subsequent CALL. (ix) RETURN from the AP subroutine. (x) Release the AP f o r other u s e r s . Table V d e t a i l s the t y p i c a l host overhead i n v o l v e d i n the execution of an AP program a t C o r n e l l . Due to the c h a r a c t e r i s t i c s of the C o r n e l l V i r t u a l Machine environment mentioned e a r l i e r , there i s no a p p r e c i a b l e overhead ( l e s s than 1 msec) to maintain a main program r e s i d e n t i n the host, since the program i s paged out of r e a l memory when i n a c t i v e . I t i s important to note that the m a j o r i t y o f the c o n t r i b u t i o n s to the t o t a l overhead are f i x e d q u a n t i t i e s . Exceptions which are under the c o n t r o l o f the pro­ grammer i n c l u d e the number of data t r a n s f e r s (APPUTs and APGETs), and the number o f subroutine c a l l s . One major o b j e c t i v e o f the C o r n e l l APTRAN compiler and i t s a s s o c i a t e d linkage e d i t o r was to minimize data t r a n s f e r s by t r a n s f e r r i n g data i n COMMON blocks i n one chained o p e r a t i o n (without e x p l i c i t t r a n s f e r being r e q u i r e d by the programmer, as i s otherwise the c a s e ) . In a d d i t i o n , APTRAN chains AP subroutine c a l l s together, e f f e c t i v e l y r e p l a c i n g the CALL and RETURN, program t r a n s f e r , and data t r a n s f e r ( i n some o

o

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

10.

CHESTER ET AL.

Melting and Freezing of Simple Systems

121

TABLE V

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

T y p i c a l Values f o r Host Overhead

(IBM 370/168)

Operation

Approx. Overhead (msec.)

Comment s

APM

10

Job submission

AP ATTACH

40

A t t a c h AP to user machine

Subroutine CALL

0.5

APINIT

58

I n i t i a l i z e the AP

APPUT .

1

Data T r a n s f e r to AP (each)

SUBMIT

APGET

Data T r a n s f e r from AP (each)

T r a n s f e r AP code

10

Put r o u t i n e ' s micro­ code i n AP (up to 4K instructions) Release AP f o r subsequent CALL

APRLSE

Subroutine RETURN

0.5

APDETACH

10

TOTAL

132 msec.

R e l i n q u i s h AP from user machine

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

COMPUTER MODELING OF

122

MATTER

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

cases) by an AP "JUMP SUBROUTINE" i n s t r u c t i o n . This i n s t r u c t i o n was shown e a r l i e r to i n c u r an extremely low overhead (167 nano­ second worst c a s e ) . The Table shows that a minimum overhead of about 150 msec, o f 370/168 cpu time must be i n v o l v e d i n any AP computation, no matter how t r i v i a l . The cost r a t i o of 80:1 leads to a r u l e - o f thumb estimate that the f u l l cost b e n e f i t s of the AP w i l l not be approached unless a subroutine can be created which: ( i ) Contains l e s s than 300-400 FORTRAN statements (because o f the l i m i t e d Program Source Memory) i f u s i n g a compiler. T h i s number might be doubled f o r hand-coding. (ii) Operates on l e s s than 64K f l o a t i n g p o i n t numbers as data at one time (96K f o r hand-coding). (iii) Computes f o r at l e a s t 1-2 minutes i n the AP (3-5 minutes on an IBM 370/168 or 30-90 seconds on a CDC 6700) between subroutine CALL and RETURN. (This does not imply that intermediate data t r a n s f e r s are not made to and from the AP r e s i d e n t s u b r o u t i n e ) . Point ( i i i ) above may need some c l a r i f i c a t i o n ; the l a r g e s t part o f the overhead per subroutine c a l l i s i n v o l v e d i n i n i t i a l i ­ z i n g the AP and t r a n s f e r r i n g the microcode f o r the subroutine. Once the subroutine i s o p e r a t i n g i n the AP, i t can make data t r a n s f e r s to and from the host at an overhead of about 1 msec, per t r a n s f e r (independent o f the amount o f data t r a n s f e r r e d up to the c a p a c i t y of the AP). Thus c a r e f u l programming can ensure that the same subroutine remains r e s i d e n t i n the AP, but that t h i s subroutine operates on sucessive batches of data; i t i s the r e t u r n from the subroutine to the host and the corresponding c a l l which must be minimized. For a p p l i c a t i o n s which f i t the above c o n s t r a i n t s the cost advantages of the AP can be outstanding. For example, i n the Monte-Carlo c a l c u l a t i o n s described i n t h i s paper the o v e r a l l program s t r a t e g y i s s t i l l being r e f i n e d to minimize subroutine c a l l s , but i t appears that a goal of over 997o of the computation time i n the AP and l e s s than 17o 370/168 overhead can d e f i n i t e l y be r e a l i z e d . This i n turn implies an o v e r a l l cost i n the range o f $50/AP hour, even using the prime time r a t e s on the IBM 370/168. Accuracy o f the

Calculations

One concern w i t h p o t e n t i a l users of the AP 190L (or the equi­ v a l e n t AP 120B) f o r s c i e n t i f i c work i s the a v a i l a b l e word l e n g t h . No hardware double p r e c i s i o n c a p a b i l i t y i s a v a i l a b l e on the AP 190L, so that a l l f l o a t i n g p o i n t operations are c a r r i e d out i n the 38 b i t i n t e r n a l f l o a t i n g p o i n t format, with a 10 b i t b i n a r y exponent, a 28-bit mantissa, and a hardware convergent rounding algorithm i n the F l o a t i n g Adder and F l o a t i n g M u l t i p l i e r . Floating Point Systems (6) s t a t e that t h i s provides a p r e c i s i o n of 8.1 decimal d i g i t s , as compared to 7.2 decimal d i g i t s f o r IBM 370 s i n g l e p r e c i s i o n , 16.8 decimal d i g i t s f o r IBM 370 double p r e c i s i o n ,

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

10.

CHESTER ET AL.

Melting and Freezing of Simple Systems

123

and approximately 14 decimal d i g i t s f o r the CDC 7600 i n s i n g l e precision. Table VI shows the r e s u l t s o f some accuracy comparisons which we have made. The dot product r e s u l t s correspond t o a dot product o f two v e c t o r s o f 1000 elements, with random values between 0 and 1. This shows that the AP 190L r e s u l t gives the f i r s t 7 f i g u r e s o f the r e s u l t s from the IBM 370 double p r e c i s i o n calculation. The other e n t r i e s i n Table VI show r e s u l t s from the Monte Carlo c a l c u l a t i o n s . TABLE VI

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

E f f e c t o f AP Word Length on P r e c i s i o n IBM REAL*8

AP 190L

0.255178129E03

0.255178122E03

T o t a l Energy (16)

-30.7006546

-30.7006483

T o t a l Energy (36)

-61.6051765

-61.6052179

Dot

Product

The t o t a l energy i s the running average energy computed f o r one pass through the system o f 16 or 36 p a r t i c l e s . M a i n t a i n i n g the accuracy o f the running average energy i s c r u c i a l i n Monte Carlo calculations. For l a r g e systems o f s e v e r a l hundred p a r t i c l e s i t may be necessary to recompute the running average i n the course o f a pass, since t h i s q u a n t i t y i s the most s e n s i t i v e to accumu­ l a t e d e r r o r s . This shows that the AP 190L r e s u l t gives the f i r s t 7 f i g u r e s o f the IBM 370 double p r e c i s i o n r e s u l t f o r 16 p a r t i c l e s , and the f i r s t 6 f i g u r e s f o r 36 p a r t i c l e s . A f t e r t e s t i n g the AP random-number generator r o u t i n e a v a i l ­ able i n the F l o a t i n g Point Systems Math L i b r a r y , a random-number generator f o r a 28-bit mantissa a r c h i t e c t u r e was w r i t t e n i n FORTRAN and compiled on the APTRAN compiler. Conclusions In our experiments the F l o a t i n g Point Systems 190L Array Processor proved capable o f speeds between 0.43 and 0.66 the speed o f a CDC 7600. The broad c o n c l u s i o n i s that the AP 190L appears to be between 10 and 100 times more c o s t - e f f e c t i v e than the other systems we have discussed f o r the type o f Monte Carlo simulations we r e p o r t . In f a c t the c o s t - e f f e c t i v e n e s s o f the AP 190L i s such that there i s l i t t l e economic i n c e n t i v e to extensive program o p t i m i z a t i o n , though i t may be e s t h e t i c a l l y s a t i s f y i n g . While i t w i l l be necessary l a t e r to i n c l u d e the a d d i t i o n a l costs i n v o l v e d when a c t u a l l y u s i n g the combined IBM 370/AP 190L system, we do not estimate that these costs w i l l m a t e r i a l l y a l t e r our c o n c l u s i o n . I t appears a l s o that the word

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

124

COMPUTER MODELING OF MATTER

length o f the AP 190L i s s u f f i c i e n t to provide adequate f o r our purposes.

precision

Acknowledgement s

Downloaded by EAST CAROLINA UNIV on January 2, 2018 | http://pubs.acs.org Publication Date: June 1, 1978 | doi: 10.1021/bk-1978-0086.ch010

This work was supported by the N a t i o n a l Science Foundation Grant No. DMR-74-23494 and a l s o through the M a t e r i a l s Science Center, C o r n e l l U n i v e r s i t y , through the N a t i o n a l Science Founda­ t i o n Grant No. DMR-72-03029. The O f f i c e o f Computer Services provided IBM 370/168 computer time. We would a l s o l i k e to thank Mr. J . Tobochnik f o r help i n running the benchmarks on the PRIME computer. Literature 1. 2. 3. 4. 5. 6. 7.

Cited

Ceperley, D., Chester, G.V. and Kalos, M.H., Phys. Rev., (1977), 16, 3081. Ceperley, D., Chester, G.V. and Kalos, M.H., Phys. Rev., (1978), 17, 1070. Gann, R., Chakravarty, S. and Chester, G.V., Phys. Rev., (1978), to be p u b l i s h e d . M e t r o p o l i s , N., Rosenbluth, A.W., Rosenbluth, M.N., T e l l e r , A.H. and T e l l e r , E., J. Chem. Phys. (1953), 21, 1087. Schaefer, H.F. and Miller, W.H., Computers and Chemistry, (1971), 1, 85. F l o a t i n g Point Systems "AP 120-B Processor Handbook", 7259-02. F l o a t i n g Point Systems, Beaverton, Oregon 1976. Bucy, R.S. and Senne, K.D., "Nonlinear Filtering Algorithms f o r P a r a l l e l and Pipe Line Machines". Proceedings o f the Meeting on P a r a l l e l Mathematics - P a r a l l e l Computers, Munich March 1976, North Holland Press, Amsterdam.

RECEIVED August 15,

1978.

Lykos; Computer Modeling of Matter ACS Symposium Series; American Chemical Society: Washington, DC, 1978.