A Micro Vector Processor for Molecular Mechanics Calculations - ACS

13

Downloaded by UNIV OF CALIFORNIA SANTA BARBARA on April 11, 2018 | https://pubs.acs.org Publication Date: November 6, 1981 | doi: 10.1021/bk-1981-0173.ch013

A Micro Vector Processor for Molecular Mechanics Calculations DAVID N. J. WHITE The University, Chemistry Department, Glasgow G12 8QQ Scotland

C e r t a i n chemical computations such as molecular mechanics, molecular o r b i t a l and X-ray c r y s t a l l o g r a p h i c c a l c u l a t i o n s have always required l a r g e and powerful mainframe computers. The t r a d i t i o n a l s o l u t i o n to the problem o f performing these c a l c u l a t i o n s w i t h i n i n d i v i d u a l l a b o r a t o r i e s , an end which is d e s i r a b l e for a number o f reasons, has been the purchase o f a minicomputer and the cramming of q u a r t - s i z e d programs i n t o p i n t - s i z e d machines. The f l o a t i n g p o i n t a r i t h m e t i c c a p a b i l i t y of the average m i n i computer is strictly l i m i t e d and the need arose f o r p e r i p h e r a l devices, o f t e n c a l l e d "array processors", to perform these operations f o r the minicomputer. T h i s i s a s a t i s f a c t o r y , though still expensive, method o f p r o v i d i n g "in-house" number crunching facilities. However, the advent o f the microprocessor and s i n g l e c h i p f l o a t i n g p o i n t processors r a i s e s the possibility o f p r o v i d i n g these facilities at very modest c o s t indeed. P e r i p h e r a l processors which are capable of performing f l o a t i n g p o i n t a r i t h m e t i c operations at high speed are used to enhance the poor performance of popular general purpose m i n i computers in t h i s area. These devices are described in various ways but the f o l l o w i n g nomenclature w i l l be used in t h i s paper. I t i s assumed that a l l f l o a t i n g p o i n t operations w i l l be performed on arrays o f data and so the nomenclature r e f l e c t s the nature of the hardware. Most f l o a t i n g p o i n t a c c e l e r a t o r s c u r r e n t l y a v a i l a b l e commercially c o n s i s t e s s e n t i a l l y of a s i n g l e f l o a t i n g p o i n t processor, which i s p i p e l i n e d to maximize i t s throughput, together with a c o n t r o l processor, memory and I/O channel to exchange data with the host minicomputer. These u n i t s w i l l be c a l l e d attached f l o a t i n g p o i n t processors or AFPP's and t y p i c a l devices are the F l o a t i n g Point Systems AP120B and the CSPI MAP-200. A c c e l e r a t o r s which c o n t a i n a l i n e a r a r r a y (i.e. a vector) o f i d e n t i c a l f l o a t i n g p o i n t processors are c a l l e d vector processors or VP's and a t y p i c a l device i s the CSPI MAP-300. F i n a l l y there are devices which c o n t a i n a number o f f l o a t i n g p o i n t processors arranged in a g r i d f a s h i o n and these devices w i l l be c a l l e d a r r a y processors or AP's o f which the only

0097-6156/ 81 /0173-0193$ 11.00/0 © 1981 American Chemical Society

Lykos and Shavitt; Supercomputers in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1981.


194

SUPERCOMPUTERS IN CHEMISTRY

commercial example i s the ICL DAP. The a r c h i t e c t u r e and usage of the commercial "array p r o c e s s o r s " mentioned i n t h i s paper are discussed i n "High Speed Computer and Algorithm O r g a n i z a t i o n " (V) • AFPP's have t r a d i t i o n a l l y been a low c o s t method of p r o v i d ing minicomputer users with the number crunching c a p a b i l i t y of small mainframe computers. However, the low c o s t i s only low r e l a t i v e to the enormous c o s t of a mainframe computer and a t y p i c a l AFPP i s s t i l l expensive i n an absolute sense a t £30k-£50k f o r a system with enough options to perform u s e f u l work. In recent years the performance of low c o s t microcomputer systems has reached that of popular midrange minicomputers which might c o s t upwards of four times the p r i c e of the microcomputer. To take a s p e c i f i c example the performance of a 4MHz Z80A microcomputer system running M i c r o s o f t F o r t r a n s l i g h t l y exceeds that of a standard DEC PDP-11/34 running F o r t r a n V2.1 under RT-11 V3B f o r most s c i e n t i f i c a p p l i c a t i o n s . However i f the PDP-11/34 i s configured with a f l o a t i n g p o i n t processor then i t w i l l run programs which c o n t a i n a s i g n i f i c a n t amount of f l o a t i n g p o i n t a r i t h m e t i c up to f i v e times f a s t e r than the microcomputer. In order to make a microcomputer, an a t t r a c t i v e p r o p o s i t i o n f o r s c i e n t i f i c work there must be some means of enhancing i t s f l o a t i n g p o i n t performance. There are multidudinous ways of a c h i e v i n g t h i s and some of the p o s s i b i l i t i e s w i l l be discussed i n i n c r e a s i n g order of performance. Scalar A r i t h m e t i c

Processors

The lowest c o s t method of improving the a r i t h m e t i c c a p a b i l i t y of a microcomputer i s to i n t e r f a c e a s c i e n t i f i c c a l c u l a t o r type c h i p with the data bus of the CPU. The N a t i o n a l Semiconductor MM57109 number crunching c h i p (2) i s designed f o r j u s t t h i s purpose and i t s i n s t r u c t i o n set includes b a s i c a r i t h m e t i c operations, transcendental f u n c t i o n s and data manipulation operations. Data are entered i n t o the MM57109 one d i g i t a t a time (8 mantissa d i g i t s , 2 exponent d i g i t s ) and a s i x b i t i n s t r u c t i o n i s loaded when the data i s i n p l a c e . A simple f l o a t i n g p o i n t i n s t r u c t i o n such as m u l t i p l y takes an average of 32 mS e x c l u s i v e of the time required to load data and r e t r i e v e results. The MM57109 i s not an a t t r a c t i v e o p t i o n from the speed p o i n t of view because a 4MHz Z80A can perform a f l o a t i n g p o i n t m u l t i p l y (32 b i t ) i n much l e s s than 32 mS. The MM57109 does save both the time r e q u i r e d to code f l o a t i n g p o i n t a r i t h m e t i c r o u t i n e s and the space r e q u i r e d to store them but our prime c o n s i d e r a t i o n here i s speed. The second p o s s i b i l i t y f o r improving microcomputer f l o a t i n g p o i n t performance l i e s with the North Star Hardware F l o a t i n g Point Board {3). T h i s device executes f l o a t i n g p o i n t , add, s u b t r a c t , m u l t i p l y and d i v i d e with up to twelve decimal d i g i t s of p r e c i s i o n . One byte of data i s reserved f o r the exponent and the other s i x f o r the mantissa of each f l o a t i n g p o i n t number.



13.

WHITE

Molecular

Mechanics

Calculations

195

Each byte of data i n the mantissa contains two BCD d i g i t s and the exponent byte contains the mantissa s i g n (MSB) and the exponent i n excess 64 binary. Mean execution times f o r f l o a t i n g p o i n t add, s u b t r a c t , m u l t i p l y and d i v i d e (6 d i g i t ) are 6, 11, 194 and 175yS again e x c l u s i v e of data t r a n s f e r times. The speeds of execution with t h i s device are a t t r a c t i v e and c e r t a i n l y l e s s than could be achieved by an e i g h t b i t microprocessor alone. U n a t t r a c t i v e features i n t h i s case are the lack of transcendental functions and a rather odd f l o a t i n g p o i n t number format. These drawbacks could be t o l e r a t e d i f there was not a more a t t r a c t i v e alternative. The t h i r d f l o a t i n g p o i n t processor o p t i o n uses the Advanced Micro Devices Am 9511A a r i t h m e t i c processor c h i p . This c h i p i s a v a i l a b l e i n s e v e r a l types; the Am9511, Am9511A-DC, Am9511A-1DC and Am9511A-2DC. The Am9511A-DC obsoleted the Am9511 and the DC, 1DC and 2DC v a r i a n t s r e f e r t o 2, 3 and 4MHz clock speeds. A l l f u r t h e r references i n t e x t and f i g u r e s are t o the Am9511A-1DC 3MHz c h i p . A block diagram of the f u n c t i o n a l u n i t s of t h i s c h i p i s shown i n Figure 1 and i t s i n s t r u c t i o n s e t i s shown i n Table I . Mean times f o r each i n s t r u c t i o n are shown i n Table I I and these o b v i o u s l y vary f o r d i f f e r e n t patterns i n the f i x e d or f l o a t i n g p o i n t arguments. Each f l o a t i n g p o i n t number i s 32 b i t s long and the d e t a i l e d format i s shown i n Figure 2. Data are entered i n t o the Am9511A by pushing one byte a t a time onto the 16 byte operand stack and r e s u l t s r e t r i e v e d by byte wide pops o f the same stack. S i n g l e byte commands enter the Am9511A v i a the same data l i n e s (DB0-DB7) as the operands but are routed to the i n t e r n a l command r e g i s t e r and execution proceeds using the top one or two operands on the stack. When i n s t r u c t i o n execution i s complete the "consumed" operands are popped o f f the stack and the r e s u l t pushed onto the top of the stack. The Am9511A i s e a s i l y i n t e r faced t o most popular 8 - b i t microcomputers and Figure 3 shows one method of connection t o the Z i l o g Z80A. Thus f a r the Am9511A appears t o have s a t i s f i e d requirements f o r range of operations, ease of i n t e r f a c i n g and speed. The remaining requirement i s f o r ease of software i n t e r f a c i n g t o a high l e v e l language. The remaining options f o r improving microcomputer f l o a t i n g p o i n t performance which w i l l be considered lead t o much higher performance than the Am9511A o p t i o n but have one or more major drawbacks. The I n t e l 8086 or 8088 microprocessors could be used i n conjunction with the I n t e l 8087 f l o a t i n g p o i n t processor c h i p (£) which i s probably twice as f a s t as the Am9511A f o r on-chip operations and includes extended p r e c i s i o n a r i t h m e t i c i n i t s instruction set. Unfortunately the 8087 was only l a i d down on paper, not s i l i c o n , when t h i s work s t a r t e d . The 8087 i s now (January 1981) a v a i l a b l e i n sample q u a n t i t i e s a t a p r i c e f a r i n excess of the Am9511A. In a d d i t i o n to the p r i c e and a v a i l a b i l i t y problem the i n s t r u c t i o n s e t of the 8087 i s l e s s s u i t e d t o chemical computations than the Am9511A i n that many transcen-


196


1

'8

t

18

S

IS

% 1? * I

*2

I5 11 i1

INTERFA CC COMTROL

Ml

o

5

2

8.

1S2

ui S x O o

It

SI*

O H> OK

•BIT BUS

*8

i

a a.

i

1

| MUX |


fig

o o

DATA B U f F E R

000 1

i

i

§

8

8§i §1

8


WHITE

Molecular

Mechanics

Calculations

Table


Command

ACOS ASIN ATAN CHSD CHSF CHSS COS DADD DDIV DMUL DMUU DSUB EXP FADD

FDIV FIXD FIXS FLTD FLTS FMUL FSUB LOG LN NOP POPD POPF POPS PTOD PTOF PTOS PUPI PWR SADD SDIV SIN SMUL SMUU SQRT SSUB TAN

XCHD XCHF XCHS

Mnemonics

in

I Alphabetical

Order

ARCCOSINE ARCSINE ARCTANGENT CHANGE SIGN DOUBLE CHANGE SIGN FLOATING CHANGE SIGN SINGLE COSINE DOUBLE ADD DOUBLE DIVIDE DOUBLE MULTIPLY LOWER DOUBLE MULTIPLY UPPER DOUBLE SUBTRACT EXPONENTIATION ( e ) FLOATING ADD FLOATING DIVIDE FIX DOUBLE FIX SINGLE FLOAT DOUBLE FLOAT SINGLE FLOATING MULTIPLY FLOATING SUBTRACT COMMON LOGARITHM NATURAL LOGARITHM NO OPERATION POP STACK DOUBLE POP STACK FLOATING POP STACK SINGLE PUSH STACK DOUBLE PUSH STACK FLOATING PUSH STACK SINGLE PUSH 7T POWER (X ) SINGLE ADD SINGLE DIVIDE SINE SINGLE MULTIPLY LOWER SINGLE MULTIPLY UPPER SQUARE ROOT SINGLE SUBTRACT TANGENT EXCHANGE OPERANDS DOUBLE EXCHANGE OPERANDS FLOATING EXCHANGE OPERANDS SINGLE X


198


Table

Execution


Mnemonic

ACOS ASIN ATAN CHSD CHSF CHSS COS DADD DDIV DMUL DMUU DSUB EXP FADD FDIV FIXD FIXS FLTD FLTS FMUL FSUB LOG LN NOP POPD POPF POPS PTOD PTOF PTOS PUPI PWR SADD SDIV SIN SMUL SMUU SQRT SSUB TAN XCHD XCHF XCHS

Times

II

in

Microseconds

Hex C o d e

E x e c u t i o n Time

06 05 07 34 15 74 03 2C 2F 2E 36 2D 0A 10 13 IE IF 1C ID 12 11 08 09 00 38 18 78 37 17 77 1A 0B 6C 6F 02 6E 76 01 6D 04 39 19 79

2101 2077 1664

9 5 7 1280 7 65 65 61 13 1265 18 51 30 30 19 21 49 23 1491 1433 1 4 4 3 7 7 5 5 2763 5 28 1265 28 27 261 10 1631 9 9 6

( M sees)

2761 2646 2179 7 8 1626 70 70 73 1626 123 61 112 71 114 52 56 123 2377 2319

4011 6 31 1603 31 33 290 11 1962



31 24 23

Figure 2.

BYTE 2

MANTISSA LSB

The floating point number representation format used by the Am9511 A.

EXPONENT MANTISSA MSB (2's COMPLEMENT)

30

NORMALIZED TO 1 (EXCEPT FOR FP ZERO)

EXPONENT SIGN

MANTISSA SIGN



t

I

DBO-7

CLK

o

o

O

Schematic logic diagram of a hardware interface between a Zilog Z80A microprocessor and the Am9511A arithmetic processor unit.

0

RESET

RESET

Cs

Am9511A

WR PAUSE

=o

RD C/S

WAIT

Figure 3.

Z80

IORQ RD

0-15

WR

A



13.

WHITE

Molecular

Mechanics

Calculations

201

d e n t a l f u n c t i o n s must be computed o f f c h i p using a minimal s e t of on c h i p f u n c t i o n s . For these operations (e.g. ASIN, AC0S) the 8087 i s much slower than the Am9511A. Furthermore the 8086 and 8088 are the end of a s e r i e s of processors based on the I n t e l 8080 and are not i n any way upwards compatible with I n t e l ' s new products. (Contrast t h i s with the Z i l o g Z8000 whose i n s t r u c t i o n set i s a superset of the Z80's a t the assembler l e v e l ) . Enhancement of an 8086/8087 design would t h e r e f o r e be somewhat difficult. The f i n a l p o i n t which m i t i g a t e s a g a i n s t the 8086/ 8087 i s the lack of a supporting high l e v e l language a t present. I f or when such a language i s a v a i l a b l e i t i s u n l i k e l y to be cheap. The p l u s p o i n t s f o r the 8086/8087 are t h e r e f o r e speed and extended p r e c i s i o n a r i t h m e t i c but these are outweighed by the disadvantages discussed above. E i t h e r of the remaining two options to be discussed would provide a t l e a s t an order of magnitude increase i n speed on the Am9511A and are much coveted by the author, but u n f o r t u n a t e l y they would r e q u i r e f i n a n c i a l resources on a very l a r g e s c a l e f o r development i n t o useable f l o a t i n g p o i n t a r i t h m e t i c processors. The f i r s t of these options i s the American Microsystems Inc.'s S2811 microprocessor c h i p which contains 256 bytes each of data ROM and RAM, 256 x 17 i n s t r u c t i o n ROM, a 300nS 12 x 12 m u l t i p l i e r and an add/subtract u n i t . The c h i p runs at a 20MHz (minimum) c l o c k speed and a l l i n s t r u c t i o n s ( i n c l u d i n g m u l t i p l y ) execute i n 300 nS. I n t e r f a c i n g to a c o n t r o l processor i s simple and s t r a i g h t f o r w a r d . The whole processor i s e x t e n s i v e l y p i p e l i n e d and would perform a 32 b i t f l o a t i n g p o i n t m u l t i p l y i n around 3 microseconds as opposed to 50 microseconds on the Am9511A, There i s a l s o s u f f i c i e n t space i n the microcode ROM to emulate the e n t i r e Am9511A i n s t r u c t i o n s e t which makes the S2811 appear to be a very a t t r a c t i v e p r o p o s i t i o n ; so what's the drawback? U n f o r t u n a t e l y the S2811 i s only mask programmable, at the f a c t o r y , and t h i s i n v o l v e s a fee of £5000 p l u s a guaranteed minimum order of 500 chips a t around £200 each! AMI have no immediate plans to r e l e a s e v e r s i o n s of the S2811 using EPROM or RAM f o r i n s t r u c t i o n memory, which would make i t economic to use i n small quantities. The S2811 i s a t present t h e r e f o r e r e s t r i c t e d to l a r g e commercial users except i n one s p e c i a l i z e d i n s t a n c e . The S2814 performs r e a l or complex, forward or reverse f a s t f o u r i e r transforms using 32 b i t f l o a t i n g p o i n t a r i t h m e t i c and a s i n g l e c h i p w i l l perform 32 complex, forward transformations i n 1.5mS w h i l s t an a r r a y of 32 c h i p s takes 3.35mS to perform a 1024 p o i n t transformation! This c h i p i s a v a i l a b l e ready programmed a t around £400 f o r s i n g l e q u a n t i t i e s . The f i n a l o p t i o n considered i n v o l v e s the use of the TRW MPY-24 b i t m u l t i p l i e r i n c o n j u n c t i o n with 2900 s e r i e s b i t - s l i c e microprocessors i n order to design a f l o a t i n g p o i n t u n i t from the ground up. T h i s would r e s u l t i n a u n i t s i m i l a r to the F l o a t i n g Point Systems FPS-100 AFPP. The design p r i n c i p l e s f o r such an approach are w e l l known, but somewhat time-consuming and



202


r e l a t i v e l y expensive. T h i s o p t i o n would be the f a s t e s t o f a l l those d i s c u s s e d and would y i e l d f l o a t i n g p o i n t add, s u b t r a c t , m u l t i p l y and d i v i d e times of around 100nS. As a general r u l e o f thumb the breakpoint f o r academic development of f l o a t i n g p o i n t u n i t s i n non-engineering or computer science environments i s c u r r e n t l y i n the 1-1OyS range; i n order t o go much below t h i s r e q u i r e s funds and personnel on an increasingly large scale. Almost a l l a p p l i c a t i o n s programming i n chemistry, and s t r u c t u r a l chemistry i n p a r t i c u l a r , i s performed i n the FORTRAN language and the molecular mechanics c a l c u l a t i o n s f o r which t h i s hardware/software design e x e r c i s e was undertaken i s no exception. There are two problems which must be solved i n order t o b u i l d a FORTRAN microcomputer system with good s c a l a r and a r r a y computational performance and these are f i r s t l y , the design of an e f f i c i e n t s c a l a r processor with a transparent FORTRAN i n t e r f a c e to the hardware and secondly, the design of an e f f i c i e n t AFPP, VP or AP and supporting l i b r a r y of FORTRAN c a l l a b l e subroutines. S c a l a r FORTRAN System The system was designed around a Vector Graphics Inc. Vector MZ microcomputer c o n s i s t i n g of 4 MHz CPU, 48k bytes o f memory and two 315k byte M i c r o p o l i s mini f l o p p y d i s c d r i v e s . The system u t i l i s e s the S-100 bus s t r u c t u r e and the i n t e r f a c e f o r an Am9511A i s shown i n Figure 4. The o n l y s u i t a b l e FORTRAN compiler which would run under the D i g i t a l Research CP/M operating system i s M i c r o s o f t ' s F80 package which i n the event proved t o be a very sound and robust p i e c e of software. M i c r o s o f t a l s o provide s u f f i c i e n t l y d e t a i l e d documentation of the F80 run time l i b r a r y to enable a r e l a t i v e l y problem f r e e replacement of the software f l o a t i n g p o i n t a r i t h m e t i c r o u t i n e s by c a l l s to subroutines which invoke the Am9511A. Provided then, that the M i c r o s o f t naming and c a l l i n g procedures are adhered to the end r e s u l t i s an F80 system which w i l l run p r e v i o u s l y prepared programs without any changes i n the FORTRAN code being r e q u i r e d . These same programs w i l l now though run about an order of magnitude f a s t e r . D e t a i l e d comparisons o f the execution times f o r v a r i o u s F80 f u n c t i o n s and l i b r a r y subroutines a r e given i n Table I I I . I t i s obvious from Table I I I that the Am9511A has been used t o implement a number o f f u n c t i o n s not s u p p l i e d by M i c r o s o f t . The times given are f o r 10,000 f u n c t i o n e v a l u a t i o n s . The o n l y unfortunate f e a t u r e o f t h i s whole implementation i s the f a c t that the AMD f l o a t i n g p o i n t format (Figure 2) and the M i c r o s o f t f l o a t i n g p o i n t format (Figure 5) i s d i f f e r e n t , r e q u i r ing a M i c r o s o f t t o AMD conversion before using any o f the Table I I I f u n c t i o n s / s u b r o u t i n e s and an AMD t o M i c r o s o f t conv e r s i o n before r e t u r n i n g t o the F80 c a l l i n g program. The source l i s t i n g of a Z80 assembly l e v e l subroutine t o perform t h i s conversion i n the forward (Microsoft •> AMD) d i r e c t i o n i s shown i n Table IV. The forward and reverse conversions take s l i g h t l y


WHITE

Molecular

Mechanics

Calculations


13.


203


31

Figure 5.

0

MANTISSA MSB MANTISSA BYTE 2 MANTISSA LSB

IMPLIED NOT STORED)

The floating point number representation format used by Microsoft in the F80 Fortran compiler.

(EXCESS 80H)

EXPONENT

24 23

IS

(MS BIT OF NORMALIZED MANTISSA

MANTISSA SIGN BIT


13.

WHITE

Molecular

Mechanics

Calculations

Table Benchmark


FUNCTION

SQRT SIN COS TAN ASIN ACOS ATAN ALOG10 ALOG EXP SINH COSH TANH FLOAT INT ABS AINT NINT AMIN0 AMAX0 MINI MAX1 AMIN1 AMAX1 AMOD DIM ATAN 2 MOD $M9 $D9 $AB $SB $MB $DB $EB $CJ $EA $NB $E9 RAN

*

-

MSP-9500

5.2 13.7 14.9 17.4 21.6 22.4 18.2 17.2 16.8 16.0 32.8 32.4 20.5 2.4 2.1 1.5 3.8 3.7 9.5 8.8 14.7 15.4 16.2 13.1 8.8 4.5 19.8 1.3 0.9 1.0 4.5 5.3 4.7 5.3 31.1 2.7 30.9 0.5 1.6 5.7

O p e r a t i o n not

205

III

Timings

SOFTWARE

126.2 109.0 113.7

* * 103.2 138.7 138.7 119.9

* 144.4 2.8 6.5 1.3 4.9

* 12.0 11.0 20.4 19.2 15.2 14.4 28.2 5.3 123.9 7.2 2.7 5.7 3.1 4.5 6.7 19.4 245.8 2.7 66.2 0.9 5.3

*

(Seconds)

DESCRIPTION

SQUARE ROOT SINE COSINE TANGENT ARC S I N E ARC COSINE ARC TANGENT LOG TO BASE 10 LOG TO BASE e e TO POWER HYPERBOLIC S I N E HYPERBOLIC COSINE HYPERBOLIC TANGENT INTEGER -> REAL REAL -> INTEGER ABSOLUTE VALUE R E A L - > R E A L TRUNCATION NEAREST INTEGER REAL MIN OF INT L I S T REAL MAX OF INT L I S T INT MIN OF REAL L I S T INT MAX OF REAL L I S T REAL MIN OF REAL L I S T REAL MAX OF REAL L I S T REAL REMAINDER REAL P O S I T I V E D I F F . ARC TANGENT a l / a 2 INTEGER REMAINDER INTEGER*INTEGER INTEGER/INTEGER REAL+REAL REAL-REAL REAL*REAL REAL/REAL REAL**REAL REAL -> L O G I C A L REAL**INTEGER NEGATE INTEGER**INTEGER RANDOM NUMBER 0 -> 1.

available



t o AMD F l o a t i n g

Point

IV Format C o n v e r s i o n

NOMOV:

VPPUT:

;

PUSH CALL LD POP SUB SBC JR LD ADD ADD LD LD LD LDIR LD LD INC INC PUSH EXX POP

.Z80 EXT

1

1

DE ;•SAVE DE ON STACK LD3ARG ,f ( A B A S E ) < - H L , ( C B A S E ) < - D E , DE=NCOMP (NCOMP),DE ?SAVE NUMBER OF ARRAY ELEMENTS IN MEMORY DE ;DE POINTS TO BOTTOM BYTE OF C ARRAY A j•CLEAR CARRY F L A G H L , DE ;ARE A AND C ARRAY BASE ADDRESSES EQUAL? Z ,NOMOV j'YES CONVERT A ARRAY IN PLACE H L , ( N C O M P ) j •GET NO. OF ARRAY ELEMENTS INTO HL H L , HL •AND MULTIPLY BY TWO H L , HL ;AND BY TWO AGAIN TO GET NO. BYTES IN ARRAY C,L ?BC IS BYTE COUNTER IN LDIR SO LOAD B,H ;HL INTO BC H L , ( A B A S E ) , ;HL->A ARRAY,DE->C ARRAY,BC CONTAINS LENGTH ?MOVE ( H L ) - > ( D E ) BC TIMES H L , ( C B A S E ) , fGET READY TO CONVERT C ARRAY IN PLACE BC,(NCOMP) ?BY S E T T I N G UP HL AND BC(BC=0 NOW) HL ?HL->BYTE 2 OF FP WORD HL ?HL->BYTE 3 OF FP WORD HL ?HL->MS MANTISSA B Y T E - S A V E I T ! ?GET H L , B C , D E ' HL ?HL'->MS MANTISSA BYTE

f

OFLOV UFLOV,ABASE,CBASE,LD3ARG,NCOMP

V P P U T SUBROUTINE ;INVOKED FROM FORTRAN BY C A L L V P P U T ( A , C , N ) WHERE A IS AN ARRAY ;IN HOST FORMAT AND C IS THE CONVERTED ARRAY IN VPU FORMAT. N ;IS THE LENGTH OF THE ARRAYS. A MAY BE CONVERTED IN PLACE I F ; T H E SYMBOLIC ADDRESSES A AND C ARE THE SAME

Microsoft

Table



NXNUM:

SETl:

CNVRT:

PLUS:

END

EXX INC LD ADD JR CP JP CP JP JR CP JP SUB LD EXX BIT JR SET EXX RES JP EXX SET INC DEC LD OR JP RET 1

;BACK TO H L , B C , DE ; H L - > B Y T E 4 ( I . E . EXPONENT) OF MICROSOFT FP WORD ;LOAD MICROSOFT FORMAT EXPONENT INTO A C C . ;ADD ZERO TO A C C - S E T S CONDITION CODES ;ZER0 SAME IN BOTH FORMATS-DONT CONVERT ;IS EXP GE 8 0 ( H E X ) ; Y E S - G O CHECK S I Z E OF EXPONENT ;IS EXPONENT GT 40 (HEX) ;NO-NUMBER TOO SMALL FOR APU FORMAT ;YES-NUMBER WITHIN RANGE-GO CONVERT ;IS EXPONENT GT 0 B F ( H E X ) ? ;YES-NUMBER TOO LARGE FOR APU FORMAT ;NO-CNVRT TO 2 ' S COMPLEMENT ;STORE CNVRTD EXPONENT-NO CONV FOR NEG EXP ;GET H L ' , B C ' , D E 7,(HL) ; T E S T SIGN B I T OF MICROSOFT MANTTISSA 1= N Z , S E T l ;NEG MANTISSA-GO F I X SIGN IN APU FP FORMAT 7,(HL) ;MS MANTISSA B I T ALWAYS SET IN APU F P FORMAT ;GET H L , B C , D E 7, (HL) ;POS MANT- RESET MS APU FMT EXPONENT B I T NXNUM ;CONVERSION TO APU FORMAT COMPLETE ;GET H L , B C , D E 1,(HL) ;MS MANT B I T ALREADY S E T - S E T APU MANT SIGN B I T HL ; H L - > L S MANTISSA BYTE OF NEXT FP NUMBER BC ;DECREMENT NO. OF ARRAY ELEMENTS L E F T TO F I X A,C ;LOAD C INTO A C C . AND OR I T B ;WITH B . R E S U L T IS ZERO ONLY I F B=C=0 N Z , C O N L O O P ; R E S U L T NOT ZERO-GO PROCEESS MORE ELEMENTS ; A L L CONVERTED-RETURN

HL A,(HL) A,00H Z,NXNUM 80H P,PLUS 40H M,UFLOV CNVRT 0BFH P,OFLOV 80H (HL),A


to 3

» £ IT §* *

S. * S § § g*

I

*j B w


208

SUPERCOMPUTERS IN

CHEMISTRY

longer i n t o t a l than a f l o a t i n g p o i n t a d d i t i o n on the Am9511A and so f l o a t i n g p o i n t a d d i t i o n and s u b t r a c t i o n do not o b t a i n f u l l b e n e f i t from the use of the Am9511A. F o r t u n a t e l y the conversion times are a small p r o p o r t i o n of the execution times f o r a l l other important Am9511A f u n c t i o n s . No conversions are of course required for integer operations. One f u r t h e r p o i n t should be mentioned and t h i s i s the f a c t that the Am9511A i s r e s t r i c t e d to 32 b i t f l o a t i n g p o i n t a r i t h m e t i c . T h i s i s q u i t e adequate f o r a l l X-ray c r y s t a l l o g r a p h i c and most molecular mechanics c a l c u l a t i o n s . In general a t l e a s t 48 b i t r e p r e s e n t a t i o n s are r e q u i r e d f o r d i r e c t , rather than i t e r a t i v e , matrix operations which are widely used i n quantum chemistry f o r example. T h i s problem could be overcome by means of the Am9512, a companion c h i p to the Am9511A, which has 64 b i t add, s u b t r a c t , m u l t i p l y and d i v i d e i n i t s i n s t r u c t i o n s e t . The Am9512 i s however a t r o c i o u s l y slow i n 64 b i t operations and has no on-chip 64 b i t transcendental f u n c t i o n s . Use of the Am9512 was t h e r e f o r e not considered f u r t h e r . A n a t u r a l p r o g r e s s i o n from the use of a s i n g l e Am9511A a r i t h m e t i c processor c h i p , which can enhance the performance of F80 f u n c t i o n s / s u b r o u t i n e s by a f a c t o r of up to 25x, i s to use a number of such c h i p s i n a VP or AP a r c h i t e c t u r e . Vector A r i t h m e t i c Processors Having p r e v i o u s l y decided upon the use of the Am9511A the problem then i s to f i n d a s a t i s f a c t o r y method of i n t e r f a c i n g m u l t i p l e devices t o the S-100 bus. The hardware adopted i s i l l u s t r a t e d s c h e m a t i c a l l y i n Figure 6. The Am9511A's are s t i l l accessed v i a I/O p o r t s , one f o r data (low address) and one f o r commands (high address) mediated by address l i n e A0, but the I/O address decoder c i r c u i t r y (IC14, IC16 and SW1) d e f i n e s a switch s e l e c t a b l e block of 16 consecutive I/O p o r t s any one of which i s s e l e c t a b l e during a CPU input/output o p e r a t i o n . The system c l o c k runs a t 3MHz asynchronously of the Z80A CPU c l o c k and data are b u f f e r e d onto and from the S-100 bus by IC18 and IC19. The Am9511A read and w r i t e l i n e s are d r i v e n by the S-100 bus SINP and PDBIN and SOUT and PWR s i g n a l s (5). The a r i t h m e t i c processor handshaking l i n e s (PAUSE) are ANDED together by IC15 & IC1 so t h a t lowering of any of the PAUSE l i n e s w i l l cause the Z80A to t e m p o r a r i l y suspend execution, u n t i l a l l PAUSE l i n e s are high again, i n order to allow s u f f i c i e n t time f o r data t r a n s f e r . L i g h t e m i t t i n g diodes are used to monitor the Am9511A END (of operation) l i n e s and the S-100 bus PRDY l i n e (useful f o r checking that everything i s running). I t i s p o s s i b l e to improve the performance of the hardware (hereafter c a l l e d the MVP-9500) by p r o v i d i n g i n t e r r u p t f a c i l i t i e s to s i g n a l completion of o p e r a t i o n ( s ) or to request more data and a l s o by moving data v i a d i r e c t memory access r a t h e r than under Z80 program c o n t r o l . These, and other refinements, were


WHITE

Molecular

Mechanics

Calculations

209

+ 5v LS85 A 4 - 7 [ V IC14 4

SWlJ Ov


A=Bo

A1-3Q^

LS138

CSO-7

IC16

SOUTD

WR0-70I) IC17I

JC20

PDBINQ-

MHz

8x Am951UDC|

IC19 8 LS244

Dl 0-7Q-

3.2768

CLK0-7(II)| RD0-70I)

SINPD-

CLOCK

|DB0-7(II) D O 0-7

IC18

O -

t

—

f

LS244

t

H-5v

IC5-12

PIT A0Q-

C/D 0-7(11) \IC15 PAUSE 0-7

PRESETD-

PRDYD-

Figure 6.

IC17

RESET0-7(II)

IC1

Logic diagram of the MVP-9500/8 vector processor card for S-100 bus microcomputers.



210

SUPERCOMPUTERS IN

CHEMISTRY

d e f e r r e d to a l a t e r date and f i r s t p r i o r i t y was given to g e t t i n g the b a s i c MVP-9500 up and running. There are two modes i n which the MVP-9500 can be used i n c o n j u n c t i o n with the Z80A. In the f i r s t mode one or more arguments are loaded i n t o each Am9511A i n turn, and when a l l are loaded then command bytes are s e q u e n t i a l l y output from the Z80A to each Am9511A. T h i s procedure has the drawback that the APU's spend a s i g n i f i c a n t time loaded with data which i s not being operated upon. An a l t e r n a t i v e approach i n v o l v e s " h i d i n g " data t r a n s f e r operations behind APU a r i t h m e t i c o p e r a t i o n s . The f i r s t Am9511A i s loaded with arguments and then a command t o s t a r t i t operating on the arguments. As soon as APU number one i s executing, APU number two i s loaded with data and s e t running and so on u n t i l a l l of the APU's are c o n c u r r e n t l y executing instructions. The APU's are then p o l l e d to see i f they are busy, i n the same order i n which they were loaded, and reloaded with more data and a command as they f i n i s h the previous operation. Obviously t h i s type of s t r u c t u r e w i l l give maximum e f f i c i e n c y i f the APU load times are s h o r t compared with the i n s t r u c t i o n execution times. This i s i n f a c t the case as i t takes around 30yS to load the APU with data and s e t i t running compared with APU i n s t r u c t i o n execution times of roughly 50-4000yS. The MVP-9500 w i l l o b v i o u s l y be more e f f i c i e n t with c a l c u l a t i o n s i n v o l v i n g transcendental f u n c t i o n s than i t w i l l be with c a l c u l a t i o n s i n v o l v i n g f l o a t i n g p o i n t add and s u b t r a c t . The remaining o b s t a c l e to producing a working vector processor system i s the l i b r a r y of FORTRAN c a l l a b l e subroutines. Vector FORTRAN System The MVP-9500 system contained a l l of the s c a l a r system components with the a d d i t i o n of an MVP-9500/8 p r i n t e d c i r c u i t board and the VPLIB l i b r a r y of F80 c a l l a b l e subroutines; i t i s shown s c h e m a t i c a l l y i n F i g u r e 7. The design c r i t e r i a f o r the VPLIB l i b r a r y were as f o l l o w s : I t should be p o s s i b l e to add Am9511A c h i p s to the system as funds permitted without having to make any changes to the s o f t ware; the l i b r a r y should i n c o r p o r a t e the best f e a t u r e s of those produced by F l o a t i n g P o i n t Systems f o r the AP120B and by CSPI for the MAP-200 s e r i e s of AFPP's; the l i b r a r y should c o n t a i n s p e c i a l subroutines f o r use i n coordinate geometry and molecular mechanics c a l c u l a t i o n s ; a s o l u t i o n should be found to the problem of time consuming f l o a t i n g p o i n t format conversions; and f i n a l l y the l i b r a r y should be as modular as p o s s i b l e without s a c r i f i c i n g e f f i c i e n c y by using l a r g e numbers of subroutine c a l l s or the l i k e ( i . e . f o r g e t about s t r u c t u r a l programming as i t ' s speed not beauty we're concerned w i t h ! ) . The problem of f l o a t i n g p o i n t format d i f f e r e n c e s was the most p r e s s i n g one and was s o l v e d i n the f o l l o w i n g way. A l l data a r r a y s to be operated upon by the MVP-9500 are converted to the



13.

WHITE

Molecular

Mechanics

Calculations

211

MVP-9500/8 VECTOR PROCESSOR APUS

ETC

Figure 7.

Block diagram of the MVP-9500/Vector

MZ vector processor

microcomputer system.



212


AMD format a t the s t a r t of a s e r i e s of vector o p e r a t i o n s , u s u a l l y i n p l a c e as the Z80A and MVP-9500 share memory. In order to permit s c a l a r processing i n t e r s p e r s e d between vector c a l l s a number of subroutines f o r performing a r i t h m e t i c opera t i o n s without format conversion are provided. The F80 code f o r the m u l t i p l i c a t i o n of two random matrices (Table V) i l l u s t r a t e s t h i s p o i n t and a l s o the software independence of the number of Am9511A's used. R(1) and R(2) are random number generator seeds and e x i s t i n M i c r o s o f t f l o a t i n g p o i n t format a t the s t a r t of the program. The c a l l to VINIT i n i t i a l i z e s VPLIB with the number of Am9511A's a c t u a l l y present, f i v e i n t h i s case. The c a l l to VPPUT converts the a r r a y R, i n p l a c e , to AMD format and the c a l l s to VRAND use R(1) and R(2) as seeds to a random number generator which p l a c e s 1600 random numbers between 0.0 and 1.0 i n a r r a y A and a s i m i l a r number i n a r r a y B, i n AMD format. The pause statements are f o r timing purposes and MMUL m u l t i p l i e s the matrices A and B together and p l a c e s the r e s u l t i n matrix C. A, B and C are i n AMD format a t t h i s stage and must be converted to M i c r o s o f t format, by c a l l s to VPGET, before they are p r i n t e d out. The s i m i l a r i t y to FPS and CSPI Inc. r o u t i n e s i s obvious. The question of s p e c i a l purpose r o u t i n e s f o r s t r u c t u r a l chemistry w i l l be discussed l a t e r and a t t e n t i o n w i l l now be d i r e c t e d to the c o n s t r u c t i o n of the l i b r a r y . I t would have been p o s s i b l e to w r i t e a small number of s t r a t e g i c r o u t i n e s i n Z80A/MVP-9500 assembly language and the remainder of the l i b r a r y i n FORTRAN code which made use of the assembler l e v e l nucleus. This would c e r t a i n l y have r e s u l t e d i n an o p e r a t i o n a l l i b r a r y i n the s h o r t e s t p o s s i b l e time but a t c o n s i d e r a b l e s a c r i f i c e i n efficiency. A l l of the VPLIB subroutines were t h e r e f o r e w r i t t e n e n t i r e l y i n Z80A/MVP-9500 assembly language and t h i s produced modules which contained, on average, one t h i r d of the assembler i n s t r u c t i o n s produced by F80 f o r the same o p e r a t i o n coded i n FORTRAN. In a d d i t i o n to these s t r a i g h t f o r w a r d savings a c o n s i d e r a b l e amount of hand o p t i m i z a t i o n was p o s s i b l e on the assembler l e v e l subroutines. The above mentioned p o i n t s are i l l u s t r a t e d by a d i s c u s s i o n of the VPLIB subroutine(s) f o r e v a l u a t i n g transcendental f u n c t i o n s when s u p p l i e d with a s i n g l e f l o a t i n g p o i n t argument (Table V I ) . The e n t r y p o i n t s are the l a b e l s with double colons (e.g. VSIN) and the accumulator A i s loaded with the Am9511A opcode f o r the p a r t i c u l a r o p e r a t i o n . The code s t a r t i n g a t VECTOR (a l o c a l symbol) i s then common to a l l o p e r a t i o n s . The Am9511A opcode i s stored i n the a l t e r n a t e accumulator A' (the Z80 has two d u p l i c a t e s e t s of r e g i s t e r s ) and LD3ARG moves p o i n t e r s to argument addresses i n t o a l o c a l area (ABASE f o r the source v e c t o r , CBASE f o r the d e s t i n a t i o n v e c t o r , and the 16 b i t r e g i s t e r DE contains the number of components i n the v e c t o r ) . The 16 b i t r e g i s t e r HL i s loaded with the c u r r e n t address i n the A a r r a y and a c a l l to VLD1EX loads each APU and s e t s i t executing as p r e v i o u s l y d e s c r i b e d . On e x i t from VLD1EX HL p o i n t s to the



WHITE

Molecular

Mechanics

Table Fortran

Code

for

Matrix

DIMENSION

1 2 3

213

Calculations

V

M u l t i p l i c a t i o n using

VPLIB

R(3),A(40,40),B(40,40),C(40,40)

R(l)-1. R(2)=2, R(3)=3. CALL VINIT(5) CALL VPPUT(R,R,3) CALL VRAND(R(1),A,1600) CALL VRAND(R(2),B,1600) PAUSE START CALL MMUL(A,B,C,40,40,40) PAUSE STOP CALL VPGET(A,A,1600) CALL VPGET(B,B,1600) CALL VPGET(C,C,1600) WRITE(5,1) ((A(I,J),J=1,40),1=1,40) WRITE(5,2) ((B(I,J),J=1,40),1=1,40) WRITE(5,3) ((C(I,J),J=1,40),1=1,40) FORMAT(10F8.3) FORMAT(10F8.3) FORMAT(10F8.3) STOP END



VI

VTRNLP:

VECTOR:

VEXP::

VLN: :

VLOG::

VATAN::

VACOS::

VASIN::

VTAN::

VCOS::

VSIN::

VSQRT::

BASEVP

EX CALL LD CALL

LD JP LD JP LD JP LD JP LD JP LD JP LD JP LD JP LD JP LD

.Z80 EXT EQU APU SQRT OPCODE->ACC, COMMON CODE APU SIN OPCODE->ACC. COMMON CODE APU COS OPCODE->ACC. COMMON CODE APU TAN OPCODE->ACC. COMMON CODE APU A S I N OPCODE->ACC. COMMON CODE APU ACOS OPCODE->ACC, COMMON CODE APU ATAN OPCODE->ACC. COMMON CODE APU LOG OPCODE->ACC. COMMON CODE APU LN OPCODE->ACC. COMMON CODE APU EXP OPCODE->ACC.

1

AF,AF ; S A V E APU OPCODE IN A F ' LD3ARG ;GET ARRAY ADDRESSES AND NO. OF COMPONENTS H L , ( A B A S E ) ; H L - > N E X T COMPONENT OF A VLD1EX ;LOAD V P - 9 5 0 0 AND EXECUTE

A,01H VECTOR A , 02H VECTOR A , 03H VECTOR A,04H VECTOR A,05H VECTOR A,06H VECTOR A , 07H VECTOR A,08H VECTOR A , 09H VECTOR A,0AH

ABASE , C B A S E , E R R , S T R E S , L D 3 A R G , N P R O C 0E0H ;BASE OF V P - 9 5 0 0 I/O ADDRESSES

;VECTOR TRANSCENDENTAL FUNCTIONS ROUTINE ;FORTRAN C A L L I S , FOR E X A M P L E , C A L L V S Q R T ( A , C , N ) , WHERE A IS AN ;ARRAY CONTAINING THE ARGUMENTS AND C IS AN ARRAY WHERE THE RESULTS ;ARE TO BE STORED. N IS THE NUMBER OF COMPONENTS IN EACH ARRAY TO BE OPERATED O N .

Table



VLD1EX::LD EXX LD EXX P R O C L P : LD OTIR INC EX OUT EX LD DEC EXX INC CP EXX RET LD OR RET INC JP END

LD CALL LD CALL LD LD OR JP RET

1

1

1

1

1

C , B A S E V P ; L O A D C WITH V P - 9 5 0 0 BASE I/O ADDRESS ;SWITCH TO A L T E R N A T E R E G I S T E R S E T E,00H ;ZERO E ' , THE APU COUNTER. ;SWITCH BACK TO H L , B C , DE B,04H ;ONE F P WORD PER APU ;LOAD F P WORD ONTO EACH APU STACK C ;C->APU COMMAND PORT AF,AF ; R E T R I E V E OPCODE FROM A F (C),A ;AND OUTPUT IT TO APU COMMAND PORT AF,AF ; R E S A V E OPCODE IN A F A , ( N P R O C ) ; L O A D NUMBER OF A P U ' S A V A I L A B L E INTO A C C . DE ;DECREMENT COMPONENT COUNTER ;SWITCH TO A L T E R N A T E R E G I S T E R S E T E ;AND INCREMENT THE APU COUNTER E E ;HAVE NPROC A P U ' S EXECUTED? ;BACK TO H L , B C , D E , F L A G R E G I S T E R UNCHANGED Z ;RETURN I F YES A,D ;CHECK TO S E E I F BOTH H I - AND E ; L O - B Y T E OF DE ARE ZERO Z ;DE=0 SO RETURN C ;MORE COMPONENTS,C->NEXT DATA P O R T , H L - > N X T . PROCLP ;GO AND PROCESS NEXT COMPONENT

f

( A B A S E ) , H L ; H L - > N E X T COMPONENT-SAVE I T ! ERR ;CHECK EACH APU FOR ERRORS H L ( C B A S E ) ; H L - > N E X T COMPONENT OF C STRES ;UNLOAD V P - 9 5 0 0 , RESULTS TO ARRAY C ( C B A S E ) , H L ; H L - > N E X T COMPONENT OF C - S A V E I T ! A,D ;LOAD H I - B Y T E OF COMPONENT COUNTER INTO A E ;AND OR WITH L O - B Y T E , RESULT ZERO ONLY I F DE N Z ,VTRNLP;MORE COMPONENTS TO C A L C U L A T E ? ;NO,RETURN TO FORTRAN C A L L I N G PROGRAM



216


next A component t o be processed and i t i s saved i n memory a t ABASE. The ERR subroutines check each APU f o r e r r o r s such as overflow, attempt t o d i v i d e by zero, argument out o f range e t c . and r e t u r n s i f t h i n g s are going according to p l a n , or aborts and p r i n t s an e r r o r message i f they a r e n ' t ! HL i s then loaded with the c u r r e n t address i n the d e s t i n a t i o n a r r a y C and STRES unloads the APU generated r e s u l t s i n t o memory. On e x i t HL again p o i n t s to the next component, but o f the C a r r a y , and i s saved i n CBASE. The component counter DE, s u i t a b l e decremented i n VLD1EX, i s then checked t o see i f i t i s zero (two statements are r e q u i r e d because Z80 16 b i t operations don't always a f f e c t the flags register!). I f i t i s zero and a l l components o f the A a r r a y have been processed and s t o r e d i n the C a r r a y then a r e t u r n i s made t o the c a l l i n g program otherwise, another s e t o f components i s processed. The VLD1EX subroutine operates as f o l l o w s : The 8 b i t C r e g i s t e r i s loaded with the base I/O address o f the vector o f Am9511A's. The addresses are contiguous and run i n the order DATA PORT 1, COMMAND PORT 1, DATA PORT 2, COMMAND PORT 2, e t c . EXX switches t o the a l t e r n a t e r e g i s t e r s e t and E', the counter f o r the number o f APU's loaded, i s zeroed and EXX switches back to the o r d i n a r y r e g i s t e r s e t . The 0TIR i n s t r u c t i o n i s a block output from memory p o i n t e d t o by r e g i s t e r HL t o the I/O p o r t pointed t o by the C r e g i s t e r . A f t e r each output HL i s incremented and B decremented and the process repeated u n t i l B i s zero. Having loaded one FP word (4 bytes) onto the APU, C i s incremented to p o i n t t o the command p o r t o f the same '9511 and the EX i n s t r u c t i o n switches t o the a l t e r n a t e accumulator which c o n t a i n s the APU opcode. (EX, AF, AF' switches accumulators independently of EXX which switches between r e g i s t e r s e t s HL, BC, DE and HL', B C , DE'). The OUT i n s t r u c t i o n then s e t s the APU operating on the c u r r e n t argument. The f o l l o w i n g EX resaves the APU opcode i n A' and the o r d i n a r y accumulator A i s loaded with the number o f '9511's a v a i l a b l e (VINIT loads l o c a t i o n NPROC with t h i s v a l u e ) . The component counter DE i s decremented by one t o r e f l e c t the f a c t that a component o f the A a r r a y has been processed. EXX g i v e s access t o E' which i s incremented and compared with the accumulator which contains the number o f 9511's, EXX r e t u r n s t o the o r d i n a r y r e g i s t e r s e t without a f f e c t i n g the f l a g , or c o n d i t i o n code, r e g i s t e r F. I f a l l the Am9511's a v a i l a b l e are loaded and running then a r e t u r n i s made to the VSQRT::, or whatever, subroutine. I f not a l l '9511's are loaded the component counter i s checked f o r zero and a r e t u r n made t o the VSQRTl! etc subroutine. I f n e i t h e r o f the previous c o n d i t i o n s i s f u l f i l l e d a loop i s made back t o PROCLP i n order t o s e t another APU running (note HL a u t o m a t i c a l l y p o i n t s t o the next A a r r a y element because o f the 0TIR i n s t r u c t i o n ) . The remaining VPLIB r o u t i n e s are v a r i a t i o n s on t h i s theme and a complete l i s t of the subroutines a v a i l a b l e and t h e i r purpose i s given i n Table V I I . f



of

f

f

#

f

f

f

f

f

f

f

r

f

r

f

VCLR(A,N) VMOV(A, I, C , K, N) VSWAP(A,I,C,K,N) VFILL(A,C,N) VRAMP(A,B,C,N) VNEG(A,C,N) VADD(A,B,C,N) VSUB(A,B,C,N) VMUL(A,B,C,N) V D I V ( A , B ,C ,N) VSADD(A,B,C,N) V S S U B ( A , B , C , N) V S M U L ( A , B , C , N) VSDIV(A,B,C,N) VSQ(A C N) VSSQ(A C N) V A B S ( A , C N) VSQRT(A C N) V L O G ( A , C N) VLN(A C N) VEXP(A,C N) VSIN(A C N) VCOS(A,C N)

VINIT(N) VPPUT(A,C,N) VPGET(A,C,N)

Contents

VII Processing

VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR

Library

CLEAR MOVE SWAP FILL RAMP NEGATE ADD SUBTRACT MULTIPLY DIVIDE SCALAR ADD SCALAR SUBTRACT SCALAR MULTIPLY SCALAR D I V I D E SQUARE SIGNED SQUARE ABSOLUTE VALUE SQUARE ROOT LOG(10) NATURAL LOGARITHM EXPONENTIAL SINE COSINE

I N I T I A L I Z E MVP-9500 PUT DATA INTO MVP-9500 GET DATA FROM MVP-9500

the VPLIB V e c t o r

Table



VII

(contd.)

f

r

f

f

VTAN(A,C,N) VASIN(A,C,N) VACOS(A C,N) VATAN(A C,N) VAA(A,B,C,D,N) VAM (A,B,C,D,N) VADIV(A,B,C D,N) VSBSB(A,B,C,D,N) VSBM(A,B,C,D,N) VSBDIV(A,B,C,D,N) VMA(A,B,C,D,N) VMSB(A,B,C,D,N) VMM(A,B ,C , D, N) VMDIV(A,B,C,D,N) VDIVA(A,B,C,D,N) VDIVSB(A,B,C,D,N) VDIVDV(A,B,C,D,N) VPWR(A,B,C,N) VATAN 2 (A, B , C , N) VRAND(A C,N) VMSA(A,B,C,D,N) VSMA(A, B , C , D, N) VSMSB(A,B,C #D,N) VSBSQ(A,B,C,N) VSMSA(A,B,C,D,N) VMMA(A,B,C,D,E,N) VMMSB(A,B,C,D,E,N) VAAM(A,B,C,D,E,N) VSBSBM(A,B,C,D,E,N)

VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR VECTOR f

f

TANGENT ARC S I N E ARC COSINE ARC TANGENT ADD AND ADD ADD AND MULTIPLY ADD AND D I V I D E SUBTRACT AND SUBTRACT SUBTRACT AND MULTIPLY SUBTRACT AND D I V I D E MULTIPLY AND ADD MULTIPLY AND SUBTRACT MULTIPLY AND MULTIPLY MULTIPLY AND D I V I D E D I V I D E AND ADD D I V I D E AND SUBTRACT D I V I D E AND D I V I D E POWER ARC TANGENT OF A/B RANDOM NUMBERS MULTIPLY AND SCALAR ADD SCALAR MULTIPLY AND ADD SCALAR MULTIPLY AND SUBTRACT SUBTRACT AND SQUARE SCALAR MULTIPLY AND SCALAR ADD MUL MUL AND ADD < ( A * B ) + ( C * D ) > MUL MUL AND SUB < ( A * B ) - ( C * D ) > ADD,ADD AND MUL SUB,SUB AND MUL < ( A - B ) * ( C - D ) >

Contents of the VPLIB Vector Processing L i b r a r y

Table



13.

WHITE

Molecular

Mechanics

219

Calculations

w CO EH W £ OS OS W O O ^ EH EH CO W CJ CJ EH W W Z PS > > W O

S H 2 U

H

H

EH E H

OS W O. H M < i-3 W

2

w

En < O

W > EH EH _ Z Z OS w w w

o

o £

^ EH

x CJ W

1X4

W

> iJ

OS OS W O O

O

£

W W W J M* D W W < > Z

£

£

D £ M

D £ M

CJ o £ < x z w w D W > CO s s s

z

2

2 ~

CJ

M

M

X

M

> > co £

£

£

z

Q Z M

z z > >

2 S U H

W >

M" M " J

CJ W >

E H E H E H E H M M M M

u u u u o o o o w w w w o o o o

Z

J

E

H

W

O

a CO

H J

o v

04 04 E H CO M OH Z EH W < C J > 2 D Z EH £ n

M*

osososos > , J , J M

w w o

W OS < D

EH E H

W OS OS OS OS EH O O O O O 0$ EH E H E H E H EH

£ £ H H f t W U U U U X
> > >

OS O W Z £ > M*

OS OS OS O O O EH E H EH

CJ CJ CJ

X M OS EH

W W W > > >

X X M M OS OS E H EH

Z

CJ — z CJ z cj — - £

Z

CJ - £ CJ CJ
' ^ I ' ii' ^2' ^2' ^2' # VC0MPS i s an a r r a y f o r holding AX0?. i n t r i p l e s corresponding to Ax + Ay + Az and places the r e s u l t i n the BL a r r a y . Taking the square root (VSQRT) of the elements of BL gives the bond lengths which overwrite the o r i g i n a l q u a n t i t i e s i n the BL a r r a y . The astute reader w i l l have noticed that i t i s p o s s i b l e to perform an e n t i r e bond length c a l c u l a t i o n on a four deep stack without the removal of intermediate r e s u l t s . However i n order to take f u l l advantage of t h i s f a c t we r e q u i r e an Am9511A which has i t s own i n d i v i d u a l c o n t r o l processor or microsequencer. This i s a p r o j e c t e d improvement to the MVP-9500. Interatomic d i s t a n c e s , angles and t o r s i o n angles may be e f f i c i e n t l y c a l c u l a t e d with the routines discussed above plus the remainder of VPLIB. A number of other s p e c i a l purpose r o u t i n e s have been developed and are under development but these w i l l not be discussed here. 2

2

2

2

Vector Processing and Molecular Mechanics C a l c u l a t i o n s The primary reason f o r undertaking t h i s whole e x e r c i s e was to evaluate vector processors f o r use i n molecular mechanics c a l c u l a t i o n s and as an adjunct to chemical computer graphics systems• P i c t u r e transformations (e.g. r o t a t i o n , t r a n s l a t i o n , s c a l i n g , p e r s p e c t i v e etc) i n i n t e r a c t i v e computer graphics lend themselves n a t u r a l l y to r e p r e s e n t a t i o n i n matrix n o t a t i o n , and implementa t i o n of the various algorithms on a vector processor i s o b v i o u s l y s t r a i g h t f o r w a r d and very worthwhile ( p a r t i c u l a r l y i f moving p i c t u r e s are r e q u i r e d . ) . For t h i s reason g r a p h i c a l a p p l i c a t i o n s of the MVP-9500 w i l l not be discussed here and the i n t e r e s t e d reader i s r e f e r r e d to one of the standard t e x t s i n t h i s area {!) •



XV

10

DO 10 I=l,NBOND CALL SVE(VCOMPS(IJ),BL(I),3) IJ=IJ+3 CONTINUE CALL VSQRT(BL,BL,NBOND) RETURN END

COMMON/BOND/XO (3, 60) ,INDXI(180) ,INDXJ(180) ,VC0MPS(i8u) ,8L(6iJ) C *** XO CONTAINS ORTHOGONAL COORDS IN ANGSTROMS,VCOMPS CONTAINS T H E ** C *** VECTOR COMPONENTS OF EACH BOND,AND BL CONTAINS THE BOND LENGTH ** C *** INDXI & INDXJ POINT TO X,Y,Z COORDINATES OF ATOMS I & ATOMS J ** C CALL VSUSQI(XO,INDXI,XO,INDXJ,VCOMPS,NBOND*3)

SUBROUTINE BLEN(NBOND)

C a l c u l a t i o n o f Bond Lengths u s i n g VSUSQI

Table


3

CO H

s

«

o

3

S

§ g 8 jg C «

to to


13.

WHITE

Molecular

Mechanics

Calculations

233

Molecular mechanics c a l c u l a t i o n s (8) c e r t a i n l y i n v o l v e a great d e a l of f l o a t i n g p o i n t a r i t h m e t i c but a c a s u a l study of the problem does not r e v e a l the s u i t a b i l i t y , or otherwise, of the c a l c u l a t i o n s f o r implementation on a vector processor. The author has implemented h i s molecular mechanics program (9) on the MVP-9500/Vector MZ i n a piecemeal f a s h i o n . That i s , a t the outset the o b v i o u s l y v e c t o r i z a b l e p a r t s of the program were converted to use VPLIB f i r s t and the l e s s obvious or more d i f f i c u l t conversions t a c k l e d l a s t . Several p o i n t s immediately emerged, such as the f a c t t h a t : some p a r t s of the c a l c u l a t i o n are performed more e f f e c t i v e l y as s c a l a r rather than vector operations; the p r i c e of speed with the MVP-9500 i s a l a r g e r memory requirement than that f o r the a l l - s c a l a r program; e f f i c i e n t use of the MVP-9500 r e q u i r e s s u b s t a n t i a l r e c o n s t r u c t i o n of the s e r i a l , s c a l a r program flow and f i n a l l y (and p a r a d o x i c a l l y ) i t i s sometimes worth doing a l i t t l e more computation than i s a b s o l u t e l y necessary i n order to remove the requirement f o r c o n d i t i o n a l statements. These f a c t s are discussed a t length i n the e x c e l l e n t Infotech State of the A r t Report on Supercomputers (10) but w i l l be mentioned here to the extent that they impact on the author's molecular mechanics program. The program s t a r t s by reading i n orthogonal coordinates f o r a molecule and c a l c u l a t i n g a l l of the bonded atom p a i r s from these and a t a b l e of bond r a d i i . T h i s information i s then used to c o n s t r u c t a bonding matrix which i s operated upon to produce groups of three numerical i d e n t i f i e r s (the atom numbers) f o r valency angles and groups of four i d e n t i f i e r s f o r t o r s i o n angles. Although these operations can be expressed i n MVP-9500 code or VPLIB c a l l s the r e s u l t i s not s a t i s f a c t o r y and best l e f t i n standard s c a l a r FORTRAN. However, the r e s t of the program which embraces 99% of the work can be e f f i c i e n t l y v e c t o r i z e d . T h i s remainder i s e x c l u s i v e l y concerned with the e v a l u a t i o n of the molecular p o t e n t i a l energy consequent upon changes i n the orthogonal c o o r d i n a t e s , and with matrix i n v e r s i o n and matrix by vector m u l t i p l i c a t i o n . The former s e r i e s of operations takes p l a c e during the c a l c u l a t i o n of numerical f i r s t and second d e r i v a t i v e s and the l a t t e r during the c a l c u l a t i o n of new orthogonal coordinates f o r the particular iteration. E v a l u a t i o n of the p o t e n t i a l energy i n v o l v e s f i r s t : c a l c u l a t i n g the bond l e n g t h s , angles, non-bonded d i s t a n c e s and t o r s i o n angles from the p r e v i o u s l y constructed bonding matrix, valency angle i d e n t i f i e r t a b l e and t o r s i o n angle i d e n t i f i e r t a b l e and second; the use of these geometric q u a n t i t i e s , together with a f o r c e f i e l d to c a l c u l a t e energies f o r i n d i v i d u a l i n t e r a c t i o n s which are then summed to g i v e the t o t a l energy. The c a l c u l a t i o n of bond lengths e t c . by means of VPLIB c a l l s has a l r e a d y been d i s c u s s e d . In the case of the energies, the same four a l g e b r a i c expressions are used over and over again with



234


d i f f e r e n t f o r c e constants and bond lengths e t c . so that t h i s s e r i e s of computations i s a l s o s t r a i g h t f o r w a r d l y expressed i n terms of VPLIB c a l l s . The remaining c a l c u l a t i o n of inverse matrices and matrix by vector products i s handled by c a l l s to MATINV and MMUL (a vector can be considered as an Nx1 m a t r i x ) . The c o n s t r u c t i o n of an MVP-9500 v e r s i o n of the molecular mechanics program which ran f a s t e r than the author's PDP-11/40 (FIS) v e r s i o n was an undemanding i f tedious process but, as with c o n s t r u c t i o n of VPLIB r o u t i n e s , a r r i v i n g a t an e f f i c i e n t rather than brute f o r c e coding w i l l take s e v e r a l i t e r a t i o n s . Conclusions The author has presented d e t a i l s of a c o s t e f f e c t i v e vector processor f o r use with S-100 microcomputers and produced a l i b r a r y of FORTRAN c a l l a b l e subroutines f o r general purpose f l o a t i n g p o i n t computations. B r i e f d e t a i l s of the c o n s t r u c t i o n of a molecular mechanics program using the vector processor have been given. I t i s appropriate a t t h i s p o i n t to i n c l u d e a b r i e f d i s c u s s ion of the p o s s i b i l i t i e s f o r f u t u r e , enhanced versions of the MVP-9500. The MVP-9500 was designed with f l e x i b i l i t y and ease of enhancement very much i n mind and some improvements have a l r e a d y been implemented. Perhaps the most s t r a i g h t f o r w a r d enhancement i n v o l v e s replacement of the Z80A c o n t r o l processor and the Am9511A-1DC a r i t h m e t i c processors with the f a s t e r Z80B (6MHz as opposed to 4MHz f o r the Z80A) and Am9511A-2DC (4MHz as opposed to 3MHz). Using f a s t e r Am9511A's r e s u l t s i n an increase i n performance roughly p r o p o r t i o n a t e to the c l o c k speeds f o r VPLIB operations ( i . e . run times are reduced to «0.8 of the o r i g i n a l ) , but using a 6MHz Z80B r e s u l t s i n an e x t r a gain because not o n l y i s there a r e d u c t i o n i n time r e q u i r e d to move data around the system by v i r t u e of the higher c l o c k r a t e , but a l s o the f a c t that t h i s enhances p a r a l l e l i s m amongst the Am9511A's because load times are a smaller p r o p o r t i o n of the t o t a l multi-Am9511A operation than p r e v i o u s l y . Using a 6.55MHz Z80B reduces the time f o r the benchmark of Table X from 7.00 sec to 4.20 sec. Another s t r a i g h t f o r w a r d p o s s i b i l i t y i s to increase the number of Am9511A's i n the system a t e i g h t per card. Diminishing returns s e t s i n f a i r l y q u i c k l y and the optimum number of Am9511A's f o r d i f f e r e n t systems i s as f o l l o w s : 8 f o r a 4MHz Z80A, 16 f o r a 6MHz Z80B, 64 f o r a 6MHz Z8000 and 128 f o r the p r o j e c t e d 10MHz Z8009. T h i s leads on to r e p l a c i n g the 8 b i t Z80 with a more powerful 16 b i t Z8000 as the c e n t r a l processor. I t i s here that the advantage of choosing the S-100 bus becomes obvious i n that the o n l y hardware change required f o r t h i s enhancement i s replacement of the Z80 CPU c a r d ,


13.

WHITE

Molecular

Mechanics

Calculations

235


The software needs t o be changed of course and VPLIB can e i t h e r be t r a n s l a t e d t o Z8000 code with a commercially a v a i l a b l e program, or a l t e r n a t i v e l y r e w r i t t e n to take maximum advantage o f the Z8000 a r c h i t e c t u r e . The author has t e s t e d t h i s hardware configuration. 16 b i t wide data busses a l s o o f f e r the p o s s i b i l i t y of l o a d i n g two Am9511A's i n p a r a l l e l with one Z8000 instruction. The main l i m i t a t i o n on the MVP-9500 system as d e s c r i b e d l i e s i n i t s 64k byte address space. By the time the operating system and the r e q u i s i t e p a r t s of VPLIB and the f o r t r a n l i b r a r y are loaded there are o n l y «20k bytes a v a i l a b l e f o r data arrays on a 48k Vector MZ. A 16 b i t c o n t r o l processor such as the Z8000 would help i n t h i s respect as i t has an 8M byte data address space. However, the Z8000 program counter i s o n l y 16 b i t s wide and access to memory beyond 64k bytes i s by the a d d i t i o n of the contents of a segment r e g i s t e r to the program counter t o g i v e a 24 b i t address. T h i s can be simulated with a Z80 by means of bank switching i n which an output i n s t r u c t i o n (analogous t o l o a d i n g the segment r e g i s t e r i n the Z8000) i s used to s e l e c t (usually) one o f e i g h t 64k byte banks o f memory. This i s done i n the author's system, as shown i n Figure 7. In t h i s case a modified VPLIB dedicates one memory bank t o each o f the A, B, C, D and E v e c t o r s referenced by the VPLIB subroutine calls. Data are loaded i n t o the banks e i t h e r by way of a reserved area o f permanently enabled memory or by I/O t r a n s f e r s i n which case the memory, vector processor and c o n t r o l processor r e s i d e i n a "vector coprocessor" attached to the Vector MZ by a high speed p a r a l l e l l i n k . F i n a l l y , most o f the matrix r o u t i n e s h e a v i l y used by the author such as MMUL and MATINV r e l y h e a v i l y on the VSMA (vector s c a l a r m u l t i p l y and add routine) and a means of improving the vector by s c a l a r m u l t i p l y would be most welcome. Fortunately t h i s i s easy t o achieve with l i t t l e e x t r a hardware, and minimal software changes. The c h i p s e l e c t l i n e s of the Am9511A's are c o n d i t i o n e d by a l a t c h e d I/O p o r t so that they are e i t h e r l o g i c a l l y separate and the Am9511A's i n d i v i d u a l l y addressable (as p r e v i o u s l y described) or l o g i c a l l y connected one with the other so that addressing one Am9511A enables a l l o f the other Am9511A's on the board too. A s c a l a r load t o a MVP-9500/8 i s reduced from e i g h t separate loads o f the same s c a l a r to each Am9511A, t o one p a r a l l e l load o f the s c a l a r to a l l e i g h t Am9511A's simultaneously. In aggregate the enhancements discussed above could lead to v e r s i o n s of the MVP-9500 up t o 30 times more powerful than the o r i g i n a l a t very l i t t l e e x t r a ( d i r e c t ) c o s t .


236

SUPERCOMPUTERS

IN

CHEMISTRY

Ac knowledgements The author i s indebted to the Science Research C o u n c i l f o r f i n a n c i a l support of t h i s p r o j e c t and to Video Vector Dynamics L t d . f o r the p r o v i s i o n of p r i n t e d c i r c u i t boards and t e c h n i c a l support.


Literature Cited 1.

2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

D.H. Kuck, D.H. Lawrie and A.H. Sameh (Eds) "High Speed Computer and Algorithm O r g a n i z a t i o n " ; Academic Press: New York, 1977. Gupta, B.K. Computer Design J u l y 1980, 19, 85. "North Star Computers, Product Catalog"; North Star Computers Inc.: Berkeley (U.S.), 1980. Palmer, J . , SIGARCH Newsletter, May 1980, 8, 174. Elmquist, K.A., Fullmer, H., Gustavson, D.B. and Morrow G. Computer J u l y 1979, 12, 28. Markham, S.; " F l o a t i n g Point Systems, The Array Processor Company"; F.P.S. Inc.: B r a c k n e l l (U.K.), 1979. Newman, W.M.; S p r o u l l , R.F. " P r i n c i p l e s of I n t e r a c t i v e Computer Graphics"; McGraw-Hill: New York, 1979. White, D.N.J.; "Molecular Structure by D i f f r a c t i o n Methods V o l . 6"; The Chemical S o c i e t y : London, 1978; p.38. White, D.N.J. Computers & Chemistry 1977, 1, 225. "Infotech State of the A r t Report, Supercomputers"; Infotech I n t e r n a t i o n a l : Maidenhead (U.K.), 1979. "Algorithmic D e t a i l s f o r the Am9511 A r i t h m e t i c Processing U n i t " , Advanced Micro Devices.

RECEIVED

June 18, 1981.


A Micro Vector Processor for Molecular Mechanics Calculations - ACS

Recommend Documents