Encoding and Decoding WLN - Journal of Chemical Information and

Encoding and Decoding WLN. George A. Miller. J. Chem. Doc. , 1972, 12 (1), pp 60–67. DOI: 10.1021/c160044a016. Publication Date: February 1972...
0 downloads 0 Views 584KB Size
G E O R G E A. MILLER

Encoding and Decoding WLN GEORGE A. MILLER* Heuristics Laboratory, Division of Computer Research and Technology, National Institutes of Health, Department of Health, Education and Welfare, Bethesda, Md. 2 0 0 1 4 Received August 2 5 , 1971

This paper deals with the encoding and decoding of a Wiswesser Line Notation (WLN). This problem so far has been addressed only from the point of a human. This paper discusses the encoding and decoding with exactness suitable for a computer, and is an outgrowth of a computer program now in operation at NIH which automatically encodes and decodes WLN.

T h e computer system for encoding a n d decoding W L N 1 can be broken down into four separate programs (see Figure 1). Two of t h e programs, Rand Tablet Program and Display, are used t o facilitate t h e communication between t h e chemist a n d t h e computer. T h e Rand Tablet Program allows a chemist t o sketch a chemical structure free-hand on a special pad. T h e program t h e n converts t h e information into a connection table for further processing by t h e Encoding Program. Examples of free-hand sketches of two chemical structures are given in Figure 2. Conversely, t h e Decoding Program converts a connection table into a two-dimensional chemical structure. E x amples of t h e results of this program are given in Figure 3. Once t h e input/output considerations are put aside, we can concentrate on t h e core of t h e problem, namely, encoding and decoding. There is a further division which naturally suggests itself. Chemical structures can be broken into their acyclic a n d cyclic parts and handled separately. ‘Present address. Universit) of Pennsylbania, Moore School of Electrical Engineering. Philadelphia. P a . 19140.

Figure 1

As one might expect t h e encoding a n d decoding programs are similar. T h e decoding algorithm seems t o follow t h e encoding algorithm precisely in reverse. This paper discusses t h e general aspects of both algorithms glossing over most of t h e minor details and exceptions which make a n algorithm tedious. ENCODINGACYCLIC STRUCTURES First assume t h a t t h e chemical structure to be encoded is entirely acyclic. We c a n abstract t h e problem a s follows: consider a n acyclic network with each node being a letter a s in Figure 4. T h e linear notation for such a network or chemical structure will be a particular permutation of t h e letters of t h e nodes. T h e permutation is decided by t h e following a b stracted Wiswesser rules: Rule 1. Cite all chains of nodes letter-by-letter as connected Rule 2. Resolve all otherwise equal alternatives in letter sequence by selecting the sequence that would be biggest

Over-all view of programming

structui for encoding and decoding WLN

Example:

C-N

It

C

‘c’ 60

..

II

N

-

I :::::: -I

&

Journal of Chemical Documentation, Vol. 12, No. 1 , 1972

/

3. C 2,4 4. c 3,5,5 5. c 1,4,4

T5NN CHJ

E N C O D I N G AND DECODING W L N

E X E C U T I O N TInE: 1.183 SECONDS ULN: 1 F 6 05 C666 En O N b 4 T T l J Mol

C X L C U l l D N T l n L i 1.017 SECOND6 YLN: L CS l 6 6 6 T J RVH L IO no 00

Figure 2.

TSNV OnTJ I I Y O X C G G C R 4 C R

101 U V O l SO- BlSOlJ

r-

OTSOV LHJ

Sample results from the encoding program

T F6 05 C666 E n O N 4 4 1 1 T J H O I T O 1 U V O l S O - 0 1 6 O l J

Figure 3.

Sample results from the decoding program

Figure 4 .

Example of an abstracted chemical structure

Rule 3. Cite branched structures along that chain of nodes which includes, first, the largest possible number of branch nodes (a node connected to three or more other nodes) and after this, the largest possible number of nodes; start at the end of this chain required by Rule 2 and then follow Rule 4. Rule 4. After each branch node, cite first, in the following order of choice, ( a ) the chain with the fewest branch nodes: and after this ( b ) the chain with the fewest nodes; and after this (c)the biggest chain. Note t h a t “bigger” is a relation defined for the sequence of letters of the s a m e length. One sequence is said to be bigger if it follows another in alphabetic series. Conversely, one sequence is said to be “smaller” if it preceeds another in alphabetic series-e.g. “ T W E F ” is bigger t h a n “TWEA” a n d “ABEDL” is smaller t h a n “ B B E D L “ . Thus, t h e correct sequence for t h e above example would be: Y V T Z I ‘r I D C G B P C G B M

Journal of Chemical Documentation, Vol. 12, No. 1, 1972

61

GEORGE A. MILLER The rules explain what t h e correct coding should be b u t fail t o indicate how a structure is t o be encoded. T h e algorithm employed by t h e author c a n be thought of as a parallel process where t h e ends of t h e structure are “eaten” away, node by node, until there is only one node left, t h e n t h a t too is “eaten”; however, whenever a node is “eaten,” t h e notation for t h e structure u p t o this point is formed so t h a t when t h e structure is gone, t h e notation will remain. Since we do not have a parallel processor, t h e algorithm proceeds in t h e following sequential manner: Step 1: Calculate the connectivity of all the nodes. Put a triangle around all the terminal nodes (nodes connected to just one other node) and put a square around all nodes connected to two other nodes. If no triangles were drawn, go to Step 4. (This is a weakness of WLN; in fact a new notation has been proposed in which the connectivity would be inherent and not required calculation.) Step 2: Find a triangle. If it is not connected to anything go to Step 4, otherwise eat it. If you can’t find any go to Step 1. Step 3 : If the node just eaten was not connected to a square, go to Step 2, otherwise eat this square and repeat this step. Step 4: There should he just one node left, devour it, and you are finished. How is a node eaten? T h e process of eating a node not only destroys it but also results in four by-products which are passed along t o t h e connecting node (realize t h a t there is only one connecting node since only terminal nodes are eaten). T h e four by-products are: 1. The number of branch nodes counted to this point 2. The number of nodes counted to this point 3. The forward sequence of letters following rules 1-4

to this point 4. The backward sequence of letters following rules 1-4 to this point T h e forward sequence is t h a t sequence of notation going from t h e node just eaten t o its connecting node. Similarly, t h e backward sequence is t h e notation going in t h e other direction. Both forward and backward sequence must be carried along since t h e correct direction will not be established until t h e end of t h e process. The remaining bit of explanation describes how t h e four by-products are calculated. When a node is t o be eaten, it is either t h e last node left or not. First we shall consider t h e latter case. T h e expiring node might have a number of lists of by-products from nodes which were previously connected to it ( a list is simply a n enumeration of t h e four by-products). All of these lists will be combined into a new list a n d passed o n t o t h e connecting node. T h e first byproduct, number of branched nodes, is t h e s u m of t h e n u m bers in t h e connecting lists plus one if t h e node being eaten has two or more lists attached t o it. T h e second by-product is t h e largest number in t h e connecting lists plus one. T h e third by-product or t h e forward sequence is formed by first finding t h e sequence S1. This sequence is t h e one among t h e forward sequences of t h e connecting lists containing t h e most number of branch nodes, t h e most number of symbols, and t h e biggest. T h e new forward sequence is formed by concatenating t h e sequence, S1, with t h e letter of t h e eaten node with t h e remaining buchu urd sequences in order of least number of branches, least number of nodes, and bigness. T h e fourth by-product of t h e backward sequence is formed by concatenating t h e letter of t h e eaten node with t h e backward sequence of t h e connecting lists in order of least number of branches, least number of nodes, and 62

Journal of Chemical Documentation, Vol. 12, No. 1, 1972

bigness. T h e process of devouring is exactly like t h a t of eating a node except t h a t t h e fourth by-product, t h e backward sequence, is calculated differently. The backward sequence for devouring is formed by first finding t h e sequence S2. This sequence is t h e second one among t h e forward sequences of t h e connecting lists containing t h e most number of branch nodes, t h e most number of symbols, and t h e biggest. T h e new backward sequence is formed by concatenating t h e sequence, S2, with t h e letter of t h e devoured node, with t h e remaining bachicard sequences in order of least number of branches, least number of nodes and bigness. T o alleviate any confusion at this point t h e acyclic structural example is encoded s t e p by step as shown in Figure 5 , which is based on a node eaten a t each step. T h e algorithm terminates with a list containing a forward sequence a n d a backward sequence. The resultant notation will be t h e bigger of t h e two sequences. In t h e case of t h e example, the resultant notation is t h e forward sequence or YVTZITIDCGBPCGBM. This algorithm is detailed elsewhere.?

DECODING ACYCLIC NOTATION Here also t h e problem must be abstracted a n d grossly simplified in order t o present t h e basic algorithm in a lucid fashion. T h e problem is t o s t a r t with t h e notation of a n acyclic chemical structure a n d t o produce t h e structure. T h e notation can be thought of as a sequence of N symbols designated by S, where i varies from 1 t o 9. A new sequence of numbers, V, is produced from S by table lookup. V, represents t h e valence of t h e chemical symbol S,. It will be assumed t h a t t h e valence is known and constant for each symbol when in reality many chemicals are variable valent. Next t h e sequence P is formed by t h e following formula P, = P,-, V, - 2 for 1 5 i 5 ” (POis defined t o be 2 ) . Finally t h e sequence F is created. The algorithm entails incrementing t h e value of i from 1 t o N - 1. If t h e value of V, is 1 do nothing otherwise, set F, + t o i. Now if t h e value of V , is 2 do nothing, otherwise, successively decrease t h e value of V, and P, by 1 until V, is equal t o 2 . E a c h time t h e value is decreased, look ahead along t h e P sequence until a value of P,is found equal t o P,. At this point set F,,, equal t o i. At t h e end of it all, set Fr equal t o 1. This algorithm is clearly outlined in Figure 6. T h e conclusion of t h e process results in t h e creation of t h e sequence F. This series of numbers describes how t h e structure is formed: First write t h e

+

Step 1: numbers from 1 to N , then Step 2: go through F and for every i (1I i I N ) draw a line between the number i and the number F,, finally, Step 3 : substitute S , for every number i and the structure is drawn. T o illustrate, we shall use t h e notation encoded above:

T i’

=

16 1 2

3

4

5

S : Y V T 2

I

6

7

8

9

T

I

D

C G B P

1011 1213141516 ~

_

_

C G B M v : 1 2 2 4 1 2 1 3 3 1 2 1 3 1 2 1 P : 1 1 1 3 2 2 1 2 3 2 2 1 2 1 1 0 F : - 1 2 3 4 4 6 4 8 9 9 11 8 1 3 1 3 1 5 ‘

T h e steps are shown in Figure 7 .

_

.

ENCODING AND DECODING WLN

A. Step 1:

A

I. slql3:

I

(mi)

J. S t e p 3

I

A I

P-

A-a-z--m-A (

O

M

-

I -I ?

C

C

D-

m - L - ( - $

-m-A

I

-El-A

I A

"

c -m-A

I

m

I

)

1

m-Z--m-(-)

8. Step2:

1

M

D-

Folward requenm

El

O

m-c 1- 4

El Q I I A-c-D-c+TJ-A A-a-za-m-A I I A Number of branch nodm Number of symbols

(

(

K. Slql2:

m

h

I ?

]

m- C

C

D-

-@--

I