Graph Algorithms in Chemical Computation - ACS Symposium Series

Jun 1, 1977 - Introduction. The use of computers in science is widespread. Without powerful number-crunching facilities at his** disposal, the modern ...
0 downloads 6 Views 2MB Size
1

Graph

A l g o r i t h m s in C h e m i c a l C o m p u t a t i o n

ROBERT ENDRE TARJAN*

Downloaded by 190.6.22.186 on October 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001

Computer Science Dept., Stanford University, Stanford, CA 94305

1.

Introduction.

The use of computers i n science i s widespread. Without powerful number-crunching f a c i l i t i e s at his** disposal, the modern scientist would be greatly handicapped, unable to perform the thousands or millions of calculations required to analyze his data or explore the implications of his favorite theory. He (or his assistant) thus requires at least some familiarity with computers, the programming of computers, and the methods which might be used by computers to solve his problems. An entire branch of mathematics, numerical analysis, exists to analyze the behavior of numerical algorithms. However, the t y p i c a l scientist's appreciation of the computer may be too narrow. Computers are much more than fast adders and multipliers; they are symbol manipulators of a very general kind. A scientist who writes programs i n FORTRAN or some similar, s c i e n t i f i c a l l y oriented computer language, may be unaware of the potential use of computers to solve computational, but not necessarily numeric, problems which might arise in his research. This paper discusses the use of computers to solve nonnumeric problems in chemistry. I shall focus on a particular problem, that of identifying chemical structure, and examine computer methods for solving it. The discussion w i l l include

*

This research was partially supported by the N a t i o n a l Science Foundation, grant MCS75-22870, and by the O f f i c e o f Naval Research, contract NOOO14-76-C-0688.

**

For the purpose o f smooth reading, I have used the masculine gender throughout t h i s paper.

1 In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.

Downloaded by 190.6.22.186 on October 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001

2

ALGORITHMS FOR CHEMICAL COMPUTATIONS

elements o f graph theory, l i s t p r o c e s s i n g , a n a l y s i s o f algorithms, and computational complexity. I -write as a computer s c i e n t i s t , not as a chemist; I s h a l l n e g l e c t d e t a i l s o f chemistry i n order t o focus on i s s u e s of a l g o r i t h m i c a p p l i c a b i l i t y , s i m p l i c i t y , and speed. I t i s my hope t h a t some readers of t h i s paper w i l l become i n t e r e s t e d i n a p p l y i n g t o t h e i r own problems i n chemistry the methods developed i n recent years by computer s c i e n t i s t s and mathematicians. The paper i s d i v i d e d i n t o s e v e r a l s e c t i o n s . Section 2 discusses r e p r e s e n t a t i o n o f chemical molecules as graphs. Section 3 covers complexity measures f o r computer algorithms. Section k surveys what i s loi own about the s t r u c t u r e i d e n t i f i c a ­ t i o n problem i n g e n e r a l . S e c t i o n 5 solves the problem f o r mole­ cules without r i n g s . S e c t i o n 6 gives a method f o r a n a l y z i n g a molecule by s y s t e m a t i c a l l y b r e a k i n g i t i n t o smaller p a r t s . Section 7 d i s c u s s e s the case o f "planar" molecules. Section 8 o u t l i n e s a complete method f o r s t r u c t u r e i d e n t i f i c a t i o n , and mentions some f u r t h e r a p p l i c a t i o n s o f the ideas contained h e r e i n to chemistry. 2.

Molecules and T h e i r Representation.

Consider a h y p o t h e t i c a l chemical i n f o r m a t i o n system which performs the f o l l o w i n g t a s k s . I f a chemist asks the system about a c e r t a i n molecule, the system w i l l respond with the i n f o r m a t i o n i t has concerning t h a t molecule. I f the chemist asks f o r a l i s t i n g o f a l l molecules which s a t i s f y c e r t a i n p r o p e r t i e s (such as c o n t a i n i n g c e r t a i n r a d i c a l s ) , the system w i l l respond with a l l such molecules known t o i t . I f the chemist asks f o r a l i s t i n g of p o s s i b l e molecules (known or n o t ) , which s a t i s f y c e r t a i n p r o p e r t i e s , the system w i l l p r o v i d e a l i s t . Such an i n f o r m a t i o n system must be able t o i d e n t i f y molecules on the b a s i s o f t h e i r s t r u c t u r e . Given a molecule, the system must d e r i v e a unique code f o r the molecule, so t h a t the code can be looked up i n a t a b l e and the p r o p e r t i e s o f the molecule l o c a t e d . I t i s t h i s coding or c a t a l o g i n g problem which I want t o consider here. A number of codes f o r molecules have been proposed and used; e.g. see (1,2,3,Ij-). The e x i s t e n c e o f many d i f f e r e n t codes w i t h no s i n g l e standard suggests the importance and the d i f f i c u l t y of the problem. I s h a l l attempt t o e x p l a i n why the problem i s d i f f i c u l t , and t o suggest some computer approaches t o it. To d e a l with the problem i n a r i g o r o u s fashion, we couch i t w i t h i n the branch of mathematics c a l l e d graph theory. A graph G = (V, E) i s a f i n i t e c o l l e c t i o n V of v e r t i c e s and a f i n i t e c o l l e c t i o n Ε o f edges. Each edge (v,w"5 c o n s i s t s of an unordered p a i r of d i s t i n c t v e r t i c e s . Each edge and each v e r t e x may i n a d d i t i o n have a l a b e l s p e c i f y i n g c e r t a i n i n f o r m a t i o n

In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.

1.

Graph Algorithms

TARJAN

3

about i t . We represent a chemical molecule as a graph by c o n s t r u c t i n g one v e r t e x f o r each atom and one edge f o r each chemical bond; a b a l l - a n d - s t i c k model of a molecule i s r e a l l y a graph r e p r e s e n t a t i o n of i t . We l a b e l each v e r t e x with the type of atom i t r e p r e s e n t s . See F i g u r e 1 f o r an example. Two v e r t i c e s ν and w of a graph are s a i d t o be adjacent if (v,w) i s an edge of the graph. I f (v,w) i s an edge, and ν i s a v e r t e x contained i n i t , the edge and v e r t e x are s a i d t o be i n c i d e n t . Two graphs ^ = (V- ,E ) and G = ( V , E ) are L

1

2

2

2

s a i d t o be isomorphic i f t h e i r v e r t i c e s can be i d e n t i f i e d i n a one-to-one f a s h i o n so t h a t , i f v^ and w^ are v e r t i c e s i n G^

Downloaded by 190.6.22.186 on October 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001

and

v

and

2

(v ,w ) 1

of

2

2

are the corresponding v e r t i c e s i n

i s an edge of

1

G

w

.

G

1

i f and only i f

Furthermore the p a i r s

(v^w^) , ( v , w ) 2

2

v

1

, v

;

2

G

(v ,w ) 2

w

, then

i s an edge

2

, w

1

2

; and

2

must have the same l a b e l s i f the graphs are

labelled. The problem we s h a l l consider i s t h i s : given two graphs, determine i f they are isomorphic. Or: given a graph, c o n s t r u c t a code f o r i t such t h a t two graphs have the same code i f and only i f they are isomorphic. Notice t h a t t h i s mathematical a b s t r a c t i o n of chemical s t r u c t u r e i d e n t i f i c a t i o n n e g l e c t s some d e t a i l s of chemistry. For instance, we allow bonds between only two mole­ cules, thereby p r e c l u d i n g the r e p r e s e n t a t i o n of resonance s t r u c ­ tures, and we ignore i s s u e s of stereochemistry ( i f two bonds of a carbon atom are f i x e d , our model allows f r e e interchanging of the other two, whereas i n the r e a l world such interchanging may produce stereoisomers; see F i g u r e 2 ) . However, these are d i f f e r e n c e s of d e t a i l only, which can e a s i l y be i n c o r p o r a t e d i n t o the model; we n e g l e c t them only f o r s i m p l i c i t y . Note a l s o t h a t our model does not allow loops (edges of the form (v,v) ), but i t does a l l o w m u l t i p l e edges (which may be used t o represent m u l t i p l e bonds, or f o r other purposes). A g e n e r a l i z a t i o n o f the isomorphism problem i s the subgraph isomorphism problem. Given two graphs G^ = (V^, E^) and G

2

= (VgjEJg) * we

subset of

V

2

and

say

G-j_

i s a subgraph o f

i s a subset of

E

2

G .

2

if

The

V-^

is a

subgraph

isomorphism problem i s t h a t of determining i f a given graph i s isomorphic t o a subgraph of another given graph

G

2

.

G-^

This i s

one of the problems our h y p o t h e t i c a l information system must solve t o provide a l i s t of molecules c o n t a i n i n g c e r t a i n r a d i c a l s . We s h a l l d e a l with t h i s problem b r i e f l y ; i t seems t o be much harder than the isomorphism problem. I f a computer i s t o e f f i c i e n t l y encode molecules i t must f i r s t have a way t o represent a molecule, or a graph. We consider

In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.

Downloaded by 190.6.22.186 on October 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001

4

ALGORITHMS FOR CHEMICAL COMPUTATIONS

Figure 1.

Graphic representation of benzene

Figure 2.

Stereoisomers

In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.

1.

TARJAN

5

Graph Algorithms

two standard ways t o represent graphs i n a computer. The f i r s t i s "by an adjacency matrix. I f G = (V, E) i s a graph with η v e r t i c e s numbered from 1 t o η , an adjacency matrix f o r G i s the η by η matrix M = (m. .) w i t h elements 0 and 1 , such that

m. . = 1

^•3

^-3 i f

(v.,v.)

^- 3

i s an edge of

G

and

m. . = 0

~^~3

other-

wise. See F i g u r e 3 ( a ) , ( b ) . Note t h a t M i s symmetric and t h a t i t s main d i a g o n a l i s zero. The m a t r i x M i s not a code f o r G since i t i s not unique; i t depends upon the v e r t e x numbering. An adjacency matrix r e p r e s e n t a t i o n of a graph has s e v e r a l n i c e p r o p e r t i e s . Many n a t u r a l graph operations correspond t o standard m a t r i x operations (see (5) f o r some examples). The b i t s of M can be packed i n groups i n t o computer words, so t h a t Downloaded by 190.6.22.186 on October 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001

storage of

M

requires only

η /w

words, i f w

i s the word

ο l e n g t h o f the machine (or only η /2w words, i f advantage i s taken o f the symmetry o f M ). I f M i s packed i n t o words i n t h i s way, the b i t s can be processed w at a time, at l e a s t i n c e r t a i n kinds of computations. However, the matrix r e p r e s e n t a t i o n has some serious disadvan­ tages. An important p r o p e r t y of graphs r e p r e s e n t i n g chemical molecules i s t h a t they are sparse; most o f the p o t e n t i a l edges are m i s s i n g . Since each atom has a f i x e d , s m a l l valence, the number of edges i n a graph r e p r e s e n t i n g a molecule i s no more than some f i x e d constant times η , the number of v e r t i c e s . However, i n an a r b i t r a r y graph the number o f edges can be as l a r g e as

2

(n -n)/2 (or l a r g e r , i f t h e r e are m u l t i p l e edges). An adjacency matrix f o r a sparse graph contains mostly zeros, but t h e r e i s no good way o f e x p l o i t i n g t h i s f a c t . I t has been proved t h a t t e s t i n g many graph p r o p e r t i e s , i n c l u d i n g isomorphism, r e q u i r e s examining some f i x e d f r a c t i o n of the elements o f the adjacency matrix i n the worst case ( 6 ) . Any a l g o r i t h m which uses a matrix r e p r e s e n t a t i o n

2 of a graph thus runs i n time p r o p o r t i o n a l t o at l e a s t η i n the worst case. I f we wish t o d e a l with l a r g e graphs and hope t o get a running time c l o s e t o l i n e a r i n the s i z e o f the graph, we must use a d i f f e r e n t r e p r e s e n t a t i o n . The one we choose i s an adjacency s t r u c t u r e . An adjacency s t r u c t u r e f o r a graph G = (V, E) i s a set o f l i s t s , one f o r each v e r t e x . The l i s t f o r v e r t e x ν contains a l l v e r t i c e s adjacent to ν . Note t h a t a given edge (v,w) i s represented twice; w appears i n the adjacency l i s t f o r ν and ν appears i n the adjacency l i s t f o r w . See F i g u r e 3 ( c ) . An adjacency s t r u c t u r e i s s u r p r i s i n g l y easy t o d e f i n e and manipulate i n FORTRAN or any other standard programming language. We use t h r e e arrays, which we may c a l l adjacent to, vertex, and next. For any v e r t e x ν , the element e^ = adjacent t o (v) represents the f i r s t element on the adjacency l i s t f o r v e r t e x v . The corresponding v e r t e x i s v e r t ex (e-, ) , and the element

In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.

Downloaded by 190.6.22.186 on October 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001

6

ALGORITHMS

1:

2, k, 6

2:

1, 3 , 6

3:

2, k, 5

^:

1, 3 , 5

5:

3,

6:

1, 2, 5

FOR

CHEMICAL

COMPUTATIONS

6

(c)

adjacent t o :

1

2

3

1

2

8

U 5

1U

1 2 3

k 5

vertex:

2 1

1 6 1 3

next:

3 7 5 12

6 6

8 9 10 11 12 13 Hi- 15 16 17 18

67

2 6 2

/

10 9 11

3

18 13 15

/

5

3

5

1+ 6

5

/ / // 16

17

(d) Figure 3. Graphic representations: (a) graph, (b) adjacency matrix, (c) adjacency structure, and (d) array representation of adjacency structure

In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.

1.

TARJAN

Graph Algorithms

= nextÇe^)

represents t h e next element on t h e l i s t ,

7 A null

element i n d i c a t e s the end o f t h e l i s t . See F i g u r e 3(