1
Graph
A l g o r i t h m s in C h e m i c a l C o m p u t a t i o n
ROBERT ENDRE TARJAN*
Downloaded by 190.6.22.186 on October 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001
Computer Science Dept., Stanford University, Stanford, CA 94305
1.
Introduction.
The use of computers i n science i s widespread. Without powerful number-crunching f a c i l i t i e s at his** disposal, the modern scientist would be greatly handicapped, unable to perform the thousands or millions of calculations required to analyze his data or explore the implications of his favorite theory. He (or his assistant) thus requires at least some familiarity with computers, the programming of computers, and the methods which might be used by computers to solve his problems. An entire branch of mathematics, numerical analysis, exists to analyze the behavior of numerical algorithms. However, the t y p i c a l scientist's appreciation of the computer may be too narrow. Computers are much more than fast adders and multipliers; they are symbol manipulators of a very general kind. A scientist who writes programs i n FORTRAN or some similar, s c i e n t i f i c a l l y oriented computer language, may be unaware of the potential use of computers to solve computational, but not necessarily numeric, problems which might arise in his research. This paper discusses the use of computers to solve nonnumeric problems in chemistry. I shall focus on a particular problem, that of identifying chemical structure, and examine computer methods for solving it. The discussion w i l l include
*
This research was partially supported by the N a t i o n a l Science Foundation, grant MCS75-22870, and by the O f f i c e o f Naval Research, contract NOOO14-76-C-0688.
**
For the purpose o f smooth reading, I have used the masculine gender throughout t h i s paper.
1 In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by 190.6.22.186 on October 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001
2
ALGORITHMS FOR CHEMICAL COMPUTATIONS
elements o f graph theory, l i s t p r o c e s s i n g , a n a l y s i s o f algorithms, and computational complexity. I -write as a computer s c i e n t i s t , not as a chemist; I s h a l l n e g l e c t d e t a i l s o f chemistry i n order t o focus on i s s u e s of a l g o r i t h m i c a p p l i c a b i l i t y , s i m p l i c i t y , and speed. I t i s my hope t h a t some readers of t h i s paper w i l l become i n t e r e s t e d i n a p p l y i n g t o t h e i r own problems i n chemistry the methods developed i n recent years by computer s c i e n t i s t s and mathematicians. The paper i s d i v i d e d i n t o s e v e r a l s e c t i o n s . Section 2 discusses r e p r e s e n t a t i o n o f chemical molecules as graphs. Section 3 covers complexity measures f o r computer algorithms. Section k surveys what i s loi own about the s t r u c t u r e i d e n t i f i c a t i o n problem i n g e n e r a l . S e c t i o n 5 solves the problem f o r mole cules without r i n g s . S e c t i o n 6 gives a method f o r a n a l y z i n g a molecule by s y s t e m a t i c a l l y b r e a k i n g i t i n t o smaller p a r t s . Section 7 d i s c u s s e s the case o f "planar" molecules. Section 8 o u t l i n e s a complete method f o r s t r u c t u r e i d e n t i f i c a t i o n , and mentions some f u r t h e r a p p l i c a t i o n s o f the ideas contained h e r e i n to chemistry. 2.
Molecules and T h e i r Representation.
Consider a h y p o t h e t i c a l chemical i n f o r m a t i o n system which performs the f o l l o w i n g t a s k s . I f a chemist asks the system about a c e r t a i n molecule, the system w i l l respond with the i n f o r m a t i o n i t has concerning t h a t molecule. I f the chemist asks f o r a l i s t i n g o f a l l molecules which s a t i s f y c e r t a i n p r o p e r t i e s (such as c o n t a i n i n g c e r t a i n r a d i c a l s ) , the system w i l l respond with a l l such molecules known t o i t . I f the chemist asks f o r a l i s t i n g of p o s s i b l e molecules (known or n o t ) , which s a t i s f y c e r t a i n p r o p e r t i e s , the system w i l l p r o v i d e a l i s t . Such an i n f o r m a t i o n system must be able t o i d e n t i f y molecules on the b a s i s o f t h e i r s t r u c t u r e . Given a molecule, the system must d e r i v e a unique code f o r the molecule, so t h a t the code can be looked up i n a t a b l e and the p r o p e r t i e s o f the molecule l o c a t e d . I t i s t h i s coding or c a t a l o g i n g problem which I want t o consider here. A number of codes f o r molecules have been proposed and used; e.g. see (1,2,3,Ij-). The e x i s t e n c e o f many d i f f e r e n t codes w i t h no s i n g l e standard suggests the importance and the d i f f i c u l t y of the problem. I s h a l l attempt t o e x p l a i n why the problem i s d i f f i c u l t , and t o suggest some computer approaches t o it. To d e a l with the problem i n a r i g o r o u s fashion, we couch i t w i t h i n the branch of mathematics c a l l e d graph theory. A graph G = (V, E) i s a f i n i t e c o l l e c t i o n V of v e r t i c e s and a f i n i t e c o l l e c t i o n Ε o f edges. Each edge (v,w"5 c o n s i s t s of an unordered p a i r of d i s t i n c t v e r t i c e s . Each edge and each v e r t e x may i n a d d i t i o n have a l a b e l s p e c i f y i n g c e r t a i n i n f o r m a t i o n
In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
1.
Graph Algorithms
TARJAN
3
about i t . We represent a chemical molecule as a graph by c o n s t r u c t i n g one v e r t e x f o r each atom and one edge f o r each chemical bond; a b a l l - a n d - s t i c k model of a molecule i s r e a l l y a graph r e p r e s e n t a t i o n of i t . We l a b e l each v e r t e x with the type of atom i t r e p r e s e n t s . See F i g u r e 1 f o r an example. Two v e r t i c e s ν and w of a graph are s a i d t o be adjacent if (v,w) i s an edge of the graph. I f (v,w) i s an edge, and ν i s a v e r t e x contained i n i t , the edge and v e r t e x are s a i d t o be i n c i d e n t . Two graphs ^ = (V- ,E ) and G = ( V , E ) are L
1
2
2
2
s a i d t o be isomorphic i f t h e i r v e r t i c e s can be i d e n t i f i e d i n a one-to-one f a s h i o n so t h a t , i f v^ and w^ are v e r t i c e s i n G^
Downloaded by 190.6.22.186 on October 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001
and
v
and
2
(v ,w ) 1
of
2
2
are the corresponding v e r t i c e s i n
i s an edge of
1
G
w
.
G
1
i f and only i f
Furthermore the p a i r s
(v^w^) , ( v , w ) 2
2
v
1
, v
;
2
G
(v ,w ) 2
w
, then
i s an edge
2
, w
1
2
; and
2
must have the same l a b e l s i f the graphs are
labelled. The problem we s h a l l consider i s t h i s : given two graphs, determine i f they are isomorphic. Or: given a graph, c o n s t r u c t a code f o r i t such t h a t two graphs have the same code i f and only i f they are isomorphic. Notice t h a t t h i s mathematical a b s t r a c t i o n of chemical s t r u c t u r e i d e n t i f i c a t i o n n e g l e c t s some d e t a i l s of chemistry. For instance, we allow bonds between only two mole cules, thereby p r e c l u d i n g the r e p r e s e n t a t i o n of resonance s t r u c tures, and we ignore i s s u e s of stereochemistry ( i f two bonds of a carbon atom are f i x e d , our model allows f r e e interchanging of the other two, whereas i n the r e a l world such interchanging may produce stereoisomers; see F i g u r e 2 ) . However, these are d i f f e r e n c e s of d e t a i l only, which can e a s i l y be i n c o r p o r a t e d i n t o the model; we n e g l e c t them only f o r s i m p l i c i t y . Note a l s o t h a t our model does not allow loops (edges of the form (v,v) ), but i t does a l l o w m u l t i p l e edges (which may be used t o represent m u l t i p l e bonds, or f o r other purposes). A g e n e r a l i z a t i o n o f the isomorphism problem i s the subgraph isomorphism problem. Given two graphs G^ = (V^, E^) and G
2
= (VgjEJg) * we
subset of
V
2
and
say
G-j_
i s a subgraph o f
i s a subset of
E
2
G .
2
if
The
V-^
is a
subgraph
isomorphism problem i s t h a t of determining i f a given graph i s isomorphic t o a subgraph of another given graph
G
2
.
G-^
This i s
one of the problems our h y p o t h e t i c a l information system must solve t o provide a l i s t of molecules c o n t a i n i n g c e r t a i n r a d i c a l s . We s h a l l d e a l with t h i s problem b r i e f l y ; i t seems t o be much harder than the isomorphism problem. I f a computer i s t o e f f i c i e n t l y encode molecules i t must f i r s t have a way t o represent a molecule, or a graph. We consider
In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by 190.6.22.186 on October 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001
4
ALGORITHMS FOR CHEMICAL COMPUTATIONS
Figure 1.
Graphic representation of benzene
Figure 2.
Stereoisomers
In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
1.
TARJAN
5
Graph Algorithms
two standard ways t o represent graphs i n a computer. The f i r s t i s "by an adjacency matrix. I f G = (V, E) i s a graph with η v e r t i c e s numbered from 1 t o η , an adjacency matrix f o r G i s the η by η matrix M = (m. .) w i t h elements 0 and 1 , such that
m. . = 1
^•3
^-3 i f
(v.,v.)
^- 3
i s an edge of
G
and
m. . = 0
~^~3
other-
wise. See F i g u r e 3 ( a ) , ( b ) . Note t h a t M i s symmetric and t h a t i t s main d i a g o n a l i s zero. The m a t r i x M i s not a code f o r G since i t i s not unique; i t depends upon the v e r t e x numbering. An adjacency matrix r e p r e s e n t a t i o n of a graph has s e v e r a l n i c e p r o p e r t i e s . Many n a t u r a l graph operations correspond t o standard m a t r i x operations (see (5) f o r some examples). The b i t s of M can be packed i n groups i n t o computer words, so t h a t Downloaded by 190.6.22.186 on October 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001
storage of
M
requires only
η /w
words, i f w
i s the word
ο l e n g t h o f the machine (or only η /2w words, i f advantage i s taken o f the symmetry o f M ). I f M i s packed i n t o words i n t h i s way, the b i t s can be processed w at a time, at l e a s t i n c e r t a i n kinds of computations. However, the matrix r e p r e s e n t a t i o n has some serious disadvan tages. An important p r o p e r t y of graphs r e p r e s e n t i n g chemical molecules i s t h a t they are sparse; most o f the p o t e n t i a l edges are m i s s i n g . Since each atom has a f i x e d , s m a l l valence, the number of edges i n a graph r e p r e s e n t i n g a molecule i s no more than some f i x e d constant times η , the number of v e r t i c e s . However, i n an a r b i t r a r y graph the number o f edges can be as l a r g e as
2
(n -n)/2 (or l a r g e r , i f t h e r e are m u l t i p l e edges). An adjacency matrix f o r a sparse graph contains mostly zeros, but t h e r e i s no good way o f e x p l o i t i n g t h i s f a c t . I t has been proved t h a t t e s t i n g many graph p r o p e r t i e s , i n c l u d i n g isomorphism, r e q u i r e s examining some f i x e d f r a c t i o n of the elements o f the adjacency matrix i n the worst case ( 6 ) . Any a l g o r i t h m which uses a matrix r e p r e s e n t a t i o n
2 of a graph thus runs i n time p r o p o r t i o n a l t o at l e a s t η i n the worst case. I f we wish t o d e a l with l a r g e graphs and hope t o get a running time c l o s e t o l i n e a r i n the s i z e o f the graph, we must use a d i f f e r e n t r e p r e s e n t a t i o n . The one we choose i s an adjacency s t r u c t u r e . An adjacency s t r u c t u r e f o r a graph G = (V, E) i s a set o f l i s t s , one f o r each v e r t e x . The l i s t f o r v e r t e x ν contains a l l v e r t i c e s adjacent to ν . Note t h a t a given edge (v,w) i s represented twice; w appears i n the adjacency l i s t f o r ν and ν appears i n the adjacency l i s t f o r w . See F i g u r e 3 ( c ) . An adjacency s t r u c t u r e i s s u r p r i s i n g l y easy t o d e f i n e and manipulate i n FORTRAN or any other standard programming language. We use t h r e e arrays, which we may c a l l adjacent to, vertex, and next. For any v e r t e x ν , the element e^ = adjacent t o (v) represents the f i r s t element on the adjacency l i s t f o r v e r t e x v . The corresponding v e r t e x i s v e r t ex (e-, ) , and the element
In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by 190.6.22.186 on October 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001
6
ALGORITHMS
1:
2, k, 6
2:
1, 3 , 6
3:
2, k, 5
^:
1, 3 , 5
5:
3,
6:
1, 2, 5
FOR
CHEMICAL
COMPUTATIONS
6
(c)
adjacent t o :
1
2
3
1
2
8
U 5
1U
1 2 3
k 5
vertex:
2 1
1 6 1 3
next:
3 7 5 12
6 6
8 9 10 11 12 13 Hi- 15 16 17 18
67
2 6 2
/
10 9 11
3
18 13 15
/
5
3
5
1+ 6
5
/ / // 16
17
(d) Figure 3. Graphic representations: (a) graph, (b) adjacency matrix, (c) adjacency structure, and (d) array representation of adjacency structure
In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
1.
TARJAN
Graph Algorithms
= nextÇe^)
represents t h e next element on t h e l i s t ,
7 A null
element i n d i c a t e s the end o f t h e l i s t . See F i g u r e 3(