An Intelligent Sketch Pad as Input to Molecular ... - ACS Publications

LISP, the Esperanto of artificial intelligence research, makes possible a representation of chemical structural formulas which is much more nearly ana...
0 downloads 0 Views 972KB Size
14 An Intelligent Sketch Pad as Input to Molecular Structure Programs Carl Trindle Artificial Intelligence Applications in Chemistry Downloaded from pubs.acs.org by UNIV LAVAL on 09/23/15. For personal use only.

Chemistry Department, University of Virginia, Charlottesville, VA 22901

The programming and manipulation of chemical graphs i s awkward i n most f a m i l i a r programming languages. LISP, the Esperanto of artificial i n t e l l i g e n c e research, makes possible a representation of chemical s t r u c t u r a l formulas which i s much more nearly analogous to the chemist's view of such graphs. This i s a considerable computational advantage as well as a convenience for the user. We have designed a "functional fragment" representation of s t r u c t u r a l formulas, applicable to any molecule, which will resolve a crude sketch of a chemical structure into a list of fundamental fragments. Exploiting the PROPERTY feature of LISP and the distance geometry algorithms of Crippen we can recover Cartesian coordinates f o r each atom, suitable for input to molecular mechanics programs, or to ab i n i t i o electronic structure packages. Besides l o c a l geometries, the i n t e l l i g e n t sketchpad can contain any l o c a l properties, including bond types and strengths, chromophore o p t i c a l spectra, and nuclear magnetic resonance and infrared spectra c h a r a c t e r i s t i c of a l o c a l chemical environment.

Computational chemists have developed several remarkably powerful and r e l i a b l e computer codes, capable of describing the r e l a t i v e s t a b i l i t y of various conformations of macromolecules, and d e t a i l s of the e l e c t r o n i c structure of molecules of more modest s i z e (1). The properties of molecules which can be obtained by use of these programs c o r r e l a t e with important features of chemical r e a c t i v i t y and the properties of m a t e r i a l s . Molecular design, i n pharmac e u t i c a l s , photochemistry, and general materials science can be made much more e f f i c i e n t by the routine use of these computational systems. However, t h e i r use i s at present not widespread; i t i s l i m i t e d to a few large chemical companies. 0097-6156/86/0306-0159$06.00/0 © 1986 American Chemical Society

160

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

One of the obstacles to wider use of the well-tested and powerf u l programs such as A l l i n g e r ' s molecular mechanics (2) and Pople's GAUSS80 (3) i s that the programs require such elaborate and awkward input. Users must o r d i n a r i l y prepare a l i s t of Cartesian coordinates of each atom. This i s cumbersome f o r molecules of even moderate s i z e . But more s i g n i f i c a n t l y , chemists powerful sense of three-dimensional molecular structure i s never expressed i n Cartesian coordinates. Instead chemists think more n a t u r a l l y of " i n t e r n a l coordinates," that i s bond lengths, primary valence angles, and l o c a l dihedral angles. Of course a f u l l set of i n t e r n a l coordinates defines i n p r i n c i p l e the set of Cartesian coordinates (4). Unfortunately, the usual algorithms f o r generating Cartesian coordinates from i n t e r n a l coordinates are s e n s i t i v e to small e r r o r s . These errors accumulate and can perpetrate enormities such as leaving rings unclosed, or forcing u n r e a l i s t i c a l l y short separations between nonbonded atoms. In the chemist's conceptual p i c t u r e , r e a l i s t i c bond distances f o r rings are maintained, even i f d i s t o r tions i n normal valence angles are required. The problem i s to transform the p i c t o r i a l view of molecules which i s the d a i l y companion of the chemist, to the numerical form required by programs, WITHOUT FORCING THE USER TO EFFECT THE TRANSLATION. We must not ask the chemist to do much more than i d e n t i f y the atoms, t h e i r c o n n e c t i v i t y , and some gross features of the stereochemistry. The s t r u c t u r a l formula i s the medium by which such simple yet r i c h l y evocative information i s conveyed. The s t r u c t u r a l formula does a f t e r a l l s u f f i c e f o r the chemist's work day to day. It should be adequate to convey the e s s e n t i a l information to useful computer programs. There w i l l be two major stages to the t r a n s l a t i o n of information from the chemist's p i c t o r i a l image to the r i g i d l y formatted input f i l e required by molecular mechanics or molecular o r b i t a l programs. F i r s t the sketch i s impressed on a d i g i t i z i n g tablet (perhaps as simple as a Koala Pad (R), or a more accurate d i g i t i z i n g t a b l e t ) . Then the graph must be interpreted and a t r i a l geometry generated.

Artificial Intelligence Applications in Chemistry Downloaded from pubs.acs.org by UNIV LAVAL on 09/23/15. For personal use only.

1

Accepting the Sketch. The (computationally) most convenient way to enter a s t r u c t u r a l diagram i s to use a d i g i t i z i n g tablet with a mouse or s t y l u s . Our experience has been with the Houston Instruments HIPAD (R). The software accompanying t h i s (and most ordinary) d i g i t i z i n g tablet accepts and stores l o c a l coordinates of p a r t i c u l a r points, and a set of pointers designating which v e r t i c e s are to be connected (5). In t h i s way the molecular topology can be s p e c i f i e d with no novel analysis or programming. I t would be more i n t e r e s t i n g from the point of view of A r t i f i c i a l I n t e l l i g e n c e research to i n t e r p r e t a sketch already on paper, by the analysis of dark and l i g h t elements (6). We have made only small progress i n t h i s task, but some preliminary remarks can make the d i f f i c u l t i e s c l e a r . The f i e l d of view i s resolved into p i c t u r e elements, and an o p t i c a l scanner would assign a numerical value corresponding to the darkness of the sketch at that l o c a t i o n . Heavy l i n e s would be easy to recognize, by the sequence of adjacent dark spots detected by the scanner. Intersections might be harder to recognize i f the g r i d i s coarse, but knowledge of the existence of l i n e s could guide the search, by estimates of the i n t e r s e c t i o n s by extrapolation. A planar graph (with no crossing l i n e s ) would

14.

TRINDLE

An Intelligent Sketch Pad for

Molecular Structure Programs

Artificial Intelligence Applications in Chemistry Downloaded from pubs.acs.org by UNIV LAVAL on 09/23/15. For personal use only.

seem t o p r e s e n t few d i f f i c u l t i e s . Vertices representing generalized atoms ( t h a t i s , "Me" i n p l a c e o f a f u l l y d e t a i l e d m e t h y l group) would have t o be more c a r e f u l l y s p e c i f i e d . The chemist would use B e r z e l i u s - n o t a t i o n c a p i t a l l e t t e r s f o r l a b e l s , w h i c h would have t o be i n t e r p r e t e d . T h i s i s a h a r d t a s k , as the p o s t o f f i c e has l e a r n e d . I t would be e s s e n t i a l f o r the system t o r e a l i z e when i d e n t i f i c a t i o n o f a v e r t e x i s i m p o s s i b l e o r ambiguous, and r e q u e s t guidance from the u s e r . F i g u r e 1 shows how v e r t i c e s a r e s p e c i f i e d . I t w i l l be n e c e s s a r y t o d i s t i n g u i s h the s t r o k e s which i d e n t i f y s i n g l e o r m u l t i p l e bonds from the s t r o k e s d e n o t i n g l o n e p a i r s , and i t w i l l be r e q u i r e d t o s u p p l y m i s s i n g hydrogens and l o n e p a i r s w h i c h a r e o f t e n o m i t t e d from c a s u a l s k e t c h e s . T h i s l a t t e r problem w i l l a l s o be e n c o u n t e r e d i f the s k e t c h i s i n p u t d i r e c t l y by the digitizing tablet. We r e t u r n t o t h a t l i n e of a p p r o a c h . P r e l i m i n a r y P r o c e s s i n g of the Sketch. Even a t t h i s e a r l y s t a g e , b e f o r e d i f f e r e n t atoms a r e d i s t i n g u i s h e d and hydrogens a r e f u l l y e x p r e s s e d , we have much o f the i n f o r m a t i o n needed f o r some k i n d s o f analysis. A l l o f the g r a p h - t h e o r e t i c a n a l y s i s o f p i systems (7), w h i c h may be c o n s i d e r e d t o be based on the H u c k e l model, uses no more than the c o n n e c t i v i t y between e q u i v a l e n t c e n t e r s . However p o w e r f u l the graph t h e o r y has been, i t cannot be d e n i e d t h a t i t s u p p r e s s e s much o f the d e t a i l e x p r e s s e d i n the s t r u c t u r a l d i a g r a m . T h e r e f o r e we w i l l n o t be c o n t e n t to s t o p a t t h i s s t a g e . I t w i l l be n e c e s s a r y a t minimum t o d e f i n e the t y p e o f atom p r e s e n t a t each v e r t e x . We r e d u c e the l a b o r n e c e s s a r y f o r t h i s s p e c i f i c a t i o n by (a) s u p p r e s s i n g hydrogens i n the p r e l i m i n a r y s k e t c h ; and (b) assuming as a d e f a u l t t h a t each v e r t e x r e p r e s e n t s a carbon atom, r e q u i r i n g an amendment o n l y f o r heavy atoms. Our software i s r e s p o n s i b l e f o r f i l l i n g i n hydrogens. This process i s f r e q u e n t l y ambiguous, g i v e n o n l y the s k e l e t o n o f heavy atoms. T h e r e f o r e the computer system w i l l sometimes i n t e r r o g a t e the u s e r f o r the number o f hydrogen atoms a t each v e r t e x . With t h i s i n f o r m a t i o n the t a s k o f c o m p l e t i n g a Lewis s t r u c t u r e i s l e f t t o the s o f t w a r e , w h i c h i s a t l e a s t as c a p a b l e o f t h i s t a s k as the a v e r a g e f i r s t - y e a r student. T h i s i s the f i r s t t a s k t h a t r e q u i r e s a n y t h i n g r e s e m b l i n g A r t i f i c i a l I n t e l l i g e n c e , so a few remarks on the d e s i g n may n o t be out o f p l a c e . A R o u t i n e t o A s s i g n Lewis S t r u c t u r e s . The p r o c e d u r e f o r a s s i g n i n g Lewis s t r u c t u r e s i s f a m i l i a r ( 8 ) . G i v e n t h e s e t o f atoms, one must sum the v a l e n c e e l e c t r o n s . In our LISP system, each ATOM can be a s s i g n e d PROPERTIES w h i c h may i n c l u d e the number o f v a l e n c e e l e c t r o n s i t c o n t r i b u t e s t o the m o l e c u l e , and e q u a l l y i m p o r t a n t , i t s s e t o f NEIGHBORS by w h i c h the s k e l e t o n o f the m o l e c u l e i s s p e c i f i e d . Each such l i n k i s a s s i g n e d a p a i r o f v a l e n c e e l e c t r o n s , and a census i s k e p t o f e l e c t r o n p a i r s i n the v i c i n i t y o f each atom. Among the PROPERTIES of each atom i s an e s t i m a t e o f i t s e l e c t r o n e g a t i v i t y , and the program a s s i g n s e l e c t r o n p a i r s t o f i l l o c t e t s u s i n g the e l e c t r o n e g a t i v i t y to s e t p r i o r i t y . The l a s t s t e p i s most " d i f f i cult." F o r each o f t h o s e atoms w h i c h l a c k a f u l l o c t e t , the system must l o o k among the NEIGHBORS f o r atom(s) p o s s e s s i n g a l o n e p a i r w h i c h i t might s h a r e . Of a l l those p o t e n t i a l d o n o r s , one chooses the atom w i t h the most n e g a t i v e f o r m a l charge. The m u l t i p l e bond

161

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Artificial Intelligence Applications in Chemistry Downloaded from pubs.acs.org by UNIV LAVAL on 09/23/15. For personal use only.

of

A D i a l o g u e Accompanying t h e E n t r y a Molecule o f Moderate Complexity

SPECIFY

NON-CARBON

NUMBER : NUMBER: NUMBER: NUMBER: NUMBER:

1 8 10 12 0

SPECIFY

NET CHARGE: *1

HYDROGENS HYDROGENS HYDROGENS HYDROGENS HYDROGENS HYDROGENS HYDROGENS HYDROGENS HYDROGENS HYDROGENS HYDROGENS

VERTICES: TYPE: TYPE: TYPE: TYPE:

AT AT AT AT AT AT AT AT AT AT AT

VERTEX VERTEX VERTEX VERTEX VERTEX VERTEX VERTEX VERTEX VERTEX VERTEX VERTEX

nitrogen oxygen oxygen oxygen

8 9 11 : 12 13

F i g u r e 1. A l l v e r t i c e s a r e f i r s t assumed t o be CARBON. The system r e q u e s t s t h a t t h e u s e r s p e c i f y non-CARBON v e r t i c e s ; i t w i l l b u i l d a set of u s e r ' s abbreviations.

14.

TRINDLE

An Intelligent Sketch Pad for Molecular Structure Programs

Artificial Intelligence Applications in Chemistry Downloaded from pubs.acs.org by UNIV LAVAL on 09/23/15. For personal use only.

i s represented by the appearance of the donor several times i n the (revised) NEIGHBOR l i s t . When the Lewis structure routine finds an ambiguity which we would represent by a set of resonance structures, i t reports that f a c t and chooses the f i r s t l e g a l structure f o r further processing. Figure 2 shows the procedure i n p r a c t i c e . Representation of the Molecule i n LISP. We have used the chemist's sketch, or i t s Lewis structure equivalent, as the model of a data structure i n LISP (9). This language has the f l e x i b i l i t y needed to express an e s s e n t i a l l y non-numerical object, i n terms of l i s t s . LISP w i l l permit us to organize molecular structure information i n a way that mimics the human expert's knowledge. To accomplish t h i s representation, we must develop a clear idea how the chemist assimilates the information provided d i r e c t l y and e x p l i c i t l y by the sketch, and how the properties of the molecule are r e c a l l e d to the chemist's awareness. The s t r u c t u r a l formula at minimum i d e n t i f i e s the atoms and t h e i r connectivity. This hardly seems to be adequate i n complexity to express much molecular information. This apparent paradox i s resolved when we recognize that the chemist brings much of h i s experience to the task of i n t e r p r e t i n g the sketch, and much of the information i s evoked rather than transmitted by means of the s t r u c t u r a l formula. The atoms' names—carbon, n i t r o g e n — c a l l up a f l o o d of associations which (although they are almost never w r i t t e n exp l i c i t l y i n the chemist's sketch) are nonetheless part of the information i t can summon. Among t h i s data are the atomic mass, t y p i c a l valencies, l o c a l geometry, perhaps a van der Waals radius, and a guide to chemical behavior, i t s " e l e c t r o n e g a t i v i t y . " The connectivity can define some aspects of the geometry i n a useful semiquantitative way. The chemist has a very r e l i a b l e idea of the range of bond lengths; CC(single), 1.54 A; CC(double), 1.33 A, etc. By counting connections and recognizing the atoms being connected, one can assign good estimates of the distances between d i r e c t l y bonded atoms. The chemist's knowledge of molecular geometry extends beyond t y p i c a l values of bond distances. He w i l l also be able to predict many bond angles f a i r l y accurately. This i s equivalent to specifying a 1-3 nonbonded interatomic distance. The chemist's sketch portrays c i s and trans isomerization, syn and a n t i , and gauche conformations which specify either t o r s i o n angles, or i n d i r e c t l y , a 1-4 nonbonded distance. Besides primary bond distances and angles, and some s p e c i a l cases of t o r s i o n a l and dihedral angles, the chemist knows more global features of molecular geometry. However, such knowledge becomes more and more fragmentary; the longest distances i n a molecule are most poorly defined. A LISP S t r u c t u r a l Recognizer. A molecule i s represented i n our LISP program f i r s t as a l i s t of atoms. A numbering scheme assigns an unique l a b e l to each atom. Each atom has a c o l l e c t i o n of PROPERTIES; foremost among them i s i t s generic NAME. The name CARBON c a r r i e s with i t a van der Waals RADIUS and a VALENCE. Other properties can be added as desired. The major feature of a molecular sketch i s the topology or

Artificial Intelligence Applications in Chemistry Downloaded from pubs.acs.org by UNIV LAVAL on 09/23/15. For personal use only.

164

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

connectivity of the molecule. This i s expressed as the property NEIGHBOR f o r each atom. This i s j u s t the set of labels of other atoms connected to a p a r t i c u l a r atom. The NEIGHBOR property i s a compact way to store the adjacency matrix used i n graph theory. The chemist's sketch, processed into the l i s t representation j u s t described, i s not yet very valuable; the system at the moment i s very ignorant of the structure of the molecule i n question. But the chemist knows much of the molecule from l i t t l e more than the diagram. How does the chemist "see" a complex molecular diagram? In our judgement a chemist knows so much about a molecule because he recognizes recurrent fragments of moderate s i z e . Rings of varying atomic composition, structure, and s i z e ranging from carbonyl groups to s t e r o i d systems, are recognized at a glance. Many stereochemic a l l y well-defined fragments, such as spiro and norbornyl systems, are part of the chemist's conceptual t o o l k i t . Our programming task i s to assure that our system recognizes such fragments, with a l l the associated information on t h e i r structure and properties, with ease. Somehow we must discern the presence of meaningful, f a m i l i a r fragments i n the molecular l i s t . We mimic t h i s stock of informative portions of molecules i n our LISP system by l i s t s c a l l e d FRAGMENTS. The FRAGMENTS, permanent members of a growing data base, each cont a i n a set of ATOMS and a NEIGHBOR l i s t f o r each atom i d e n t i f y i n g the connectivity. Besides t h i s topological information, the fragments contain as PROPERTIES a stock of a t t r i b u t e s of the fragments. The f i r s t c o l l e c t i o n of PROPERTIES we gathered were interatomic d i s tances gleaned from c r y s t a l structures. A l l interatomic distances are defined w i t h i n a fragment. The system can now assign many (though not a l l ) interatomic distances i n an a r b i t r a r y molecule i f fragments could be discerned w i t h i n the sketch. We have developed a search technique which w i l l scan the MOLECULE and locate a l l fragments. Design of t h i s recognition algorithm i s d i f f i c u l t . The search routine shares some of the features of the "knapsack problem," a c l a s s i c d i f f i c u l t y i n computer science. We expect that we w i l l be able to speed t h i s step considerably. At present we scan a l l stored fragments, though that i s not the way an expert would proceed. We screen out many fragments by a s u p e r f i c i a l test that the atoms i n the fragment must be a subset of the atoms i n the molecule. The fragments are subjected to more and more thorough t e s t s , u n t i l recognition i s complete. These tests are e s s e n t i a l l y recursive applications of the requirement that i f a fragment i s to be i d e n t i f i e d i n a molecule, the environment of each atom i n the fragment must be found i n the molecule f o r the corresponding atom. More d e t a i l on the search condition may be found i n a previous a r t i c l e (10). Figure 3 shows a t y p i c a l fragment representation. In t h i s f i r s t formulation we have already established that i t i s most e f f e c t i v e to scan the largest candidate fragments f i r s t . I t i s desirable to recognize overlapping fragments; more distances are determined. However, i t i s i n e v i t a b l y the case that a substantial number of distances w i l l be l e f t undefined, p a r t i c u l a r l y the longest distances which would not be incorporated into a fragment. Distance Geometry Changes Distances to Cartesian Coordinates. Most esperimental measures of molecular geometry provide quantities which may be most d i r e c t l y interpreted as defining interatomic distances.

TRINDLE

An Intelligent Sketch Padfor Molecular Structure Programs

Assignment, FORMULA:

HI Π Ν 03

COMPUTED

Artificial Intelligence Applications in Chemistry Downloaded from pubs.acs.org by UNIV LAVAL on 09/23/15. For personal use only.

23

VALENCE

PAIRS

o f

Si

ELECTRONS:

ASSIGNED

TO L I N K S

ASSIGNED

3

3

VERTEX

10 A S S I G N E D

2

PA IPCS?

VERTEX

12 A S S I G N E D

3

PAIR(S)

VERTEX

1 ASSIGNED

1 PAIR;S )

VERTEX

2

ASSIGNED

1

PA1R(S>

VERTEX

3 ASSIGNED

1

PAIR(S)

VERTEX

4 ,

SHARING VERTEX

PAIR(S)

7 ,

BETWEEN 5

SHARING VERTEX

6 ,

S t r u c t u r e