A Self-Organized Knowledge Base for Recall, Design, and Discovery

Organic chemistry is a unique theater for AI research because over the past 150 .... are stored in files as LISP lists and reside in core as arrays. S...
1 downloads 0 Views 2MB Size
A for

Self-Organized Knowledge Base R e c a l l , D e s i g n , and Discovery in O r g a n i c C h e m i s t r y 1

2,3

Craig S.Wilcox and Robert A. Levinson Downloaded by UNIV OF CALIFORNIA SAN DIEGO on January 5, 2017 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

2

18

1

Department of Chemistry, University of Texas at Austin, Austin, TX 78712 Department of Computer Science, University of Texas at Austin, Austin, TX 78712 The design and operation of a system which forms generalizations about organic chemical reactions and structures and uses these generalizations to organize the reactions and structures for efficient retrieval and to generate precursors to a given target molecule is presented. Approaches to computer based classificatory concept formation and organization are discussed. A new linear notation for organic reactions is described. The complex professional tasks accomplished by organic chemists are an i n t r i g u i n g example of i n t e l l i g e n t human a c t i v i t y . Organic chemists organize and r e c a l l a vast amount of information. In ascending order of complexity, the knowledge created and used by the organic chemist consists of i n d i v i d u a l observations, conceptual schemes and generalizations which organize t h i s factual knowledge base, and, most importantly, procedures which describe how to use these facts and conceptual schemes to solve a given problem. We are interested i n the ways i n which information i s organized and used for problem solving. Our objective i s to design machines which w i l l encode reactions and structures, w i l l automatically create generalizations based on these data, and w i l l use these generalizations to organize the data and to solve the problem of precursor generation. Organic chemists often use s t r u c t u r a l features to c l a s s i f y reactions. The capacity to conceptualize i s an indispensable aspect of i n t e l l i g e n c e . We wished to determine whether a computer, given a large number of structures or reactions and a small set of r u l e s , can create useful generalizations. In designing such a program, we have faced a number of interesting issues concerning conceptualization i n organic chemistry. Organic chemistry i s a unique theater for AI research because over the past 150 years organic chemists have created a powerful graphical knowledge representation scheme. This representation 3

Current address: Board of Studies in Computer Science, University of California, Santa Cruz, CA 95064 0097-6156/ 86/ 0306-0209S06.25/ 0 © 1986 American Chemical Society

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on January 5, 2017 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

210

A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

method i s the second language of a l l organic chemists and supports tasks ranging from mundane r e c a l l of a s p e c i f i c datum to the generation of highly complex, creative proposals for multi-step syntheses of previously unknown molecules. Computer science and organic chemistry have been i n comfortable c o l l a b o r a t i o n for the past 25 years.(1-7) A number of important programs have been developed i n that time. The DENDRAL project influenced AI research i n far-reaching ways.(8) Organic chemistry i s an enticing arena for AI research because to a limited but important extent, i n the microworld of the organic chemist, the problem of how to represent knowledge has been solved. The graphical language shared by a l l organic chemists for over a century i s a remarkably sophisticated knowledge representation scheme which i s e a s i l y adapted to contemporary techniques i n computer science. The organic chemist does use many concepts ( e l e c t r o n e g a t i v i t y , insights from quantum theory, and s p a t i a l r e l a t i o n s h i p s between molecular components) which are absent or are i n d i r e c t l y encoded i n h i s graphical notation. Nevertheless, a substantial amount of knowledge at the factual l e v e l , and a useful number of higher l e v e l concepts, can be expressed as connected graphs. Consider the following l i s t : functional groups, the a l d o l reaction, the Paterno-Buchi r e a c t i o n , carbon-carbon bond formation, ene r e a c t i o n , esters, alkenes, e l i m i n a t i o n , enamines, Claisen rearrangement, a l l y l i c alcohols, halogenation. These words describe just a few general categories used by chemists to c l a s s i f y reactions or structures. These categories, some i n use for over 100 years, can be described using organic s t u c t u r a l formulae and find d a i l y use i n classifying chemical facte. Computer systems have used such generalizations (provided by chemists) to guide data organization, r e c a l l , and planning. The benefits of o r i g i n a l machine calculated generalizations w i l l be r e a l i z e d when capable conceptualizing programs are a v a i l a b l e . I t w i l l be shown here that, given structures and reactions and a simple set of i n s t r u c t i o n s , a computer can indeed discover generalizations, some of which are equal to the categorizations used by chemists. While the fact that some discoveries are very s i m i l a r to known categories i s i n t e r e s t i n g , i t i s more important that the computer can also discover patterns previously unknown to chemists. In this program the generalizations about reactions and structures which are discovered by the system are used very much as man-made generalizations have been used. They organize the data, they are used during the r e c a l l procedure, and they are used to generate precursors to target structures. We hypothesize that because only a few chemistry s p e c i f i c h e u r i s t i c s are used i n the generalization algorithm, t h i s system w i l l have more creative p o t e n t i a l than systems which are more r i g i d l y constructed from many s p e c i a l rules based on detailed chemical knowledge. In current system the answers provided to the precursor generation problem are naive because we have not yet incorporated a h e u r i s t i c based module to guide precursor s e l e c t i o n . Here, as i n a c h i l d , however, t h i s naivety i s accompanied by the p o t e n t i a l to suggest fresh approaches to solving a problem. The answers are not directed to conform to a concensus view of correctness. We seek a system of answering questions, but not a system which provides only expected answers. The f i r s t part of this paper provides an overview of what the

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on January 5, 2017 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

18.

WILCOX A N D LEVINSON

A SeIf-Organized Knowledge Base

211

program does and how i t works. We present an approach to representing both structures and reactions as single connected graphs. We w i l l refer to a l l such l a b e l l e d graphs as s t r u c t u r a l concepts or simply as concepts. S t r u c t u r a l concepts range from the very general (carbon-carbon single bonds, carbon-oxygen double bonds) through intermediate s i z e and generality (the a l d o l reaction, the pyran ring) up to the most complex real-world instances of molecules or reactions. By v i r t u e of the graph representation scheme, reactions and structures, both r e a l and abstract, may be stored i n a single data base. This system d i f f e r s i n several ways from other approaches to organic chemistry data base organization. The data organization of t h i s system i s based on machine generated s t r u c t u r a l concepts rather than pre-determined screens. The rules which guide the generalization process w i l l be d e t a i l e d . The data i s h e i r a r c h i c a l l y self-organized, i n a p a r t i a l ordering proceeding from the smallest, most general s t r u c t u r a l concepts ( p r i m i t i v e s ) to the largest and most s p e c i f i c structures or reactions. The generalizations that are created aid r e t r i e v a l and are used i n precursor generation. The idea of a h e i r a r c h i c a l organization of knowledge has h i s t o r y far predating computer science.(9) ( Consider for example the arbor porphyriana, a "tree of concepts" proposed by Porphyry i n the 3rd century A.D. ) We recognize that the h e i r a c h i c a l organization and manipulation of graphs i s a general approach to knowledge processing and should find a p p l i c a t i o n outside of organic chemistry. In the second part of the paper examples of the system i n action w i l l be given. We f e e l that because our system uses c l e a r l y defined rules f o r creating generalizations, i t may o f f e r fresh insights and solutions to problems. Rules f o r generalization can be systematically modified. The question of how such modifications a f f e c t the problem solving c a p a b i l i t i e s of the system i s unanswered. An appendix i s provided and d e t a i l s the new techniques used i n t h i s program. An e f f i c i e n t algorithm based on a p a r t i a l ordering allows the r e c a l l of subgraphs, supergraphs and close-matches for any query graph. Some comparisons w i l l be made of t h i s algorithm with previously used screen approaches for graph r e t r i e v a l . Overview of the System Reaction Representation. From the outset, t h i s project was shaped by the graphical form of t r a d i t i o n a l organic reaction representations:

-

Li 8

yK

^

0 ι

• ^

Ρ

ο

Reactions are i n v a r i a b l y w r i t t e n t h i s way, and obviously have a l e f t hand side and a r i g h t hand side. To the beginning organic student, t h i s format n a t u r a l l y suggests a "before and a f t e r " or "cause and e f f e c t " perception of reactions. " I f the s t a r t i n g material i s treated i n this way, then the product w i l l r e s u l t . " This perception has influenced the design of some computer programs. Reactions have been represented either as two related structures or as one structure and a set of changes required to produce the other structure.

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

212

A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on January 5, 2017 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

To simplify comparisons between reactions, we sought to describe entire reactions as a single labeled graph. Just as cause and e f f e c t can be considered either as two separate events or as a u n i f i e d process, changing with time, so a reaction can be perceived as two structures, as shown above, or as a single assembly of n u c l e i connected by bonds which change with time. The aldol-type reaction just i l l u s t r a t e d can be rewritten as follows:

:12 .Ν Note that bonds which are invariant with time are represented i n the usual way. The dotted l i n e s represent bonds which change over the time course of the reaction event. Each changed-bond i s labeled to indicate i t s bond order before and after the reaction. Obviously, the unchanging bonds can also be labeled i n an i d e n t i c a l fashion. (For example, "1:1" would represent an unchanged single bond.) A second example of this representation i s i l l u s t r a t e d i n Figure 1. These formulae are unorthodox only because they contain unusual types of bonds, bonds which change with time. I t i s t h i s same feature which makes the formulae very useful. The single formulae represent entire reactions and can be stored or manipulated using any of the methods already devised for the storage of s t a t i c structures. We have chosen to use a bond-centered approach to encoding these graphs. The smallest s t r u c t u r a l unit i s the atom-bond-atom fragment, and w i l l be referred to as a p r i m i t i v e . Connected networks of these atom-bond-atom fragments define a molecule or a reaction. These networks of primitives are node labeled connected graphs and can be represented as adjacency tables wherein the nodes are labeled with numbers corresponding to p r i m i t i v e s . F i n a l l y , these adjacency tables are stored i n f i l e s as LISP l i s t s and reside i n core as arrays. Steps followed i n thç t r a n s l a t i o n of a reaction into a LISP l i s t structure are i l l u s t r a t e d i n Figure 1. Reaction Generalizations Based on S p e c i f i c Observations. Organic chemists have long sought to organize t h e i r observations. Reactions represented as connected graphs can be formed into groups on the basis of common substructures (subgraphs) shared by a l l the members of the group. These substructures (subgraphs) are s t r u c t u r a l concepts which are more general than the s p e c i f i c reactions from which they are derived. These s t r u c t u r a l concepts help to organize the large numbers of known reactions. Structural concepts derived from examples of real-world reactions may have the form of a normal reaction but are not necessarily good reactions as formulated. For example, most organic chemists would recognize the following as the generic form of the Diels-Alder reaction but few chemists expect this exact reaction to afford a high y i e l d .

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

WILCOX A N D LEVINSON

A Self-Organized Knowledge Base

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on January 5, 2017 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

(c=:-c1-:»c:-c2=:-c3:-),c2-c4-o-c5-c3,c4»o,c5«o,c1-i.

213

(b)

9

C-C01

NODE * 1 2 3 4 5 6 7 8 9 10 11 12 13

CONCEPT* 1KC-F11) 6(C-C12) 5 (show 539) (graphs. Eventually, graphical output C1-C2-C3.. -> (show 306) Iwill be possible. C1-C2- (show 484) |a supergraph is viewed. (C1 -C2-C3-C4-C5-C6-),C7-C3-C2-C 1 -012.C8-C2-C9-C 10-C 11..

F i g u r e 4. S t r u c t u r a l r e t r i e v a l . Responses p r o v i d e d by the are i n i t a l i c s . A n n o t a t i o n s a r e i n s e r t e d on the r i g h t .

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

user

WILCOX AND LEVINSON

A Se If-Organized Knowledge Base

What would you like to do? 1 • Change the database. 2 * Ask a question. 3 = Go to lisp level input. A = Save changes 5 = Quit

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on January 5, 2017 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

2 Initiating a query... Please enter the list of classes.

libit time we are interested in (only reactions.

(r)

Please enter the legal substitutions: nil

I no substitutions are allowed

Type in the structure please:

(the user wants to see reactions (which form c-c bonds at the alpha Icarbon in cyclohexanone:

(c I -c2-c-c-c-c-),c1-o,c2:-c. Searching data-base of graphs... Exact matches: nil

Ithe exact reaction is not in the data (bue. |no known subgraphs.

Subgraphs: nil Supergraphs: (667 681 826 1224)

Ifour known reactions are supergraphs Ιοί the query. Close matches: ((508 7) (814 7) (676 7) (136 4) (1063 3) (105 3) (1057 2) (359 2)) {concept 50S. for example, has a 7 bond Number of concepts searched: 21 Isubgraph in common with the query. Number of complete node-by-node searches required: 19 Would you like to add the structure as a new concept? (y-yes) no (this is one way in which the system (can learn new concepts

J Going to lisp level input. To return to this menu type (hi Τ -> (show 826)

Ithe user now examinee some laupergrapha of the query. (C1-C2-C3-C4-C5-C6-).(C12+C13+C14+C15+C16+C17* ),07-C 1 -C6-C8.C6 :-C9= :-C 10-C11 -C12.C 1 1=018.. U

Η

Ο

Ο

-> (com 826)(Ύ (Qjll [Tj [ Ô ] lc«w»em« include bibliographie ^ ^ ^ ^ (information. Yields are stored (House, H. 0. "Modem Synthetic Methods", pp 595-6 Un a separate file. S > S

-> (show 1224) (C 1 -C2-C3-C4-C5-C6-),(C2-C3-C 10-C 11-N12-C13 :-).07=C 1-C2 :-C 13 :-N 12-C15.C 1 -C6-C8.C6-C9 ,C13=:014,. o

0

•> (com ,224) (Corey . E J.. étal J. Amer.Chem.Soc. 1974. 96.6516)

Figure 5.

Reaction r e t r i e v a l .

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

222

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

What is the target molecule? Type in the structure please:

Ithe target:

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on January 5, 2017 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

(c 1 -c2-c3-c-c-c-c7-c-), c 1*o, c2-c-c-c, c3-c7.

Adding concept... Searching data-base of graphs... The concept is 1231.

1216 11 236 193 27

83 61 56 56 A}

COIthe system temporarily adds the target (to the data base. In this way known (subgraphs of the target are found together Iwith the reactions that will produce theei. I these reactions are then used to generate (precursors.

The following precursors are suggested: • reaction validity size 1 2 3 4 5

^

13 13 14 13 13

Ithe table gives five precursors, the (concept used to generate the precurter is jehewn with the transform validity of this (application (see text), the last column (gives the number of bonds in a precursor.

The precursors are on list pre'.

Ithe veer new views the first three

-> (view pro 1) (C1