Progress toward an On-Line Chemical and Biological Information

both the chemical structure data and associated pharmacological data, and for .... are later incorporated into the database via the floppy disk unit o...
0 downloads 0 Views 2MB Size
8

Progress toward an On-Line Chemical and Biological I n f o r m a t i o n S y s t e m at t h e

Upjohn

Company

W. J. HOWE and T. R. HAGADONE

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

The Upjohn Company, Kalamazoo, MI 49001

Over the past t e n or fifteen y e a r s , researchers in the pharma c e u t i c a l i n d u s t r y have seen a gradual but important change in the way computers are utilized in research and research support functions. E a r l y a p p l i c a t i o n s tended to focus on numerical tasks such as statistical analyses and quantum mechanical c a l c u l a t i o n s o r on the a r c h i v a l storage of i n f o r m a t i o n r e l a t e d to the chemistry o r b i o l o g y of research substances. I n the latter case, i n f o r m a t i o n retrieval systems were often u n w i e l d l y and r e q u i r e d c o n s i d e r a b l e e x p e r t i s e f o r their use. The l a b o r a t o r y researcher u s u a l l y had to work through an intermediary in order to r e t r i e v e i n f o r m a t i o n from such systems. More r e c e n t l y , we have seen a shift of emphasis to where computers are now recognized as i n d i s p e n s a b l e t o o l s in the day-to-day o p e r a t i o n of scientific research. On-line i n t e r a c t i v e methods have placed the i n f o r m a t i o n resource much c l o s e r to the end u s e r . I n a d d i t i o n to their "traditional" a p p l i c a t i o n s , computer-based systems are being employed to assist in the design of organic syntheses, in the interpretation of s p e c t r o s c o p i c d a t a , i n the design and development of new drug candidates, f o r r e a l - t i m e experiment c o n t r o l , and in a wide variety of r e l a t e d areas (1-6)· The retrieval and m a n i p u l a t i o n of m e d i c i n a l chemical i n f o r m a t i o n is another area in which computer—based systems have made an i m pact and which will become i n c r e a s i n g l y important in future y e a r s . This paper w i l l focus on a p r o j e c t which has been under way at the Upjohn Company to develop a comprehensive chemical and b i o l o g i c a l i n f o r m a t i o n system to be used by research s c i e n t i s t s and research support p e r s o n n e l . C a p a b i l i t i e s of the system w i l l event u a l l y i n c l u d e o n - l i n e s t r u c t u r e r e g i s t r y , s t r u c t u r e and s u b s t r u c ture s e a r c h i n g , the r e t r i e v a l and m a n i p u l a t i o n of pharmacological t e s t data, and the r e t r i e v a l of s p e c t r o s c o p i c , p a t e n t , and other types of s t r u c t u r e - a s s o c i a t e d d a t a . There are c u r r e n t l y a number of systems i n the company which are being used f o r the storage of b i o l o g i c a l data a s s o c i a t e d w i t h compounds that have been s y n t h e s i z e d f o r s c r e e n i n g . I n most cases, the o p e r a t i o n of these systems has i n the past been q u i t e i n d e pendent of " c h e m i c a l l y - o r i e n t e d " i n f o r m a t i o n . Chemical s t r u c t u r e 0-8412-0465-9/78/47-084-107$06.25 Published 1978 American Chemical Society Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

RETRIEVAL

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

108

OF

MEDICINAL

CHEMICAL

INFORMATION

and s u b s t r u c t u r e searching has been accomplished through the use of a fragment code which was developed i n the l a t e 1950's, and which, d e s p i t e a number of drawbacks that are commonly inherent i n fragment-based systems, has met the needs of our s c i e n t i s t s f o r a number of y e a r s . However, s i n c e one of the goals of the new i n formation system i s to p r o v i d e a means f o r i n t e r a c t i v e l y a c c e s s i n g both the chemical s t r u c t u r e data and a s s o c i a t e d pharmacological data, and f o r the e x t r a c t i o n of subgroups of compounds which could then, f o r example, a c t as source data f o r end-user a p p l i c a t i o n s such as p a t t e r n r e c o g n i t i o n a n a l y s e s , the design of a f l e x i b l e and e f f i c i e n t chemical s t r u c t u r e entry and search system became the i n i t i a l t a r g e t of our a t t e n t i o n . The chemical s t r u c t u r e system c o n s i s t s of three p a r t s , i n d i f f e r e n t stages of development: (a) the s t r u c t u r e database, a c o l l e c t i o n of approximately 60,000 chemical s t r u c t u r e s i n connection t a b l e format, the c o n s t r u c t i o n of which has r e c e n t l y been completed, (b) the s t r u c t u r e entry system, an i n t e r a c t i v e computer-graphics based system which was developed t o create the i n i t i a l d a t a base; p o r t i o n s of t h i s w i l l a l s o be i n c o r p o r a t e d i n the compound r e g i s t r y and search system, (c) the compound r e g i s t r y and search system, c u r r e n t l y under development, which c o n s i s t s of two p a r t s : (1) an o n - l i n e r e g i s t r y f a c i l i t y which w i l l a l l o w i n t e r a c t i v e d a i l y updating of the database, and, (2) the query f a c i l i t y , which w i l l a l l o w o n - l i n e i n t e r a c t i v e s t r u c t u r e and s u b s t r u c t u r e searching and e v e n t u a l searching and m a n i p u l a t i o n of a s s o c i a t e d pharmacological i n f o r m a t i o n . The system w i l l enable the user t o d i s p l a y the r e t r i e v e d i n f o r m a t i o n i n a convenient format and to produce h i g h q u a l i t y hard copy output of both s t r u c t u r a l and t e x t u a l data. 1.

The S t r u c t u r e Database

A key phase of the p r o j e c t i n v o l v e d the c r e a t i o n of the s t r u c t u r e database, a g r a d u a l l y e n l a r g i n g c o l l e c t i o n of a p p r o x i mately 60,000 chemical s t r u c t u r e s which over the years had e i t h e r been synthesized in-house f o r t e s t i n g purposes o r obtained from o u t s i d e o r g a n i z a t i o n s . The fragment-coded search system a l s o operated on t h i s c o l l e c t i o n of compounds; however, s i n c e fragment codes represent s t r u c t u r a l a t t r i b u t e s , the codes could not be used to regenerate complete connection t a b l e s . A s t r u c t u r e entry system was designed which, by using computer g r a p h i c s as the input medium, would a l l o w d i r e c t t r a n s c r i p t i o n of the s t r u c t u r e diagrams from hard copy format i n t o the computer system. Connection t a b l e s would be generated i n r e a l - t i m e as the s t r u c t u r e drawing o p e r a t i o n progressed. The s t r u c t u r e e n t r y program was ready f o r use about 1-1/2 years ago and f u l l s c a l e s t r u c t u r e e n t r y began a t the s t a r t of 1977. Although many

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

8.

HOWE AND HAGADONE

On-Line

Information

System

109

e r r o r - d e t e c t i o n devices were b u i l t i n t o the system, there were s t i l l c e r t a i n types of e r r o r s which could s l i p by and enter the data base. For that reason i t was decided at the outset t h a t each s t r u c t u r e would have to be entered t w i c e , by d i f f e r e n t t e r m i n a l operators, thereby enabling an i d e n t i t y check to be performed on the host computer. E r r o r checking by manual comparison of each entered s t r u c t u r e w i t h a hard copy record would, i t was f e l t , take j u s t as long as i t would take to redraw a s t r u c t u r e a second time and would s t i l l provide no guarantee that a l l e r r o r s had been caught. The s t r u c t u r e entry o p e r a t i o n has j u s t r e c e n t l y been comp l e t e d , having taken c o n s i d e r a b l y l e s s time than o r i g i n a l l y a n t i c i p a t e d . Now that the high volume input of the database "backlog" i s done, i t i s planned that r o u t i n e d a i l y update of the database w i t h low volume " c u r r e n t " s t r u c t u r e s w i l l be handled by the onl i n e r e g i s t r y f a c i l i t y which i s nearing completion and which w i l l be discussed l a t e r . 2.

The S t r u c t u r e Entry System

The s t r u c t u r e entry system was designed to accommodate r a p i d e r r o r - f r e e s t r u c t u r e e n t r y , w i t h much c o n s i d e r a t i o n given t o s t r u c t u r e diagram cosmetics. I t was a l s o designed so that i t could be e a s i l y i n c o r p o r a t e d i n t o the compound r e g i s t r y and search system w i t h l i t t l e or no m o d i f i c a t i o n . For that reason, we w i l l present an o p e r a t i o n a l overview of the g r a p h i c a l s t r u c t u r e e n t r y system, f o c u s i n g i n p a r t i c u l a r on i t s use i n the c r e a t i o n of the s t r u c t u r a l database. (a) Hardware. The data entry t e r m i n a l i s operated essent i a l l y as a stand-alone computer system (Figure 1) which t r a n s mits completed s t r u c t u r e connection t a b l e s to the host machine (370/155) where they are compared a g a i n s t t h e i r d u p l i c a t e s t r u c tures (double e n t r y ) . Once a day an e r r o r l o g i s p r i n t e d to enable c o r r e c t i o n of s t r u c t u r a l e r r o r s (using a s i m i l a r program on the database management t e r m i n a l , see Figure 1). The s t r u c t u r e entry system c o n s i s t s of a PDP 11/04 computer w i t h 28K words of memory, a dual f l o p p y - d i s k d r i v e , keyboard, graphics t a b l e t , and CRT ( s i m i l a r to the DEC GT43 package). The graphics t a b l e t and a s s o c i a t e d s t y l u s enable a user to i n t e r a c t w i t h the d i s p l a y by moving the s t y l u s on the surface of the t a b l e t , r a t h e r than p o i n t i n g to the face of the scope as would be done w i t h a l i g h t pen. Software i n the computer t r a c k s the motion of the s t y l u s w i t h a cursor (a s m a l l cross) on the scope. Depressing the s t y l u s a c t i v a t e s a s w i t c h i n the s t y l u s t i p which i n t u r n allows the user to s e l e c t options from a "menu" on the d i s p l a y . Such a device has been found to be a very n a t u r a l medium f o r i n t e r a c t i n g w i t h a d i s p l a y and much more convenient than a l i g h t pen. A d d i t i o n a l det a i l s on the use of a graphics t a b l e t f o r chemical s t r u c t u r e drawing can be found i n references 7_ and 8_.

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

RETRIEVAL

110

O F MEDICINAL

CURRENT [ STRUCTURE I ι DATABASE,

CHEMICAL

INFORMATION

FUTURE f STRUCTURE I DATABASE,

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

HOST MACHINES

IBM 370/155 OS/VS1

IBM 370/148 VM/CMS

COMMUNICATION CONTROLLER

FLOPPY DISK (DSD-210)

PROGRAM DEVEL­ OPMENT & STRUC­ TURE CORRECTION SYS. (PDP11/40)

GRAPHICS, KEYBOARD, TABLET

STRUCTURE ENTRY SYSTEM (PDP11/04)

PRINTER/ PLOTTER (VERSATEC)

FLOPPY DISK

GRAPHICS, KEYBOARD, TABLET

Figure 1. Hardware configuration for structure-entry project. High-volume structure entry was accomplished on the small graphics system (PDP 11/04); data base was formed on 370/155. Data-base maintenance and structure corrections were performed on the large graphics system (PDP 11/40). Information-retrieval-system-runs on 370/ 148 with data base transferred from 370/155.

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

8.

H O W E AND HAGADONE

Οπ-Line

Information

System

111

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

At times the host machine i s not a v a i l a b l e , and so r a t h e r than t r a n s m i t t i n g completed s t r u c t u r e s d i r e c t l y to the database on the 370, the program i n s t e a d w r i t e s them on a floppy d i s k . These are l a t e r incorporated i n t o the database v i a the floppy d i s k u n i t on the second graphics t e r m i n a l (see Figure 1). (b) I n t e r n a l S t r u c t u r e Representation. S t r u c t u r e s are r e ­ presented i n the computer i n the form of atom-bond connection t a b l e s . These are arrays of data which account f o r such things as: ( i ) f o r each atom; atom type, formal charge, i s o t o p e l e v e l , presence of unpaired e l e c t r o n , two-dimensional c o o r d i ­ nates, number of bonds attached, and a d d i t i o n a l i n f o r ­ mation r e q u i r e d f o r regeneration of the s t r u c t u r e d i a ­ gram, ( i i ) f o r each bond; i d e n t i f i e r s f o r the atoms at each end of the bond, bond m u l t i p l i c i t y , and stereochemical i n f o r ­ mation f o r cases where the bond i s attached to a c h i r a l atom. The connection t a b l e i s formed i n c r e m e n t a l l y during the s t r u c t u r e drawing o p e r a t i o n . Since X-Y coordinate data f o r each atom are stored i n the t a b l e , a complete molecular p i c t u r e can be generated almost i n s t a n t a n e o u s l y from the connection t a b l e . The t a b l e provides an unambiguous r e p r e s e n t a t i o n of a s t r u c t u r e ; how­ ever, a t the time the connection t a b l e i s i n s e r t e d i n the data­ base, a c a n o n i c a l i z a t i o n step (using a modified Morgan a l g o r i t h m (9j10)) i s performed which r e s u l t s i n a unique o r d e r i n g of the atoms w i t h i n the t a b l e and f a c i l i t a t e s a d i r e c t comparison of two " d u p l i c a t e " t a b l e s to detect d i f f e r e n c e s ( e r r o r s ) . The connection t a b l e that i s s t o r e d contains no h i g h e r - l e v e l chemical i n f o r m a t i o n such as a r o m a t i c i t y , r i n g i n f o r m a t i o n , o r stereochemical r e l a t i o n ­ ships other than the bond type t r a n s c r i b e d from the hard copy r e ­ cord. Such h i g h - l e v e l r e l a t i o n s h i p s (and others) can be e x t r a c t e d from the b a s i c i n f o r m a t i o n contained i n the t a b l e by a p p r o p r i a t e p e r c e p t i o n r o u t i n e s on the host machine. In f a c t , the s t r u c t u r e record that w i l l be used f o r high speed s u b s t r u c t u r e searching i s not the o r i g i n a l master connection t a b l e (CT) f o r each s t r u c t u r e , but a s p e c i a l l y formatted record derived from the CT which a l s o contains a l l the h i g h e r - l e v e l data necessary to provide search r e ­ s u l t s i n as c l o s e to " i n t e r a c t i v e time" as p o s s i b l e (see d i s c u s ­ s i o n of s u b s t r u c t u r e s e a r c h i n g ) . Thus, the c o n n e c t i v i t y informa­ t i o n w i l l a c t u a l l y be present i n more than one form i n the com­ p l e t e d r e g i s t r y and search system. In the f o l l o w i n g d i s c u s s i o n , however, "connection t a b l e " r e f e r s to the expanded c o n n e c t i v i t y array described at the s t a r t of t h i s s e c t i o n . (c) G r a p h i c a l S t r u c t u r e Input. S t r u c t u r e s are t r a n s c r i b e d i n t o the system from data sheets which c o n t a i n molecular formula, chemical name, s t r u c t u r e diagram, some p h y s i c a l and b i o l o g i c a l screening data, and a r e g i s t r y number c a l l e d a "U-number". I n

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

112

RETRIEVAL

OF

MEDICINAL

INFORMATION

some r e s p e c t s , the g r a p h i c a l entry system i s s i m i l a r to those used by Chemical A b s t r a c t s S e r v i c e (11) f o r s t r u c t u r e input and by com­ p u t e r - a s s i s t e d s y n t h e s i s research groups (12 1Ζ 1) f o r s p e c i f i c a ­ t i o n of t a r g e t molecules. There a r e , however, a number of d i f f e r ­ ences from the l a t t e r systems due to our r e q u i r e d focus on e r r o r c o n t r o l , speed of e n t r y , and o v e r a l l s t r u c t u r e diagram cosmetics. A number of drawing options appear on the d i s p l a y , which es­ s e n t i a l l y represent a "menu" of graphics c o n t r o l s . To s e l e c t from the menu the user moves the s t y l u s on the t a b l e t so as to super­ impose the t r a c k i n g cursor on one of the o p t i o n s , and then de­ presses the s t y l u s s l i g h t l y to a c t i v a t e the d e s i r e d o p t i o n . As can be seen i n F i g u r e 2, at the top of the d i s p l a y a r e c t a n g l e appears around the TYPE o p t i o n . This i n d i c a t e s to the operator the o p t i o n that i s c u r r e n t l y a c t i v e . Some i n f o r m a t i o n must be entered v i a the keyboard. This i n ­ cludes the date and the operator's i n i t i a l s (at the s t a r t of a s e s s i o n ) , and a U-number and molecular formula (MF) before each s t r u c t u r e i s drawn. The system matches the MF a g a i n s t the s t r u c ­ ture when the OUTPUT o p t i o n i s s e l e c t e d and only t r a n s m i t s the s t r u c t u r e to the database i f the MF and s t r u c t u r e match. The l a r g e r e c t a n g l e i n the center of the d i s p l a y represents the drawing area i n s i d e which the molecular diagram i s c o n s t r u c t ­ ed. E r r o r messages and other t e x t u a l feedback to the user appear at the bottom of the drawing area. The options which are arrayed along the top of the d i s p l a y a l l o w the user to change drawing modes. They operate as f o l l o w s . DRAW allows the user to perform a freehand drawing o p e r a t i o n to enter bonds and i m p l i c i t carbon atoms (see below f o r d e s c r i p t i o n ) ; RINGS changes the d i s p l a y to a second menu from which pre-drawn r i n g systems can be s e l e c t e d ; MOVE enables the user to a d j u s t the p o s i t i o n of atoms and t h e i r attached bonds by simply superimposing the cursor on the d e s i r e d atom and moving the s t y l u s (and thereby, the atom) to i t s new p o s i t i o n ; CENTER centers the drawing i n the box; DELETE a l l o w s the s e l e c t i v e erasure of atoms or bonds; TYPE r e t u r n s c o n t r o l to the keyboard; OUTPUT sends a completed s t r u c ­ t u r e to the host machine a f t e r the molecule i s subjected to a s e r i e s of e r r o r checks (remaining e r r o r s are detected on the 370 during the d u p l i c a t e match); and CLEAN erases the drawing area and i n i t i a l i z e s the connection t a b l e . The three remaining options at the top of the d i s p l a y are f o r bond character m o d i f i c a t i o n . For example, the broken/zigzag l i n e allows s p e c i f i c a t i o n of stereochemical i n f o r m a t i o n . While the system i s i n t h i s mode, the user can " p o i n t to the center of a bond and i t w i l l become a dashed bond to i n d i c a t e a p r o j e c t i o n of the bond back i n t o the plane of the drawing. P o i n t i n g to the bond a second time converts i t to a "wavy" bond of the type normally used to i n d i c a t e undefined absolute c o n f i g u r a t i o n at a c h i r a l center. So f a r , t h i s has been s u f f i c i e n t to permit an adequate s p e c i f i c a t i o n of stereochemistry; however, i n the next v e r s i o n of the g r a p h i c a l e n t r y system wedge-shaped bonds w i l l be s p e c i f i a b l e , 3

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

CHEMICAL

9

11

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

8.

HOWE AND HAGADONE

On-Line

Information

System

to i n c r e a s e the f l e x i b i l i t y of the system and prevent any ambig u i t y of c h i r a l s i t e d e f i n i t i o n . The arrow o p t i o n above DELETE i s used f o r the s p e c i f i c a t i o n of s t r o n g l y p o l a r i z e d bonds where there i s a formal charge separat i o n between the ends of the bond. This can be used i n N-oxides or phosphates, f o r example. The r e g i s t r y and search system w i l l recognize the equivalence of R 3 N — > 0 and R 3 N 0 , so the a r row i s used mainly f o r cosmetic purposes without any l o s s o f s t r u c t u r a l i n f o r m a t i o n . And f i n a l l y , the s o l i d l i n e a t the top of the d i s p l a y i s used t o convert any of the s p e c i a l bond types j u s t described back t o a normal s i n g l e bond. Along the bottom of the d i s p l a y appear a number of commonlyo c c u r r i n g atom types and f u n c t i o n a l groups, as w e l l as some cont r o l o p t i o n s . The FLIP o p t i o n changes the bottom menu t o r e v e a l an a d d i t i o n a l s e t of l e s s commonly-occurring atoms and groups. F u n c t i o n a l groups not present i n p r e c o n s t r u c t e d form can be drawn simply by s e l e c t i n g the component atoms from the menu and conn e c t i n g them w i t h the a p p r o p r i a t e bonds. T h i s , however, takes longer than i t does t o i n s e r t one of the predrawn groups. Many of the predrawn f u n c t i o n a l groups can a l s o be converted t o s t r u c t u r a l l y s i m i l a r groups. For example, t o draw a t r i c h l o r o m e t h y l group, the operator would (a) i n s e r t a C F 3 from the menu, (b) p i c k up a CI from the menu and superimpose i t on the F 3 i n the t r i f l u o r o m e t h y l group, and (c) depress the s t y l u s . This would immediately convert the C F 3 t o a C C I 3 . Since the C C I 3 i s r e p r e sented i n the connection t a b l e as three d i s t i n c t c h l o r i n e atoms attached t o the same carbon, the operator could a l s o enter the same group by drawing the three c h l o r i n e s s e p a r a t e l y . Although the appearance would be d i f f e r e n t , the c o n n e c t i v i t y data f o r the two forms would be the same. At the s t a r t o f each s t r u c t u r e drawing o p e r a t i o n , the comput e r requests the U-number and molecular formula of the compound. A f t e r t h i s i n f o r m a t i o n i s typed i n by the o p e r a t o r , the t a b l e t i s a c t i v a t e d and the p i c t u r e drawing stage can begin. Although there i s no drawing order imposed on the o p e r a t o r , the c y c l i c nucleus of the molecule i s u s u a l l y drawn f i r s t . Rings can be drawn i n two ways, freehand o r by s e l e c t i n g a predrawn r i n g system from the second d i s p l a y . To draw a bond "freehand", the user s e l e c t s the DRAW o p t i o n and then depresses the s t y l u s w i t h the cursor i n s i d e the drawing area. As the s t y l u s i s moved, a s t r a i g h t l i n e appears on the scope as i f i t were " i n k " from the s t y l u s . When the s t y l u s i s l i f t e d , the l i n e i s f r o z e n and the new bond i s i n s e r t e d i n the connection t a b l e . The t e r m i n a t i n g atoms are i n i t i a l l y assumed t o be carbon. A d d i t i o n a l bonds can be drawn i n t h i s manner u n t i l the d e s i r e d r i n g i s formed. We have found, however, that the entry o p e r a t i o n can be speeded up c o n s i d e r a b l y by p r o v i d i n g a c o l l e c t i o n of predrawn r i n g systems which can be brought i n t o view (Figure 3) by p r e s s i n g the RINGS o p t i o n . +

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

113

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

RETRIEVAL

Figure 2.

O F MEDICINAL

CHEMICAL

INFORMATION

Structure-entry dispfoy showing some of the graphics options avail­ able

DOUBLE

RETURN

U-59226A

CO OO Θ

Ο EE

Cc Figure 3.

Λ ^ o »

MOl/E

110 L .

Information

3

2

"

V-CH -N=CH-CH 2

CH

0

\ = /

V

3

1 OH

STRUCTURE "

Κ

I.

F

Ν

C

- C a S i C l S H ?

L 1 HaB r Ρ

0

DOESN'T

MATCH M O L E C U L A R

h

+

C H

2

-*

+

C H

3

FLIP

FORMULA

COH^P

CH N H

OCH3

OH H N Ν sC

H 3 C CH3O

HO NH S O ^

2

C

2

Figure 6. Illustration of error detection prior to insertion of structure in data base. Message at bottom of drawing area says "structure doesnt match molecular formula."

USER 2

USER Ν

Figure 7. General hardware-component configuration of substructure-search sys­ tem. Front end consists of graphics minicomputers. Back end consists of dedicated minicomputer, "intelligent" disk controller, and dedicated disk.

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

RETRIEVAL

126

OF

MEDICINAL

CHEMICAL

INFORMATION

SS p o r t i o n s are e x t r a c t e d and sent to the SS minicomputer f o r exe c u t i o n . The SS minicomputer then sends the s u b s t r u c t u r e screen b i t s to the i n t e l l i g e n t d i s k c o n t r o l l e r and i n s t r u c t s i t to s t a r t scanning the d i s k . The s t r u c t u r e screens and connection t a b l e s w i l l be s t o r e d on the d i s k i n the format shown i n Figure 8. A ones-complement o p e r a t i o n i s performed on the s t r u c t u r e screen before i t i s w r i t t e n on the d i s k . Therefore, b i t s w i t h a value of 1 represent those s t r u c t u r a l a t t r i b u t e s that are absent i n the s t r u c t u r e . As the screen b i t s of each s t r u c t u r e pass the read head of the d i s k they are read by the c o n t r o l l e r and l o g i c a l l y AND ed w i t h che s u b s t r u c t u r e screen b i t s s u p p l i e d to the c o n t r o l l e r by the minicomputer. I f the r e s u l t of t h i s o p e r a t i o n i s nonzero the s t r u c t u r e cannot p o s s i b l y c o n t a i n the s u b s t r u c t u r e and i s e l i m i n a t e d from f u r t h e r c o n s i d e r a t i o n ; otherwise, the connect i o n t a b l e i s read i n t o the main memory of the minicomputer f o r f u r t h e r p r o c e s s i n g . The d i s k i s read i n a s e q u e n t i a l manner and when the end of a t r a c k has been reached the d i s k head i s stepped over to the next t r a c k . Scanning continues a f t e r a one r e v o l u t i o n delay. While the d i s k i s being scanned by the c o n t r o l l e r the minicomputer i s simultaneously executing the candidate s e l e c t i o n and atom-by-atom matching p o r t i o n s of the search. The atom and bond candidate s e l e c t i o n step i s performed by an a l g o r i t h m t h a t combines b i t screen and set r e d u c t i o n techniques. The connection t a b l e i s arranged i n a s p e c i a l format, w i t h one t a b l e entry f o r each bond i n the s t r u c t u r e . Each entry contains the atom types and sequence numbers of the atoms at each end of the bond as w e l l as the bond type. E n t r i e s are ordered by i n c r e a s i n g frequency of occurrence (based on s t a t i s t i c s c a l c u l a ted over the e n t i r e database) of the simple p a i r (atom-bond-atom sequence) c o n t a i n i n g the bond. In a d d i t i o n , a s m a l l number of screen b i t s , c a l l e d a p a i r screen, i s a s s o c i a t e d w i t h each bond. The p a i r screen, which i s a f u n c t i o n of atom and bond sequences w i t h i n a r a d i u s of 2 bond lengths of the c e n t r a l bond, d e s c r i b e s the s t r u c t u r a l environment i n the immediate neighborhood of the bond i n a manner s i m i l a r to that of a f u l l s t r u c t u r e screen. The p a i r screen b i t s are c a l c u l a t e d at the time the compound i s regi s t e r e d and are s t o r e d permanently i n the database. Although these e x t r a b i t s i n c r e a s e the s i z e of the database, experiments have shown t h a t they help provide short and r e l a t i v e l y c o n s i s t e n t search times. Execution proceeds by s e l e c t i n g , i n t u r n , each entry i n the s u b s t r u c t u r e t a b l e and screening against i t those e n t r i e s i n the s t r u c t u r e t a b l e t h a t are of the same simple p a i r type. The complemented screen b i t s of each q u a l i f y i n g s t r u c t u r e entry are l o g i c a l l y AND'ed w i t h the screen b i t s of the s u b s t r u c t u r e entry i n the same manner as f o r the f u l l s t r u c t u r e screen described above. A r e s u l t of zero i n d i c a t e s that the environment of the s e l e c t e d bond i n the s t r u c t u r e i s s i m i l a r to the environment of the current bond i n the s u b s t r u c t u r e . Candidate i n f o r m a t i o n i s s t o r e d for each s t r u c t u r e bond that matches the s u b s t r u c t u r e bond, to be

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

f

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

8.

HOWE AND HAGADONE

Figure 8.

Οπ-Line

Information

System

Layout of structures on the disk for substruc­ ture searching

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

127

128

RETRIEVAL

OF

MEDICINAL

CHEMICAL

INFORMATION

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

used into have ture

l a t e r i n the f i n a l atom-by-atom mapping of the s u b s t r u c t u r e the s t r u c t u r e . I f any s u b s t r u c t u r e atom or bond f a i l s to a candidate i n the s t r u c t u r e , the examination of t h a t s t r u c i s h a l t e d (a "no match" c o n d i t i o n ) . Figure 9 shows an example of the candidate s e l e c t i o n process u t i l i z i n g a s i m p l i f i e d p a i r screen of four b i t s per bond (although the optimal number of screen b i t s has yet to be determined, i t w i l l be i n the range of e i g h t to s i x t e e n b i t s per bond), which represent an adjacent s i n g l e bond, an adjacent double bond, an attached oxygen atom, and an attached carbon atom. In t h i s case, the f o l l o w i n g occurs: the t h i r d s t r u c t u r e entry i s screened a g a i n s t the f i r s t s u b s t r u c t u r e entry (same simple p a i r type) and passes the screen; the l a s t two s t r u c t u r e e n t r i e s are screened against the second s u b s t r u c t u r e entry and only the f o u r t h s t r u c t u r e entry passes the screen; and f i n a l l y , a l l of the s t r u c t u r e ent r i e s , except the t h i r d , are screened against the t h i r d s u b s t r u c ture entry and only the f i r s t passes. The arrows i n t h i s f i g u r e i n d i c a t e the s t r u c t u r e bonds to which each s u b s t r u c t u r e bond has been mapped. I f a s t r u c t u r e passes the candidate s e l e c t i o n s t e p , an atomby-atom mapping of the s u b s t r u c t u r e i n t o the s t r u c t u r e i s performed and the r e g i s t r y numbers of compounds t h a t q u a l i f y are r e turned to the host machine as they are found. Since the SS system has yet to be implemented i n f i n a l form, accurate SS performance data are not a v a i l a b l e ; however, time p r o j e c t i o n s , based on current d i s k technology and an already implemented SS prototype system, i n d i c a t e that most searches w i l l r e q u i r e about 30 seconds elapsed time f o r the 60,000 compound database. 4.

I n t e g r a t i o n of B i o l o g i c a l Data:

Future

Goals

Although much work s t i l l needs to be done before the compound r e g i s t r y and search system w i l l be operated on a r o u t i n e b a s i s , most of the d i f f i c u l t problems concerning chemical s t r u c t u r e handling have been overcome. In the next major phase of the proj e c t the p r i n c i p a l e f f o r t w i l l focus on " b i o l o g i c a l data", a term which encompasses a very broad range of i n f o r m a t i o n i n the f i e l d of pharmacological s t u d i e s . The b i o l o g i c a l data h a n d l i n g capabil i t i e s of the query system w i l l undergo a c o n t i n u i n g e v o l u t i o n which w i l l come about not only as new types of pharmacological data become a v a i l a b l e f o r i n c o r p o r a t i o n i n t o the system, but a l s o as the need f o r (and a v a i l a b i l i t y o f ) new techniques f o r manipul a t i n g experimental data evolves. I n i t i a l work on the i n c o r p o r a t i o n of b i o l o g i c a l i n f o r m a t i o n i n t o the compound r e g i s t r y and search system w i l l d e a l mainly w i t h data that i s already being captured on a r o u t i n e b a s i s f o r computer input and storage. This i n c l u d e s screening r e s u l t s i n which the b i o l o g i c a l response of compounds to a v a r i e t y of t e s t screens i s i n d i c a t e d by numerical a c t i v i t y values or b i n a r y a c t i v i t y assignments ( a c t i v e / i n a c t i v e ) . A d d i t i o n a l data types to be i n c o r -

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

2

4

3

1 4

Figure 9.

2

5

6

=

_

STRUCTURE TABLE S M IR PLS F IC E IR D P A I EEN B O N D A T O M # 2 A T O M # 1 C ATOM #2TYPE TYPE TYPE - = 0 ATOM #1 1 4 C 0 1111 5 4 C 0 1 001 1 3 C 0 1011 1 2 C C 1110 5 6 C C 10 10

=

Simplified candidate-selection example (this is the second phase in a substructure search)

c-c-o-c-c

II

0

STRUCTURE

1

c-c-?

II

Ο

3

SUBSTRUCTURE

SUBSTRUCTURE TABLE ATOM #1BOND ATOM #2M IR F IC E IR D ATOM #1 ATOM #2TYPE TYPE TYPE PS A IPLS EEN 2 3 C 0 -1 0C 1 = 0O 1 2 C C 1110 ? 2 4 C 1111

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

130

RETRIEVAL OF MEDICINAL CHEMICAL INFORMATION

porated w i l l e v e n t u a l l y i n c l u d e t o x i c i t y i n f o r m a t i o n and more det a i l e d a c t i v i t y r e s u l t s that p e r t a i n to i n d i v i d u a l c l a s s e s of pharmacological agents. As was mentioned i n the s e c t i o n on implementation, searches over the b i o l o g i c a l p o r t i o n of the database w i l l be c o n t r o l l e d by the r e l a t i o n a l database management system. The l o g i c c o n s t r u c t s of the expert query language w i l l a l l o w the user to s p e c i f y r a t h e r complex chemical and b i o l o g i c a l search requests i n which, for example, the database i s searched f o r a l l compounds that cont a i n a p a r t i c u l a r s u b s t r u c t u r e , which a l s o e x h i b i t a d e s i r e d a c t i v i t y l e v e l i n a given s c r e e n , and which a l s o were submitted a f t e r a p a r t i c u l a r date. Use of the RDBMS promises not only to reduce s u b s t a n t i a l l y the e f f o r t r e q u i r e d f o r i n t e g r a t i o n of the chemical and b i o l o g i c a l databases, but a l s o w i l l s i m p l i f y cons i d e r a b l y the e v o l u t i o n of b i o l o g i c a l l y - o r i e n t e d search c a p a b i l i ties. In a d d i t i o n to p r o v i d i n g a means f o r i n t e r a c t i v e searching of chemical and b i o l o g i c a l data ( f o r d i s p l a y or r e p o r t generation purposes), an important feature of the system w i l l be i t s a b i l i t y to c r e a t e subsets of the main database. Users w i l l be able to t r e a t the r e s u l t s of t h e i r searches as t h e i r own p r i v a t e d a t a bases which can be accessed by s p e c i a l l y t a i l o r e d a p p l i c a t i o n programs. For example, compounds which were r e t r i e v e d by a combined s u b s t r u c t u r e and screening a c t i v i t y search could become source data f o r more d e t a i l e d analyses u s i n g p a t t e r n r e c o g n i t i o n , molecular m o d e l l i n g , or s t a t i s t i c a l techniques. Although we expect that the b i o l o g i c a l i n f o r m a t i o n h a n d l i n g c a p a b i l i t i e s of the system w i l l undergo a c o n t i n u i n g e v o l u t i o n , there i s a need f o r the i n c l u s i o n of other types of data as w e l l . S p e c t r a l data, patent s t a t u s i n f o r m a t i o n , CAS r e g i s t r y numbers, chemical names, and p h y s i c a l property data a l l f a l l under the umb r e l l a of " m e d i c i n a l chemical i n f o r m a t i o n " and are some of the more important data types that have been planned f o r eventual i n c l u s i o n i n the system. The p r o j e c t e d c a p a b i l i t i e s of the system, enabling a user to i n t e r a c t i v e l y query and manipulate such d i v e r s e types of i n f o r m a t i o n , should make the system an important asset i n the research and research management f u n c t i o n s . Literature Cited 1.

2. 3. 4.

Computer-Assisted Organic Synthesis, Wipke, W. T. and Howe, W. J., e d s . , ACS Symposium S e r i e s No. 61, American Chemical S o c i e t y , Washington, D . C . (1977). Minicomputers and Large Scale Computations, Lykos, P., e d . , ACS Symposium S e r i e s No. 57 (1977). Computer-Assisted Structure Elucidation, Smith, D. H., e d . , ACS Symposium S e r i e s No. 54 (1977). Chemometrics: Theory and Application, K o w a l s k i , B. R . , e d . , ACS Symposium S e r i e s No. 52 (1977).

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

8.

HOWE AND HAGADONE

On-Line Information System

Algorithms for Chemical Computations, C h r i s t o f f e r s o n , R. E., ed., ACS Symposium S e r i e s No. 46 (1977). 6. Computer Networking and Chemistry, Lykos, P., ed., ACS Sym­ posium S e r i e s No. 19 (1975). 7. Howe, W. J. and Hagadone, T. R., "Substructure S e a r c h i n g " , in Proceedings of the Technical Information Retrieval Com­ mittee of the Manufacturing Chemists Association, Washington Meeting, 1977, in p r e s s . 8. Corey, E . J. and Wipke, W. T., Science, 166, 178 (1969). 9. Morgan, H . L., J. Chem. Doc., 5, 107 (1965). 10. Wipke, W. T. and D y o t t , T. M., J. Amer. Chem. Soc., 96, 4834 (1974). 11. B l a k e , J. Ε., Farmer, Ν. Α . , and Haines, R. C., J. Chem. Inf. and Computer Sci., 17, 223 (1977). 12. Corey, E . J., Wipke, W. T., Cramer, R. D., and Howe, W. J., J. Amer. Chem. Soc, 94, 421 (1972). 13. Wipke, W. T., in Computer Representation and Manipulation of Chemical Information, Wipke, W. T., Heller, S. R., Feldman, R. J., Hyde, E., e d s . , p . 147, Wiley Publ., New York (1974). 14. Brown, H. D., Castlow, Μ., C u t l e r , Ε. Α . , Jr., Demott, Α. Ν., Gall, W. B., Jacobus, D. P., and Miller, C. J., J. Chem. Inf. and Computer Sci., 16, 5 (1976). 15. Codd, E . F., "A Relational Model of Data f o r Large Shared Data Banks", Commun, of the ACM, XIII, 377 (1970). 16. Codd, E . F., "Further N o r m a l i z a t i o n of the Data Base R e l a ­ tional Model", Courant Computer Science Symposia 6, Data Base Systems, Prentice-Hall, New York (1971). 17. Date, C. J., An Introduction to database Systems, Addison Wesley, Reading, M a s s . , (1975). 18. A s t r a h a n , M. M., et al, "System R. R e l a t i o n a l Approach to Database Management", A.C.M. Transactions on Database Sys­ tems, 1, 97 (1976). 19. Feldman, Α . , Hodes, L., J. Chem. Doc., 15, 147 (1975). 20. Adamson, G. W., Bush, J. Α . , M c l u r e , Α . , and Lynch, M. F., J. Chem. Doc., 14, 44 (1974). 21. Meyer, Ε., "Superimposed Screens f o r the GREMAS System", i n Proc. FID-IFIP Conference, p . 280, Samuelson, Κ., ed., Rome Meeting, 1967, North Holland P u b l . (1968). 22. Sussenguth, Ε. H., Jr., J. Chem. Doc., 5, 36 (1965). 23. F i g u e r a s , J., J. Chem. Doc., 12, 237 (1972). 24. H a i n e s , R. C., "Substructure Search Design Study Status Re­ p o r t " , Chemical A b s t r a c t s S e r v i c e Working Paper (unpublished), 1976. 25. Bird, R. M., Tu, J. C., Worthy, R. Μ., "Associative/Parallel Processors for Searching Very Large Textual Data Bases", SIGIR-SIGARCH-SIGMOD T h i r d Workshop on Computer A r c h i t e c t u r e for Non-numeric P r o c e s s i n g , McGill, M. J., ed., SIGMOD, 9, No. 2, 8 (1977). Downloaded by UNIV LAVAL on April 22, 2018 | https://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch008

5.

131

RECEIVED August 29, 1978.

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.