An Integrated Chemical and Biological Data ... - ACS Publications

replaced in the file struture by any atom meeting the criteria they impose. In Figure 1, the query structure contains two spec ial atoms, an "X" which...
1 downloads 0 Views 2MB Size
12

An

Integrated C h e m i c a l a n d B i o l o g i c a l D a t a R e t r i e v a l

System for

Drug

Development

1

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

J. A. PAGE , R. THIESEN, and F. KUHL Walter Reed Army Institute of Research, Bethesda, MD 20014

The Division of Experimental Therapeutics, The Walter Reed Army Institute of Research, in conjunction with the Division of Biometrics has been engaged in the development and implementation of a large scale integrated chemical - biological data retrieval system for the support of the Army Medical Research and Development Command's drug development a c t i v i t i e s . The system i s being developed on a Control Data Corporation 3500 with one million bytes of memory, 16 disk drives with removable packs which contain 37 million bytes of storage each, six 7-track tape drives, two line printers, and 16 communication lines supporting line speeds of 110, 300, and 1200 baud. This effort represents a total redesign of the original system which was described earlier [1].

F i l e Organization The WRAIR Chemical Information Retrieval System (CIRS) is comprised of four subsystems: Biology, Inventory, Chemistry and the Report Generator. The f i r s t three subsystems contain f i l e s of information peculiar to each system, and programs for searching these f i l e s . The Report Generator is used to combine and sort output from searches of the other subsystems. The subsystems must be searched separately because they are too large to be searched together. The output from any subsystem may be used to control the search of the next, by means of a common key. For example, a chemistry search yields a number of structures, each of which is identified by a unique accession number. These numbers might then be used to garner information from Biology and Inventory relative to samples of the structures whose numbers come from Chemistry. Or, the l i s t of sample numbers from an Inventory search might be used to extract information from Chemistry 1

Present address: Uniformed Services University of the Health Sciences, Bethesda, MD 20014. This chapter not subject to U.S. copyright. Published 1978 American Chemical Society Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

182

RETRIEVAL

O F MEDICINAL

CHEMICAL

INFORMATION

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

r e l a t i n g t o those samples. In both examples, the Report Genera t o r would combine the information from the various subsystems and s o r t the r e p o r t i n t o the d e s i r e d order. The standard CIRS r e p o r t may contain information from any o f the subsystems alone, o r from any two, or from a l l three. Chemistry R e t r i e v a l Subsystem. The chemistry r e t r i e v a l subsystem f i l e design and o r g a n i z a t i o n i s being d e s c r i b e d i n d e t a i l f o r p u b l i c a t i o n elsewhere [2]. B r i e f l y the system cons i s t s o f two numeric index cross reference f i l e s , a screen index f i l e , and a master s t r u c t u r e f i l e . I t contains about 270,000 unique s t r u c t u r e s and occupies 8 d i s k packs of 37 m i l l i o n chara c t e r s each. The three index f i l e s provide a cross indexing scheme that allows f o r f l e x i b i l i t y i n sequencing and updating. The system may be accessed by 1) a c c e s s i o n number, a unique number s i m i l a r i n concept t o the CAS r e g i s t r y number, which i s assigned by the chemistry update system t o each new s t r u c t u r a l formula, 2) sample number, which i s a unique, s e q u e n t i a l number assigned to each p h y s i c a l sample without regard t o the chemical s t r u c t u r e by the inventory update system, o r 3) chemical s t r u c t u r e s , e i t h e r whole or sub-structures. The a c c e s s i o n index f i l e contains the accession number f o r a given s t r u c t u r a l formula and a t a b l e of sample index records f o r each a c c e s s i o n number. T h i s sequence provides quick access t o data f o r a l l samples o f a p a r t i c u l a r chemical. In order t o provide c o n t i n u i t y and allow f o r the expression of a h i e r a r c h i c a l r e l a t i o n s h i p the a c c e s s i o n number i s s t r u c t u r e d so t h a t f u n c t i o n a l l y d i f f e r e n t f i l e s may be maintained and s a l t s may be t i e d , through t h e i r a c c e s s i o n number, to the parent compound. The p a r t s and f u n c t i o n s o f the p a r t s are as f o l l o w s : 1) A two d i g i t alpha p r e f i x designates s e r i e s . C u r r e n t l y only two s e r i e s are being used: "WR" f o r s t r u c t u r e s f o r which a p h y s i c a l sample has been r e c e i v e d f o r screening and "XR" f o r s t r u c t u r e s proposed or under c o n s i d e r a t i o n but not a c t u a l l y r e c e i v e d . An a d d i t i o n a l s e r i e s f o r r e l a t e d s t r u c t u r e s reported i n the l i t e r a t u r e i s planned. The s e r i e s p r e f i x i s a u t o m a t i c a l l y up-graded t o "WR" i f the compound i s r e c e i v e d and processed through the inventory system. 2) A s i x d i g i t s e q u e n t i a l number which i d e n t i f i e s the p r i mary chemical s t r u c t u r e . 3) A two d i g i t numeric " s a l t s u f f i x " which i s assigned by the update system t o d i f f e r e n t s a l t s of compounds having the same primary s t r u c t u r e . T h i s allows the user t o r e t r i e v e a s p e c i f i c compound and a l l of i t s s a l t forms without doing a sub-structure search. I t a l s o allows data on a given compound and a l l o f i t s s a l t forms t o be grouped together on an a c c e s s i o n number sequenced r e p o r t .

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

12.

PAGE

ETAL.

Integrated

Data

Retrieval

System

183

The sample index f i l e i s keyed by the sample or b o t t l e number. T h i s number i s c r o s s - r e f e r e n c e d t o the accession number so t h a t a given sample may be attached t o a s p e c i f i c s t r u c t u r e . The sequence permits the chemistry subsystem d i r e c t access t o e i t h e r the b i o l o g y o r the inventory subsystem and provides f o r d i r e c t access o f the s t r u c t u r e s f o r r e p o r t s by the inventory and/or b i o logy subsystems. The sample index f i l e a l s o contains some admini s t r a t i v e i n f o r m a t i o n about each sample such as the source; the method by which the sample was obtained (e.g. g i f t , purchased, e t c . ) , whether t h i s sample i s the o r i g i n a l submission o r a dup l i c a t e , d i s c r e e t ( i . e . , p r o p r i e t a r y ) o r open. The screen index f i l e i s the f i r s t f i l e accessed f o r any s t r u c t u r e o r sub-structure search. I t contains a l l the informat i o n necessary t o determine s t r u c t u r e matches. When the s t r u c t u r e matches have been l o c a t e d , a d d i t i o n a l information f o r each s t r u c t u r e , such as the s t r u c t u r e p i c t u r e , may be r e t r i e v e d v i a the a c c e s s i o n number. Because i t s o r g a n i z a t i o n i s index-sequent i a l , the screen index f i l e may be accessed e i t h e r s e q u e n t i a l l y , or s e l e c t i v e l y by use o f the i t s indexes. Each s t r u c t u r e has i t s own record on the screen index. The c h i e f items s t o r e d f o r each are the connection t a b l e ( i n a compressed, non-redundant format), and the s t r u c t u r e ' s unique acc e s s i o n number. The key f o r each r e c o r d c o n s i s t s o f the acc e s s i o n number, preceded by the s t r u c t u r e screen and the p a r t i t i o n i n g f a c t o r . The screen i s a 96-bit superimposed code d e r i v e d a l g o r i t h m i c a l l y from the s t r u c t u r e . I t has been described i n d e t a i l by Feldman [3]. I f two s t r u c t u r e s have d i f f e r e n t screens, they must have d i f f e r e n t s t r u c t u r e s . Thus only those f i l e s t r u c t u r e s having the same screen as a given query compound are cand i d a t e s f o r matching. I t e r a t i v e matching w i l l be necessary t o confirm the matches, but the amount o f i t e r a t i v e matching r e q u i r e d i s d r a s t i c a l l y reduced by the screen. For sub-structure searches, the i n c l u s i v e property o f the screen becomes s i g n i f i c a n t . In the example below, the f i r s t s t r u c t u r e (discounting hydrogens which are not considered i n the c a l c u l a t i o n o f the screen) i s wholly contained by the second and as a sub-structure would be a m a t c h —

The screen f o r the f i r s t s t r u c t u r e i s a l s o wholly contained i n the screen f o r the second, i . e . f o r each b i t s e t i n the f i r s t screen, the corresponding b i t i s set i n the second. To match a sub-structure then, a candidate's screen must have a t l e a s t a l l the b i t s t h a t a r e set i n the query's screen. The e f f e c t i v e n e s s of t h i s system i n e l i m i n a t i n g f i l e compounds from c o n s i d e r a t i o n depends on the nature o f the sub-structure and v a r i e s g r e a t l y .

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

184

RETRIEVAL

OF

MEDICINAL

CHEMICAL

INFORMATION

The p a r t i t i o n i n g f a c t o r i s conceptually s i m i l a r to Hode's bucket index [ 4 ] . I t i s a 12 b i t code d e r i v e d from the screen through a s e r i e s o f AND and OR operations i n such a way that the i n c l u s i o n p r o p e r t i e s o f the screen are preserved. Its function i n the screen index f i l e i s to a l l o t records to one of 4096 p a r t i t i o n s with a t h e o r e t i c a l l y uniform d i s t r i b u t a t i o n . Therefore, i n a f i l e o f 250,000 records the expected number of records i n a given p a r t i t i o n i s 61. Since f o r i d e n t i t y searches, the f a c t o r and screen o f the query must be matched e x a c t l y only t h i s very small p o r t i o n o f the f i l e need be read p r i o r to the search. The u t i l i z a t i o n of the p a r t i t i o n i n g f a c t o r i n the subs t r u c t u r e search i s more complex. Any compound c o n t a i n i n g a given sub-structure w i l l have a p a r t i t i o n i n g f a c t o r which cont a i n s a t l e a s t those b i t s set i n the sub-structure's p a r t i t i o n i n g f a c t o r . As s u b - s t r u c t u r e queries become more s p e c i f i c more screen b i t s are u s u a l l y s e t , and more b i t s are set i n the part i t i o n i n g f a c t o r . The number of p o s s i b l e i n c l u s i v e matches to the f a c t o r drops e x p o n e n t i a l l y as the number of one b i t r i s e s . Because the screen index has the p a r t i t i o n i n g f a c t o r as i t s major key, i t i s necessary to read only those records having the r i g h t factor. I f , however, the sub-structure i s so general that more than one t h i r d o f the p a r t i t i o n s must be accessed randomly i t i s quicker t o scan the screen index f i l e s e q u e n t i a l l y . The master f i l e contains the chemical s t r u c t u r e , which has been captured a t i n p u t and saved i n a condensed form. I t i s sequenced by a c c e s s i o n number as t h i s number i s a u t o m a t i c a l l y assigned to a new s t r u c t u r e by the f i l e update system. In add i t i o n to the s t r u c t u r e , the molecular formula and q u a l i f i e r s are in this f i l e . The molecular formula has been stored i n such a way as to permit searching both as an exact match or an i n c l u s i v e match. T h i s format a l s o permits the s o r t i n g of matches by molec u l a r formula i n t o CAS sequence f o r r e p o r t i n g . Because the connection t a b l e s are e s s e n t i a l l y two-dimensiona l , and do not c o n t a i n s p e c i a l bond types, many chemical propert i e s such as stereochemistry cannot be represented. The s o l u t i o n t o t h i s and s i m i l a r problems was the i n c l u s i o n of machinereadable q u a l i f i e r f i e l d s f o r each s t r u c t u r e to i n d i c a t e such t h i n g s as stereo i n f o r m a t i o n , polymers, mixtures, and c o o r d i n a t i o n complexes. Each q u a l i f i e r has a l s o non-searchable f r e e t e x t s t o r e d with i t to a i d human i n t e r p r e t a t i o n of the p i c t u r e . B i o l o g y R e t r i e v a l Subsystem. The b i o l o g y r e t r i e v a l subsystem c o n s i s t s o f two indexed s e q u e n t i a l f i l e s c o n t a i n i n g b i o l o g i c a l t e s t data r e l a t i n g to the s t r u c t u r e s of the chemistry subsystem. There are over three m i l l i o n records occupying 6 d i s k packs of 37 m i l l i o n c h a r a c t e r s each. Both f i l e s are sequenced by sample number (BN) and l a b o r a t o r y i d e n t i f i c a t i o n number (Lab I.D.). The data f i e l d s are dependent on the type of e x p e r i mentation done by a s p e c i f i e d l a b o r a t o r y and are predefined i n a data name d i c t i o n a r y . From a u s e r s p o i n t of view the two f i l e s 1

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

12.

PAGE E T A L .

Integrated

Data

Retrieval

System

185

are i d e n t i c a l . The d i v i s i o n i s a r b i t r a r y and i s designed to allow searching only p a r t o f the data base. The primary f i l e contains those l a b ID's most f r e q u e n t l y accessed while the other f i l e contains a number o f secondary t e s t systems and h i s t o r i c a l data. New l a b o r a t o r i e s can be added t o the system by simply adding e n t r i e s i n t o the data name d i c t i o n a r y under a new l a b ID number. Inventory R e t r i e v a l Subsystem. The inventory r e t r i e v a l subsystem i s an indexed s e q u e n t i a l f i l e c o n t a i n i n g information p e r t i n e n t t o the p h y s i c a l samples. I t c u r r e n t l y contains 433 thousand records and occupies 5 d i s k packs of 37 m i l l i o n chara c t e r s each. The f i l e i s maintained i n sample number sequence. When a sample i s r e c e i v e d i t i s assigned the next a v a i l a b l e sample number and a l l a v a i l a b l e data ( i . e . , date o f r e c e i p t , source, amount, c o n d i t i o n o f r e c e i p t , s h e l f l o c a t i o n , chemical and p h y s i c a l p r o p e r t i e s , etc.) are entered i n t o the record. A l l t r a n s a c t i o n s i n v o l v i n g t h a t sample (shipments to t e s t i n g l a b o r a t o r i e s , removal from inventory, etc.) and the date of the t r a n s a c t i o n s a r e a l s o entered i n t o record. The data f i e l d s f o r t h i s f i l e a r e a l s o p r e d e f i n e d i n a data name d i c t i o n a r y f o r searching. Retrieval C r i t e r i a Chemical Subsystem. The heart o f the chemical r e t r i e v a l subsystem i s the sub-structure search c a p a b i l i t y . The general purpose o f sub-structure searching i s to r e t r i e v e compounds having s p e c i f i e d s t r u c t u r a l s i m i l a r i t i e s . In our system, the s i m i l a r i t i e s a r e s p e c i f i e d i n the form o f an incomplete s t r u c ture, which must be i n c l u d e d i n any f i l e s t r u c t u r e that i s t o be r e t r i e v e d . While the f i l e s t r u c t u r e may contain atoms and i n terconnections not shown i n the query, those i n the query must be matched. Thus, a query sub-structure may contain normal s t r u c t u r e atoms and bonds, and i n d e f i n i t e atoms o r bonds. The former must be matched e x a c t l y , and the l a t t e r may be s u b s t i t u t e d according to the r u l e s governing the p a r t i c u l a r atom or bond. S t r u c t u r e s , e i t h e r queries or f i l e compounds, are represented by a connection t a b l e . The t a b l e contains an entry f o r each non-hydrogen atom, together with information on the numbers and s i z e s o f covalent bonds on each atom, and the other non-hydrogen atoms attached t o i t ( c a l l e d "neighbors"). Each entry a l s o shows the number o f hydrogens attached t o the atom, any i o n i c charges, and a f l a g t h a t i s set i f the atom i s i n a r i n g . We have d e l i b e r a t e l y discarded the knowledge o f what type o f bond attaches which neighbor, not because i t i s u n i n t e r e s t i n g but because i t allows resonating s t r u c t u r e s , such as phenyl r i n g s , t o appear i d e n t i c a l r e g a r d l e s s o f the p r e c i s e arrangement o f double and s i n g l e bonds. We a l s o make c e r t a i n adjustments to tautomers to allow them t o be i d e n t i f i e d by e i t h e r form r e g a r d l e s s o f which

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

186

RETRIEVAL

OF

MEDICINAL

CHEMICAL

INFORMATION

form i s used f o r input. The d e t a i l s o f t h i s have been discussed elsewhere [ 5 ] . Our connection t a b l e s are not capable of d i s t i n g u i s h i n g stereoisomers or polymers. But codes f o r the presence of such c o n d i t i o n s and t e x t e x p l a i n i n g them are stored with the o r i g i n a l i n p u t formula and are r e t r i e v e d with i t . These are the q u a l i ­ f i e r s mentioned above. They may a l s o be s p e c i f i e d f o r a chemis­ t r y search. Normal s t r u c t u r e s are coded f o r input by means of a s p e c i a l ­ l y - m o d i f i e d t e l e t y p e [£J, which allows the s t r u c t u r a l formula to be typed as a combination o f atoms and bonds to represent chains and r i n g s . I t a l s o allows s t r i n g s of s u b s c r i p t e d element symbols and groups i n c l o s e d w i t h i n parentheses, whose connections must be i n f e r r e d . The extensive l o g i c necessary to i n t e r p r e t these formu­ l a s has been described [ 7 ] . The r e s u l t i s a f a i r l y simple s e t o f r u l e s f o r the chemist w r i t i n g the formula f o r input that g e n e r a l ­ l y corresponds to normal chemical conventions. These s t r u c t u r e s are complete, and except f o r c e r t a i n Markush-type s t r u c t u r e s , no ambiguity or s u b s t i t u t i o n i s allowed a t any p o i n t . Queries, on the other hand, are incomplete s t r u c t u r e s , allowing a d d i t i o n s and substitutions at specified points. Most query atoms are normal atoms. That i s , they are not s p e c i a l atoms, they have no u n s p e c i f i e d bonds and t h e i r valance i s not zero. I t i s r e q u i r e d t h a t they match f i l e atoms. The f i l e s t r u c t u r e must c o n t a i n one i d e n t i c a l atom f o r each normal atom i n the query. I f the query atom i s i n a r i n g a f l a g i s s e t i n the atom d e s c r i p t o r word and the f i l e atom must a l s o be i n a r i n g . However, i f the query atom i s not i n a r i n g , the f i l e atom need not be i n a r i n g , but i t i s allowed to be. For example:

Ζ - ΝΗ - Ζ

w i l l allow

If you wish t o f o r c e a query atom to be matched only by a f i l e atom which i s a r i n g member, the query atom must be i n a r i n g . I f the query cannot be w r i t t e n i n such a way as to include the p a r t i c u l a r query atom i n a r i n g , r i n g members may perhaps be s p e c i f i e d with a s p e c i a l atom. The simplest type o f s u b s t i t u t i o n i s that o f a s p e c i a l atom. They appear i n the query as atoms with s p e c i a l symbols and may be r e p l a c e d i n the f i l e s t r u t u r e by any atom meeting the c r i t e r i a they impose. In F i g u r e 1, the query s t r u c t u r e contains two spec­ i a l atoms, an "X" which allows the s u b s t i t u t i o n of any non-hydro­ gen atom and a Q" which allows the s u b s t i t u t i o n o f any nonhydrogen, non-carbon atom. These s p e c i a l atoms may appear any­ where w i t h i n the query s t r u c t u r e , that i s , they need not be t e r ­ minal atoms but may be incorporated i n a s t r i n g or i n a r i n g . In order t o be a " h i t " t o a query a f i l e atom need only have a t l e a s t the bonds, charges e t c . , i n d i c a t e d f o r a s p e c i a l atom. 11

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

12.

PAGE ET AL.

Integrated

Data Retrieval

187

System

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

Therefore, the f i l e compounds i n F i g u r e 1 contain many atoms other than the normal atoms and the s p e c i a l atoms i n the query. The s p e c i a l atoms allowed and t h e i r r e s t r i c t i o n on r i n g membership are given i n Table 1.

Symbol Ζ X Xr Xc Q Qr Qc Ha Mt Rc Cc

Table I. Types of S p e c i a l Atoms Atom Type Any Not Η Not Η Not Η Not H, not C Not H, not C Not H, not C F, C l , Br, I Any metal Carbon Carbon

Ring Membership Indifferent Indifferent Required Excluded Indifferent Required Excluded Indifferent Indifferent Required Excluded

Any atom i n the query, i n c l u d i n g s p e c i a l atoms may c a r r y a charge, which must then be matched by the f i l e atom. The i n v e r s e however, i s not t r u e . That i s , the absence of a charge on a query atom does not preclude a charged f i l e atom as a match. Because the valence of s p e c i a l atoms i s indeterminate (except f o r Ha, Rc, Cc), a l l s p e c i f i c a l l y r e q u i r e d bonds to a s p e c i a l atom must be shown e x p l i c i t l y . The r i n g and chain carbons (Rc and Cc) may be used to allow one to w r i t e p a r t of a r i n g as a chain, o r to exclude r i n g s , e s p e c i a l l y fused r i n g s , as answers. (See F i g u r e 2). F i g u r e s 3,4, and 5 show the r e l a t i o n s h i p s among the s p e c i a l atoms and how they may be used to modify a query so that i t becomes more general or more s p e c i f i c depending on the nature and the number of matches d e s i r e d . Figure 3 d i v i d e s the universe of p o s s i b l e s u b s t i t u t i o n s i n t o e i g h t c a t e g o r i e s and l i s t s which of these c a t e g o r i e s w i l l be r e t r i e v e d by each s p e c i a l atom. Figure 4 shows r e p r e s e n t a t i v e s t r u c t u r e s f o r each category s i n g l y sub­ s t i t u t e d on a methyl group and which of these s t r u c t u r e s would be r e t r i e v e d by each s p e c i a l atom. F i g u r e 5 shows a few of the p o s s i b i l i t i e s f o r an ortho d i - s u b s t i t u t e d phenyl r i n g . These simple examples g i v e , of course, only an i n d i c a t i o n of the v e r s i t i l i t y p o s s i b l e . Combinations of these few s p e c i a l atoms pro­ v i d e s a very powerful sub-structure search c a p a c i t y . The other major query element i s the u n s p e c i f i e d bond, w r i t ­ ten as any bond overstruck with a question mark. In general, the u n s p e c i f i e d bond may be used to allow the connections on the connected atoms to vary, as long as the neighbor r e l a t i o n i s maintained. Used between two normal atoms, i t r e q u i r e s the two

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

RETRIEVAL

O F MEDICINAL

CHEMICAL

INFORMATION

Query:

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

Matches:

Figure 1.

Example of a simple substructure query

Query A: C R R -C C

C

C

C

Matches:

Q

Ci* (Τ' NH

2

Ouery B:

ζ

Matches:

CH

3

Figure 2.

CH CH3 2

Substructure query using R and C c

c

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Integrated

PAGE E T AL.

Mt

not C

chain

chain

3

4

Mt ring

not C ring

Ha

H

1

2

System

chain

S

C ring

7

6

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

C

Data Retrieval

8

Ζ

= 18

Qr =

6-7

X

= 28

Ha =

2

Xc = 2-5

Mt =

3,6

Xr

= 68

Ce =

5

Q

= 2,3,4,6,7

Rc =

8

Figure 3. Areas of substitution allowed by the special atoms

Qc = 2 4

CH — 3

Query —

Ζ

Xc





















CH3*Cd-CH3











CH -NH































CH

Xr

Qr

X

Response 1

Q

Qc

Ha

Mt

Cc

Rc



4

CH3CH3

CH3-C1

3

2

CH -S^~[ 3

Figure 4.

• •

• • •



Substructure substitution at a single point

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

RETRIEVAL

OF

MEDICINAL

CHEMICAL

INFORMATION

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

190

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

12.

PAGE E T AL.

Integrated

Data

Retrieval

191

System

atoms t o be neighbors, but allows the valence o f the two atoms t o vary (Figure 6) o r the nature o f the attachment between the atoms t o vary (Figure 7). A s i n g l e query may c o n s i s t o f a sub-structure fragment with normal atoms and any number and combination o f s p e c i a l atoms and u n s p e c i f i e d bonds. I t may i n f a c t c o n s i s t o f only s p e c i a l atoms. I t may a l s o c o n s i s t o f s e v e r a l sub-structure fragments each o f which may have s p e c i a l atoms and u n s p e c i f i e d bonds. These f r a g ments may be r e l a t e d t o each other i n any combination o f three ways. I f the fragments are simply disconnected, each fragment must be d i s t i n c t l y and simultaneously present i n the f i l e s t r u c t u r e . I f however two fragments are r e l a t e d through the Boolean operator "AND", they are matched independently. Thus i f p a r t s o f the fragments are i d e n t i c a l , those i d e n t i c a l p a r t s are redundant and need not be d i s t i n c t l y present. For example:

X -^~"^-X · X-CH-CI»Ci 2

r e q u i r e s a t l e a s t two CI atoms i n the response.

C1-^"""^CHC1 N0 -^^-CH CHCH CI 2

2

2

while

2

CI

X-^ "Vx AND X-CH -CI AND CI 2

allows two CI atoms i n the response

H 0 -^~~^-CH Cl

but only r e q u i r e s one.

CI CH CH -^^-NHCH CH

2

2

2

2

r

CICH CH "-^ ^-CCI 2

2

etc.

i n a d d i t i o n t o the f i r s t responses. Two fragments may a l s o be r e l a t e d through the Boolean ope r a t o r "AND NOT". In t h i s case the f i l e s t r u c t u r e must not match the s p e c i f i e d fragment. Each query may c o n t a i n up t o 32 such fragments i n any combination t h a t i s not d i r e c t l y c o n t r a d i c t o r y . B i o l o g y and Inventory R e t r i e v a l Subsystems. E s s e n t i a l l y the same programs a r e used t o search the b i o l o g y and inventory systems. The major d i f f e r e n c e i s i n the data name d i c t i o n a r y that i s attached. Each subsystem has a d i c t i o n a r y o f a l l data items in i t s f i l e . T h i s provides the c a p a b i l i t y o f searching on any f i e l d o r combination o f f i e l d s i n e i t h e r data base. A f i e l d may be d e f i n e d as numeric, alpha, alphanumeric o r as a repeating group. A l s o a d d i t i o n a l f l e x i b i l i t y i s provided by a l l o w i n g new f i e l d s t h a t are not i n the data name d i c t i o n a r y to be d e f i n e d and used i n the search. The search i s made up o f a search command (SUBSET, o r SUBSET, USING), the f i e l d ( s ) t o be s e l e c t e d as d e f i n e d

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

RETRIEVAL

O F MEDICINAL

CHEMICAL

INFORMATION

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

Query: SfC-C

Matches: CH SCH CH 3

2

Ο