New Speed to Structural Searches - C&EN Global Enterprise (ACS

Nov 5, 2010 - Out of this effort came the con- . cept of using high speed computers to conduct the search and a new approach to the coding of chemical...
0 downloads 8 Views 912KB Size
LITERATURE ASCHER OPLER A N D T E D R. N O R T O N * Research Department,

Western OfV/s/onr,

The D o w Chemical Co., Pittsburg, Calif*

N e w Speed to Structural Searches With high speed computers^ 1000 organic compounds can be searched for desired structural features in two to 15 seconds X HE D o w C H E M I C A L C O . , like t h e rest

of t h e industry, must keep track of thousands of compounds synthesized and studied by i t s numerous research groups. As the r a t e o£ accession of new compounds h a s increased, pressure has mounted for a satisfactory w a y to search long lists of compounds for specified structural groupings. I n t h e field of structure-property correlation, the handling of t h e structure problem is far more difficult than that of scanning physical property' tabulations. The authors started work in late 1952 to develop a suitable system for searching. Out of this effort came t h e con- . cept of using high speed computers to conduct t h e search and a new approach to the coding of chemical structures. This approach h a s been tested in t w o pilot computer studies. The computer program is highly specific a n d versatile a n d can readily distinguish between such compounds as ( A ) a n d ( B ) : OH sCOOH

COOH

0

OH

COOH

Br

Br (B) One reel of magnetic tape c a n store more than 25,000 coded structures of average complexity, and a typical list of 1000 organic chemicals can b e searched for desired structural features in two to 15 seconds. Condensation is such that t h e structures of one million compounds^ can b e stored on t a p e reels occupying less t h a n thirty inches of shelf space. T h e system could also b e used to code inorganic compounds. 0 Present address of Norton: Agricultural Chemical Research Laboratory, The D o w Chemical Co., Midland, Mich.

2812

C&EN

JUNE

4, 1 9 5 6

The Coding System The D o w coding system, which r e duces chemical structures to sequences of digits, is satisfactory for machine searching b u t impractical for ciphering for purposes of simplified typesetting, condensation of indexes, a n d shorthand. I t was designed to consider both t h e requirements of chemical researchers and t h e capabilities of digital computers, general criteria being that: • T h e code must be able to s u p p l y a n unequivocal representation for every chemical compound characterized. • T h e code must b e simple to learn and t o apply. • T h e coding system must b e expandable, modifiable, a n d checkable,

these tasks t o b e performed, b y t h e machine. The system used is actually o n e for encoding networks (for example, electrical, highway, etc.). To c o d e , one regards t h e c o m p o u n d as a structural network of chemical groups. Using intuition a n d experience, a judicious selection of 332 structural elements (such as ether, ester, o r methoxy) was assigned arbitrary three-digit numbers. This represents a compromise between using single atoms a s structural elements, which makes t h e s y s t e m t o o unwieldy, a n d using t o o highly specific groups like dichloroacetyl, which w o u l d r e d u c e its versatility. When coding, one first codes a n y one of the 332 structural elements t h a t may

Ascher O p l e r studied a t Brooklyn College and the University of Michigan, receiving a B.S. in chemistry from W a y n e University i n 1944. H e is presently an associate scientist of t h e research, department, W e s t e r n Division, The D o w Chemical Co., Pittsburg, Calif. T h e idea for applying liigh speed computers for searching chemical compounds for structural groupings g o t its start because someone suggested to Opler a n d T e d R. Norton "If a l l scientists could devote Wo of their time to the p r o b l e m of handling efBciently the vast and rapidly growing mass of technical information available for research purposes, that problem would b e well on its w a y toward solution." Ted R. N o r t o n received his A.B. in chemistry from the College of the Pacifie i n 1940 a n d his P h . D . i n chemistry from Northwestern University in 1 9 4 3 . He is presently director of the Agricultural Chemical Research Lab-., T h e D o w Chemical Co., M i d land, Mich. Late in 1952, Norton and Opler decided to tackle t h e problem of searching chemical compounds for structural groupings. They t h e n sat down to work out t h e general details of their solution. They dmd so in roughly t w o weeks, a n d have been d o t t i n g the Fs ever since.

Ι I I 1

1

A relatively simple compound, 2-chloro-4- isopropylbenzoic acid, is coded as shown. Since one can start with any group and number sequentially in any order, this compound could b e 3 g r o u p s p r e s e n t and number Cl f^\ « GRR2 p o s i t i o n s w i t h i n them i n t h e GRP.l ι > \ u s u a l way. CH

Γα Code: 1. L i s t t h e a r b i t r a r i l y a s ­ s i g n e d 3—digit numbers o f t h e groups p r e s e n t . _ J j 1 I

1 J

2 . Number g r o u p s sequen— t i a l l y , s t a r t i n g w i t h any group. In t h i s case, b e n z e n e i s 1, c h l o r o i s

«

7th | 1 2nd

1st

3rd

0 6 0 0

6 1 4 3

benzene chloro acid propyl

1 0 3 0

0 6 0 0

6 1 4 3

1 benzene 2 chloro 3 acid Λ propyl

Certain special problems had t o be treated before the system could be considered comprehensive. Among structural features requiring special attention are condensed rings, hetero­ cyclics, bridged compounds, steieoisomers, unsaturates, chelates, and car­ bohydrates. Some of these special problems are illustrated in the coding of quinine hydrochloride. (See page 2814, "For the specialist.")

2, e t c .

10 6 1 benzene 3 . For each group, l i s t the 0 6 1 1 S chloro number of t h e p r e v i o u s 3 0 4 1 3 acid g r o u p t o w h i c h i t i s a t ­ __ 0 0 3 1 4 propyl tached. Benzene, being group 1 i n t h i s c a s e , has no number u n d e r t h i s step. The p r e v i o u s g r o u p t o which c h l o r o i s a t ­ t a c h e d i s benzene, group 1, e t c . 10 6 1 benzene 4 . L i s t t h e p o s i t i o n i n each 1 0 6 1 1 2 chloro g r o u p by w h i c h i t i s a t ­ 1 3 0 4 1 3 acid tached to the previous 2 0 0 3 1 4 propyl group. For i n s t a n c e , p r o p y l i s a t t a c h e d by i t s 2 p o s i t i o n to the p r e ­ vious group, benzene. B e n z e n e , a g a i n , h a s no number u n d e r t h i s s t e p . 10 6 1 benzene 5. For each group, l i s t the p o s i t i o n i n t h e p r e v i o u s ~2 Τ ÏÏ "6" Τ Τ "2" c h l o r o group t o which i t i s a t ­ 1 1 3 0 4 1 3 a c i d 4 2 0 0 3 1 4 propyl tached. For i n s t a n c e , chloro i s attached to the 2 p o s i t i o n in benzene. A g a i n , b e n z e n e h a s no number u n d e r t h i s s t e p . 6. Assemble d i g i t s i n t o f i n a l code. / 1 0 6 - 1 / 2 1 0 6 1 1 2 / 1 1 3 0 4 1 3 / 4 2 0 0 3 1 4 / =• 2 - c h l o r o - 4 benzene chloro a c i d propyl isopropylbenzoic a c i d

b e present in the given compound. H e next codes an adjacent group and then moves sequentially from group to group until the entire structure is coded. A given compound could thus b e coded in many different ways, all of which are equivalent as far as the computer is concerned. For this reason, and because there are no coding rules based on chemical principles, coding is quite simple. Those with moderate

6th

GRR4

1 0 3 J)

The group's serial order of citation ( 1 , 2, 3, 4, etc.) The number of the pre­ viously cited group to which the group being cited is attached. The position in the group being cited by which it is attached t o the previously cited group. The position in the pre­ viously cited group to which the group being cited is attached. I n certain infrequent cases, digit 3 takes on a special interpretation.

C o d i n g Experience

| 1

technical training can learn to code quite rapidly. In its simplest form, the code con­ sists of a sequence of seven-digit numbers. In general these indicate: Digits 3rd, 4th, Chemical nature of the and 5th group—one of the 332 arbi­ trarily assigned 3-digit num­ bers.

Experience on 15,000 compounds has proved that coding generally goes accurately and rapidly. Coding com­ pounds of average complexity required less than two minutes per compound. In a preliminary study, the compounds in "Organic Chlorine Compounds," by Huntress, (much simpler than our average) were coded by the authors at rates as high as 100 per hour. In order to code the 15,000 com­ pounds, 50 volunteers undertook to code lots of approximately 300 ran­ domly selected compounds. The coding was accomplished i n a few weeks with little basic misunderstand­ ing. The codes were all hand-checked by reconstructing the original com­ pound by decoding. After key-punch­ ing onto IBM cards, a number of automatic checks (such as nonassigned groups and impossible attachments) were made and the errors (these either had eluded the checkers or were key­ punch mistakes) corrected. The empirical formulas of the com­ pounds, already available on IBM cards, 'were added so that further checks could b e carried out. For example, compounds with two chlorines in groups and none in the empirical formula (or vice versa) demanded further inspection. JUNE

4, 1956 C&EN

2813

Principles of M a c h i n e Searching O n c e t h e compounds have been coded—reduced to their digital equiva­ lent—and key-punched o n cards, they are ready for storage in and search by a h i g h speed computer. The search­ ing principles a r e t h e same, whatever machine i s used; only the specific in­ structions fed to t h e machine are different. The search takes place b y a series of logical steps which m a y b e likened to a set of hurdles over w h i c h e v e r y compound meeting the require­ m e n t s must pass. As soon as t h e compound fails to leap a hurdle ( r e ­ q u i r e m e n t ) , it is out of the race (re­ j e c t e d at once) . Some simple searches involve only o n e or two hurdles w h i l e m o r e complex questions are like long r a c e s with m a n y hurdles. I n correlating physical or biological properties with structure, one seeks information a b o u t compounds d e ­ scribed only vaguely. Often, this takes t h e form of a substructure describable in general terms like: All compounds of interest must contain at least t h e folio-wing—a b e n z e n e ring attached t o a pyridine ring with two bromo-groups on t h e b e n z e n e ring, at least o n e of w h i c h is ortho to t h e pyridyl g r o u p , a n d the compound must not have any sulfur-containing groups. A n y number^ of special searching schemes can b e devised, b u t the f ollowing basic method seems most universal. W e start w i t h t h e chemical sub­ structure we a r e seeking, and w e in­ s t r u c t t h e machine to find all t h e compounds t h a t contain this sub­ structure as a portion of their complete structure. T h e computer calls t h e c o d e d compounds from storage o n e a t a time, a n d for each such compound it will: • Compare t h e empirical formula of t h e coded compound with that of t h e test substructure. Does t h e c o d e d c o m p o u n d contain a larger, smaller, or e q u a l number of stated atoms—6 C, 2 CI, etc.—compared to t h e test sub structure? • Compare t h e various groups i n t h e c o d e d compound with those i n t h e test structure, rejecting t h e coded com­ pounds obviously lacking the minimum number of required groups. • Determine whether two groups are indirectly connected through a common g r o u p , This i s done b y using certain logical relations implicit in the codes. • Where such indirect connection is f o u n d a n d positional relationships a r e 2814

C&EN JUNE

4,

1956

For th