Chemical Coding for Information Retrieval

punched cards are unsatisfactory. Mechanically sorted cards have been brought to a high effi- ciency but are not convenient for use away from the mach...
1 downloads 0 Views 270KB Size
CHEMICAL CODING FOR INFORMATION RETRIEVAL BY W . M. DUFFIN The Wellcome Research Laboratories, Langley Court, Beckenham, Kent, England

coding. L o o s e - l e a f b i n d e r s a r e p r e f e r r e d , a s c a r d s tend t o be m i s p l a c e d o r "borrowed." E a c h c h e m i c a l grouping o r ring s y s t e m i s allocated two l e t t e r s and the coding f o r a p a r t i c u l a r compound is obtained by a s s e m b l i n g t h e s e i n r e v e r s e alphabetical o r d e r . In t h e m a j o r i t y of c a s e s t h e coding f o r a group i s a b b r e v i a t e d t o t h e f i r s t l e t t e r , and t h e second is added only in s p e c i a l c a s e s , the second l e t t e r being l o w e r c a s e , e.g., J q . The coding f o r c o m p o u n d c o n s i s t s of 3 parts, a J D / r a i n d i c a t e s s e n i o r ring p r e s e n t ( T a b l e I) Rindicates o t h e r r i n g s and g r o u p s ( T a b l e 11) Y i n d i c a t e s t o t a l n u m b e r of c a r b o n s p r e s e n t In o r d e r t o b r i n g l i k e compounds t o g e t h e r , p i p e r i d i n e , m o r p h o l i n e and p y r r o l i d i n e (unsubs t i t u t e d o r substituted by alkyl o r halogen) i n t h e p r e s e n c e of o t h e r r i n g s y s t e m s and acting only as t e r t i a r y a m i n e s t h e o r e t i c a l l y r e p l a c e able by -NMe2 without change of c h e m i c a l t y p e , a r e not coded by r i n g s , but a s a m i n e s , a s i n

C h e m i c a l coding s y s t e m s nowadays g e n e r a l l y a r e designed f o r u s e with punched c a r d s by either manual o r mechanical operation. Large n u m b e r s ( s a y , m o r e than t e n thousand) of e d g e punched c a r d s a r e u n s a t i s f a c t o r y . Mechanically s o r t e d c a r d s have b e e n brought t o a high e f f i ciency but a r e not convenient f o r u s e away f r o m t h e m a c h i n e s . M o r e o v e r , i t i s not n o r m a l l y p o s s i b l e t o add d e t a i l e d i n f o r m a t i o n t o t h e c a r d s ( a ) f o r want of s p a c e and (b) b e c a u s e of t h e d i f f i c u l t y of e x t r a c t i n g e a c h s e p a r a t e c a r d a s required Our objectives w e r e t o index all compounds which had b e e n t e s t e d by u s , all compounds m a r k e t e d o r patented i n t h e p h a r m a c e u t i c a l and v e t e r i n a r y f i e l d s , and a l l c h e m i c a l s which a r e c o m m e r c i a l l y a v a i l a b l e . W e a l s o wanted t o be a b l e t o add t o t h e i n f o r m a t i o n indexed s o t h a t any c a r d showed on inspection all t h a t w a s known about t h e compound. The s y s t e m h e r e outlined i s now dealing s a t i s f a c t o r i l y with s o m e 7 0 , 0 0 0 compounds. 5,000 compounds a r e added e a c h y e a r and s o m e 200 e n t r i e s p e r day a r e m a d e on existing c a r d s . M o r e o v e r , by grouping l i k e compounds t o g e t h e r , s e a r c h f o r one c o m pound a u t o m a t i c a l l y r e v e a l s o t h e r s of t h e s a m e t y p e , and all t e s t s done on t h e m . The index w i l l not p e r m i t t h e i m m e d i a t e location of a ring s y s t e m which i s not t h e s e n i o r r i n g p r e s e n t , but e x p e r i e n c e h a s shown t h a t t h i s i s r a r e l y r e q u i r e d , n e a r l y a l l t h e e n q u i r i e s being directed to the senior system. Enquiries dir e c t e d m e r e l y t o s a y e t h e r - g r o u p s do not m a k e m u c h s e n s e without s o m e indication of t h e o t h e r g r o u p s p r e s e n t , b e c a u s e , t h e r e a r e so m a n y of t h e m . An e n q u i r y f o r allathp t h i o s e m i c a r b a z o n e s w a s a n s w e r e d i n t h r e e h o u r s , the'answer giving not m e r e l y t h e i r r e f e r e n c e n u m b e r s .(and t h e r e w e r e o v e r a h u n d r e d ) , but a l l t h e t e s t r e s u l t s . Again s u c h e n q u i r i e s a r e r a r e , t h i s being t h e only s u c h o c c a s i o n i n t h r e e y e a r s . A p a r t f r o m s u c h routine e n q u i r i e s a s t o w h e t h e r a compound h a s b e e n e x a m i n e d (about t e n a d a y ) , a n a v e r a g e of t h r e e a day a r e f o r m o r e d e t a i l e d i n f o r m a t i o n . N e a r l y a l l a r e a n s w e r e d by telephone. T h e index is r u n by one g r a d u a t e and one n o n - g r a d u a t e w i t h r e s e a r c h e x p e r i e n c e , with s u c h c l e r i c a l a s s i s t a n c e a s i s needed f o r typing r e p o r t s . The compounds a r e f i l e d on c a r d s , o r i n l o o s e - l e a f b i n d e r s , i n alphabetical o r d e r of t h e i r

.

PhCHzCHzNa

.

In t h e a b s e n c e of o t h e r ring

s y s t e m s , o r w h e r e any o t h e r substituent is attached t o a c a r b o n of t h e r i n g , e.g., H N S O O H ,

-

t h i s r u l e d o e s not apply, and they qualify t o be coded a s r i n g s . F o r a similar r e a s o n , methylenedioxy c o m pounds a r e coded a s d i e t h e r s and not as ring systems. E a c h g r o u p attached t o c a r b o n of a ring i s coded individually without r e f e r e n c e t o o t h e r p a r t s of t h e m o l e c u l e ,

e.g.,

c b

i s coded a s

H

p i p e r i d i n e t 2 c a r b o n y l g r o u p s , not a s a m i d e . In t h o s e c a s e s w h e r e an exocyclic group is attached t o t h e h e t e r o - a t o m of a r i n g , t h e h e t e r o - a t o m is c o n s i d e r e d a s p a r t of the g r o u p , e.g., is coded a s a n a m i d e , etc.

44

c

-

c

NCOCH3

NCOHNz a s a n u r e a ,

CHEMICAL CODING F O R INFORMATION R E T R I E V A L

I n t h o s e caseis w h e r e a g r o u p m a y b e cons i d e r e d i n i s o m e r i c f o r m s , t h e s e n i o r coding is u s e d , e.g., u r a c i l , w h e r e t h e oxygen-containing g r o u p s e coded as c a r b o n y l , not hydroxyl. If t h e s t r u c t u r e is unknown, t h e coding i s X / followed by t h e n a m e . TABLE I a , Rin S stem.-In

satura:edyor

order of seniority [no distinction made between unsaturated rings, except Th Cyclohexyl, Ve Piperazinq

Wy Ww Wv Wu Wt Ws Wr

Not otherwise included ON rings not otherwise included OS rings not otherwiiie included NS rings not otherwi*ie included S rings not otherwise included 0 rings not otherwise included N rings not otherwise included

wq WP Wn Wm w1 Wk wg Wf We Wd wc Wb Wa vv

Tetrazine Triazine Purine Pyrazine Pyrimidine Pyridazine Acridine Phenanthridine Quinoline Pyridine Thiopyran Pyran Tetrazole

vu vt

vr

vq

"P

Vn Vh Ve Vd Vb T"

These letters are followed by two numbers, The first shows the total number of heterocyclic atoms, the second the total number of rings, 3, phenothiazine, Wu23 pyrrocoline, Wr12

1,2,4-Triazole 1,2,3-Triazole Pyr a zo l e Imidazole Thiazole Oxazole Carbazole Indole Pyrrole Piperazine Thiophen Furan n-Fused carbon rings

T m Homocyclic > 7C atoms TI Cycloheptane Tk Cyclopropane T j Cyclobutane T i Cyclopentane T h Cyclohexane and Hydrobenzenes Tg Tri- and Tetrapheny lme thane Tf Diphenylmethane Tn n-Unfused benzene rings T Benzene 0 No ring present

E D C B A

Kn diazoketones; Kp azides. Ja Amines; JI 3-haloamines: Jq amides: Jv proteins and peptides: Jy imino- ethers; Jz imino-chlorides, Ha Esters of carboxylic acids; He acid halides: Hj acid anhydrides. Ga Carboxylic acids. Fa Oxides: Fb ozonides; Fc peroxides: Fe ethers: Ff acetals: Fm ketenes: Fs carbohydrates: Ft glycosides. Ea Carbonyl; Ez qufnones. Da Hydroxyl. Ca Halides, Ba Olefinic bonds in chain: Bc acetylenic bonds in chain. Alkyl group (with no codable substituent) attached to C of ring. If a grouping occurs more than once, a subscript number i s added to the symbol (see Examples). e.g., J 4 "Onium" compounds are indicated by a superscript ', ammonium: C4 iodonium. EXAMPLES CH3CH20H PhCH(O w CHMeNHMe Ve/V2B/16

Wu23/J/18

4 Ve /J2 /2 6

Referring to Table 1, when benzene rings are the senior ring present, and there are 'n' of them, use the symbol Tn in the R position of the code. In all other cases of duplication of the senior ring, use the simple symbol in the a-position, and indicate the other(s) in the p-position. If i t is required to break down groups Wr to Ww further, this may be done either by indicating the "component" nuclei by their separated letters or by indicating the number of heterocyclic atoms in each, e.g.,

-

45

Wr32, may be sub-divided as

N,

L . 3

W1, Vf or as 2 , l . TABLE I1

B , Other Rings and Groups.--In order of seniority. Normally only the first letter is used. As Table I. As Table I. Ux Element other than C, H, N, 0, S or halogen, where x is the symbol of the element. Alkali or alkaline earth metals occurring as salts of acids are not included. T As Table I. Norm,ally abbreviated to T, but for Tf put T2, for Tg put T3 or T4, for Tn put Tn. Sa Thiocyanates; Se isothiocyanates: Sg thiosemicarbazides and S thiosemicarbazones: Sj thiourea: Sk - NHCSSH: Sn sulfamic acids. R Ra Sulfonic acids; Rb sulfonyl halides: Rc sulfonamides: Rd Sulfonates: Rj sulfones: Rm sulfoxides: Rp -COSH: Rq -CSOH: Rt -CSSII; Ru sulfhinic acids: Ru sulfenic acids, Q Qa Mercaptans: Q: sulfur chlorides: Qe sulfides: Qj disulfides: Qp thioketones. P Pa Ureas: Pe guanjdines; Pf semicarbazides: Pg semicarbazones: P j urethans: Pp amidines: Pq amidoximes. N Nd Cyanates: Ne isocyanates: Nf C-nitroso: Ng N-nitroso; Nj hydroxylamines or N-oxides: Nk oximes: Np hydroxamic acids. M Ma Nitro: Mz nitramine. L La Cyanides: Lm isocyanides; Ls cyanamides. K Ka Diazonium compounds: Kb azo compounds: Kc diazoamino: Kd azoxy: Kh hydrazines and hydrazides: Km acid azides: W V

It h a s b e e n found convenient t o t r e a t p h o s p h o r u s compounds and o t h e r s of t h e U c l a s s i n a n a r b i t r a r y r a t h e r than s t r i c t l y c h e m i c a l m a n n e r . P h o s p h o r u s e s t e r s P ( 0 R ) a r e coded F ( a s if t h e y w e r e e t h e r s ) , P S and P ( S R ) as Q and P(NH2) a s J . P ( O ) , P - 0 - P and PC1 a r e i g n o r e d . E x a m p l e : (EtO)2 P (S)OLCHz]zSEt

0 /UpQz F 3 / 8

46

W. M. DUFFIN

In u s e , i f d e t a i l s of a p a r t i c u l a r compound a r e r e q u i r e d , t h e full coding i s worked out and the r e c o r d is found i m m e d i a t e l y . Homologs, differing only in t h e n u m b e r of c a r b o n a t o m s , will be found on e a c h s i d e of i t . If m o r e g e n e r a l i n f o r m a t i o n is r e q u i r e d , or i t i s r e q u i r e d t o in code f o r a wide g r o u p of compounds (%., p a t e n t s ) , t h e coding i s t a k e n only f a r enough to c o v e r t h e g e n e r a l c l a s s . F o r e x a m p l e , phenylp y r i m i d i n e s a r e c o v e r e d by W l / T and u n d e r t h i s heading will be found g e n e r a l r e f e r e n c e s c o v e r i n g t h i s s e r i e s of compounds. Following i t , i n o r d e r of coding, w i l l be found r e f e r e n c e s w h e r e m o r e d e t a i l e d coding i s p o s s i b l e , %, W l / T J f o r a m i n o d e r i v a t i v e s , and finally t h e individual compounds, %, p y r i m e t h a m i n e a t W l / T J z C A / 12. B e c a u s e of the e a s e with which a p a r t i c u l a r compound c a n be l o c a t e d ( s e c o n d s only) it i s a s i m p l e m'atter to add f u r t h e r i n f o r m a t i o n t o a n existing s h e e t , o r to s e e what analogs have been e x a m i n e d . A m e c h a n i c a l l y s o r t e d punched c a r d i n s t a l l a t i o n which h a s been m a i n t a i n e d f o r 14 y e a r s i s s o m u c h s l o w e r than t h e m a s t e r index t h a t i t i s t o be abandoned. The c a r d s u s e d w e r e s t a n d a r d I . C . T . 65column c a r d s , with a column allotted to e a c h i t e m of the p-coding, and a g r o u p of c o l u m n s provided f o r t h e a - c o d i n g . T h i s s y s t e m w a s of u s e f o r location of "junior" groupings but took a c o n s i d e r a b l e t i m e b e c a u s e all o r n e a r l y a l l t h e c a r d s had to be s e a r c h e d , and when t h e s e a r c h w a s done the n u m b e r s obtained had t o be r e f e r r e d to the m a n u a l i n d e x to get t h e t e s t r e s u l t s The chief u s e m a d e of t h e punched c a r d s i n

r e c e n t y e a r s h a s been the production of a c o m p l e t e index e a c h y e a r ( i n o r d e r of coding) which w a s d i s t r i b u t e d to t h e v a r i o u s r e s e a r c h l a b o r a t o r i e s f o r t h e i r i n f o r m a t i o n . The c e s s a t i o n of t h i s index will be a c o n s i d e r a b l e l o s s , but i s being o v e r c o m e by m o r e efficient a r r a n g e m e n t s f o r a n s w e r i n g questions f r o m the m a n u a l index by telephone. Again, e x p e r i e n c e h a s shown that m a n y people have p r e f e r r e d t o a s k the c e n t r a l index--it is l e s s t r o u b l e t o t h e m s e l v e s , i t i s upt o - d a t e , and i t contains a l l the i n f o r m a t i o n . T h i s coding a l s o i s being used f o r a c l a s s i fied index t o r e a c t i o n s . F o r t h i s , s i x l e t t e r s a r e u s e d . F o r example, G a J q / J a includes a l l t h o s e m e t h o d s by which a carboxylic a c i d (Ga) is c o n v e r t e d into a n a m i d e (Jq) by r e a c t i o n with an amine ( J a ) . F o r t h i s p u r p o s e A and Z l e t t e r s as "ope r a t o r s " a r e used ( T a b l e 111). Y l e t t e r s a r e u s e d f o r biological a c t i v i t i e s .

SYNTACTIC STUDY O F CHINESE AND ENGLISH

BRIDGMAN RESEARCH PAPERS

TABLE I11 Aa Reduction

Am Nitration

Za Characterization

Ab Dehalogenation An Nitrosation

Zb Stability

Ac Halogenarion

Ap Disproportion

Zc Optical Resolution

A d Hydrolysis

Aq Sulfur

Ze Physical Properties

Ae CO

Ar

Sulfonation

Zm Metabolism

As

Desulfurizatlon

Zn Nomenclature

Af

Oxidation

Ag co2

Au Metals

Zs

Structure

Ai

Isomerization

Aw Dehydration

Zx Isolation

Aj

Ammonia

Ax Polymerization

Zy Availability

Az Degradation

Zz Methods of synthesis

A1 HCN or ( C V z

I

A c o m p a r a t i v e s y n t a c t i c study of t h e C h i n e s e and E n g l i s h l a n g u a g e s w i l l be u n d e r t a k e n with a NSF g r a n t t o t h e Ohio S t a t e U n i v e r s i t y R e s e a r c h Foundation. T h e p u r p o s e of t h e study i s t o f a c i l i t a t e m a c h i n e t r a n s l a t i o n and i n f o r m a t i o n retrieval.

T h e collected r e s e a r c h p a p e r s of t h e l a t e D r . P e r c y W. B r i d g m a n w i l l be published by H a r v a r d U n i v e r s i t y with t h e a s s i s t a n c e of t h e Nat i o n a l S c i e n c e Foundation. T h e c o l l e c t i o n consists of about 200 p a p e r s t o be published i n seven volumes.