CHEMICAL CODING FOR INFORMATION RETRIEVAL BY W . M. DUFFIN The Wellcome Research Laboratories, Langley Court, Beckenham, Kent, England
coding. L o o s e - l e a f b i n d e r s a r e p r e f e r r e d , a s c a r d s tend t o be m i s p l a c e d o r "borrowed." E a c h c h e m i c a l grouping o r ring s y s t e m i s allocated two l e t t e r s and the coding f o r a p a r t i c u l a r compound is obtained by a s s e m b l i n g t h e s e i n r e v e r s e alphabetical o r d e r . In t h e m a j o r i t y of c a s e s t h e coding f o r a group i s a b b r e v i a t e d t o t h e f i r s t l e t t e r , and t h e second is added only in s p e c i a l c a s e s , the second l e t t e r being l o w e r c a s e , e.g., J q . The coding f o r c o m p o u n d c o n s i s t s of 3 parts, a J D / r a i n d i c a t e s s e n i o r ring p r e s e n t ( T a b l e I) Rindicates o t h e r r i n g s and g r o u p s ( T a b l e 11) Y i n d i c a t e s t o t a l n u m b e r of c a r b o n s p r e s e n t In o r d e r t o b r i n g l i k e compounds t o g e t h e r , p i p e r i d i n e , m o r p h o l i n e and p y r r o l i d i n e (unsubs t i t u t e d o r substituted by alkyl o r halogen) i n t h e p r e s e n c e of o t h e r r i n g s y s t e m s and acting only as t e r t i a r y a m i n e s t h e o r e t i c a l l y r e p l a c e able by -NMe2 without change of c h e m i c a l t y p e , a r e not coded by r i n g s , but a s a m i n e s , a s i n
C h e m i c a l coding s y s t e m s nowadays g e n e r a l l y a r e designed f o r u s e with punched c a r d s by either manual o r mechanical operation. Large n u m b e r s ( s a y , m o r e than t e n thousand) of e d g e punched c a r d s a r e u n s a t i s f a c t o r y . Mechanically s o r t e d c a r d s have b e e n brought t o a high e f f i ciency but a r e not convenient f o r u s e away f r o m t h e m a c h i n e s . M o r e o v e r , i t i s not n o r m a l l y p o s s i b l e t o add d e t a i l e d i n f o r m a t i o n t o t h e c a r d s ( a ) f o r want of s p a c e and (b) b e c a u s e of t h e d i f f i c u l t y of e x t r a c t i n g e a c h s e p a r a t e c a r d a s required Our objectives w e r e t o index all compounds which had b e e n t e s t e d by u s , all compounds m a r k e t e d o r patented i n t h e p h a r m a c e u t i c a l and v e t e r i n a r y f i e l d s , and a l l c h e m i c a l s which a r e c o m m e r c i a l l y a v a i l a b l e . W e a l s o wanted t o be a b l e t o add t o t h e i n f o r m a t i o n indexed s o t h a t any c a r d showed on inspection all t h a t w a s known about t h e compound. The s y s t e m h e r e outlined i s now dealing s a t i s f a c t o r i l y with s o m e 7 0 , 0 0 0 compounds. 5,000 compounds a r e added e a c h y e a r and s o m e 200 e n t r i e s p e r day a r e m a d e on existing c a r d s . M o r e o v e r , by grouping l i k e compounds t o g e t h e r , s e a r c h f o r one c o m pound a u t o m a t i c a l l y r e v e a l s o t h e r s of t h e s a m e t y p e , and all t e s t s done on t h e m . The index w i l l not p e r m i t t h e i m m e d i a t e location of a ring s y s t e m which i s not t h e s e n i o r r i n g p r e s e n t , but e x p e r i e n c e h a s shown t h a t t h i s i s r a r e l y r e q u i r e d , n e a r l y a l l t h e e n q u i r i e s being directed to the senior system. Enquiries dir e c t e d m e r e l y t o s a y e t h e r - g r o u p s do not m a k e m u c h s e n s e without s o m e indication of t h e o t h e r g r o u p s p r e s e n t , b e c a u s e , t h e r e a r e so m a n y of t h e m . An e n q u i r y f o r allathp t h i o s e m i c a r b a z o n e s w a s a n s w e r e d i n t h r e e h o u r s , the'answer giving not m e r e l y t h e i r r e f e r e n c e n u m b e r s .(and t h e r e w e r e o v e r a h u n d r e d ) , but a l l t h e t e s t r e s u l t s . Again s u c h e n q u i r i e s a r e r a r e , t h i s being t h e only s u c h o c c a s i o n i n t h r e e y e a r s . A p a r t f r o m s u c h routine e n q u i r i e s a s t o w h e t h e r a compound h a s b e e n e x a m i n e d (about t e n a d a y ) , a n a v e r a g e of t h r e e a day a r e f o r m o r e d e t a i l e d i n f o r m a t i o n . N e a r l y a l l a r e a n s w e r e d by telephone. T h e index is r u n by one g r a d u a t e and one n o n - g r a d u a t e w i t h r e s e a r c h e x p e r i e n c e , with s u c h c l e r i c a l a s s i s t a n c e a s i s needed f o r typing r e p o r t s . The compounds a r e f i l e d on c a r d s , o r i n l o o s e - l e a f b i n d e r s , i n alphabetical o r d e r of t h e i r
.
PhCHzCHzNa
.
In t h e a b s e n c e of o t h e r ring
s y s t e m s , o r w h e r e any o t h e r substituent is attached t o a c a r b o n of t h e r i n g , e.g., H N S O O H ,
-
t h i s r u l e d o e s not apply, and they qualify t o be coded a s r i n g s . F o r a similar r e a s o n , methylenedioxy c o m pounds a r e coded a s d i e t h e r s and not as ring systems. E a c h g r o u p attached t o c a r b o n of a ring i s coded individually without r e f e r e n c e t o o t h e r p a r t s of t h e m o l e c u l e ,
e.g.,
c b
i s coded a s
H
p i p e r i d i n e t 2 c a r b o n y l g r o u p s , not a s a m i d e . In t h o s e c a s e s w h e r e an exocyclic group is attached t o t h e h e t e r o - a t o m of a r i n g , t h e h e t e r o - a t o m is c o n s i d e r e d a s p a r t of the g r o u p , e.g., is coded a s a n a m i d e , etc.
44
c
-
c
NCOCH3
NCOHNz a s a n u r e a ,
CHEMICAL CODING F O R INFORMATION R E T R I E V A L
I n t h o s e caseis w h e r e a g r o u p m a y b e cons i d e r e d i n i s o m e r i c f o r m s , t h e s e n i o r coding is u s e d , e.g., u r a c i l , w h e r e t h e oxygen-containing g r o u p s e coded as c a r b o n y l , not hydroxyl. If t h e s t r u c t u r e is unknown, t h e coding i s X / followed by t h e n a m e . TABLE I a , Rin S stem.-In
satura:edyor
order of seniority [no distinction made between unsaturated rings, except Th Cyclohexyl, Ve Piperazinq
Wy Ww Wv Wu Wt Ws Wr
Not otherwise included ON rings not otherwise included OS rings not otherwiiie included NS rings not otherwi*ie included S rings not otherwise included 0 rings not otherwise included N rings not otherwise included
wq WP Wn Wm w1 Wk wg Wf We Wd wc Wb Wa vv
Tetrazine Triazine Purine Pyrazine Pyrimidine Pyridazine Acridine Phenanthridine Quinoline Pyridine Thiopyran Pyran Tetrazole
vu vt
vr
vq
"P
Vn Vh Ve Vd Vb T"
These letters are followed by two numbers, The first shows the total number of heterocyclic atoms, the second the total number of rings, 3, phenothiazine, Wu23 pyrrocoline, Wr12
1,2,4-Triazole 1,2,3-Triazole Pyr a zo l e Imidazole Thiazole Oxazole Carbazole Indole Pyrrole Piperazine Thiophen Furan n-Fused carbon rings
T m Homocyclic > 7C atoms TI Cycloheptane Tk Cyclopropane T j Cyclobutane T i Cyclopentane T h Cyclohexane and Hydrobenzenes Tg Tri- and Tetrapheny lme thane Tf Diphenylmethane Tn n-Unfused benzene rings T Benzene 0 No ring present
E D C B A
Kn diazoketones; Kp azides. Ja Amines; JI 3-haloamines: Jq amides: Jv proteins and peptides: Jy imino- ethers; Jz imino-chlorides, Ha Esters of carboxylic acids; He acid halides: Hj acid anhydrides. Ga Carboxylic acids. Fa Oxides: Fb ozonides; Fc peroxides: Fe ethers: Ff acetals: Fm ketenes: Fs carbohydrates: Ft glycosides. Ea Carbonyl; Ez qufnones. Da Hydroxyl. Ca Halides, Ba Olefinic bonds in chain: Bc acetylenic bonds in chain. Alkyl group (with no codable substituent) attached to C of ring. If a grouping occurs more than once, a subscript number i s added to the symbol (see Examples). e.g., J 4 "Onium" compounds are indicated by a superscript ', ammonium: C4 iodonium. EXAMPLES CH3CH20H PhCH(O w CHMeNHMe Ve/V2B/16
Wu23/J/18
4 Ve /J2 /2 6
Referring to Table 1, when benzene rings are the senior ring present, and there are 'n' of them, use the symbol Tn in the R position of the code. In all other cases of duplication of the senior ring, use the simple symbol in the a-position, and indicate the other(s) in the p-position. If i t is required to break down groups Wr to Ww further, this may be done either by indicating the "component" nuclei by their separated letters or by indicating the number of heterocyclic atoms in each, e.g.,
-
45
Wr32, may be sub-divided as
N,
L . 3
W1, Vf or as 2 , l . TABLE I1
B , Other Rings and Groups.--In order of seniority. Normally only the first letter is used. As Table I. As Table I. Ux Element other than C, H, N, 0, S or halogen, where x is the symbol of the element. Alkali or alkaline earth metals occurring as salts of acids are not included. T As Table I. Norm,ally abbreviated to T, but for Tf put T2, for Tg put T3 or T4, for Tn put Tn. Sa Thiocyanates; Se isothiocyanates: Sg thiosemicarbazides and S thiosemicarbazones: Sj thiourea: Sk - NHCSSH: Sn sulfamic acids. R Ra Sulfonic acids; Rb sulfonyl halides: Rc sulfonamides: Rd Sulfonates: Rj sulfones: Rm sulfoxides: Rp -COSH: Rq -CSOH: Rt -CSSII; Ru sulfhinic acids: Ru sulfenic acids, Q Qa Mercaptans: Q: sulfur chlorides: Qe sulfides: Qj disulfides: Qp thioketones. P Pa Ureas: Pe guanjdines; Pf semicarbazides: Pg semicarbazones: P j urethans: Pp amidines: Pq amidoximes. N Nd Cyanates: Ne isocyanates: Nf C-nitroso: Ng N-nitroso; Nj hydroxylamines or N-oxides: Nk oximes: Np hydroxamic acids. M Ma Nitro: Mz nitramine. L La Cyanides: Lm isocyanides; Ls cyanamides. K Ka Diazonium compounds: Kb azo compounds: Kc diazoamino: Kd azoxy: Kh hydrazines and hydrazides: Km acid azides: W V
It h a s b e e n found convenient t o t r e a t p h o s p h o r u s compounds and o t h e r s of t h e U c l a s s i n a n a r b i t r a r y r a t h e r than s t r i c t l y c h e m i c a l m a n n e r . P h o s p h o r u s e s t e r s P ( 0 R ) a r e coded F ( a s if t h e y w e r e e t h e r s ) , P S and P ( S R ) as Q and P(NH2) a s J . P ( O ) , P - 0 - P and PC1 a r e i g n o r e d . E x a m p l e : (EtO)2 P (S)OLCHz]zSEt
0 /UpQz F 3 / 8
46
W. M. DUFFIN
In u s e , i f d e t a i l s of a p a r t i c u l a r compound a r e r e q u i r e d , t h e full coding i s worked out and the r e c o r d is found i m m e d i a t e l y . Homologs, differing only in t h e n u m b e r of c a r b o n a t o m s , will be found on e a c h s i d e of i t . If m o r e g e n e r a l i n f o r m a t i o n is r e q u i r e d , or i t i s r e q u i r e d t o in code f o r a wide g r o u p of compounds (%., p a t e n t s ) , t h e coding i s t a k e n only f a r enough to c o v e r t h e g e n e r a l c l a s s . F o r e x a m p l e , phenylp y r i m i d i n e s a r e c o v e r e d by W l / T and u n d e r t h i s heading will be found g e n e r a l r e f e r e n c e s c o v e r i n g t h i s s e r i e s of compounds. Following i t , i n o r d e r of coding, w i l l be found r e f e r e n c e s w h e r e m o r e d e t a i l e d coding i s p o s s i b l e , %, W l / T J f o r a m i n o d e r i v a t i v e s , and finally t h e individual compounds, %, p y r i m e t h a m i n e a t W l / T J z C A / 12. B e c a u s e of the e a s e with which a p a r t i c u l a r compound c a n be l o c a t e d ( s e c o n d s only) it i s a s i m p l e m'atter to add f u r t h e r i n f o r m a t i o n t o a n existing s h e e t , o r to s e e what analogs have been e x a m i n e d . A m e c h a n i c a l l y s o r t e d punched c a r d i n s t a l l a t i o n which h a s been m a i n t a i n e d f o r 14 y e a r s i s s o m u c h s l o w e r than t h e m a s t e r index t h a t i t i s t o be abandoned. The c a r d s u s e d w e r e s t a n d a r d I . C . T . 65column c a r d s , with a column allotted to e a c h i t e m of the p-coding, and a g r o u p of c o l u m n s provided f o r t h e a - c o d i n g . T h i s s y s t e m w a s of u s e f o r location of "junior" groupings but took a c o n s i d e r a b l e t i m e b e c a u s e all o r n e a r l y a l l t h e c a r d s had to be s e a r c h e d , and when t h e s e a r c h w a s done the n u m b e r s obtained had t o be r e f e r r e d to the m a n u a l i n d e x to get t h e t e s t r e s u l t s The chief u s e m a d e of t h e punched c a r d s i n
r e c e n t y e a r s h a s been the production of a c o m p l e t e index e a c h y e a r ( i n o r d e r of coding) which w a s d i s t r i b u t e d to t h e v a r i o u s r e s e a r c h l a b o r a t o r i e s f o r t h e i r i n f o r m a t i o n . The c e s s a t i o n of t h i s index will be a c o n s i d e r a b l e l o s s , but i s being o v e r c o m e by m o r e efficient a r r a n g e m e n t s f o r a n s w e r i n g questions f r o m the m a n u a l index by telephone. Again, e x p e r i e n c e h a s shown that m a n y people have p r e f e r r e d t o a s k the c e n t r a l index--it is l e s s t r o u b l e t o t h e m s e l v e s , i t i s upt o - d a t e , and i t contains a l l the i n f o r m a t i o n . T h i s coding a l s o i s being used f o r a c l a s s i fied index t o r e a c t i o n s . F o r t h i s , s i x l e t t e r s a r e u s e d . F o r example, G a J q / J a includes a l l t h o s e m e t h o d s by which a carboxylic a c i d (Ga) is c o n v e r t e d into a n a m i d e (Jq) by r e a c t i o n with an amine ( J a ) . F o r t h i s p u r p o s e A and Z l e t t e r s as "ope r a t o r s " a r e used ( T a b l e 111). Y l e t t e r s a r e u s e d f o r biological a c t i v i t i e s .
SYNTACTIC STUDY O F CHINESE AND ENGLISH
BRIDGMAN RESEARCH PAPERS
TABLE I11 Aa Reduction
Am Nitration
Za Characterization
Ab Dehalogenation An Nitrosation
Zb Stability
Ac Halogenarion
Ap Disproportion
Zc Optical Resolution
A d Hydrolysis
Aq Sulfur
Ze Physical Properties
Ae CO
Ar
Sulfonation
Zm Metabolism
As
Desulfurizatlon
Zn Nomenclature
Af
Oxidation
Ag co2
Au Metals
Zs
Structure
Ai
Isomerization
Aw Dehydration
Zx Isolation
Aj
Ammonia
Ax Polymerization
Zy Availability
Az Degradation
Zz Methods of synthesis
A1 HCN or ( C V z
I
A c o m p a r a t i v e s y n t a c t i c study of t h e C h i n e s e and E n g l i s h l a n g u a g e s w i l l be u n d e r t a k e n with a NSF g r a n t t o t h e Ohio S t a t e U n i v e r s i t y R e s e a r c h Foundation. T h e p u r p o s e of t h e study i s t o f a c i l i t a t e m a c h i n e t r a n s l a t i o n and i n f o r m a t i o n retrieval.
T h e collected r e s e a r c h p a p e r s of t h e l a t e D r . P e r c y W. B r i d g m a n w i l l be published by H a r v a r d U n i v e r s i t y with t h e a s s i s t a n c e of t h e Nat i o n a l S c i e n c e Foundation. T h e c o l l e c t i o n consists of about 200 p a p e r s t o be published i n seven volumes.