Information Theory and Other Quantitative Factors ... - ACS Publications

probability data constantly. However, it is interesting that many people, including the statisticians, have beenclever in finding ways of increasing t...
2 downloads 3 Views 649KB Size
INFORMATION THEORY AND OTHER QUANTITATIVE FACTORS IN CODE DESIGN FOR DOCUMENT CARD SYSTEMS* By EUGENE GARFIELD, Director Institute for Scientific Information, 1122 Spring Garden Street, Philadelphia 23, Pa.

In the p a s t t e n y e a r s , the f i e l d of i n f o r m a t i o n r e t r i e v a l h a s w i t n e s s e d the development of many new s y s t e m s , d e v i c e s , and t h e o r i e s . In p a r t i c u l a r , two opposing "schools" of thought on c a r d indexing s y s t e m s have developed. One c l a i m s t h a t t h e t e r m c a r d (unit t e r m ) o r "collating" s y s t e m i s the m o s t d e s i r a b l e . The o t h e r advocates the document c a r d (unit r e c o r d ) or "scanning" s y s t e m . D r . Whaley h a s noted many of the advantages and d i s a d v a n t a g e s of collating and scanning s y s t e m s , and I am glad to adopt h i s terminology and a g r e e with m o s t of h i s c o m m e n t s . F o r the r e c o r d , h o w e v e r , I w i s h t o r e m i n d the proponents of t e r m c a r d s y s t e m s t h a t t h e i r s was no new finding. Costello s a y s Batten anticipated Taube by 1 5 y e a r s . Batten w a s a n ticipated by a t l e a s t a n o t h e r 35 y e a r s . One t e r m c a r d s y s t e m began at the t u r n of the c e n t u r y a t J o h n s Hopkins Hospital. S u b s e quently, i t went through all the evolutionary s t a g e s which c l e a r l y d e m o n s t r a t e the i n h e r e n t s i m i l a r i t i e s between t e r m c a r d and document c a r d s y s t e m s . T h i s d o e s not m e a n that the r e d i s c o v e r y of the t e r m c a r d s y s t e m w a s a n insignificant development. After all, many useful i d e a s and inventions a r e r e d i s c o v e r e d and we a r e g r a t e f u l f o r t h e s e d i s c o v e r i e s . However, when a p p r o p r i a t e , o u r p r e c u r s o r s ought t o b e given c r e d i t . E v e n the t e n column posting c a r d w a s anticipated by P a u l Otlet, founder of t h e m o d e r n documentation m o v e m e n t . Indeed, long a g o , the t e r m c a r d s y s t e m w a s u s e d i n s e v e r a l m e d i c a l i n s t i t u t i o n s , including J o h n s Hopkins Hospital and the Mayo Clinic. T e x t s on m e d i c a l r e c o r d s m a n a g e m e n t d e m o n s t r a t e s u c h s y s t e m s . T h e s e c o n s i s t of one 3 x 5 c a r d f o r e a c h d i s e a s e ( t e r m ) . E a c h c a r d then l i s t s the c a s e h i s t o r y document n u m b e r s f o r all patients s o diagnosed. Ultimately, t h e number of c a s e h i s t o r y n u m b e r s g r e w l a r g e r and the t i m e r e q u i r e d to m a k e any c o r r e l a t i o n s between two diagnostic t e r m c a r d s i n c r e a s e d t o r i d i c u l o u s , exponential p r o p o r t i o n s . Some w h e r e along the l i n e i t w a s decided t h a t the docum e n t c a r d s y s t e m should be employed. At J o h n s Hopkins and Mayo, H o l l e r i t h c a r d s w e r e in u s e as e a r € y a s the 1920's. The School of Public Health a t Johns Hopkins w a s one of t h e e a r l i e s t u s e r s of punched-card m a c h i n e s . T h e i r equipm e n t is still of e a r l y vintage. At J o h n s Hopkins, e v e n the IBM c a r d finally b e c a m e a p r o b l e m as t h e volume of p a t i e n t s g r e w into t h e h u n d r e d s of t h o u s a n d s . The "vicious c i r c l e " w a s continued when i t w a s decided t o u s e duplicate s e t s of cards i.e., r o t a t e d f i l e s , not unlike the s y s t e m u s e d a t t h T C h e m i c a 1 - Biological Coordination

-

-

C e n t e r (CBCC) s e v e r a l y e a r s ago. F i n a l l y , t h i s s e m i - c o l l a t i n g , s e m i - s c a n n i n g s y s t e m was abandoned b e c a u s e of the high c o s t of s t o r i n g m i l l i o n s of c a r d s . The e n t i r e f i l e w a s tabulated on p r i n t e d s h e e t s and the punched-cards thrown out. T h i s pxinted index a r r a n g e m e n t is v e r y s i m i l a r to the o r i g i n a l t e r m c a r d a r r a n g e m e n t . However, in a s e p a r a t e s e c t i o n , the equivalent of the document c a r d is a l s o p r i n t e d . T h u s , one is able t o do a s e a r c h by both m e t h o d s . Depending upon the individual s e a r c h e i t h e r one or both m a y be used. P r e - c o o r d i n a t i o n s w e r e m a d e w h e r e a p p r o p r i a t e before printing the index. The Mayo Clinic long ago attacked the s p a c e p r o b l e m i n a n o t h e r f a s h i o n . The s t o r a g e density of the IBM c a r d w a s i n c r e a s e d by a s y s t e m of binary coding. T h e s e IBM m e t h o d s , I b e l i e v e , a r e s t i l l u s e d t h e r e . The b i n a r y coding u t i l i z e s a l l of t h e 4024 combinations possible in a 1 2 position punched-card column. It is u n d e r s t a n d able t h a t a g r o u p of s t a t i s t i c i a n s would d i s c o v e r t h i s method. After all, s t a t i s t i c i a n s w o r k with probability d a t a constantly. However, i t is i n t e r e s t i n g t h a t many people, including the s t a t i s t i c i a n s , have been c l e v e r i n finding ways of i n c r e a s i n g the n u m b e r of codes t h a t c a n be c r a m m e d on a c a r d ( W i s e , M o o e r s , e t . ) . Howe v e r , t h e p r o b l e m of how many t i m e s e a c h w a s u s e d w a s not c o n s i d e r e d as i m p o r t a n t . T h i s a s p e c t f i r s t troubled m e while working with t h e IBM 101 a t the Welch Medical L i b r a r y Indexing P r o j e c t . Some r e a d e r s m a y r e c a l l the e x p e r i m e n t a l 101 s y s t e m we d e m o n s t r a t e d in I953 using f i v e digit d e c i m a l c o d e s , r a n d o m l y s t r u n g along the f i r s t sixty columns of a n IBM c a r d . F o r e a c h subject heading or d e s c r i p t o r t h e r e w a s one five digit d e c i m a l n u m b e r . E a c h c a r d contained 1 2 s u c h n u m b e r s . T h e d e t a i l s a r e d e s c r i b e d in the f i n a l r e p o r t of the p r o j e c t . To u s e the s a m e code l e n g t h f o r all d e s c r i p t o r s r e g a r d l e s s of t h e i r f r e q u e n c y w a s r a t h e r inefficient i n t e r m s of s p a c e utilization, input t i m e and s e a r c h i n g c o s t . Obviously, o t h e r s have a r r i v e d at s i m i l a r conclusions b e c a u s e t h e i r coding s y s t e m s intuitively employ a statistical a p p r o a c h . It is s u r p r i s i n g , h o w e v e r , how m a n y extant s y s t e m s s t i l l d o not m a k e p r o v i s i o n s f o r " n o r m a l distribution." A good e x a m p l e is t h e CBCC s y s t e m , and t h e s a m e is t r u e of U n i t e r m , Zatocoding and o t h e r s . T o r e i t e r a t e : they all u s e the s a m e amount of coding s p a c e f o r e a c h d e s c r i p t o r , r e g a r d l e s s of i t s f r e q u e n c y of u s e . Working with the CBCC s y s t e m , and utilizing Heumann's s t a t i s t i c a l d a t a on about 25,000 c h e m i c a l compounds coded with t h i s s y s t e m , it w a s p o s s i b l e t o d e s i g n a code which r e d u c e d

OPresenred at the American Documentation Institute Annual Meeting, October 22, 1959, Lehigh University, Bethlehem, Pa.

70

FACTORS INFLUENCING CODE DESIGN FOR DOCUMENT CARD SYSTEMS significantly c a r d s p a c e and the t i m e and c o s t of s e a r c h i n g . F o r the m o m e n t it is s u f f i c i e n t t o s t a t e b r i e f l y t.hat t h e s t a t i s t i c a l i n f o r m a t i o n a v a i l a b l e on the CBCC f i l e w a s u s e d t o c o n s t r u c t a n o r m a l d i s t r i b u t i o n c u r v e giving t h e f r e q u e n c y of u s e of e a c h al.pha-numerical c o d e . One t h e n a r b i t r a r i l y b r e a k s into t h e f r e q u e n c y c u r v e s in v a r i o u s s e c t i o n s t o d e t e r m i n e the s p a c e a l l o c a t i o n s f o r t h e d e s c r i p t o r s . If a d e s c r i p t o r , s u c h as b e n z e n e , o c c u r s i n half t h e c h e m i c a l s and t h e code f o r u r a n i u m o c c u r s r a r e l y , why devote the s a m e a m o u n t of s p a c e to both. Obviously, as Wiswes'ser, Siteidle and m a n y o t h e r s have found, it is q u i t e sufficient t o a s s i g n p e r m a n e n t c a r d locations to frequently occurring codes. On the o t h e r hand, d e s c r i p t o r s which o c c u r i n f r e q u e n t l y c a n be a s s i g n e d s o m e coding conf i g u r a t i o n which r e q u i r e s , r e l a t i v e l y , a g r e a t d e a l of c a r d s p a c e . T h i s will be of l i t t l e c o n s e quence s i n c e it will c r o p up s o r a r e l y . T h e s e " r a r e " b i r d s a r e t r e a t e d a s a c l a s s and c o d e s a r e used that p e r m i t many combinations in a l a r g e r s p a c e . The Mayo s y s t e m is one e x a m p l e ; a n o t h e r is t h e Z a t o r s y s t e m , a s applied by S c h u l t z . Indeed, one of t h e p r i m a r y s h o r t c o m i n g s of M o o e r ' s Z a t o r s y s t e m is the i n d i s c r i m i n a t e , i. e . , r a n d o m a s s i g n m e n t of a n equal n u m b e r of code s y m b o l s r e g a r d l e s s of a c t u a l o c c u r r e n c e i n t h e f i l e . This; r e s u l t s in e x c e s s n o i s e , k , f a l s e d r o p s . Incidentally, I w i s h t o point out t h a t I am w e l l a w a r e of M o o e r ' s e a r l y a t t e m p t i n A m e r i c a n Documentation to s e t Wise s t r a i g h t o n the folly of a s u p e r i m p o s e d coding s c h e m e f o r the now defunct Rapid S e l e c t o r . H o w e v e r , t o to use u s e probability t h e o r y is one thing i n f o r m a t i o n t h e o r y is s o m e t h i n g e l s e . We all r e a d i l y c a n v i s u a l i z e m e t h o d s of utilizing c a r d s p a c e t h a t w i l l g r o s s l y t a k e advantage of t h e f a c t s r e v e a l e d by a s t a t i s t i c a l a n a l y s i s of the u s e m a d e of a p a r t i c u l a r d e s c r i p t o r d i c t i o n a r y o r s u b j e c t heading l i s t . The t h e o r e t i c i a n , howe v e r , w a n t s p r e c i s e quantitative c r i t e r i a f o r a l l o c a t i n g code s p a c e to individual d e s c r i p t o r s or g r o u p s of d e s c r i p t o r s . H e r e is w h e r e I n f o r m a t i o n T h e o r y c o m e s to t h e r e s c u e . T h e d e s i g n of the m o s t efficient coding s y s t e m d o e s not depend upon t h e meaning of t e r m s . The t e r m s , by t h e m s e l v e s , have no i n f o r m a t i o n a l v a l u e . R a t h e r , i t is the: f r e q u e n c y of u s e of a p a r t i c u l a r d e s c r i p t o r which d e t e r m i n e s i t s i n f o r m a t i o n a l content. One c a n only m e a s u r e the a m o u n t of i n f o r m a t i o n i n the w o r d benzene when t r a n s m i t t i n g i t in E n g l i s h t e x t . As a code o r t e r m in a document collection dictionary, the word h a s no v a l u e . It is only significant in s o far as i t o c c u r s with a p a r t i c u l a r f r e q u e n c y . If half of t h e chemicalls coded contain benzene t h e n the knowledge t h a t particular chemical contains benzene reduceis the r e m a i n i n g c h o i c e s to one half. Having c l e a r e d the cobwebs o n what the r e a l "coding" p:roblem is in documentation s y s t e m s i t i s then r e l a t i v e l y s i m p l e t o apply Shannon's b a s i c f o r m u l a f o r m e a s u r i n g

-

--

71

i n f o r m a t i o n a l content. I might mention that i t is difficult, a t f i r s t , t o think of the c a r d s e a r c h i n g p r o b l e m as a t r a n s m i s s i o n p r o b l e m . H o w e v e r , if you think i n t e r m s of m a g n e t i c t a p e s y s t e m s (Univac) o r p a p e r tape s y s t e m s s u c h as the W e s t e r n R e s e r v e S c a n n e r , i t is e a s i e r to s e e a n analogy between " t r a n s m i s s i o n " and s e a r c h i n g . The i n f o r m a t i o n content of a document f i l e is n e i t h e r the n u m b e r of d e s c r i p t o r s u s e d , nor t h e n u m b e r of d o c u m e n t s which t h e v a r i o u s combinations of d e s c r i p t o r s c o n s t i t u t e . The i n f o r m a t i o n content of a document collection is a function of the p r o b a b i l i t i e s of the d e s c r i p t o r s i n the d i c t i o n a r y . E, the f a m i l i a r t h e r m o d y n a m i c e n t r o p y function, and Shannon's m e a s u r e of i n f o r m a t i o n , is e q u a l t o t h e s u m of the individual p r o b a b i l i t i e s m u l t i p l i e d by t h e l o g a r i t h m of the individual p r o b a b i l i t i e s , Q., E =-(E log E, 1 + E 2 log p 2 + f Pn log En).

. ..

F r o m t h i s we a r e able t o d r a w many i n t e r e s t i n g c o n c l u s i o n s . F o r e x a m p l e , a document c o l l e c t i o n of 1,000 d o c u m e n t s m a y contain no m o r e i n f o r m a t i o n than a document collection of one m i l l i o n d o c u m e n t s . T h i s f a c t accounts f o r the intuitive d e c i s i o n of the P a t e n t Office t o u s e a "composited" c a r d , which in c e r t a i n c a s e s is quite j u s t i f i a b l e . It a l s o c a n be shown t h a t the i n f o r m a t i o n a l equality i n two s u c h f i l e s c a n be changed r e a d i l y if the depth of indexing is a l t e r e d . Indeed, if the i n f o r m a t i o n a l content r e m a i n s c o n s t a n t d u r i n g s u c h a growth one m u s t e i t h e r conclude t h a t u n n e c e s s a r y c a r d s r e m a i n i n the f i l e , new sub-dividing t e r m s a r e r e q u i r e d , or noise is p r e s e n t d u r i n g a s e a r c h . This s i t u a tion is i l l u s t r a t e d p e r f e c t l y by our e x p e r i e n c e in coding s t e r o i d c h e m i c a l s using the P a t e n t Office c o d e . In m a n y i n s t a n c e s a dozen different s t e r o i d s w e r e coded e x a c t l y a l i k e . If the code d i c t i o n a r y is not changed, i t is p r o p e r l y concluded t h a t i t i s m o r e e c o n o m i c a l t o " c o m p o s i t e " t h e 12 c a r d s into o n e . H o w e v e r , one could i n c r e a s e the specifity of the coding. F r o m the point of view of the P a t e n t Office, with e m p h a s i s o n the g e n e r i c a p p r o a c h , the f o r m e r conclusion, conipositing, m a y a p p e a r s i m p l e s t . F r o m the point of view of the r e s e a r c h c h e m i s t the l a t t e r a p p r o a c h , m o r e s p e c i f i c i t y in coding, i s m o r e d e s i r a b l e . T a u b e ' s p a p e r a t the ICs1 C o n f e r e n c e i m p l i e s t h a t a t e r m c a r d s y s t e m f o r the s a m e s t e r o i d f i l e could be u s e d a s r e a d i l y a s t h e P a t e n t Office document c a r d s y s t e m . This h a s a t h e o r e t i c a l validity i n view of the f a c t that in both s y s t e m s no a t t e n t i o n w h a t s o e v e r i s devoted t o the f r e q u e n c y of o c c u r r e n c e of the v a r i o u s c o d e s . ( T h e P a t e n t Office u s e s one y n c h e d hole position f o r e a c h d e s c r i p t o r and t h e U n i t e r m s y s t e m u s e s a 4 digit document n u m b e r f o r e a c h d e s c r i p t o r . ) Indeed, f r o m a t a b u l a t i o n of t h e coding done by the P a t e n t Office of o v e r 2500 U . S . p a t e n t s , involving about 35,000 c o d e s , i t is no coincidence to find t h a t s e v e n d e s c r i p t o r s account f o r o v e r 9,200 c o d e s , 16 additional account f o r a n o t h e r 9,100, t h e next 52 a n o t h e r 9 , 4 0 0 and all the r e m a i n i n g

72

Eugene G a r f i e l d

359 d e s c r i p t o r s 6,800. Deciding the r e l a t i v e m e r i t s of working with a t e r m c a r d involving 1,500 document n u m b e r s (the h i g h e s t f r e q u e n c y c o d e ) o r the t i m e to r u n 2,500 c a r d s through a m a c h i n e with a s p e e d v a r y i n g ( a c c o r d i n g to p r i c e ) f r o m 500 to 2,000 c a r d s p e r minute is m e a n i n g l e s s . T h i s becomes p a r t i c u l a r l y l u d i c r o u s if one then c o n s i d e r s the t i m e r e q u i r e d to find those c h e m i c a l s containing both a 3Hydroxy S t e r o i d code and a 17-Hydroxy s t e r o i d which o c c u r s with a l m o s t equal f r e q u e n c y (1,200 o c c u r r e n c e s ) . Instead of matching n u m b e r s on U n i t e r m c a r d s by e y e , one can s p e e d t h i s u p by "collating" on an IBM machine a t s p e e d s c o m p a r a b l e to the s o r t i n g o p e r a t i o n . Using a R a m a c s y s t e m or a high s p e e d c o m p u t e r this c a n be s p e e d e d f u r t h e r . The point is that e a c h s y s t e m , a c c o r d i n g to the c i r c u m s t a n c e s , h a s advantages and f o r this r e a s o n , in c e r t a i n c a s e s , I have even going s o far u s e d a combination of both a s to m a i n t a i n two independent s y s t e m s . T h i s i s commonly done, but not a d m i t t e d , i n many installations, Returning to the d i s c u s s i o n of the now m e a s u r a b l e quantity of a n i n f o r m a t i o n f i l e , to explain how t h i s m e a s u r e of i n f o r m a t i o n i s d e t e r m i n e d and u s e d , I m u s t r e s o r t t o b a s i c Inf o r m a t i o n t h e o r y . F o r t h a t I have p a r a p h r a s e d Shannon's own w o r d s , to which I r e f e r t h o s e who a r e not y e t f a m i l i a r with Information T h e o r y . Information t h e o r y i s c o n c e r n e d with the d i s c o v e r y of m a t h e m a t i c a l l a w s governing s y s t e m s designed to communicate o r manipulate i n f o r m a t i o n . It s e t s u p quantitiative m e a s u r e s of i n f o r m a t i o n and the c a p a c i t y to t r a n s m i t , s t o r e and p r o c e s s i n f o r m a t i o n . Information i s i n t e r p r e t e d to include the m e s s a g e s o c c u r r i n g in s t a n d a r d communication m e d i a , c o m p u t e r s , and even the n e r v e networks of a n i m a l s . The s i g n a l s o r m e s s a g e s need not be meaningful in any o r d i n a r y s e n s e . Information T h e o r y is quite d i f f e r e n t f r o m c l a s s i c a l communication engin e e r i n g t h e o r y , which d e a l s with t h e d e v i c e s not with t h a t which i s communicated. used I s u b m i t that m o s t of the p o l e m i c s conc e r n i n g d e v i c e s , &, t e r m c a r d document c a r d s y s t e m s have kept u s i n t h e d a r k a g e s of conventional engineering t h e o r y . Relatively speaking, we have paid little attention to t h e n a t u r e of the i n f o r m a t i o n i t s e l f . T h i s l e d to the f a i l u r e to design r e a l l y efficient s e a r c h i n g d e v i c e s ; anyone who r e n t s a n IBM machine knows t h i s . The m e a s u r e of i n f o r m a t i o n , H, i s i m p o r t a n t because i t d e t e r m i n e s the saving in t r a n s m i s s i o n t i m e that i s p o s s i b l e , by p r o p e r encoding, due to the s t a t i s t i c s of the m e s s a g e s o u r c e . C o n s i d e r a model language i n which t h e r e a r e A, B, C , and D. T h e s e l e t t e r s only f o u r l e t t e r s have the p r o b a b i l i t i e s 1 / 2 , 1/4, 1 / 8 and 1 / 8 . In a long t e x t , A will o c c u r 1 / 2 t h e t i m e , B one q u a r t e r , and C and D e a c h 1 / 8 . Suppose this language i s to be encoded into binary d i g i t s , 0 o r 1 a s in a pulse s y s t e m with two t y p e s of p u l s e .

-

-

--

-

The m o s t d i r e c t code is: A equal 00, B equal 01, C equal 10, and D equal 11. T h i s code r e q u i r e s 2 binary digits p e r l e t t e r . However, a b e t t e r code c a n be c o n s t r u c t e d , with A equal 0 , B equal 10, C equal 110 and D equal 111. The number of b i n a r y digits u s e d i n t h i s code is s m a l l e r on t h e a v e r a g e . It will equal 1 / 2 ( 1 ) t 1 / 4 ( 2 ) t 1 / 8 (3) t 1 / 8 (3) = 1 3 / 4 , w h e r e the f i r s t t e r m d e r i v e s f r o m l e t t e r A, s e c o n d B, etc, T h i s i s j u s t the value of H found if t h e probability functions a r e c a l c u l a t e d . The r e s u l t v e r i f i e d f o r this s p e c i a l c a s e if the i n f o r m a t i o n r a t e of t h e holds g e n e r a l l y bits p e r l e t t e r , it is possible to message i s encode i t into binary digits using, on t h e a v e r a g e , only binary digits p e r l e t t e r of text. T h e r e is no method of encoding which u s e s l e s s t h a n t h i s amount if the o r i g i n a l m e s s a g e is to be r e c o v e r e d without n o i s e . An a v e r a g e of 1 1 / 4 bits is p o s s i ble if the m e s s a g e is allowed to be noisy, k., not a completelyfaithful rendition of the o r i g i n a l message. Before we c a n c o n s i d e r how i n f o r m a t i o n is to be m e a s u r e d it is n e c e s s a r y t o c l a r i f y t h e p r e c i s e meaning of "Information" to t h e c o m munication e n g i n e e r . In g e n e r a l , m e s s a g e s to be t r a n s m i t t e d have "meaning," but have no b e a r i n g on the p r o b l e m of t r a n s m i t t i n g t h e inf o r m a t i o n . It i s as difficult to t r a n s m i t n o n s e n s e w o r d s o r s y l l a b l e s a s meaningful text ( m o r e so i n f a c t ) . The significant point is that one p a r t i c u l a r m e s s a g e i s chosen f r o m a s e t of p o s s i b l e m e s s a g e s . What m u s t be t r a n s m i t t e d i s a specification of the p a r t i c u l a r m e s s a g e chosen by the i n f o r m a t i o n s o u r c e . The o r i g i n a l m e s s a g e c a n be r e c o n s t r u c t e d a t the r e c e i v i n g point only if s u c h a n unambiguous specification i s t r a n s m i t t e d . Thus "information" is a s s o c i a t e d with t h e notion of a choice of a s e t of p o s s i b i l i t i e s . F u r t h e r m o r e , t h e s e c h o i c e s o c c u r with c e r t a i n probabilities; some messages a r e m o r e frequent than o t h e r s . The s i m p l e s t type of choice is f r o m two p o s s i b i l i t i e s , e a c h with probability 1 / 2 , a s when a coin is t o s s e d . It is convenient, but not n e c e s s a r y , t o u s e a s the b a s i c unit t h e b i n a r y digit o r bit. If t h e r e a r e 3 p o s s i b i l i t i e s , all equally likely, the amount of i n f o r m a t i o n is given by l o g z x , If the p r o b a b i l i t i e s a r e not e q u a l , the f o r m u l a is m o r e complicated. When t h e c h o i c e s have p r o b a b i l i t i e s El, p2, En, the amount of i n f o r m a t i o n His given by t h e equation above. An i n f o r m a t i o n s o u r c e p r o d u c e s a m e s s a g e which c o n s i s t s nbt of a single choice but of a sequence of c h o i c e s , f o r e x a m p l e , the l e t t e r s of a printed text o r the e l e m e n t a r y w o r d s or sounds of s p e e c h . In t h e s e c a s e s , by a n application of a g e n e r a l i z e d f o r m u l a f o r E , the r a t e of production of i n f o r m a t i o n c a n be c a l c u l a t e d . T h i s "information" r a t e f o r E n g l i s h text i s roughly one bit p e r l e t t e r , when s t a t i s t i c a l s t r u c t u r e out to s e n t e n c e l e n g t h i s c o n s i d e r e d ( s e e Bell S y s t e m T e c h . J., October 1949) o r ("Encyclopedia Britannica" a r t i c l e on . Information Theory).

11

-

. . .,

FACTORS INFLUENCING CODE DESIGN FOR DOCUMENT CARD SYSTEMS

T h e p r o b l e m of applying i n f o r m a t i o n t h e o r y t o d o c u m e n t a t i o n , I believe, is to be s o l v e d in p r o p e r l y defining the i n f o r m a t i o n s o u r c e , which is t h e totality of d e s c r i p t o r s a s s i g n e d i n any f i l e . The next p r o b l e m is defining t h e language u n i t s , &, the d e s c r i p t o r s a n d / o r t h e i r compon e n t s . A c l a s s i f i c a t i o n n u m b e r , e.g., h a s built i n t o it m u c h m o r e i n f o r m a t i o n t h F a U n i t e r m . E a c h f a c e t of t h e c l a s s n u m b e r m u s t be t a k e n into c o n s i d e r a t i o n when m e a s u r i n g the i n f o r m a t i o n content of a c l a s s i f i c a t i o n s y s t e m . It i s t h e n n e c e s s a r y t o d e t e r m i n e the p r o b a b i l i t i e s of t h e units'involved. I will f u r t h e r h a z a r d the s t a t e m e n t t h a t i n the d e s i g n of a document c a r d of t h e IBM type the m o s t efficient s p a c e utilization will be obt a i n e d when the i n f o r m a t i o n a l content of all c a r d f i e l d s a p p r o a c h equality. F o r e x a m p l e , i n t h e c a s e of t h e s t e r o i d f i l e mentioned a b o v e , a c a r d of f o u r b a s i c f i e l d s could be d e s i g n e d i n which about 25% of the! i n f o r m a t i o n w a s contained i n e a c h . The f i r s t "field" would c o n s i s t of one c o l u m n of 1 2 punches. The twelve m o s t f r e quently o c c u r r i n g c o d e s would be a s s i g n e d t o e a c h of t h e twelve l o c a t i o n s . The next e i g h t e e n c o d e s would be a c c o m m o d a t e d i n a n o t h e r c o l u m n divided into s i x s e c t i o n s , e a c h of which could a c c o m m o d a t e t h r e e d i f f e r e n t mutually e x c l u s i v e c o d e s . You cannot have a s t e r o i d which is both a n 11-keto and a n 11-hydroxy compound. In a c t u a l p u n c h e d - c a r d application I s u s p e c t t h a t one would continue t o u s e t h e f i r s t f i v e c o l u m n s , at l e a s t , f o r d i r e c t c o d e s c o v e r i n g the f i r s t 60 m o s t f r e q u e n t l y o c c u r r i n g d e s c r i p t o r s . If not, a n o t h e r f i e l d could be u s e d t o a c c o m m o d a t e the next 28 c o d e s dividing one or m o r e c o l u m n s into 4 s e c t i o n s , e a c h containing 3 punches. To a c c o m m o d a t e t h e r e m a i n i n g 359 cod.es in one f i e l d would be quite s i m p l e by using all the 4 9 5 combinations ( b i n a r y ) of f o u r hole punching p a t t e r n s p o s s i b l e . The n u m b e r of c o l u m n s in the f i e l d would depend upon t h e a v e r a g e n u m b e r of s u c h c o d e s p o s s i b l e i n a s i n g l e compound. Specific c h a r a c t e r i s t i c s of e x i s t i n g equipment m a y modify t h i s d e c i s i o n . The p r e c e d i n g e x a m p l e of applying m e a s u r e s of i n f o r m a t i o n content t o the d e s i g n of a n IBM c a r d h a s been v e r y brief and m a y not be e n t i r e l y c l e a r to t h o s e not f a m i l i a r with IBM m a c h i n e s . It i s i m p o r t a n t , a t t h i s point, t o m a k e c l e a r the s i m i l a r i t y between t h i s s i m p l e code f o r a n IBM c a r d and a s i m i l a r code that c a n be u s e d f o r a v a r i e t y of document c a r d o r scanning c a r d s y s t e m s . L e t u s take up a brief d i s c u s s i o n of t h e q u a l i t a t i v e a s p e c t s of document c a r d s s y s t e m s , p a r t i c u l a r l y as they r e l a t e t o coding. By document c a r d s y s t e m s , as c o n t r a s t e d t o t e r m c a r d s y s t e m s , we m e a n s y s t e m s w h e r e i n all d e s c r i p t o r s , o r c o d e s f o r d e s c r i p t o r s , a r e r e t a i n e d t o g e t h e r in the p a r t i c u l a r s t o r a g e m e d i u m involved. T h u s , i n a p u n c h e d - c a r d docum e n t c a r d s y s t e m , i . e . , M c B e e , E - Z S o r t , IBM, R e m i n g t o n R a n d , Underwood-Samas, e&, the

73

h o l e s o r p e r f o r a t i o n s a r e u s e d t o encode d e s c r i p t o r s a s s i g n e d to individual d o c u m e n t s . In a l i m i t e d s e n s e , the c a r d the document. Ind e e d , if t h e coding w e r e sufficiently e l a b o r a t e a n d d e t a i l e d the c a r d could be the d o c u m e n t . The o r i g i n a l Luhn S c a n n e r employed a n IBM c a r d i n which s e m a n t i c a l l y f a c t o r e d w o r d s w e r e s t r e t c h e d a c r o s s t h e c a r d t o f o r m a n encoded t e l e g r a p h i c s t y l e m e s s a g e . T h e IBM c a r d e m ployed w a s t h e s t a n d a r d 80 column c a r d with a t o t a l of 960 pudching p o s i t i o n s . Punched c a r d document c a r d s y s t e m s have t h e i r c o u n t e r p a r t s in f i l m ( F i l m o r e x and Minic a r d ) w h e r e a g a i n all t h e d e s c r i p t o r c o d e s a r e a s s e m b l e d t o g e t h e r on a s i n g l e piece of unitized f i l m . The coding p a t t e r n s m a y or m a y not be e x a c t l y of t h e type found on p u n c h e d - c a r d s . H o w e v e r , black o r white s p o t s c o r r e s p o n d to p e r f o r a t i o n s o r t h e l a c k of p e r f o r a t i o n s . The f i l m - c a r d ( m i c r o f i c h e ) m a y a l s o contain a m i c r o i m a g e of the o r i g i n a l document. S i m i l a r l y , a n IBM c a r d could contain the s a m e m i c r o image in a microfilm i n s e r t ( F i l m s o r t ) . Simil a r l y , the M a g n a c a r d is the m a g n e t i c analog of a punched c a r d , In t h i s c a s e i n f o r m a t i o n is coded as m a g n e t i z e d s p o t s on m a g n e t i c t a p e . The u n i t - c a r d c h a r a c t e r i s t i c c o m m o n to p u n c h e d - c a r d s , f i l m c a r d s , and m a g n e t i c c a r d s is not only found in d o c u m e n t - c a r d s y s t e m s . The s a m e i n f o r m a t i o n found on M a g n a c a r d s c a n be s t o r e d o n continuous m a g n e t i c t a p e . T h i s i s done on Univac and t h e IBM 7 0 0 s e r i e s c o m p u t e r s . The m e c h a n i s m s employed t o s c a n the " c a r d " ( s e c t i o n s of t a p e ) a r e n a t u r a l l y sorpewhat d i f f e r e n t . S i m i l a r l y , the defunct Rapid S e l e c t o r w a s a continuous s e r i e s of F i l m o r e x c a r d s s t r u n g out on one r e e l of f i l m . In the BensonL e h n e r F l i p s y s t e m , t h e Rapid S e l e c t o r s y s t e m is p a r t i a l l y r e v i v e d . A c o m p r o m i s e between F i l m o r e x and the Rapid S e l e c t o r w a s s u g g e s t e d i n the AMFIS s y s t e m by Avakian. The s e r i a l c o u n t e r p a r t of p e r f o r a t e d c a r d s c a n be found i n the F l e x o w r i t e r t a p e u s e d a t W e s t e r n R e s e r v e w h e r e e a c h document is r e p r e s e n t e d by a s e r i e s of c o d e s e x a c t l y as in the f a s h i o n of 'the Luhn s c a n n e r . T h i s is no d i f f e r e n t f r o m t e l e t y p e t a p e e x c e p t f o r the n u m b e r of c h a n n e l s involved and the selector circuitry. The Z a t o r c a r d is a n o t h e r v e r s i o n of the punched c a r d . The coding method employed h a s no b a s i c dependence upon the c a r d . It c a n be u s e d with any type of document c a r d s y s t e m . S u p e r i m p o s i t i o n of c o d e s is employed t o m a k e m o r e efficient u s e of s p a c e . I mentioned e a r l i e r s o m e of the l i m i t a t i o n s of Z a t o r coding t h e o r y . T h e r e a r e , obviously, m a n y f a c t o r s t o cons i d e r i n evaluating document c a r d s y s t e m s . C o s t is one f a c t o r , but I believe its r e l a t i v e i m p o r t a n c e h a s been o v e r l y s t r e s s e d by Taube and o t h e r s . Document c a r d s y s t e m s a r e not i n h e r e n t l y e x p e n s i v e , rior s m a l l collections of m a n u a l p u n c h e d - c a r d s . D r . Whaley h a s c o v e r e d m o r e t h a n adequately m a n y o t h e r f a c t o r s which

-

74

Eugene Garfield

m a y f a v o r the d o c u m e n t - c a r d o r scanning c a r d s y s t e m . He p a r t i c u l a r l y s t r e s s e d the need, s o m e t i m e s , to r e t a i n r e l a t i o n s h i p s between v a r i o u s d e s c r i p t o r s . He did not s t r e s s a d e quately the advantages in t e r m s of input convenience and c o s t , w h e r e it is equally advantageous to k e e p codes t o g e t h e r . P r e p a r i n g a single IBM c a r d is s i m p l e r t h a n posting a dozen or m o r e document n u m b e r s to individual t e r m c a r d s . It is a l s o s i m p l e r than duplicating the s a m e c a r d a dozen t i m e s , e a c h to be f i l e d in twelve different f i l e l o c a t i o n s . At the p r e s e n t t i m e , punching a r e a l l y efficient IBM c a r d is difficult because the IBM m a c h i n e s a r e not designed f o r r e t r i e v a l p u r p o s e s exclusively. However, i n my own e x p e r i e n c e , p r e p a r i n g e l a b o r a t e l y punched c a r d s is not a n i n s u r m o u n t a b l e o b s t a c l e , Key-punching c o s t s a r e not c o n s i d e r e d m a j o r p r o b l e m s when a f i l e i s u s e d r e p e a t e d l y . Another f a c t o r t o c o n s i d e r is s e a r c h i n g t i m e f o r l a r g e f i l e s . T h i s c a n be cut down by converting t o s p e e d i e r m a c h i n e s - if t i m e is a p r o b l e m . The m a j o r c r i t i c i s m of existing documentc a r d s y s t e m s is t h e need t o o p e r a t e i n a "scanning" s e n s e , i.e., e a c h c a r d or e a c h unit of tape o r f i l e m u s t physically p a s s by a scanning unit. When t h e r e a r e l a r g e v o l u m e s of r e c o r d s involved v e r y high s p e e d s m a y be r e q u i r e d . T h i s is not only c o s t l y , but it will be obvious t h a t t h e r e is a l i m i t to the s p e e d s we c a n r e a c h i n m e c h a n i c a l l y t r a n s p o r t i n g c a r d s , f i l m , e& It is phenomenal how f a s t s o m e s o r t i n g and scanning d e v i c e s do w o r k , and possibly t h e s e s p e e d s will s a t i s f y m o s t r e q u i r e m e n t s f o r a long t i m e . Howe v e r , t h e s e s p e e d s a r e g e n e r a l l y available only a t a r e l a t i v e l y high p r i c e . IBM m a c h i n e r e n t a l s a r e higher in p r o p o r t i o n t o the s p e e d at which they w o r k , p r e s u m a b l y b e c a u s e of g r e a t e r maintenance and e n g i n e e r i n g c o s t . IBM tabul a t o r r e n t a l s a l s o v a r y a c c o r d i n g t o the s p e e d a t which they a r e o p e r a t e d . An i d e a l document c a r d s y s t e m would be one i n which the b a s i c advantages a r e retained--unit r e c o r d input and s t o r a g e , l o g i c a l c a p a b i l i t i e s , e t c . However, one would l i k e t o e l i m i n a t e the need t o s c a n the e n t i r e document f i l e , i n a physical s e n s e , +, by p a s s i n g c a r d s through a s o r t e r , or m a g- n e t i c t a p e p a s t a r e a d i n g l e a d , running f i l m by a p h o t o e l e c t r i c c e l l . I believe s u c h a s y s t e m is possible and r e q u i r e d p a r t i c u l a r l y if we a r e t o achieve the u l t i m a t e in a c c e s s t i m e . Such a s y s t e m would be a t r u l y r a n d o m a c c e s s s y s t e m and not a t e r m c a r d s y s t e m using s o - c a l l e d r a n d o m a c c e s s . S y s t e m s s u c h as RAMAC o r AMFIS do not a p p e a r t o be as e n e r g y consuming a s high s p e e d t a p e r e a d e r s o r s o r t e r s on punched c a r d s , but t h e i r m e c h a n i c a l c h a r a c t e r i s t i c s would s e e m t o be l i m i t i n g . It is c o m p a r a b l e t o solving the p r o b l e m of s o r t i n g a t high s p e e d s by using a dozen s o r t e r s all at o n c e . S i m i l a r l y to u s e the equivalent of a dozen m a g netic tape r e a d e r s is no fundamental solution.

-

~~

In the i d e a l , the f i l e will r e m a i n completely s t a t i o n a r y and the scanning m e c h a n i s m will be able to identify the e x i s t e n c e of d e s i r e d codes by scanning i n a non-mechanical f a s h i o n . An a p p r o a c h i n t h i s d i r e c t i o n is s e e n in the Bell Telephone s y s t e m of routing long d i s t a n c e c a l l s by u s e of s p e c i a l punched c a r d s . V e r n e r W . Clapp once a s k e d m e why you couldn't wave a flashlight at a f i l e and have i t t h r o w out the a n s w e r s . T h i s is not i m p o s s i b l e . I have been exploring a s i m i l a r p r i n c i p l e utilizing e l e c t r o m a g n e t i c phenomena which I have c a l l e d Radio Re t r ieval In conclusion, I have t r i e d t o show t h e fundam e n t a l s i m i l a r i t i e s between s o - c a l l e d t e r m c a r d and document c a r d s y s t e m s by t r a c i n g the c y c l i c a l evolution of a t e r m c a r d s y s t e m into a document c a r d s y s t e m , t h e n into a s e m i - d o c u m e n t c a r d s y s t e m employing collating m e t h o d s , and finally back to a t e r m c a r d p r i n t e d index a r r a n g e m e n t . I m a i n t a i n t h a t the d i f f e r e n c e s between t e r m and document c a r d s y s t e m s a r e b a s i c a l l y i l l u s o r y . You will find v i g o r o u s p r o ponents f o r e a c h s y s t e m depending upon the c i r c u m s t a n c e s . If one had no indexing s y s t e m at all i n the f i r s t p l a c e , any s y s t e m is a n i m p r o v e m e n t . Once a s y s t e m is adopted, t h e r e b y i m proving a c c e s s t o d o c u m e n t s , a p r o p o s a l t o m e r e l y change the m e c h a n i c s will not usually e x c i t e people. An a r e a of r e s e a r c h which r e q u i r e s m o r e fundamental w o r k is in coding. No m a t t e r what s y s t e m is u s e d , the s a m e amount of i n f o r m a t i o n is produced if one u s e s the s a m e code d i c t i o n a r y and code f r e q u e n c i e s . The P a t e n t Office S t e r o i d Code would be, t h e o r e t i c a l l y , equally efficient with a t e r m c a r d s y s t e m as i n i t s p r e s e n t document c a r d s y s t e m . F r o m a p r a c t i c a l point of view, it would not. Using Information T h e o r y the coding s p a c e r e q u i r e d i n a document c a r d s y s t e m c a n be r e d u c e d c o n s i d e r a b l y . It is possible t h a t similar effic i e n c i e s a r e possible in designing t e r m c a r d s y s t e m s , but t h e s e a r e not y e t a p p a r e n t and m a y be difficult t o find. In o t h e r w o r d s , t e r m c a r d s y s t e m s a r e i n h e r e n t l y inefficient b e c a u s e they s e e m i n g l y cannot t a k e advantage of the v a r i a t i o n s i n code f r e q u e n c i e s which a r e i n h e r e n t t o i n f o r m a t i o n s y s t e m s . According t o Keckley, " t h e r e is a c e n t r a l tendency f o r 90% of the activity t o be c o n c e n t r a t e d within 25% of the c l a s s i f i c a t i o n s . I ' T h i s a p p e a r s t o be well s u b s t a n t i a t e d i n the coding of 2,500 s t e r o i d p a t e n t s and independently the coding of 8,500 s t e r o i d compounds f r o m t h e l i t e r a t u r e . F u r t h e r m o r e , t e r m c a r d space requirements may increase exponentially a s the s i z e of the collection g r o w s . A collection of 1,000 documents r e q u i r e s l e s s than 7 bits p e r d e s c r i p t o r a s s i g n m e n t . A coll e c t i o n of 10,000 about 12 bits p e r d e s c r i p t o r a s s i g n m e n t , 100,000 16 b i t s , and 1,000,000 20 bits,

.

all

FACTORS INFLUENCING CODE DESIGN FOR DOCUMENT CARD SYSTEMS

M o o e r s d e s e r v e s c r e d i t f o r r e c o g n i z i n g the v a l u e of I n f o r m a t i o n T h e o r y f o r r e t r i e v a l t h e o r y . H o w e v e r , i t is j u s t as inefficient t o u s e f i v e punched h o l e s fox e v e r y d e s c r i p t o r o n a d o c u m e n t c a r d as it is t o u s e a f i v e digit d o c u m e n t n u m b e r o n a t e r r n c a r d . By p r o p e r a p p l i c a t i o n of d e s c r i p t o r p r o b a b i l i t i e s I n f o r m a t i o n T h e o r y c a n m a k e Z a t o coding even m o r e powerful. It h a s been shown t h a t one c a n quantitatively m e a s u r e the a m o u n t of i n f o r m a t i o n i n a docum e n t c o l l e c t i o n by the Shannon f o r m u l a

As a r e s u l t of t h i s e x p r e s s i o n , i t is concluded t h a t the s i z e of a document c o l l e c t i o n i s '10 r e a l i s t i c m e a s u r e of i t s " i n f o r m a t i o n content." Indeed, two c o l l e c t i o n s of e n t i r e l y dif f e r e n t s i z e contain the " s a m e " i n f o r m a t i o n if t h e v u s e e x a c t l v the s a m e code o r d i c t i o n a r v with t h e s a m e p e r c e n t a g e d i s t r i b u t i o n of d e s c r i p t o r s . T h u s , , i n t h i s s e n s e the L i b r a r y of C o n g r e s s Subject Catalog contains no m o r e i n f o r m a t i o n t h a n the l o c a l Public L i b r a r y Catalog. T h i s m a y sound !startling o r r i d i c u l o u s to l i b r a r i a n s . H o w e v e r , a s long a s the l o c a l L i b r a r y u s e s t h e L C Subject Heading Authority L i s t , it

-

75

m a y e v e n contain E i n f o r m a t i o n b e c a u s e it m a y add f u r t h e r r e f i n e m e n t s t o the existing L C d i c t i o n a r y o r u s e i t with v a r y i n g f r e q u e n c y a s s i g n m e n t s . A s p e c i a l l i b r a r y is of m o r e u s e t o i t s c l i e n t e l e t h a n is the L i b r a r y of C o n g r e s s . T o a l t e r t h e i n f o r m a t i o n content of a collection one m u s t i n d e x i n g r e a t e r depth not index m o r e d o c u m e n t s . T h i s point is m o s t i m p o r t a n t i n industry. A n a l y s i s of the P a t e n t Office s t e r o i d code f r e q u e n c i e s i l l u s t r a t e s in a s i m p l e c a s e how Inf o r m a t i o n T h e o r y m a y be put t o u s e . A brief s u m m a r y and r e v i e w of Shannon's Information T h e o r y h a s been p r e s e n t e d t o show t h a t the p a s t p r e o c c u p a t i o n of d o c u m e n t a l i s t s with d e v i c e s i s c o m p a r a b l e t o t h e e a r l i e r preoccupation of c o m m u n i c a t i o n e n g i n e e r s with m a c h i n e s r a t h e r t h a n the i n f o r m a t i o n they w e r e t r a n s m i t t i n g , The m a i n p r o b l e m in applying i n f o r m a t i o n t h e o r y i n documentation is in defining the "information s o u r c e " and the "channel." A completely s u c c e s s f u l r e t r i e v a l s y s t e m m u s t combine the a d v a n t a g e s of both t e r m and document c a r d s y s t e m s i n s u c h a way t h a t a l l i n e r t i a l c h a r a c t e r i s t i c s of e x i s t i n g s y s t e m s a r e r e m o v e d .

-

-