The Indexing Problem

problem. These fundamentals are important no matter how machines may enter into the picture. Sometimes it helps in studying a problem to start with a ...
0 downloads 0 Views 350KB Size
THE INDEXING PROBLEM* BY CHARLES L . BERNIER Chemical Abstracts Service, Ohio State University, Columbus, Ohio

f i r s t shelf of the f i r s t row of shelves, we wonder by exactly what c l u e s the needed information will be detected. We t u r n through the i s s u e page by page, reading e v e r y word. S e v e r a l days l a t e r we find, a f t e r c a r e f u l e x p e r i m e n t , that it i s possible t o s c a n pages and yet d i s c o v e r significant information without m i s s i n g any. We have a l r e a d y found a few bits of information and have jotted them and t h e i r r e f e r e n c e s onto a pad. We have a l s o taken p r e cautions to keep t r a c k of the i s s u e s , j o u r n a l s , s h e l v e s , and rows that we have a l r e a d y examined. As the f i r s t week c o m e s to a c l o s e , we wond e r i f it would not be possible a t t i m e s t o p a s s by e n t i r e j o u r n a l s o r whole i s s u e s a f t e r m e r e l y examining the j o u r n a l t i t l e s or tables of contents. We d i s c o v e r that none of the information we s e l e c t c o m e s f r o m pages of advertising and that, f r o m the n a t u r e of our s e a r c h , t h e s e pages will likely always p r o v e to be b a r r e n . During the second week, before scanning the pages of pape r s we t r y t o p r e d i c t , f r o m j o u r n a l t i t l e s and t a b l e s of contents, those p a p e r s that will t u r n out t o be fruitful. We carefully keep a r e c o r d of how well our predictions have turned out. F r o m the s u c c e s s of our predictions during this and following weeks, we gradually gain confidence i n our ability to exclude whole i s s u e s and even some j o u r n a l s . This discovery m a k e s t h e work go much m o r e rapidly since we can l a y a s i d e i s s u e s with only a glance o r m e r e l y a rapid reading of the t a b l e s of contents. Being of a cautious n a t u r e , we check f u r t h e r f r o m t i m e to t i m e , using a random s a m p l e of those i s s u e s that we have predicted to be b a r r e n , j u s t to m a k e s u r e that we a r e not m i s s ing information. The d i s c o v e r y that we c a n p r e dict s o m e b a r r e n i s s u e s and t i t l e s and c a n omit p a g e s of advertising r e d u c e s our scanning t i m e to p e r h a p s t e n p e r c e n t . of what it w a s before. As weeks p a s s , and o u r mind w a n d e r s m o m e n t a r i l y f r o m t h e t a s k , we hypothesize that since o u r r e s p o n s e t o m a t e r i a l r e j e c t e d and s e l e c t e d i s t r i g g e r e d solely by symbols (usually w o r d s ) on p a p e r , i t might have been possible t o have p r e s e l e c t e d t h e s e symbols, and thus to have saved o u r s e l v e s much of the t i m e taken t o exa m i n e a l l p e r i o d i c a l s , one by one.

E l e c t r o n i c equipment that v e r y rapidly p e r f o r m s p r o g r a m m e d m a t h e m a t i c a l and other ope r a t i o n s has stimulated much thinking a s to how it can be used for t h e s t o r a g e and r e t r i e v a l of v e r b a l (not o r a l ) information. During the l a s t twenty y e a r s subject indexing h a s been " r e d i s covered" by many whose t r a i n i n g has been l a r g e l y i n the fields other than documentation and l i b r a r i a n s h i p . Documentalists have been re-examining t h e i r methods. During t h i s period of awakened i n t e r e s t i n dealing with information, ways of keying information have been studied again. An example i s the c o r r e l a t i v e index, which f a c i l i t a t e s a selection of documents by the c o r r e l a t i o n of two o r m o r e t e r m s and i s said t o d a t e back to the t i m e of cuneiform writing. Chemical A b s t r i G pioneered i n t h e indexing of c h e m i c a l information and has grown i n this dir e c t i o n by continually r e - examining i t s methods and p r o c e d u r e s . The study of electronic and other equipment s o m e t i m e s has proved t o be a d i s t r a c t i o n f r o m investigation of the fundamentals of t h e indexing p r o b l e m . T h e s e fundamentals a r e i m p o r t a n t no m a t t e r how m a c h i n e s m a y e n t e r into the p i c t u r e . S o m e t i m e s i t helps i n studying a p r o b l e m to s t a r t with a v e r y s i m p l e model. Let u s s t a r t o u r study of the indexing p r o b l e m by imagining a r u d i m e n t a r y l i b r a r y of unc l a s s i f i e d p e r i o d i c a l s . The L i b r a r y has no subject catalog o r s i m i l a r r e t r i e v a l device. We can a s s u m e , to make the p i c t u r e m o r e rational, that t h e r e i s no l i b r a r i a n . Let u s s a y that we a r e a s s i g n e d t o the t a s k of finding a l l information about a given subject i n the s t a c k s of p e r i o d i c a l s piled in o r d e r of acquisition on the s h e l v e s . Let u s f u r t h e r a s sume that we have no knowledge of l i b r a r y techniques and no way of knowing a t what d a t e t h e d e s i r e d i n f o r m a t i o n m a y have been published. Our naivete s a v e s us f r o m being d i s h e a r t ened or resentful. So we r o l l up our m e n t a l and physical s l e e v e s , provide o u r s e l v e s with p a p e r , ballpoint pen, table, and c h a i r . Resolutely we s t a r t . A s we exarnine the f i r s t i s s u e of the f i r s t j o u r n a l f r o m the f i r s t pile of p e r i o d i c a l s on the "Gordon Conference, :?Jew Hampton, N. H., July, 1961.

25

26

CHARLES L. BERNIER

In o r d e r to t e s t this hypothesis, we e x p e r i ment by predicting t e r m s in t i t l e s and tables of contents that led to rejection of b a r r e n periodic a l s and p a p e r s . As our l i s t of t h e s e rejection t e r m s grows longer day a f t e r day, we slowly come to s e e that our l i s t will eventually include an unabridged dictionary of w o r d s , and n e a r l y a l l new words that have come into the language. We decide to abandon the u s e of a comprehens i v e rejection l i s t ; i t s u s e will be i m p r a c t i c a l because of (1) the t i m e required to compile i t , (2) the t i m e needed t o consult it, and (3) the lack of completeness a t the moment we need i t ; t h e r e will always be new words coming in s o frequently that the l i s t will need revision e v e r y day o r oftener. The u s e of a l i s t of t e r m s for selection r a t h e r than rejection of documents c o m e s to s e e m m o r e promising a s we work along and thoughtfully m u l l over the problem. As the next experiment, we t r y predicting and recording t e r m s in t i t l e s and tables of contents that a r e actually found to lead u s t o s e l e c t fruitful p e r i odicals and p a p e r s . Our s u c c e s s i s i m m e d i a t e and heartening. We find that predicted t i t l e t e r m s of p a p e r s lead t o selection of about 21% of the p a p e r s that actually contain information r e l a t e d to our s e a r c h . The r e a s o n why m o r e than 21% of relevant p a p e r s cannot be predicted by t e r m s i n t i t l e s of p a p e r s and periodicals i s mainly that the t i t l e s do not contain t e r m s that we a r e using f o r selection. About 66% of the p a p e r s we s e l e c t a r e chosen by t e r m s (not nece s s a r i l y those on our l i s t ) that we find i n the bodies of p a p e r s r a t h e r than in t h e i r t i t l e s . An analysis of the t e r m s in the experiment a l l i s t we u s e f o r selection shows that about 6470 of them a r e identical with the t e r m s a c tually used i n the p a p e r s . About 17% a r e synonymous, and about 18% a r e m o r e g e n e r i c , m o r e specific, o r m e r e l y suggestive, i n ill-defined and vague ways, of t e r m s actually i n the p a p e r s . As p r o g r e s s i s made in p r e p a r i n g a list of t e r m s for u s e in selection of p e r i o d i c a l s and p a p e r s , our s t a t i s t i c s show that we a r e selecting about 18% of documents, a s j u s t mentioned, not on the b a s i s of our l i s t of t e r m s , n o r s o m e t i m e s on the b a s i s of anything that we have been able to p r e d i c t , but on the b a s i s of vague and often poorly defined a s s o c i a t i o n s i n our minds between the problem that s t a r t e d the sgal;ch and the cont e n t s of t h e s e documents. A s ' & # , s e l e c t a docum e n t that f a l l s into t h i s 18% ca&ory, we m a y s a y to o u r s e l v e s , "The t e r m s I a m using f o r selection a r e not i n the p a p e r , but it i s about the subject," o r "I have a hunch that t h i s p a p e r will be useful," o r "This p a p e r s e e m s to be r e lated to t h e problem in i t s l a r g e r a s p e c t s , but I c a n ' t t e l l exactly how," o r "If the solution t o the problem takes a c e r t a i n t u r n , then t h i s document m u s t be used," or "This paper d e a l s with the solution to an analogous problem (and don't a s k m e to define 'analogous') . I t

In spite of the f a i l u r e s of our l i s t e d t e r m s in about 36% of the c a s e s , the s u c c e s s i s s o obvious that we a r e encouraged to p u r s u e this line of thought f u r t h e r . T i t l e s of journals a r e often s e e n to have t e r m s m o r e generic than those used in the s e l e c tion of p a p e r s . This gives u s a clue a s t o a way of organizing our l i s t of t e r m s . We r e a s o n that t h e r e should be fewer of t h e s e generic t e r m s than specific searching t e r m s . An examination of our l i s t shows this to be t r u e . Thus, one g e n e r i c t e r m can n e a r l y always be used to stand for s e v e r a l , or many, of the m o r e specific s e a r c h t e r m s . H e r e i s the way that we can organize o u r thinking and our l i s t . The m o r e specific t e r m s can be a r r a n g e d under the m o r e g e n e r a l . Thus, we r e d i s c o v e r h i e r a r c h i c a l classification. Also we find that the m o r e g e n e r a l t e r m s a r e m o r e useful i n selecting by t i t l e . If the t e r m s on our selection l i s t of t e r m s had been applied to t i t l e s and bodies of p a p e r s before we made our s e a r c h , i f the journals had been a r r a n g e d by title o r subject on the shelves, and i f a l l documents had been r e l a t e d in a l i s t of r e f e r e n c e s to each t e r m by m e a n s of abbreviations for the journal name, volume number, s e r i a l numb e r , i s s u e number, and pages covering the p a p e r , then our p r e s e n t t a s k would have been speeded beyond imagination for about 60% of the significant documents that we actually located. The months of work that we expended i n examining e v e r y periodical would have been, i n f a c t , r e duced to using the l i s t of r e f e r e n c e s r e l a t e d to our problem i n selecting periodicals and p a p e r s f r o m the stacks and to copying the information found t h e r e . This would have given u s about 6 0 % of the r e f e r e n c e s needed. The only r e f e r e n c e s that would not have been on the l i s t of r e f e r e n c e s would have been those without selection t e r m s , those that s e e m e d "analogous," those on which we "had a hunch," and the like. With our original s e a r c h finally completed and, with this background knowledge, we s t a r t to d e v i s e a s y s t e m that would have saved u s t h e s e months of c a r e f u l , patient, page-by-page, i s s u e by-is sue, and journal by- journal s e a r c hing It i s c l e a r that the p r e p a r e d l i s t of r e f e r e n c e s about which we d r e a m would have solved our p r e s e n t s e a r c h problem to the extent of about 6O0/0. It i s equally apparent that it would not have been effective i n a n e n t i r e l y different s e a r c h . In o r d e r to take c a r e of a l l s e a r c h e s , i t will be n e c e s s a r y for u s t o develop a u n i v e r s a l searching tool that will save page-by-page scanning of documents i n looking for the a n s w e r s to a l l questions. We have become convinced that our s e a r c h ing i s based solely on intelligible symbols ( U S U ally w o r d s ) on paper and that t h e s e symbols a r e to be used principally for selection r a t h e r than rejection. We know that the t e r m s used for selection o r rejection can be grouped into h i e r a r c h i e s under generic t e r m s . Also, and m o s t

-

.

THE INDEXING FROBLEM

i m p o r t a n t , we became convinced during our tedious t a s k that, "something m u s t be done about this l i t e r a t u r e problem." With t h e s e f a c t s and this motivation, we s e t out to plan and conquer i n the following way. Words r e p r e s e n t i n g new i d e a s , new f a c t s , new d a t a , i n fact ,all news, will be associated i n our planned s y s t e m with r e f e r e n c e s t o docum e n t s . Novelty o ' r newness i s an important c r i t e r i o n since we do not want to w a s t e t i m e of o u r s e l v e s o r other s e a r c h e r s on what is old on what i s a l r e a d y well known. Another c r i t e r i o n f o r selecting t e r m s will be t h e i r relation to the subject of the document, i.e., based on new concepts that: the author intended to communicate, not on what words he used. Thus, we m u s t "subject index" and not "word index." As we think about i t , this c r i t e r i o n will be a v e r y subtle one and difficult to explain. Another c r i t e r i o n will be that the t e r m s m u s t be c o m monly used o r popular ones that the s e a r c h e r will m o s t probably seek f i r s t . Still another c r i t e r i o n will be selection of the m o s t specific t e r m justified by the document and avoidance of generalization unless the author h a s sanctioned our doing s o , i . e . , we will be truthful i n our selection and G t push the author beyond l i m i t s he h a s a l r e a d y s e t . It i s c l e a r to u s that i f we index one document with g e n e r a l t e r m s , thus pushing the autho'r beyond the l i m i t s that he h a s s e t , then we m u s t index a l l documents i n the s a m e way and with a p r e - s e l e c t e d s e t of t e r m s i n ' o r d e r to e n s u r e consistency. This will, we s e e , c r e a t e a generic or classified index. With our collection of selected t e r m s and t h e i r a s s o c i a t e d r e f e r e n c e s , o u r next problem is how to organize them. We find that the t e r m s can be a r r a n g e d either into s t r a i g h t alphabetical o r d e r o r into groups with like meanings to f o r m a classification. The f o r m e r will give a n alphabetical index (of s o r t s ) and the l a t t e r a c l a s s i fied index (of s o r t s ) . If a l l of the t e r m s for indexing had been chosen f r o m a p r e p a r e d l i s t of g e n e r a l t e r m s , then we would have a g e n e r i c index (of s o r t s ) . We decide in favor of the alphabetical a r rangement of specific t e r m s since the s e a r c h e r can approach i t d i r e c t l y with the words that he knows r a t h e r than f i r s t having to t u r n to a classification schedule and f r o m t h e r e to the c l a s s i f i e d index. We could have chosen a c l a s sified a r r a n g e m e n t i n o r d e r to help with g e n e r i c questions and to suggest analogous a n s w e r s in the event that the exact information sought is unavailable. During o u r page-by-page s e a r c h , we made the d i s c o v e r y that analogous i n f o r m a tion can often be used. Use of t h i s simple alphabetical index of t e r m s and r e f e r e n c e s will, we believe, be much better than going through the l i t e r a t u r e page by page. However, a s we visualize this simple index, it will p r e s e n t difficulties i n actual u s e . We can s e e that some index t e r m s will be v e r y popular and have many r e f e r e n c e s a s s o c i a t e d

--

-

27

with them. We can picture how f r u s t r a t i n g it will be to look up a l l undifferentiated r e f e r e n c e s under a popular t e r m and find that, p e r h a p s , no r e f e r e n c e is suitable. Because of this difficulty we decide to u s e modifying p h r a s e s , o r e x p r e s sions associated with each of the original indexing t e r m s to help in differentiating among r e f e r ences under the t e r m s . We suspect that the optimum g r a m m a r and diction of t h e s e modifying p h r a s e s o r auxiliary t e r m s will need c a r e f u l study and development. In an alphabetical index, we r e a l i z e that t e r m s related by meaning often will be s e p a r a t e d by the a r b i t r a r y o r d e r of the alphabet. We plan to solve this problem by indicating semantic r e lationships among the t e r m s locked into alphabetical o r d e r . We can choose i n d i c a t o r s e x t e r nal to the index, e.g., a t h e s a u r u s or a l i s t of i n t e r n a l indicators , e .g , c ros s r e f e r enc es-r c r o s s r e f e r e n c e s and notes alphabetized along with the index e n t r i e s . We chose i n t e r n a l indic a t o r s f o r g r e a t e r convenience to the index u s e r . Alternatively, we could have chosen a c l a s sified index i n which like concepts w e r e brought together s o f a r a s possible. Also, we could have chosen to produce both kinds of indexes. We believe that it will be best t o use alphabetical o r d e r f o r t e r m s in both indexes because no other o r d e r i s a l m o s t universally r e m e m b e r e d . In our s e a r c h through chronological, unclassified s t a c k s of p e r i o d i c a l s , we a l s o c a m e to understand that shelf a r r a n g e m e n t and a r e c o r d of t h i s a r r a n g e m e n t would have saved us weeks of work i n handling piles of j o u r n a l s , most of which actually proved to be b a r r e n .

.

CONCLUSION Through our i m a g i n a r y , model s e a r c h we have re-invented l i b r a r y classification, shelf a r r a n g e m e n t , subject indexes, c r o s s r e f e r e n c e s , classified indexes, generic indexes, We have decided that, no m a t t e r how inadequate classifications and indexes m a y be, they a r e f a r s u p e r i o r to none a t a l l , and we suspect that t h e s e r e t r i e v a l tools can be vastly different in effectiveness and c o s t , depending on how they a r e built. We have s e e n that documents a r e selected by s o m e , but not a l l , of the intelligible t e r m s i n or on t h e m , that listing t e r m s for rejection of documents i s not s o effective a s listing them f o r selection, that 3470 of the documents a r e selected by t e r m s i n t i t l e s , that 66% a r e s e lected by t e r m s in the bodies of documents, that 18 to 36% a r e selected by unpredicted t e r m s , s o m e of which s e e m always to be unpredictable. It should be pointed out that t r u e subject indexing a s p r a c t i c e d by the indexing staff of Chemical A b s t r a c t s u s e s index e n t r i e s generated by a l l of the 18 to 36% of unpredicted and unpredictable t e r m s . We have c o m e , p e r h a p s , to an a p p r e c i ation of the f a c t that both classified and alphabetical indexes s e r v e useful, and somewhat different, p u r p o s e s .

*.