Current Research at Chemical Abstracts -

Chemical Abstracts Service, Ohio State University, Columbus, Ohio. The principal research projects which have been inaugurated at Chemical Abstracts d...
0 downloads 0 Views 488KB Size
CURRENT RESEARCH AT CHEMICAL ABSTRACTS* By G. MALCOLM DYSON Chemical Abstracts Service, Ohio State University, Columbus, Ohio

The principal r e s e a r c h p r o j e c t s which have been inaugurated a t Chemical Abstracts during the past y e a r a r e : (1) the study of annual index production; ( 2 ) the study of cumulative index production; ( 3 ) the organization of chemical knowledge with special r e g a r d t o s t o r a g e and r e t r i e v a l ; (4)the study of new journals and s e r v i c e s not covered by ( 3 ) ; (5) a study of chemical s e m a n t i c s ; (6) the production of a r e g i s t e r of chemical compounds. In addition, t h e r e is a joint project centered a t Chemical A b s t r a c t s and sponsored by the Synthetic Chemical Manuf a c t u r e r s Association (SOCMA) and the A m e r i can Chemical Society, f o r the production of a "Lexicon of Non-systematic and T r a d e Names for Organic Compounds .I1 I propose to d i s c u s s e a c h of t h e s e , in t u r n , paying m o s t attention to those which involve fundamental r e s e a r c h considerations. 1. T h e r e i s no need to s t r e s s the necessity for a study of index production. Chemical Abs t r a c t s indexes a r e probably of higher indexing quality and g r e a t e r indexing depth than any other scientific indexes. The thoroughness with which they a r e p r e p a r e d and the careful r e v i sion which they a r e given n e c e s s i t a t e expendit u r e of considerable t i m e . The economical u s e of t i m e in connection with indexes a l r e a d y has been the subject of a communication (Chemical & Engineering News, 38, 70 (1960)) elsewhere; what we a r e m o s t concerned with in o u r annual indexes i s the u s e of m o d e r n methods of cold type and l i n e - c a m e r a to the actual production of the index volumes. We have proved e x p e r i m e n tally by the u s e of Varityper and F o t o l i s t e r (both of which machines have been installed a t Col u m b u s ) , that a perfectly acceptable index page can be produced by this method, and that the p r i c e of printing such work by the litho-offset p r o c e s s i s a t l e a s t a s economical a s the t r a d i tional method. Indeed, a p a r t f r o m the potential saving in c o s t , c e r t a i n other advantages of this method have proved apparent: namely, g r e a t e r readability of s u b s c r i p t n u m e r a l s (always s o m e what painful in o u r traditional indexes) and a c l e a n e r -looking and m o r e readable page, s t e m ming f r o m the u s e of two columns instead of one. .This does not, a s might be expected, lead t o the u s e of m o r e pages; in'three-column f o r m a t t h e r e a r e so many s h o r t l i n e s that transf e r e n c e t o two-column f o r m a t often gives fewer l i n e s ; that is, a t h r e e - c o l u m n page of one hund r e d l i n e s p e r column often yields only one hundred and eighty to one hundred and ninety l i n e s instead of the t h e o r e t i c a l two hundred; i n c e r t a i n c a s e s the r e v e r s e is t r u e , but the s u r p r i s i n g fact e m e r g e s that the choice between

two- and three-column f o r m a t makes little diff e r e n c e t o the number of pages required for an index. Some o b s e r v e r s have formed the opinion that the two-column f o r m a t is m o r e e a s i l y r e a d , especially w h e r e the subject m a t t e r consists of organic n a m e s . The organizational difficulties in using this method a r e those associated with the conv e r s i o n of a batch p r o c e s s t o continuous p r o duction. A method is being worked out by which a s f a s t a s indexing i s done the e n t r i e s a r e s e t i n cold type, accumulated in p r o p e r indexing o r d e r and new e n t r i e s p e l d e d with the old, superfluous headings thrown out s o that the index is being continually s e t and proofed. This involves much collaborative investigation with our colleagues in the indexing and editing d e p a r t m e n t s , and when this problem is c o m pletely solved, a s i t will be f a i r l y soon, we shall have saved many months i n the t i m e elapsing between the end of a given y e a r and the appearance of the indexes, without s a c r i ficing anything in quality that is of value t o the user. 2 . The extension of t h e s e principles t o the production of cumulative indexes should be richly rewarding; the annual index c a r d s which, i n the future, will c a r r y the volume number, can be interfiled; u n n e c e s s a r y headings can be cut, and identical subject headings coalesced by exactly the s a m e techniques a s will be used f o r the annual indexes so that, a p a r t f r o m e d i t o r i a l revision f o r e r r o r s and consistency, no additional setting will be required. This should not only eliminate m o s t of the c o s t of r e - s e t t i n g but should considerably cut the delay in i s s u e .

3. Information Storage and Retrieval Extensive p r e l i m i n a r y work with v a r i o u s m e d i a and methods has established that machines are not the difficulty in this field; s u i t able devices exist f o r handling and selecting d a t a f r o m any s t o r e , provided that the l a t t e r is suitably organized. No new and wonderful m a chines a r e needed a t p r e s e n t . The p r o b l e m s -and they a r e s e r i o u s ones -- lie in the organization of the e n t r y of data into the s t o r e , the "good housekeeping" of the data once inside the s y s t e m , and the method of recall, especially i n relation t o the design of enquiries posed t o the s y s t e m . 3.1. Modus of Entry.--It is d e s i r a b l e that all information be e n t e r e d into the s y s t e m by a s i m p l e device. Philosophically, this "device" will be i n two parts: a code o r s e r i e s of codes

'Presented before the Division of Chemical Literature, New Yo& ACS Meeting, September 13, 1360.

24

CURRENT RESEARCH AT CHEMICAL ABSTRACTS

which symbolize the d a t a , and a permanent r e c o r d of the d a t a so coded. The f o r m e r will be d i s c u s s e d l a t e r ; f o r the l a t t e r we have selected a n o r m a l IBM c a r d , punched with the standard IBM026 Keypunch. Slight a l t e r a t i o n s a r e m a d e t o the punch so that signs such as #$e*, not used in coding, a r e replaced by m o r e useful symbols. This does not a l t e r the punching p a t t e r n , but m e r e l y the symbols associated with that pattern. 3.2. Modus of Decoding.-To t r a n s l a t e the encoded information back into a printed f o r m , the c a r d is r u i n through the sensing keypunch which is connected to a Document-writing an e l e c t r i c typewriter. This has upfeature p e r and lower c a s e l e t t e r s , n o r m a l s u b s c r i p t and s u p e r s c r i p t n u m e r a l s and a v a r i e t y of actuated b r a c k e t s and punctuation m a r k s , by a double-skuft solenoid-operated key-bank. T h r e e of t h e symbols on the punch a r e used as m o n i t o r s t o s e t the shifts o r t o underline. Thus, f o r example, if we punch *D into the c a r d actuates the first shift but p r i n t s nothing until the next column is r e a d so that l o w e r - c a s e I'd" is printed. Thus i f "2" is punched alone it is so a printed; if however, it is preceded by s u b s c r i p t "2" is printed. In this way, with t h r e e monitoring s i g n s , the whole of the above symbols can be used. 3.3. The P r i m a r y Record.-The f i r s t c a r d p r e p a r e d r e l a t e s t o the chemical n a t u r e of the substance. Thus, a c a r d is being p r e p a r e d f o r each chemical compound of definite composition. The r e c o r d f o r such a c a r d might read:

-

s., *

322543:B6ZN:4/C3:3N(A6)/2Cg:5X/C2:2N(C2)2 207022 This r e c o r d is i n t e r p r e t e d as follows: 3.31.-The s e r i e s of numbers up t o the first colon r e p r e s e n t s the molecular f o r m u l a c2 5 H43N302 3.32.-The s t r u c t u r e code r e p r e s e n t s the st r u c t u r e :

-

U 3.33.-The final figure-207022is the CA Register Number. This number is unique f o r the compound, and although it conveys no information concerning the n a t u r e of the s t r u c t u r e , it f o r m s the "address" by which the compound is found throughout our internal s y s t e m . Thus, in concept f i l e s ( s e e below), where the p r o p e r t i e s and attributes of compounds a r e filed, the r e g i s t e r number of the compound is used r a t h e r than its chemical code. In this way, it is proposed t o build up a complete Registry of a l l compounds ( s t a r t i n g

25

with the Organic division) which will be s e t in cold type and printed by l i n e - c a m e r a f r o m t i m e t o t i m e until a complete r e c o r d is obtained. It may be added that the example above is m e r e l y illustrative of our p r e s e n t pilot-plant experiments, other codes o r number s e r i e s may be used on the main project. So f a r , we have mounted all the compounds (both organic and inorganic) mentioned in the Formula Indexes of Chemical A b s t r a c t s according t o t h e i r molecular formula on edge-notched c a r d s ; all t h e s e c a r d s have been punched with the nature and number of elements p r e s e n t , so that it i s possible t o subdivide this pack according to any element o r combination of elements. As a pilot r u n we have selected the organic fluorine compounds and have built a subsidiary pack of s o m e 14,000 of t h e s e , which will s e r v e f o r a pilot run through the s y s t e m . F r o m this pack, IBM c a r d s of the nature described above a r e being p r e p a r e d . These, when complete, will constitute one stage in the development of the pilot run. 3.4. The Direct Access F i l e s . - - F o r i n t e r n a l u s e , and even when dealing with external inquiries where only a single compound i s in question, i t is important t o have a file that can be consulted manually. The file described above is not suitable f o r this purpose, s o that i t s information is replicated in a second file of IBM c a r d s containing a p e r t u r e s . The a p e r t u r e of any given c a r d is filled with a single f r a m e of 16 m m . film showing (a) the Registry number, (b) the s t r u c t u r a l formula drawn out and numbered, (c) the name used officially in %Indexes, and (d) a lead number to the bibliographic file which is a l s o punched in the c a r d . A u s e r identifies the required compound by Registry number (if already known t o him) o r via the s t r u c t u r e code /to r e g i s t r y number handbook, and then proceeds to the c o r r e c t number in the D i r e c t Access f i l e s , where all compounds a r e filed in o r d e r of t h e i r r e g i s t e r numbers. The m a t e r i a l in the a p e r t u r e can be viewed on a simple r e a d e r and copies made by M.M.M. reader -printer. Thus, an indexer o r other e n q u i r e r can m a k e a rapid approach to data on a few selected enquiries anywhere in the f i l e s . Since the files a r e kept in o r d e r of the Register Numbers, additions f r o m c u r r e n t work going through the shop can be m a d e without disorganizing the f i l e s , m e r e l y by adding a t the end. Experiments a r e being made, with quite successful r e s u l t s so far, to u s e a semiautomatic device f o r photographing the s t r u c t u r a l f o r m u l a s . These a r e photographed a t high speed and a t g r e a t reduction f r o m adjustable c a r d s with which the formulas a r e built up. The photocopies a r e mounted automatically in the a p e r t u r e s of the IBM c a r d s . 3.5. Concept Code Files.--The fundamental unit of the concept code c a r d pack i s a single

G. Malcolm Dyson

26

a b s t r a c t , the r e f e r e n c e number of which is placed on the c a r d together with the r e g i s t e r numbers of the compounds mentioned i n the a b s t r a c t and concept r e l a t e d t o the compounds mentioned. Thus, supposing that a single a b s t r a c t (No. 3435367) mentions the compounds 20027, 20028, 20029, 43789, and 68574 all i n connection with t h e i r u s e in trypanosomiasis in m a n , all substances having a moderate positive curative action using the intravenous route. The IBM c a r d f o r t h e s e d a t a will have the e n t r i e s :

=

20027 20028 20029 43789 68574

2.745.66

3435367

JACS59:2345

The f i r s t group of numbers r e p r e s e n t s the compounds, the second is a n a r b i t r a r y codification of the pharmacological d a t a and the t h i r d e n t r y is the a b s t r a c t . The final e n t r y is t h e r e f e r e n c e t o the original communication. The c h a r a c t e r i s t i c number "7" of the 2.745.66 indic a t e s that the physiological/pharmacological code is being used; "45" indicates human trypanos o m i a s i s and the ".66" the route by which the drug was administered in the experiments. The initial "2" is a r e c o r d , on an a r b i t r a r y s c a l e , of degree of activity. Compounds in which no a c tivity had been detected would have been e n t e r e d on another c a r d and the e n t r y for the physiological data would have r e a d 0.745.66 and " 0 " indicated null activity. S i m i l a r c a r d packs a r e being built f o r each phase of the p r o p e r t i e s of chemical s t r u c t u r e -- physical p r o p e r t i e s , p a t e n t s , industrial applications, e&. One probl e m that has had t o be solved i s a means of citing any of the 9,000 j o u r n a l s i n which chemic a l data m a y be found, without using too much of the c a r d . This has been done by constructing a f o u r - l e t t e r code f o r all the journals. A cons i d e r a b l e degree of mnemonic quality has been retained in t h e s e abbreviations -- thus, JACS f o r the J o u r n a l of the A m e r i c a n Chemical Society, JCSL for the J o u r n a l of the Chemical Society of London, JORG for the J o u r n a l of Organic Chemi s t r y r e c a l l t o mind the journal in question, often without r e c o u r s e t o the handbook. The full s e t of t h e s e abbreviations will, when checked, be published. 3.6. Methods of Retrieval.--The r e t r i e v a l s y s t e m is based on the IBM 1401 computer, which we hope t o have delivered shortly. The c a r d e n t r i e s f r o m the m a i n s t r u c t u r e file, duly s o r t e d into sections r e p r e s e n t i n g the elements and important groups will be converted to m a g netic t a p e s , a s a l s o will the c a r d s r e p r e s e n t i n g the concepts and r e f e r e n c e s . L e t u s consider a problem such as the p r e p a r a t i o n of a bibliography of r e f e r e n c e s dealing with those halogenated pyridine d e r i v a tives which have shown positive tuberculostatic

-

activity i n experimental a n i m a l s . The halogen/ nitrogen section of the m a i n file i s s e a r c h e d f o r such e n t r i e s a s contain e n t r i e s f o r pyridine and halogen; the answer i s a s e r i e s of r e g i s t e r numbers. This s e a r c h is m a d e , of c o u r s e , on the magnetic tape and not on the generating c a r d s . The numbers obtained i n this s e a r c h a r e then s e t t o s e a r c h f o r t h e s a m e numbers in the physiological /pharmacological f i l e , but the s e a r c h i s planned t o r e c o r d only those i n s t a n c e s where the r e g i s t e r number is identical and where the physiological code figure is of the f o r m a.bcd.mno where a is not z e r o and c s a r e 45 (the code f o r tuberculosis). The answer t o this s e a r c h is a tape showing a l l compounds of pyridine with halogens having positive phar macological action in t u b e r c u l o s i s . The output can be converted t o a printed l i s t showing a l l the d a t a of the concept file c a r d s concerned. F r o m this the a b s t r a c t s o r original r e f e r e n c e s can be taken off. This in itself m a y be the answer r e q u i r e d , o r the enquiree m a y p r e f e r a p e r t u r e c a r d s of the a b s t r a c t s themselves, o r even copies of t h e a b s t r a c t s i n chronological o r d e r printed xerographically. All t h e s e f o r m s a r e now the subject of experimental evaluation. It will have been noted tf.at the answer t o the question s e t out in the previous paragraph is on a somewhat b r o a d e r b a s i s than the question itself; the question included the p h r a s e "in experimental animals" and t h e answer covered all observations on the tuberculostatic activity of the compounds in question. This provision of r a t h e r m o r e m a t e r i a l than the question a p p e a r s to demand -- sometimes called the p r o vision of "display" -- is n e c e s s a r y t o make c e r t a i n that nothing of i n t e r e s t t o the e n q u i r e r h a s been overlooked. The subject is one that cannot be discussed fully h e r e ; i t might well be the subject of a s e p a r a t e talk.

-

4. New J o u r n a l s and S e r v i c e s P l a n s a l r e a d y have been made t o introduce a new j o u r n a l , Chemical T i t l e s , as f r o m January of 1961. This journal l i s t s the l a t e s t t i t l e s r e ceived at Chemical A b s t r a c t s according t o the Keyword-in-Context s y s t e m and provides a quick approach t o c u r r e n t chemical l i t e r a t u r e . This journal h a s been sufficiently introduced t o m e m b e r s of the Society not to need f u r t h e r d i s cussion h e r e . F r o m the experimental point of view, one of the m o s t i n t e r e s t i n g phases of the development of Chemical Titles has been the programming of the computer and building up of a m e m o r y s t o p - l i s t t o prevent u s e l e s s index e n t r i e s being included i n the publication. It i s believed that this is the f i r s t computer -produced journal of any magnitude t o be issued. Naturally, o u r i n t e r e s t in the r e s e a r c h d i vision has moved t o other types of new journals and s e r v i c e s . T h e r e a p p e a r s t o be a definite

27

CURRENT RES.EARCH AT CHEMICAL ABSTRACTS

need for a classified l i s t of all organic c o m pounds d e s c r i b e d and examined in a given period on a s c u r r e n t a b a s i s a s possible. We have been experimenting f o r s o m e considerable t i m e with the product of ,such a l i s t . The m a t e r i a l f o r a given i s s u e would c o r r e s p o n d t o the t i t l e s of the corresponding i s s u e of Chemical Titles; this would m e a n that the condensed r e f e r e n c e s f o r the new R e s t r i c t e d e x p r e s s l i s t s ( R E L ) would be used a s the "trace-back" for relating c o m pound and l i t e r a t u r e . E n t r i e s would be made under the molecular f o r m u l a s , subdivisionunder e a c h m o l e c u l a r formula heading would be by s t r u c t u r e notation and, in addition t o t h e s e , the r e g i s t e r n u m b e r s and r e f e r e n c e s t o original t i t l e s will be uised. A distinguishing m a r k will be placed on compounds which a r e new, in the s e n s e that they have not appeared previously in our r e c o r d s . We r e g a r d the u s e of this m a r k a s a distinct advance; heretofore t h e r e has been nothing in any index to indicate the newness of a compound; should this m a r k appear on a c o m pound it will immediately convey t o the o b s e r v e r that t h e r e is no n e c e s s i t y t o s e a r c h the e a r l i e r l i t e r a t u r e f o r a.dditiona1 information. 4.1. Lexicography.--The pilot run comp r i s i n g all data! on the organic fluorine c o m will bring together a m a s s of m a t e r i a l pounds which has not previously been s e l e c t e d f r o m the chemical r e c o r d . In a c a s e such a s t h i s , it i s v e r y probable that the collected data will have a wide i n t e r e s t ; it i s c e r t a i n that the obtaining of such data f r o m the r e c o r d would be a long and tedious t a s k , and that having been done once should not be done again. With this in mind it i s proposed t o u s e the m a t e r i a l of this section t o produce a "Lexicon of Organic Fluor i n e Compounds," which will show f o r e a c h compound: (1) r e g i s t e r number; (2) m o l e c u l a r f o r m u l a , ( 3 ) s t r u c t u r e notation, (4) bibliography, (5) physical p r o p e r t i e s , and such other data a s m a y a p p e a r t o be generally useful. Studies in the publication of such volumes will indicate w h e r e and how the principle m a y be applied ef fectively t o other groups of compounds. So f a r we have made no decisions a s to whether to choose the next pilot group on the b a s i s of a the b r o m i n e compounds, the single element organo-tin comlpounds, the organo-phosphorus o r on the b a s i s of a significant compounds group a s the nitro-compounds, the hydroc a r b o n s , o r s o m e s i m i l a r aggregate.

-

--

-

--

--

--

5. Chemical Semantics One i s plunged into s e m a n t i c p r o b l e m s i m mediately r e s e a r c h in the field of chemical documentation i s s t a r t e d . One of our e a r l i e s t s t u d i e s , on Chemical T i t l e s , posed the question a s t o which words i n a t i t l e a r e significant and which a r e not w o r t h inclusion i n an index. This i s not the s i m p l e p r o b l e m that s o m e imagine;

whether t o include "analysis ,I' "detection," and "determination" a s indexing points i s not t o be decided by some s i m p l e r u l e , but i s to be cons i d e r e d in relation t o the u s e r s ! needs. In the s a m e way we have decided not to u s e "synthes i s " but to r e t a i n "syntheses"; to r e j e c t "acid" but t o r e t a i n "acids" on the grounds, not of any r u l e , hut of the consideration of e a c h c a s e in relation t o the u s e r s 1 p r o b l e m s . This may p e r haps r e s u l t in be$ter t i t l e s ; we have a l r e a d y had one instance of an i r a t e chemist whose paper on "An Introduction to the Study of O r ganic B a s e s " was completely unindexed because all of the words in the title w e r e judged by the computer t o be non-significant. We have s t a r t e d a stqdy of the language u s e d by c h e m i s t s -- that is, the language they original c o m u s e in €orma1 printed r e c o r d s munications and a b s t r a c t s . A method has been devised, using only the s i m p l e standard IBM equipment, by which words and p h r a s e s used in text can be segregated and l i s t e d in context; this i s leading u s to a study of synonyms and what i s wrongly called the " t h e s a u r i c content" of language - - used in this field. So f a r we have r e s t r i c t e d o u r work to Section 11 of Chemical the Section dealing with physioAbstracts logical m a t t e r s . This choice was made p a r t l y t o avoid the m a s s e s of s t r u c t u r a l n a m e s of the organic division and p a r t l y because our iinmediate needs f o r concept coding lay in this a r e a . We r e g a r d this a s l o n g - t e r m r e s e a r c h ; we a r e not solving an immediate p r o b l e m , but s u r v e y ing the g e n e r a l sub-discipline. It will be some t i m e , I expect, before we can d e m o n s t r a t e the effect of this work i n our day-to-day routines. N e v e r t h e l e s s , since the whole of o u r work is fundamentally based on language, this study cannot help but be rewarding.

--

-

6. The CA R e e i s t e r of ComDounds Reference a l r e a d y has been m a d e t o the p r e p a r a t i o n of this r e g i s t e r . F o r o u r own i n t e r n a l p u r p o s e s it i s n e c e s s a r y to have such a r e g i s t e r , and we have decided t o give e v e r y s t r u c t u r e -- organic and inorganic a sevendigit n u m b e r , a s an "address." T h e s e n u m b e r s a r e s t r i c t l y given a t random; they a r e "idiot" n u m b e r s in the s e n s e that they have no significance but t h e i r uniqueness. This i s n e c e s s a r y in o r d e r e a s i l y t o manage files and t o allow f o r continual updating without destroying previous o r d e r . It i s hoped that o t h e r s may find such n u m b e r s useful in communicating s t r u c t u r e s by w r i t t e n o r t e l e g r a p h i c m e a n s . With t h i s in view we shall publish, in due t i m e , a l i s t of such n u m b e r s , together with the s t r u c t u r e s t o which they r e f e r . This dictionary will, of c o u r s e , be "two-way" in the s e n s e that one can look up both n u m b e r s and s t r u c t u r e s , This p r e s e n t s v a r i o u s p r o b l e m s which we a r e endeavoring

--

28

G. Malcolm Dyson

to solve; i t i s , however, too e a r l y a s yet t o r e port on the list.

7. Lexicon of Non-systematic and Trade Names f o r Organic Compounds The American Chemical Society in collaboration with the Synthetic Organic Chemical Manufacturers Association has agreed t o p r o duce a lexicon with the above title. There has b'een need f o r a long t i m e now of some compilation of non-systematic names used in organic chemistry with their official equivalents. The work of this compilation is being undertaken by the Research Division of Chemical Abstracts. There will be two s e p a r a t e editions one a full scientific lexicon containing all the names found, and a second containing those names used in industry. It s e e m s that the f o r m e r will contain s o m e 35,000 e n t r i e s and the l a t t e r about 7,000 e n t r i e s . Each edition will be in two p a r t s ;

-

the first will contain names, a r r a n g e d alphabetically and Register Numbers. Every name whether synonymous o r not will appear in this l i s t . The second p a r t of the work will contain the Register Number (in o r d e r of which the e n t r i e s in this section will be arranged); the molecular formula, the s t r u c t u r e notation, the drawn and numbered s t r u c t u r e of the compounds, together with the official name, the other scientific names which have been given t o the compounds and any other "trivial" o r t r a d e names. Thus, a u s e r a s c e r t a i n s f r o m the first list the Register Number of the compound he is interested in; looks up that number in the second p a r t and finds displayed all the nomenclatural data on the compound that is available. So far s o m e 18,000 e n t r i e s have been collected. This r e c o r d shows what has been done in the Division during the last twelve months; I hope f r o m t i m e t o t i m e t o add further p r o g r e s s r e p o r t s either h e r e o r i n Chemical h Engineering News.

CA