A Computer Program to Index or Search Linear Notations

A Computer Program to Index or Search Linear Notations* t. KURT D. OFER. Scientific Department, Ministry of Defence, Tel-Aviv, Israel. Received May 21...
0 downloads 0 Views 202KB Size
KURTD. OFER

A Computer Program to Index or Search Linear Notations* i. KURT D. OFER Scientific Department, Ministry of Defence, Tel-Aviv, Israel Received M a y 2 1, 1968

The program will select from a file those notations which contain specified Boolean expressions of notation symbols; a wide range of selection criteria is available and botching of questions is possible. The program can be used to prepare a Key Symbol Out of Context index, to disseminate notices on additions to the file to interested chemists, to search the file for compounds with certain structural elements and list or/and count the selected items, or to convert from a linear to a fragmentation code. Although the program wus designed to process Wiswesser line notations, it can be adapted to other linear notations with trivial changes; it is coded in FORTRAN and can thus be implemented on almost any computer.

As a rule, programs for processing notations are written in assembler language; t h e reason for this is t h a t most problem-oriented languages cannot conveniently handle t h e character strings which form a notation. I t takes a lot of time t o code a n d debug a n assembler program; it can be r u n only on a specific machine a n d is not easily modified when requirements change, a s t h e y are bound to d o in t h e experimental phase of t h e work. PL I a n d certain dialects of F O R T R A N I V (notably those implemented on C D C computers) have character-handling capability together with all t h e other advantages of highlevel languages, b u t t h e y are handicapped b y their dependence on a certain computer series a n d its character set. T h e program to be described here is written in F O R T R A S I1 a n d c a n therefore be r u n on almost a n y computer; symbols having a special meaning are defined b y a n i n p u t card, t h u s t h e implementation does not depend upon t h e character set used. I n d e e d , t h e s a m e program is able t o process several files of t h e same notation punched in different character sets; since it is easily changed, it should be ideal for experimental work such as: 1. Investigating the efficacy of classification numbers' or other sieves. 2 . Obtaining statistics of the file. 3. Comparing modifications of the notation syntax (such as the effect of methyl contraction or multipliers). 4. Defining groups for estimation of physical properties. '

I t can, of course, also be used for production r u n s until a more efficient special-purpose program is available. T h e program has been r u n satisfactorily on a Philco 2000 Presented betine the D ~ v i i i o n ol Chemical Literature 155th Xlertin:.. .AC'S. ?kin Francisco Calif , April 4 . 1'3W

t The Program deck 1s available from Dr. H . T. Bonnett. G . D. Searle P . 0 . Box 5110. Chicago. Ill. 60680

128

8; Co..

a n d on a C D C 3400 without a n y change (as a m a t t e r of fact, it was originally coded in A L T A C a n d t h e n automatically translated into F O R T R A K by a program written for this t a s k ) . T h e r e are m a n y uses for a notation file or parts thereof (such as periodical additions), the main ones being: a. To search the file for notations with specified terms (retrieval). the selected items to be listed and$or counted. b. To index the file. c. To disseminate notices on additions to the file. d. To prepare a file in a different notation such as a fragmentation code (mainly in installations where an earlier fragmentation file is in use). T h e common denominator of all these processes is the selection of records from the file according to specified selection criteria. T h e o u t p u t takes different forms. printed listings or magnetic tapes (or disks) for subsequent sorting (the program, as written, provides for printed o u t p u t only; t h e change t o writing on t a p e or disk is trivial a n d most installations have their own sorting routines). T h e file is supposed t o be on cards with the following format: C : c 1-6: Serial n u m b e r , c : c 7-80: Notations. U p t o two trailer cards (same serial n u m b e r , no other indication required) are allowed, t h u s notations with a maximum length of 223 symbols can be processed. Change to other formats or t o t a p e i n p u t is again trivial. Information separated from the notation b y more t h a n two delimiter symbols (normally blanks) will be treated as comment a n d not processed but will appear in t h e printed output. T h e selection criteria have evolved from those described earlier' in connection with a K e y Symbol O u t of Context ( K S O C ) program. T h e y are specified by cards a t the s t a r t of each run, preceded b y a definition card of t h e format given in T a b l e I. T h e definitions of t h e selection JOURNAL OF CHEMICAL DOCUMENTATION, VOL.8, No. 3. August 1968

,4COMPUTER PROGRAM TO INDEX OR SEARCH LINEARNOTATIOKS Table 11. Definitions of Selection Criteria

Table I . Lay-Out of Definition Card E\amiile

I l e h n i l 1011

Kumeral 1 Numeral 2 Sumeral 3 Sumeral 4 Sumeral 5 Numeral 6 Sumeral 7 Sumeral 8 Sumeral 9 Numeral 10 General Sumeral (Alkyl) Delimiter Ehd of profile Halogen 1 Halogen 2 Halogen 3 Halogen 4 Halogen 5 Negation Start of OR bracket End of OF: bracket End of term in OR bracket "Don't care" symbol Switch symbol Profile in profile Search for Alkyl Search for Halogen

(,Switch identifier) ::= Kumeral 1

1 Sumeral

2

1 . . . . 1 Numeral

9

1 2 3

(Search identifier) ::= any character which is not a (Switch identifier)

4 3

-

3 I

8 9 0

A blank #

E F G I J Downward arrow (

1 -

Dollar Asterisk

criteria in Backus S o r m a l F o r m ' are detailed in T a b l e 11. T h e s e cards are followed by a card with a delimiter symbol in C I C 1; after this come t h e file cards, t h e e n d of-file being signalled b y a card with zero (or negative) serial n u m b e r . T h e program will print t h e selection criteria, then-for each selected notation-the profile identification (unless this is a switch designation), t h e multiplicity of this profile in t h e notation, t h e serial n u m b e r , a n d t h e full notation including a n y c o m m e n t s (see a b o v e ) . A t t h e end-of-file signal, t h e n u m b e r of notations answering each profile is printed. T h e program as writ ten allows t h e simultaneous processing of u p t o 100 profiles with not more t h a n 1000 characters; it occupieis (together with t h e compilergenerated i n p u t - o u t p u t routines1 6090 words of memory in t h e C D C 3400. T h e processing speed is limited by t h e i n p u t - o u t p u t , which is not overlapped in this machine. T h e 10 notations a n d 15 profiles of t h e example took 1.43 minutes for compile a n d go. Considerable speedingu p is possible when t h e .multiplicity of hits is not required. Although t h e program was designed primarily t o process LViswesser Line-Formula Chemical S o t a t i o n s , only a few modifications are requirled t o a d a p t it t o a n y o t h e r linear ~

JOURSAL OF CHEMICAL DOCILMENTATION. VOL. 8, S o . 3 . August 1968

era1 2 LVill close the designated switch. t Open-switch operator; ::= Sumeral 3 \t-ill open the designated switch. (Profile identifier; ::= (,Search identifier! (Search operator) I (Switch identifier) (Close-switch operator) 1 (Switch identifier, (Openswitch op.! (Operator symbol) ::= Delimiter 1 End of profile , Segation I Start OR 1 End OR 1 End OR term 1 Switch symbol 1 Profile in profile 1 Search for Alkyl 1 Search for Halogen 1 Don't care (Sotation symbol) ::= any character which is not a (Operator symbol) (Notation character! ::= (Notation symbol) I Don't care (Primitive term, ::= (Sotation character) 1 (Sotation character, (Primitive term) 1 Profile in profile (Search identifier) 1 Switch symbol (Switch identifier, 'Profile in profile' will retrieve any predefined profile which is answered by the notation. 'Switch' will be a hit if the designated switch is closed. (OK term, ::= (Primitive term! End OR term 1 ,OK term) (OR term) (OR bracket) ::= Start OK (OR term! End OR LVill retrieve any of the OR's (,Term, ::= (Primitive term) 1 (OR bracket! 1 (Term, (Term) I (Term! Negation (Term) = (Profile identifier) (Term)End of profile (Sotation) ::= (h-oration symbol) Delimiter Delimiter Delimiter 1 (Xotation symbol) A-otation)

notation: t h e subject o f t h e notation need n o t even be chemical, indeed, concordances or statistics of syllables. words, a n d phrases in texts of n a t u r a l languages c a n be prepared with the unmodified program. ACKNOWLEDGMENT

My t h a n k s are d u e to D r . A. Betser for encouragement in t h e work. a n d t o t h e XSF for t h e travel g r a n t which m a d e m y a t t e n d a n c e a t this meeting possible. LITERATURE CITED Proc. Intern. Coni. I n f . Proc.. USESCO, Paris, (1) Backus. J. W., pp. 125-32, 1959, Oldenburg, Munich, 1960. ( 2 ) Bonnett, H. T., and D. \V. Calhoun. J . CHEM.Doc. 2; 2-6 (1962). ( 3 ) Brasie, LV. C., and D. LV. Liou, Chem Eng. Progr. 61 (5).

102-8 (1965,. (4) Ofer, K. D., C. S. Rice. R. B. Bourne, and S. IV., Logan, "A Pilot Study for the Input to a Chemical-Structure

Retrieval System," Ic51st Meeting, ACS. Pittsburgh. Pa.. March 23. 1966. (5) Smith E. G., W. Wiswesser, "Revised Rules of W. J. Wiswesser Line-Formula Chemical Sotation," 2nd ed., Crowell Co., S e w York. 1967.

129