Plausibility Check of Chemical Coding by Computer Using a Modified

Plausibility Check of Chemical Coding by Computer Using a Modified Derwent Robins Program. J. Fitting · H. Lehna · K. Specht · Cite This:J. Chem...
2 downloads 0 Views 370KB Size
J. FITTING, H.LEHNA,AND K. SPECHT

Plausibility Check of Chemical Coding by Computer Using a Modified Derwent Robins Program J. FITTING, H. LEHNA, and K. SPECHT' Research Laboratories, Schering AG, Berlin/Bergkamen, Germany Received December 27. 1973

The checking of chemical coding (Ringcode) by computer Is described. The check is based on the normal search procedure for Ringcode coded files. Therefore the different logical conditions can be changed easily. Adoption to every chemical Ringcode is possible.

The plausibility check gives us the automatic possibility of finding errors in chemical coding. The coders get alert information about type and frequency of their own mistakes. This leads to optimal coding work. Time and.costs consumed in manual checks of the coding quality are reduced to spot checks. The main reasons for incorrect files are 1. Wrong interpretation of publications to be coded 2. Punching errors 3. Coding errors The wrong interpretation of publications cannot be checked by computer, but, in this area, spot checks are sufficient to prove the standard of a coder. Punching errors which occur nonsystematically can be found to a certain extent, but with an obligatory double check of the keypunching this type of error will be very rare. Pure coding errors can be discovered by computer because of the logical connection of the different fragments. This detection will of course not be total, but a considerable number of mistakes will be found. At first, the different logical correlations between the fragments of the chemical codes are determined. There are two possibilities to proceed: 1. to write a program which considers these correlations to a certain extent 2. to use a retrieval program and to define the correlations similar to the procedure for computer search runs We decided on the second possibility because to us this solution seemed to be more flexible and has the option to be transferable to every similar chemical fragmentation code used now and in the future. We combined a certain sum of check queries. After having them tested and evaluated we combined 31 Check Queries for the General Chemical Code 15 Check Queries for the Steroid Code for the plausibility checks. To illustrate the plausibility check method, an example is given: Figure 1 shows the encoding of a chemical compound with the General Chemical Code. The coding is wrong; the position 7/11 is missing. Using the plausibility check for the General Chemical Code, an error message will be given by the computer: SEVERAL X IN ONE RING Simultaneously a card with the wrong coding will be punched; the corresponding file on the tape is obliterated. The logical correlations for this mentioned error are as follows. The positions 812-817 define the positions of het* Author to whom correspondence should be addressed

76

Journal of Chemical Documentation, Vol. 14, No. 2, 1974

eroatoms in rings with more than one heteroatom. Additionally the positions 7/11 or 714 must be punched. Therefore, we use the following check formula (812 v ai3 v ai4 v ai5 v 816 v 8/71 A (7111v 7/41 "

I

A

-

or if A, then B ( A

-

B B1

On the other hand, if the fragments 7/11 and 7/4 are present, there must be statements in the positions 8/2-8/7, too. Therefore, the logical condition if B, then A (B

+

A1

is correct, too. The condition is alternate. If we use the following definition of a query ( A A E)

V

(A

A

B)

we shall get no error message if both, A and B, are present, and if both are absent. Considering the fact that very frequently coordinate coding terms are both missing, we found it necessary not to check only A and B alternately but to check additionally one or both by further check positions: Condensed heterocycles, for example, have the punch positions (311 V 4/d V 6/11)

c and always need statements in 814-817 (A'). A' has two more possible positions than A. The new logical condition is if C, then A' (C

-

A')

Because A' = ( A

V

816

V

8/11 = > A

it is not possible to detect quantitatively the combined missing of A and B. But the possibility of overlooking erroneous codings is remarkably limited. Generally we try to set up a very fine mesh when we define the check queries. Figure 2 shows all the punch positions which may be checked using our question set with regard to the coding of the substance shown in Figure 1. Some changes of the Derwent Robins Program had to be made. For example, it was necessary to check every card of a file. To identify the erroneous cards, the card's sequence numbers have been included in the address. Files which do not have numbered cards will be numbered by the pro-

CHECK OF CHEMICAL CODING BY A

MODIFIEDDERWENTROBINSPROGRAM

P

Klart e x t :

316

116

216

ANORGAN IUWANI

llGlNl(H

516

616

716

616

719

819

618

E7 619 I

17/12

18112

1

19112

-

21112

PO1Y

SOL7

20111

iq-iqx AN F

GRUPPt AN F

2010

(4;OH

R-NH;

I

20112

jj?Jq-E AN f

1

I

Am3 RIIRO

I

HfllRO (KLUI

p. 28

21111

2211 1

H

GRUPPi AN F

R

2011

2111

C'

2,%14010. AKIlVt -2610 On, O M 01 SH, IM IS

R-NHF

2012

2112

>P-

OH

'p,

R-NRT

1915 R-NR?

1916

R-N(CH;

2,014,

/Si, 2015 \

92016 GWPt I RING

2017

y=o 1918

N+ rrcllstn 1719

I

1819

1919

HOYOLDGI

AllG

OZOHID 4110

NHR

2113

2213

2613

OR

KtIIt

NR2

2114

2214

SH, SI 2115 OM, SM 01,

sr

2116

NTz 2117

-0F-I

-OI--S'

OH, OM

-

g

2216

ANDIRt HtTtRO

2218 IDllSllGf

2019

2119

22/9

0-C

2713

3013

2714 si

4 3015 Andere

2715

P 5

Hexosan

R+ x12617

N-0

-0-s

2

c _

2217

-

Mg

Hg -

2616

-

2118

7

Sp 17 ( :y ( *-Y

or

Y=S

HOMOLO(

2614

0X.NX.I

2215 OR,

LI

2712

-

NTZ

2018

Y=NT

No

I

I

1914

2711

"2

2612

I

2013 'y

2611

1 --

1912

30112 Ribose

2711 1

2110

:S=Y

Sp. 29

\I

I

k l

IPURtN

YOMPLEX

ADO VtRB

2719

-

MI1A L l O R l

HOMOLOCI

ANOlRl

31

32

33

62

65

66

n =MISTAKE Figure 1.

Journal of Chemical Documentation, Vol. 14, No. 2,1974

77

-* J. FITTING,H. LEHNA,AND K. SPECHT

Klart e x t :

11/12

IC OM Y LDUL SRS. E I

c

fOLY

12/12

1311P

14/12

15/12

16/12

POLY

11111

2

Ill0

1210 I

3

-fH

I

1111

D

1211,

4

1112

114 KUNITIT U A POLYM

1 1 21f,,

6b

913

3N

,

1013

-(I

7

1212 (H3X

3/10 1113

1213

'1/16 -(H,N 'KtTTti

-1114

3/43

IIOlItRUNG

1.2

3

12/4

217 7

115

215

915

315

1.3

1015

1115

1

1.34

I

I

1116 ANORGAN IUBITANZ

KiARTlXT

tlGtNICU

DARII

zs

2

317

1117

1217

3

$tX

PRY1

9/8

11018

1118

4/6 1119

17112 POLY

I I 18112 POLY

19112

20112

21/12

-

22112

20111

21111

-

g

22111

GRUPPt

2110

,S=Y 2011

-N = N-

W O M , OR I H , IM. I R

2211

2611

=NPN

NHz

2013

2113

2213

2613

>p:y

OR

KtTTt

N Rz

2015 \

,B-

2016 GRUPPt I RING

2215

OM, IM

01,ST

2116

NTz

2614 1X.NX.I)

OH, OM OR, 01

Anmt

R'X-

HtItRO

2617

2217 N-0

2118 -0-Si

2218

2618

IONIliGf

KOMPltX ADD VERB

2119

2219

HOMOlOGt

iOMOLOG1

R' R-

2619

-

-

MULLLOR[ RADIKPL

4

2716

5 6

2717 HAL

7

2718

8

2719 ANOlRl

31

@ =CHECKABLE

Figure 2.

Journal of Chemical Documentation, Vol. 14, No. 2 , 1974

SI

-

--

-

-

78

2714

P S

2616

2018

2019 y = NT

Mannose

-

2216

2117 -os--ss.

Y=S

p-

2715

2017 y=o

-05--IS-

Galactose

--

NTZ

-

/Joi? Glucose

-

2214

2115

Yonodssoxipintom

NHR

-

SH,SR

ANOtRt

2612

2212 + -NsN

2114

24esoxyribose

-

OH

2f14, ,Si,

2112

2610

2210

>P-

2012

-NR]

2111

Ribose

H

AN f -

2010

3011 2

26112

POL(

POLY

E7 -

32

33

62

65

66

DRS-A

USER ORIENTEDINFORMATION RETRIEVALSYSTEM

Coded compound

Punch positions

output

erroneously absent

OH

a-CH-CHz-0

6/12

6/6

100825

01

0 2 9 ISOLATED R I N G

0 1 4 UNSATURATED HETEROCYCLE

\

do

0

10437s

2/1 25/11

25/0

01

035

SIDE CHAIK

0 4 4 COL.

25/7

104375

23/7

02

25

003 Y I N COL. 2 3

Figure 3.

gram. A new listprogram was written. This program allows the printing of accession number and card number of erroneous cards together with all mistakes (see Figure 3). Contrary to a fixed check program, the procedure de-

DRS-A

scribed here gives the possibility of changing the logical conditions very easily. Thus it is possible to consider the changing of coding rules a.s.0. without changing the program.

User Oriented Information Retrieval System GEZA SZONYI

Polaroid Corporation, Cambridge, Massachusetts 02139 Received December 6, 1973 DRS (Data Retrieval System), a user-oriented computerized information retrieval system, is described. The system allows data base generation and information retrieval by users without programming and systems knowledge. As an illustration, a chemical information system using the Wlswesser Line Notation (WLN) is generated, and retrieval examples are given using this data base. Because of its easily understandable English command language, DRS is suitable for a wide variety of information retrieval systems. Polaroid was instrumental in many of the advanced features of DRS developed during the past three years.

FAMULUS, a user-oriented computerized information retrieval system was presented in the fall of 1969.l This system, consisting of eight main and three peripheral computer programs, allows the user to set up data bases with relative ease. Having done-this, data manipulation, such as addition, correction, and deletion of information units or records, can be carried out relatively simply. Subsequently, searching of the data base by Boolean logic or by keywords is accomplished by English commands, and the retrieved subset can then be sorted alphabetically. In addition, data subunits or files can be merged, and printouts of the retrieved subset can be displayed in several output formats.* The system was mainly designed to handle personal reference collections but appeared to have applicability for all types of computerized information retrieval systems3 Unfortunately, the system was designed for IBM System/S60

Control Data Corp. 6400 and 6600, UNIVAC 1108, or Xerox Sigma 7 type computers.',* These are large machines and not particularly suited for operation by individual users without system and programming knowledge. Also, searches with this system are carried out in the batch mode, rather than interactively, making the information retrieval rather awkward. A more serious drawback of FAMULUS is its magnetic tape orientation; i.e., searching is done sequentially rather than a t random. In addition, typical data bases contain 3000-5000 records, each record having about 4000 characters.* We were interested in FAMULUS-type retrieval systems, i . e . , having all the attributes of such a system, but one which could be implemented on minicomputers, since these could potentially be operated by single users without system or programming staff support. Further, data bases Journal of Chemical Documentation, Vol. 1 4 , No. 2, 1974

79