Todai Scientific Information Retrieval (TSIR-1) System. II. Generation

Generation of a Scientific Literature Data Base in a Center-Oriented Format by ... Journal of Chemical Information and Computer Sciences 1985 25 (3), ...
0 downloads 0 Views 402KB Size
TODAI SCIENTIFIC INFORMATION RETRIEVAL SYSTEM. I1 group(s) indicated. It is seen that by far the majority of the profiles would be covered by search in three of the five section groups. The present division in odd and even issues has the result that 16 of the 41 profiles will need to be searched in both issues; similar results would be obtained if other section group combinations were used like [(I, 2, 31-44? 511, [(I, 3, 4)-(2, 511 or [(1)-(2, 3, 4, 5)l. There would, therefore, be no point in trying to change the present division of the section groups; on the other hand, the results show that, at least in our case, it is hardly worth paying a separate conversion of the odd and even issue tapes now, when it is possible to restrict the search to certain sections. Further, the recent developments in search techniques which will reduce search expense considerably,5 seem to favor a joint search of the two presently separate tapes. CONCLUSION

than 40% of the profiles must be searched in both issues; the consequence of this finding combined with introduction of the section numbers and recent developments in search techniques will, as far as our center is concerned, most likely be that odd and even issues will be searched in one operation.

LITERATURE CITED (1) Barker, H. F., “UKCIS CA Condensates Evaluation,” Re-

(2)

(3)

Our conclusion on the experiment must be that the recent introduction of the subject section numbers on the Condensates tapes can be a valuable tool to reduce computer expense provided that the users are willing to cooperate actively in the testing of the profile to select the number of sections required to obtain a satisfactory recall. The distribution of the relevant answers between the odd and even issues of Chemical A b s t r a c t s showed that more

(4)

(5)

search Report presented a t the meeting of “The European Association of Scientific Information Dissemination Centres,” Frankfurt am Main, November 1970. Berg Hansen, I., “An Evaluation of the Database CA Condensates Compared with Chemical Titles,” J. Chem. Doc. 12, 101-10 (1972). Johansson, A,, Kallner, A., and Markusson, K., “Literature Documentation Service through Chemical Abstracts Condensates-an Evaluation,” Kem. Tidskr. 82,24-6 (1970). Spiegel, M. R., “Theory and Problems of Statistics,” Schaum Publishing Co., Xew York, 1961. van Eijk van Voorthuijsen, J. J. B., and de Heer, T . , “SDI Software Development for CA Condensates Tapes in Standard Distribution Format,” Report presented a t the meeting of “The European Association of Scientific Information Dissemination Centres,” Paris, May 1971.

Todai Scientific Information Retrieval (TSIR-1) System. 11. Generation of a Scientific Literature Data Base in a Center-Oriented Format by a Tape-to-Tape Conversion of CAS SDF Data Base TAKE0 YAMMOTO.” M A M O R U USHIMARU. TOSIYASU L. KUNII. HlDETOSl TAKAHASI. and SHIZUO FUJIWARA The University of Tokyo. Hongo. Tokyo 113. Japan Received November 3, 197 1

Conversion of the Chemical Abstracts Service scientific literature data base from Standard Distribution Format to a center-oriented file format, STF, is described. Change of the number of tracks, density, maximum blocksize, block structure, and character code was attained during t w o steps of tape-to-tape conversion using a F A C O M 230-60 system and a H I T A C 5020 system. A tape file format is described in which variable length logical records, possibly longer than the blocksize, may be stored safely.

We are building a scientific information retrieval system for the University of Tokyo, TSIR-1,1.2which uses a HITAC 5020 T S S with expansions in the on-line tape IOCS3 and the disc file control system. The information retrieval system is based on a file format, STF,2 which is closely related to the CAS SDF.4.5.6 Thus, a CAS SDF data base may be converted to generate a center-oriented data base in STF by a comparatively simple method. In the present paper, our method and experience in To whom correspondence should be addressed at Department of C h e m i s t v . Faculty of Science. t h e Cniversity of Tokyo. Hongo. Tokyo 113. J a p a n .

generating the STF data base on 7-track tapes from the CAS SDF 9-track tape files will be described. Such an operation, though simple in principle, was made nontrivial by the bulkiness of the data and the requirements imposed on the conversion procedure by the locally available hardware and software (Figure 1). A new block structure for ‘STF was created to accommodate the change in the maximum blocksize from 3520 bytes in SDF to 255 words in STF. Precautions were taken to minimize errors in the resultant files caused by the complicated process of international and domestic air mail, and the handling of the original and working tapes during the conversion. Journal of Chemical Documentation. Vol. 12, No. 2. 1972

113

YAMAMOTO, USHIMARU, KUNII, TAKASHI, & FUJIWARA

/ - - - - - - fUNLOAD/ A PRCGIZAM,

4

,1

HITAC

5020

CONSOLE MESSAGE

(TOKYO)

--

RELOAD, ERROR I LEND



BLOCK/ LCGICAL

RECORD/ CHARACTER COUNT, ERROR

Figure 1,

Flow chart of the operation

ical record and the tape block (physical record): thus a block structure for STF different from the one for SDF becomes necessary. In SDF, a block may contain several logical records whereas a logical record is always contained in one block. The second word (after the block descriptor word) of each block is the beginning of a logical record. In STF, it is possible for a logical record to occupy several blocks. As the logical records are variable in length, the block structure has to provide a convenient way of locating the beginning of a logical record after an input data error during retrieval. The rules for the block structure in STF were chosen as follows. 1. A block is headed by two words of block descriptor. The first word is the blocklength (in bytes), as in SDF. The second word is the file serial number of the first (or the only) logical record in the block, converted into the binary code from the original ASCII-8. The rest of the block contains as many logical records as is allowed by the blocksize, except when a logical record has to be divided into more than one block. 2. When a logical record is divided into more than one bloc! I O U 1

00567101

00001

P

00667001

00014

CA072 1708983811

006BS201

00002

8000

00127001

00008

00000003

00547001

00009

691598069

00557001

00006

URXXAF

00597001

00017

Gorlovshll.

00597002

00018

Kueheryavyl.

00597003

00014

Lebodev,

0149POO1

014n70U1 01467401

OOlZl0ul

005010U1 0056/lu1 0 0 6 6 13'11

006r92U1

D.

1.

V.

V.

L.

OOlrlUOl

00597004

00017

Al'tahuler,

00547001

00597005

00016

Levenkova.

N.

8.

OU5'IOUl

II.

V.

N e

I. P.

00597006

00016

Mal'nlhov.

00597007

00015

Gumenyuk, V .

00587001

00036

U r e a f r o m smmonla and c a r t o n d l o x l d e

00507001

00008

U.S.S.R.

005E7001

00006

010469

00627201

00002

zz

005C7007

00627102

00001

005F7Olll

006670ni

00014

CA07217089838M

005n7noi

00707001

00006

240702

00557U01 00597002

P.

00597003 005F70U4 005EICU5 005CiGJ6

Figure 2 . Listing of STF (left) and SDF (right) files

ture. Multireel SDF files with or without IBM standard labels could be used as the input data. The 7-track working tapes were converted into 7-track STF tapes by using a HITAC 5020 system a t the Computer Centre, University of Tokyo. Each logical record was accessed for the character code and the block structure conversion. A cross check against possible handling and data errors in the input files was made by: 1. Comparing the header and the trailer label data, if any, with the control data given through the card reader 2. Comparing the total number of logical records accessed with one given in the last logical record of the file 3. Checking the presence of the file sequence data: in the last logical record

important to avoid mishandling of the tapes, in addition to the usual label check technique: 1. Using only one tape device each for the input and the output files to simplify the operation 2. Providing the operator with detailed written instructions stressing the expected response to an erroneous data file 3. Monitoring the operation by having a fairly detailed line printer output of the accounting and the error records

Most errors were caused by faulty input tapes with creases presumably formed in shipment. The ordinary plastic containers currently used for packaging seem to be inadequate for protecting the tapes from the trying conditions of international air mail, customs clearance, and domestic mail.

R ESU LTS

Several reels of Chemical Abstracts Condensates tapes in SDF have been converted into STF tapes by the above procedure. Part of the listing of an STF tape is shown in Figure 2, together with the corresponding listing of the SDF tape.' On the average, i t took us about 8 minutes of operation time, including 50 seconds of CPU time, per reel of data (one file) for the first step and about 9.5 minutes of operation time, including about 490 seconds of CPU time, per reel of data, for the second step. Both are well within practical limits. In both steps, the following measures were found to be

D I SCU SSI 0N There are two reasons for converting the file format of a large data base, such as CAS SDF tapes, into a local, center-oriented format such as STF, rather than adapting the system to the format of the data base as much as possible. 1. A local information center such as ours has a user population whose interest is not covered by any single external data source. Data records from several external sources may be combined most effectively and used after they are converted into records with a unified, center-oriented format. They may be stored in a unified file, and Journal of Chemical Documentation, Vol. 12, No. 2 , 1972

115

WARREN S. HOFFMAN may be searched and retrieved using similar retrieval programs and queries for all of the sources. 2. By developing a center-oriented file format, it is possible to combine records originating in the large-scale external data sources with locally generated data records.’ These data records, which may be highly important for the members of the local scientific community, may be distributed through the system and used commonly among the local users. In generating a sequential file consisting of variable length logical records, it may sometimes be better, or ne-. cessary, to choose a maximum blocklength which is smaller than the maximum record length. The present block structure for STF should generally be applicable in the above situation; it will certainly be convenient when most logical records are expected to be much shorter than the maximum blocklength.

ACKNOWLEDGMENT The first step of the operation was made possible with the intellectual, technical, and administrative help of Satoru Hoshino, Ichiro Nakagawa, Yasuyuki Sakamoto, and several other members of the Data Processing Center, Kyoto University. The SDF tapes were purchased with funds given by the Ministry of Education of Japan, Toku-

tei Kenkyu I, Showa 45, No. 99042 and Showa 46, No. 99032.

LITERATURE CITED Yamamoto, T., and S. Fujiwara, ”Syntactical ProximityPartial Syntactical Analysis of Xatural Language Data Records,”J. Chem. Doc. 11,256-7 (1971). Yamamoto, T., T. Kumai, K . Nakano, C. Ikeda, T. L. Kunii, H. Takahasi, and S. Fujiwara, “Todai Scientific Information Retrieval (TSIR-1j System. I. Generation, Updating and Listing of a Scientific Literature Data Base by Conversational I n p u t , ” J . Chem. Doc. 11,228-31 (1971). Yamamoto, T., K. Nakano, C. Ikeda, T. L. Kunii, H. Takahasi, and S. Fujiwara, “On-Line Tape IOCS for an Information Retrieval System,” unpublished work. Anzelmo, F. D., “A Data Storage Format for Information System Files,” IEEE Trans. Computers C-20(1),39 (1971). Chemical Abstracts Service, “Standard Distribution Format Technical Specifications,”Columbus, Ohio, 1970. Chemical Abstracts Service, “Data Content Specifications for CA Condensates in Standard Distribution Format,” Columbus, Ohio, 1970. Chemical Abstracts Service, “Samples of Chemical Abstracts Condensates Data Records in Standard Distribution Format,” Columbus, Ohio, 1971.

Du Pont Information Flow System* WARREN S. HOFFMAN Information Systems Division, Secretary‘s Department. E. I. du Pont De Nemours & Co., Inc. Wilmington, Del. 1 9 8 9 8 Received February 23, 1972

The Information Flow System is a large-scale information retrieval system developed for processing of D u Pont information files. As currently implemented, the system stores and retrieves information on company technical reports. Important features of the system include the use of threaded lists in addition to inverted files to permit optim u m searching. Users prepare searches in a free format query language, which is then optimized by the system to make most efficient use of the file structure. Answers are in the form of accession numbers or abstracts. Extensions of the system for handling chemical structure information and on-line processing are also discussed.

In 1964, Du Pont established a centralized group for indexing and searching company technical reports. Mainly to service this operation, an information retrieval system was developed using the IBM 1410 computer. The system was originally designed as an interim system, which would have an anticipated lifetime of less than five years and would be replaced by a more modern system when file size and economics dictated. The application of this system in Du Pont’s Central Report Index has been discussed.’ In recent years, the volume of input and searches using this system has steadily increased, and the file size has grown substantially. In late 1966, a study was undertaken to determine fu‘Presented betore Division of Chemical Documentatlon, Mid-Atiantlc Regional Meet ing, ACS, Philadelphia. P a . , Feb. IS. 1972.

1 16

Journal of Chemical Documentation, Vol. 12, No. 2. 1972

ture machine processing requirements of the report group. The objective of the study was to design an evolving system to handle increasing work loads and new services more efficiently and in a shorter cycle time. The new system was to make use of modern hardware and software concepts to provide a higher level of service than available in the past. The Information Flow System (IFS), which was started up during 1971, is essentially the product envisaged during the initial study. Between 1966 and 1971, a detailed file organization scheme was developed. The design and proposed system facilities were reviewed with the intended users. A basic data report specifying functional attributes of the proposed system was prepared, followed by detailed design and implementation. Conversion programs to generate the master files were also written.