INGE BERG HANSEN
ACKNOWLEDGMENT The author is indebted to the Danish Council for Scientific and Industrial Research for the financial support of the work. We further wish to express our thanks to the staff of the Documentation Dept., DTB, especially Birgit Pedersen, for assistance during the experiment and to the users participating in the experiment for their evaluation of the output. We are extremely grateful to the staff of I / S DATACENTRALEN for their active interest in the development of improved techniques in the utilization of the CAS tapes.
LITERATURE CITED (1) Amick, D., “Multivariate Statistical Analysis of the Use of a Scientific Computer-Based Current Awareness Information Retrieval System,” J. Amer. SOC.Inform. Sci. 21, 171-8 (1970). (2) Amett, E. M., “Computer-Based Chemical Information Services,” Science 170, 1370-6 (1970). (3) Baker, D. B., Tate, F. A., and Rowlett, R. J., Jr., “Changing Patterns in the International Communication of Chemical Research and Technology,”J. Chem. Doc. 11,90-8 (1971).
Barker, F. H., Kent, A. K., and Veal, D. C., “Report on the EvaluE ‘ion of an experimental Computer-Based Current Awareness Service for Chemists, ” United Kingdom Chemical Information Service, University of Nottingham, 1970. Berg Hansen, I., “Chemical Literat,ure on Tape,” Keriz. Teollisuus 26,997-1005 (1969). Berg Hansen, I., “Computer-Based Chemical Information Systems a t the Danish Technological University Library,” IA ruL proc. 5,14-20 (1970). Berg Hansen, I., “A Comparative Study of Some Information Retrieval Systems in Chemistry and Biomedicine,” Ind. Chim. Belge 37,25-30 (1972). Boman, M., “Computer-Based Chemical Information Retrieva1,”Su. Kern. Tidskr. 80,379-83 (1968). Johansson, A., Kallner, A., and Markusson, K., “Literature Documentation Service through Chemical Abstracts Condensates-an Evaluation,” Kem. Tidskr. 82,24-6 (1970). Skov, H. J., “An Electronic SDI Service for the Danish Chemical Industry and Research,” Libri 18, 204-15 (1968). Spiegel, M. R., “Theory and Problems of Statistics,” Schaum Publishing Co., New York, 1961. Swets, J. A., “Effectiveness of Information Retrieval Methods,”Amer. Doc. 20, 72-89 (1969). Veal, D. C., “United Kingdom Experiences in the Operation of a Retrieval and Dissemination Service Based on CAS Search Tapes,” Neue Tech. A. 11,281-95 (1969).
Subject Compatibility between Chemical Abstracts Subject Sections and Search Profiles Used for Computerized Information Retrieval I N G E B E R G HANSEN T h e Documentation Department, The National Technological Library of Denmark ( D T B ) . Lyngby. Denmark Received October 20, 197 1
The need for introducing subject section numbers on the CA Condensates tapes was studied by analysis of the distribution of relevant answers to 41 search profiles among the 80 subject sections of Chemical Abstracts. The average profile requires 10 CA-subject sections for adequate coverage. The average printing expense per profile could be reduced 25% by searching the individual profiles in the appropriate subject sections.
In connection with our study of the CAS database C A Condensates,’ a detailed investigation was made of the compatibility of the 80 subject sections used in Chemical Abstracts with the subject interests expressed in each individual search profile. The background for this investigation was that the first tape version of C A Condensates did not contain the subject sections to which the references belonged, and as the occurrence of redundant answers from subject sections far from the user’s field of interest aroused considerable irritation from the users of C A Condensates, CAS was repeatedly requested by the documentation centers to introduce the section numbers on the Condensates tapes. To see how useful the section numbers actually would be, DTB decided to examine how the relevant answers to the individual profiles were distributed among the 80 subject sections of CA. As this investigation was accomplished by the time DTB was ready for use of the new CAS Standard Distribution Format, which includes the section numbers, 110
Journal of Chemical Documentation, Vol. 12, No. 2, 1972
the investigation turned out to be a valuable means to obtain the best possible use of this new facility in the new CAS Standard Distribution Format.
METHODS For every profile, a base corresponding to approximately 100 relevant answers (based upon output relevance) was selected, and the percentage distribution among the 80 subject sections was calculated for Total answers Relevant answers (output relevance) Answers marked relevant after reading of the article or, if insufficient data regarding final relevance were available, answers considered as highly relevant judged from the output
For all of the profiles, it was illustrated graphically (ex-
COMPATIBILITY BETWEEN CA SUBJECTS AND SEARCH PROBLEMS Table I .
Relevant Answers (Output Rei.), 70
Total Answers, % Section Group
1
2
3
Average Distribution of Retrieved Answers among the 80 CA Sections (41 profiles)
Section Group
Section
Mean
Range
Mean
Range
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
0.1 8.4 3.6 1.3 0.7 1.3 4.1 4.7 2.5 1.5 4.1 1.9 1.6 0.6 2.3 1.2 3.8 0.5 2.0 1.1 0.6 0.4 0.1 0.4 0.1 0.3 0.3 0.3 0.2 1.2 0.2 0.2 0.8 1.4 1.0 0.3 0.6 0.1
0- 0.8 43.3 65.4 14.8 5.3 12.3 30.9 33.6 47.5 15.9 29.6 12.5 34.8 5.8 12.7 17.1 39.5 3.8 52.9 13.4 0- 0.3 11.7 6.8 1.8 7.4 1.0 3.7 1.7 5.2 4.6 56.1 3.4 3.8 0- 9.2 30.6 22.3 3.6 10.9 1.7
0.2 6.1 4.3 1.2 0.3 1.9 4.1 5.2 2.1 1.9 3.9 2.0 1.4 0.5 1.6 1.6 5.6 0.4 2.7 1.2 0.9 0.5 0.3 0.2 0.1 0.1 0.1 0.4 1.4 0.6 0.3 0.3 1.1 0.8 0.1 0.1 -
0- 1.1 28.9 35.2 14.4 1.9 5.9 41.9 41.2 46.4 28.0 40.1 18.9 37.3 4.6 12.3 29.0 44.3 4.3 57.1 28.2 0- 2.9 26.7 13.3 11.1 8.9 2.9 4.9 3.4 14.3 2.4 0.7 50.0 15.8 1.3 0- 8.4 39.8 28.9 2.2 3.3 0.4
Section group 1 = Biochemistry sections. 2 Physical and Analytical Chemistry sections.
=
Organic Chemistry sections. 3
Total Answers, 70
=
4
5
Accounting units
T S L
+
+
120.0 0.03T 0.00013ST of text units on the tape = No. of search terms = No. of printed lines
=
= No.
+ 0.1L
0.1 0.9 2.0 0.4 0.5 0.3 0.2 0.4 1.8 0.3 1.4 0.2 1.1 2.7 1.1 2.0 2.3 0.3 0.6 0.5 0.4 0.2 0.7 0.2 1.8 0.7 0.9 1.9 0.9 2.5 2.9 0.5 3.7 1.5 0.8 1.2 3.0 1.3 3.8 0.4
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
Macromolecular Chemistry sections. 4
ample shown in Figure 1) how many subject sections each individual profile required to obtain a certain percentage of the relevant answers ( = a certain recall level). The average number of subject sections required to obtain 100,95, and 90% of the relevant references retrieved by the profile in Condensates was calculated and presented in Table III. In addition, we determined how many of the five section groups used in Condensates would be necessary to obtain a similar recall level (Table IV). Finally, as the economic implications of the results were of particular interest to us, we analyzed the per cent of answers we could have avoided if only the number of groups corresponding to a 95% recall level were searched. As our computer expenses are based on the formula
Mean
Section
=
Relevant Answers (Output Rel.), %
Range
Mean
Range
1.7 6.6 48.3 4.5 11.4 2.6 0- 1.4 6.0 23.0 2.8 20.1 3.0 13.4 78.5 9.4 35.9 17.5 2.5 8.2 6.8 1.6 4.1 5.6 3.7 0- 15.9 6.5 13.0 15.3 8.8 34.0 13.4 4.7 58.1 1.2 7.9 13.2 69.3 25.4 53.6 4.7
0.6 1.5 0.9 1.0 0.2
1.0 10.8 42.4 12.9 34.4 4.3 0- 1.6 3.4 48.5 2.2 3.9 0.5 2.4 91.2 13.3 44.2 20.4 4.9 4.5 6.6 3.9 1.6 7.5 2.4 0- 26.0 16.8 31.5 32.7 18.1 37.8 20.2 2.9 61.1 2.9 43.9 67.0 34.3 45.1 2.9
-
0.2 1.5 0.2 0.2 0.4 5.4 0.5 2.0 1.0 0.1 0.3 0.6 0.2 0.5 0.2 2.0 0.6 1.1 2.2 0.8 3.3 1.8 0.2 2.7 2.0 1.5 2.6 2.0 3.7 0.3
Applied Chemistry and Chemical Engineering sections. 5
=
our expenses have a close relationship to the number of hits, and an important consequense of the introduction of the section numbers could therefore be that a considerable proportion of our computer expenses might be saved.
RESULTS AND DISCUSSION Table I shows the average distribution of total and relevant answers for all of the profiles among the 80 subject sections of CA. The average distribution for total and relevant answers must be considered as identical and the majority of the sections contribute to the hits with 1 to 2% of the total number of hits. Table 11 which shows the distribution among the five section groups illustrates clearly that certain parts of CA Condensates are utilized more extensively than others, at least in the Danish environment. From the composition of our experimental group,* we expected that the biochemistry Journal of Chemical Documentation,Vol. 12. No. 2.1972
11 1
INGE BERG HANSEN
$ 1.0 I,2 1.4 I# NO. O F SkCTIONs Figure 1 . An example of the correlation between CA$l
recall level a n d the required No. of subject-sections.
__
Recall based on output relevance
-.- Recall based on final relevance
values for the 95% recall level for the three different relevance estimates used here are identical, and these values are further identical to the 100% recall values based upon final relevance or major output relevance. We, therefore, conclude that it should be possible after a test period to select a number of subject sections which should give the user a satisfactory coverage of his interest; the selected groups should in our opinion be those which would cover 95% of the references considered relevant on the basis of the output. It would in our opinion be futile to hope for a 100% recall level by search in selected subject sections as the last 5% of the references in most cases are found in sections where the user never would have expected to find relevant material. The economic implications of these findings are that it should be possible to reduce the printing expenses if it were decided either to search only the groups which were required t o obtain a 95% recall level or, alternatively, only to print the answers representing those groups. We, therefore, calculated the per cent of the printed answers we could have saved per profile, and we found that compared to the present procedure, where we only have the possibility to restrict the search to either odd or even issues of CA, we could save an average of 26.1 =t13.7% of the printing expense per profile. We continued to examine the distribution of the relevant answers among the five section groups, as we found that other means of reducing the computer expense might be either to split the present tapes further or possibly to combine the section groups in a way different from the one used a t present. Table IV shows the number of profiles which would be adequately covered (95% recall) in the section
Table II. Distribution of Retrieved Answers among the Five CA Section Groups Table Ill. Groups
1
Totalanswers, 70 47.3 Relevant answers, '3 48.2
2
3
4
5
4.3 4.9
8.4 6.6
16.4 13.1
27.8 26.8
sections would be heavily used, but in spite of this, we were surprised to find that section group 1 was responsible for almost half of the answers. We also knew that section group 2 would have a low representation, but that the representation should be fewer than 5% of the answers came as a surprise. This result must give rise to the thought that whether the lack of interest in the organic chemistry sections should be ascribed t o a poor performance in this field or lack of need for a computerized service among the Danish organic chemists, dropping this part of Condensates entirely in our local environment probably should be considered. Other investigations have shown a heavier utilization of the organic chemistry sections than the present,', but, first, those investigations have been using selected populations of profiles offered free of charge or a t a nominal charge, whereas our users constitute a population of commercial subscribers, and second, the same investigations have pointed out the inadequacy of Condensates for computer search in this field. Figure 1 shows an example of the correlation between number of sections searched and the Condensates recall obtained; the average figures for the total population for the number of sections required to retrieve 100, 95, and 90% of the relevant references are given in Table III. On the 100% recall level, there is a difference of four sections between the results for output relevant and finally relevant answers. This difference is according to the present ret 0 . 9 ~ ) . ~ The sults probably significantly different (t J
-
11 2
Journal of Chemical Documentation,Vol. 12, No. 2,1972
Correlation between Recall Level and Required Number of CA Sections N o . of Required Sections
Condensates Recall,
a Based on Output Relevance
b Based on Final Relevance (R)
C
Based on Major Output Relevance + j
F
Mean
S.D.
Mean
S.D.
Mean
S.D.
100
14.2 9.7 6.9
6.1 4.4 3.2
10.3 8.2 6.2
2.4 2.3
10.3 7.9 6.0
4.2 3.4 3.0
95 90
1.8
Table IV. No. of Profiles Receiving 95% Coverage in Different Section Groups of CA Section Group(sj
1 4 5
No. of Profiles
11 1 3
=
15
1+3 1+4 2+5 3+4 3+5 4+5
4 2 1 1 1 7
=
16
1+2+3 1+2+4 1+2+5 1+3+4 1+4+5 3+4+5
1
1+2+3
Total
1 1 1
3 1= 8
+4
2= 2 41
TODAI SCIENTIFIC INFORMATION RETRIEVAL SYSTEM. I1 group(s) indicated. It is seen that by far the majority of the profiles would be covered by search in three of the five section groups. The present division in odd and even issues has the result that 16 of the 41 profiles will need to be searched in both issues; similar results would be obtained if other section group combinations were used like [(I, 2, 31-44? 511, [(I, 3, 4)-(2, 511 or [(1)-(2, 3, 4, 5)l. There would, therefore, be no point in trying to change the present division of the section groups; on the other hand, the results show that, at least in our case, it is hardly worth paying a separate conversion of the odd and even issue tapes now, when it is possible to restrict the search to certain sections. Further, the recent developments in search techniques which will reduce search expense considerably,5 seem to favor a joint search of the two presently separate tapes. CONCLUSION
than 40% of the profiles must be searched in both issues; the consequence of this finding combined with introduction of the section numbers and recent developments in search techniques will, as far as our center is concerned, most likely be that odd and even issues will be searched in one operation.
LITERATURE CITED (1) Barker, H. F., “UKCIS CA Condensates Evaluation,” Re-
(2)
(3)
Our conclusion on the experiment must be that the recent introduction of the subject section numbers on the Condensates tapes can be a valuable tool to reduce computer expense provided that the users are willing to cooperate actively in the testing of the profile to select the number of sections required to obtain a satisfactory recall. The distribution of the relevant answers between the odd and even issues of Chemical A b s t r a c t s showed that more
(4)
(5)
search Report presented a t the meeting of “The European Association of Scientific Information Dissemination Centres,” Frankfurt am Main, November 1970. Berg Hansen, I., “An Evaluation of the Database CA Condensates Compared with Chemical Titles,” J. Chem. Doc. 12, 101-10 (1972). Johansson, A,, Kallner, A., and Markusson, K., “Literature Documentation Service through Chemical Abstracts Condensates-an Evaluation,” Kem. Tidskr. 82,24-6 (1970). Spiegel, M. R., “Theory and Problems of Statistics,” Schaum Publishing Co., Xew York, 1961. van Eijk van Voorthuijsen, J. J. B., and de Heer, T., “SDI Software Development for CA Condensates Tapes in Standard Distribution Format,” Report presented a t the meeting of “The European Association of Scientific Information Dissemination Centres,” Paris, May 1971.
Todai Scientific Information Retrieval (TSIR-1) System. 11. Generation of a Scientific Literature Data Base in a Center-Oriented Format by a Tape-to-Tape Conversion of CAS SDF Data Base TAKE0 YAMMOTO.” M A M O R U USHIMARU. TOSIYASU L. KUNII. HlDETOSl TAKAHASI. and SHIZUO FUJIWARA The University of Tokyo. Hongo. Tokyo 113. Japan Received November 3, 197 1
Conversion of the Chemical Abstracts Service scientific literature data base from Standard Distribution Format to a center-oriented file format, STF, is described. Change of the number of tracks, density, maximum blocksize, block structure, and character code was attained during t w o steps of tape-to-tape conversion using a F A C O M 230-60 system and a H I T A C 5020 system. A tape file format is described in which variable length logical records, possibly longer than the blocksize, may be stored safely.
We are building a scientific information retrieval system for the University of Tokyo, TSIR-1,1.2which uses a HITAC 5020 T S S with expansions in the on-line tape IOCS3 and the disc file control system. The information retrieval system is based on a file format, STF,2 which is closely related to the CAS SDF.4.5.6 Thus, a CAS SDF data base may be converted to generate a center-oriented data base in STF by a comparatively simple method. In the present paper, our method and experience in To whom correspondence should be addressed at Department of C h e m i s t v . Faculty of Science. t h e Cniversity of Tokyo. Hongo. Tokyo 113. J a p a n .
generating the STF data base on 7-track tapes from the CAS SDF 9-track tape files will be described. Such an operation, though simple in principle, was made nontrivial by the bulkiness of the data and the requirements imposed on the conversion procedure by the locally available hardware and software (Figure 1). A new block structure for ‘STF was created to accommodate the change in the maximum blocksize from 3520 bytes in SDF to 255 words in STF. Precautions were taken to minimize errors in the resultant files caused by the complicated process of international and domestic air mail, and the handling of the original and working tapes during the conversion. Journal of Chemical Documentation. Vol. 12, No. 2. 1972
113