and CA Subject Index Alert (CASIA) - ACS Publications - American

Feb 1, 1977 - J. E. Blake , V. J. Mathias , J. Patton. Journal of Chemical Information and Computer Sciences 1978 18 (4), 187-190. Abstract | PDF...
0 downloads 0 Views 1MB Size
DAYTON ET and text modification terms provide considerably more retrieval capability, based on single term occurrence, than can be found in titles alone, keywords alone, or KI phrases alone. It also provides retrieval capability beyond that which is available using both titles and keywords. Beyond the question of objective retrieval capability, since

AL.

the CASIA file contains phrases, it is possible to use this context information to screen or narrow down preliminary search results. If this file were to be made available on-line, it would greatly enhance the access to Chemical Abstracts that is now available through CA Condensates with its titles and keywords.

Comparison of the Retrieval Effectiveness of CA Condensates (CACon) and CA Subject Index Alert (CASIA) D. L. DAYTON,* M. J. FLETCHER, C. W. MOULTON, J. J. POLLOCK, and A. ZAMORA Chemical Abstracts Service, P. 0. Box 3012, Columbus, Ohio 43210 Received November 23, 1976 CA Condensates (CACon) is the computer-readable file corresponding to Chemical Abstracts

(CA) containing bibliographic information and keywords. CA Subject Zndex Alert (CASIA) is the computer-readable file containing index entries for the Chemical Substance, General Subject, and Formula Indexes of the CA Volume Indexes. This paper studies the vocabulary characteristics of the two files and the relative retrieval effectiveness of CACon and CASIA for a range of general subject and chemical substance topics. A vocabulary study quantifies the unique substantive word content of each of the files, the distribution of this vocabulary over citations, and the overlap of the file vocabularies. The results demonstrate that chemical substance nomenclature is a distinctive content feature of CASIA. A search study shows that CASIA gives significantly better recall than CACon for chemical substance topics and that both files give comparable recall for general subject topics when appropriate methodology is used.

INTRODUCTION Chemical Abstracts Service (CAS) started distributing computer-readable chemical information files in 1965 with the introduction of Chemical Titles on magnetic tape. Since then, CAS has introduced a variety of computer-readable chemical data bases which are aimed at meeting various information needs of the chemical community. Many computer-readable files distributed by CAS are now available on-line, making it possible for organizations without computer facilities to use these files. The increased number of searchers using the CAS data base without direct contact with CAS has created a need for disseminating information detailing the characteristics of these files. The purpose of this paper is to report on our research into the retrieval effectiveness of the computerreadable files C A Condensates (CACon) and Chemical Abstracts Subject Index Alert (CASIA) based on demonstration searches, and then to correlate this with file vocabulary characteristics. The CAS Data Base’ consists of a large body of bibliographic information, abstract text, and index data which is supported by several authority files, such as the Registry Files. Computer-readable files like CASIA and CACon provide access to selected portions of the CAS Data Base. CACon, introduced by CAS in 1969 and covering the period from July 1968 to the present, corresponds to the documents identified in the printed issues of Chemical Abstracts (CA). It includes names and affiliations of authors, patentees, and patent assignees; titles of papers, books, patents, and conference proceedings; and source document bibliographic citations, including CA section and subsection numbers and CA reference numbers. The keyword phrases used to create the Chemical Abstracts Keyword Index for each issue comprise a major search component of the file. These keyword phrases do not contain the CAS systematic nomenclature used in the t Presented, in part, before the Division of Chemical Information, 172nd National Meeting of the American Chemical Society, San Francisco, Calif., Aug 31, 1976. * Author to whom correspondence should be addressed. 20

CA Volume Subject Indexes. Substances are indexed using author terminology or compound class names. Soon after CACon was introduced, CAS experimented with a computer-readable version of the Chemical Abstracts Volume Index called the Chemical Abstracts Integrated Subject File. This file was divided into Chemical Substance Index entries and General Subject Index entries and was in index entry order. This file was the subject of a study by Zipperer et al.233With the emphasis on current awareness and document-oriented retrieval, CAS introduced CASIA in order to supply these volume index entries on a more timely basis. CASIA files are issued every two weeks and contain the index entries for the Chemical Substance and General Subject Indexes of the Chemical Abstracts Volume Indexes sorted by CA publication citations. The ability to perform retrospective searching is also available in this new packaging format because document-ordered CASIA files have been made available to cover the period from January 1967 to the present. The General Subject Index entries basically consist of concept headings selected from a controlled vocabulary, with uncontrolled-vocabulary text modifications which provide the context of the subject of the index heading in the original document. Each Chemical Substance Index entry contains a substance name and, usually, a text modification which indicates the context in which the substance was mentioned in the original document. In addition, Chemical Substance Index entries in CASIA contain molecular formulas and CAS Registry Numbers. At the outset of our research we had two main goals: to investigate data-base vocabulary characteristics to assist users of computer-readable files in understanding the substantive content and retrieval capability differences between CASIA and CACon, and to search CASIA and CACon to demonstrate the relative retrieval performances which can be expected for different question types using each file. O’Donohue4 and PrewittS have stressed the importance of using search aids for effective profiling, and we also feel that such methodology is important to enable other users of CAS files to duplicate our

Journal of Chemical Information and Computer Sciences, Vol. 17, No. I , 1977

COMPARISON OF CACONAND CASIA results. In addition to quantifying the characteristics of the files for their more effective use, this vocabulary study of CASIA and CACon provided the opportunity to confirm that the content of the CAS data base reflected appropriate application of current editorial policies. DATA-BASE CHARACTERISTICS The substantive data contained in CASIA and CACon were compared by selecting from each file those data elements that would normally be rsearched, and then, for each document, comparing the unique contributions from CASIA and from CACon. It is necessary for our discussion, therefore, to describe the contents of both files in terms of their searchable data elements. The document title and the keyword phrases are the basic data elements used for searching CACon. Another valuable access point is the CA publication section number which enables a searcher to select documents in any of 80 areas of particular interest, such as pharmacodynamics or nuclear technology. CASIA, as indicated earlier, contains Chemical Substance and General Subject Index records. Both types of records contain the CA publication section number and text modifications, but in addl tion Chemical Substance Index records contain CAS Registry Numbers and molecular formulas. Substance names are composed of up to six data elements which together descrilbe a substance. The index heading parent consists of the name of the basic skeleton of a substance and may include a suffix. that indicates the principal functional group. Substituents indicate additions to the basic skeleton, whereas name modifcations cite derivatives of certain functional groups. Finally, stereochemical descriptors specify the three-dimensional characteristics of substances. For certain types of substances, two other data elements (homograph defi’nition and line jbrmula) are used to remove ambiguity. For example, the homograph definitions “sheep” and “ox” are used to distinguish dlifferent kinds of insulin. The General Subject Index records in CASIA generally consist of concept headings selected from a controlled vocabulary and text modifications which indicate the context in which the concept headings occurred in the original document indexed. Some concept headings such as “rust” require homograph definitions (e.g., “plant”) to remove ambiguity. Certain frequently indexed General Subject and Chemical Substance Index headings in CASIA are categorized, that is, further classified by: (a) a functional category data element which subdivides derivatives of a substance into classes such as polymers, esters, or oxides; or (b) a qualifier data element which indicates the context in which the substance occurred in the original document, e.g., its preparation, reactions, or analysis. Although these descriptions of CASIA and CACon are necessarily brief, they emphasize the data elements which are most important for information retrieval. VOCABULARY STUDY

Methodology. There are several ways of comparing the retrieval potential of two data bases. A frequently used method is to examine the retrieval performance of the data bases when searched against a standard set of questions (Text Search Study). A second technique is to compare the vocabulary assigned to corresportding citations in the data bases to determine which provides more access points for each citation (Vocabulary Study). Both of these approaches were used in our study. Williams‘ used the vocabulary comparison technique to compare the relative retrieval potential of selected pairs of data elements in CASIA and CACon. Our study expands that

approach by analyzing a larger number of citations (1 1016) and by extending the comparative analysis to include groups of data elements for each citation. This makes it possible to characterize the content of the files using the combinations of data elements typically used for searching. Our research of word overlap between CASIA and CACon has enabled us to predict the effects of augmenting either file with data elements from the other. In addition, we have been able to correlate the average number of unique words for each citation in CASIA and CACon with the retrieval performances of these files. To study word overlap we developed a computer program which extracted the “substantive words” in any set of data elements contained in a CASIA citation and compared them with the “substantive words” in the titles and keyword phrases for the corresponding CACon citation. For each citation the program tallied the number of words contained in both files, as well as the unique contribution from each file. A substantive word is one that is a useful access point to the data base. The operational definition which we used produces units that are both intuitively reasonable and can be expected to correlate with retrieval capability. Basically, we define a substantive word as a string of two or more alphabetic characters delimited at each end by a nonalphabetic character or the data boundary and which passes through a series of screens. Our screening techniques involved the following: (a) Stoplist. Each word extracted from a file is checked against a stoplist and removed if found. The stoplist is the same as that used for Chemical Titles (CT) except that it has been augmented with a few frequently occurring chemical nomenclature fragments (e.g., oxybis, ethanediyl). The words in this list are normally not considered useful for searching. (b) Removal of Duplicates. Only one instance of a given word is allowed for each citation; subsequent occurrences are discarded. (c) Prefix Harmonization. If two (or more) words differ only in the presence or absence of certain nomenclature multiplier prefixes (Le., di-, tri-, tetra-, penta-, or hexa-), then they are counted as a single word. Thus, the program counts the following as only one word: trimethyl, tetramethyl, hexamethyl. (d) Suffix Harmonization. If words differ only in the presence or absence of several selected suffixes, then they are counted as a single word; that is, the program is designed to remove common inflectional variants. Powder, powdering, powdered, and powders would thus be counted as only one word. (e) Harmonization of Abbreviations. Words in titles and concept headings were preprocessed by a program which applied ACS abbreviation rules to make it possible to match against the abbreviated words in the text modifications and keywords. A major goal of the word-screening algorithm is to identify the unique, substantive words associated with each citation. However, while eliminating duplicate words is simple, distinguishing between substantive and nonsubstantive ones is not, since it depends on judgment rather than quantitative measurements. In our opinion, the words on the C T stoplist are both nonsubstantive and ineffective for retrieval. Similarly, we consider certain nomenclature fragments (e.g., oxybis and ethanediyl) that serve mainly to connect meaning-bearing units to be equally nonsubstantive. However, nomenclature units are clearly different types of “words” from natural language words, and it is not obvious how one could satisfactorily compare the relative retrieval merits of, say, adsorption and amino. The harmonization of inflectional variants is much more clearcut. With respect to suffixes, it intuitively seems very reasonable to regard variations such as ketone and ketones as the same word and not to count them separately. This argument is bolstered by the practical observation that most searchers use right-hand truncation, so that the term appearing

Journal of Chemical Information and Computer Sciences, Vol. 17, No. 1, 1977

21

DAYTON ET

AL.

Table I ~

~~~

Data element

Occurrences/ citation

General subject heading (CTH) Chemical substance heading (PAR) Keyword phrase Title

2.13 4.18 2.26 1.oo

in profiles would probably not distinguish between the two variants. In the case of prefix harmonization, we feel that inflated counts would result from counting dichloro and chloro as separate words. Moreover, chloro in the keyword data element of CACon may refer to the same entity as trichloro in a text modification from CASIA, the multiplier in the former case having been removed to make the term more searchable. Thus, we feel that prefix harmonization increases the validity of the results with regard to both word counts and file overlap. Suffix harmonization generally affects general subject words while prefix harmonization applies exclusively to nomenclature terms. The inflection harmonization routine used in the algorithm is fairly simple and certainly could be improved. However, it accomplishes almost all the harmonization we desire with only a relatively small number of errors. While a manual harmonization of inflections could in theory be both more flexible and more accurate, it is not likely to be so in practice with such a large number of citations (1 1016); the work involved is both tedious and laborious but must be carried out very carefully for the results to be reliable. For each citation, the program builds a list of unique meaningful words while at the same time it records the source of the word (CASIA or CACon). If a word, or a suitable harmonization, is already on the list, then only its source is noted. When the citation has been completely processed, every word in the list has been tagged to show whether it occurs in CASIA, CACon, or both files. The program derives a large amount of statistical data from this list and in addition creates lists of the words occurring only in CASIA or only in CACon for subsequent analysis. Data Base. The data base analyzed for our vocabulary study consisted of citations common to two CASIA and two CACon issues. The CACon tapes corresponded to CA Volume 84, issues 2 and 3, but the CASIA tapes did not contain all these citations since CASIA citations are packaged using different criteria. There were 11 016 citations common to the two files: 6321 from issue 2 and 4695 from issue 3. The data base for the vocabulary study did not overlap at all with that used for the text search study discussed later. This was purposely done to delermine whether the statistical results from one study could be correlated with a search study on a completely different sample of the data base. Any successful correlation could then be considered to be caused by a property of the data base and not by an artifact of the particular sample. Data Element Frequencies. The frequencies of some of the data elements studied are given-in Table I. Our sample contained 6.31 index heading data elements per citation on the average. This corresponds very closely to the average (6.33) for the Volume 83 indexes. Note that a general subject heading data element may contain more than one word and that a keyword phrase data element normally consists of two to four words. Vocabulary Study Results. The main results of our vocabulary study are summarized in the figures and tables. Figure 1 illustrates the distribution of citations with respect to numbers of unique substantive words in CASIA. The CASIA data elements used to obtain this figure were those generally searched, that is, concept heading (CTH), text modification (TMD), plus the parent (PAR), substituent (SUB), and modification (MOD) for substance names. In our 22

Words

Figure 1. Citation distribution of unique substantive words from CASIA. 16001

o

!

S

4001

n

Figure 2. Citat CACon.

Journal of Chemical Information and Computer Sciences, Vol. 17, No. 1, 1977

In

1600

: t a t

1200. 8001

distribution of unique substantive words from

COMPARISON OF CACONA N D CASIA Table 11

2400’-

CASIA

CACon

TMD

PAR/SUB MOD

11.88 1 10 8.73 1->99

1.13

1.42

5.58

I I

I

3.05 1-27

3.89 1-52

0 3 1.20 0->99

CTH/ Mean Mode Median Std dev Range

7

C i t a

; 1200

., .

0

il!ustrated by Figurle 3. The remarkable similarity of Figures 2 and 3 is accentuated by the fact that the average number of unique words per citation for Figure 3 is 7.4, the standard deviation is 3.9, and the most common number of words per citation is 7 . Figure 4 illustrates the distribution for the CASIA nomenclature data elements. The reason for the peculiar shape of this histogram, and for its mode being 0, is that many citations have no chemical substance entries. [Figure 4 can be approximated by an exponential equation of the form y = be-ox, where y is the number of citations, x is the number of unique nomenclature words per citation, b = 1636, and a = 0.1’7.1 Indeed, the distribution of chemical substance entries over CA sections is highly skewed (as indicated by the large: standard deviations and the discrepancy between the range and the other statistical parameters); e.g., the average number of substance entries per citation in Volume 83 ranged from 0.29 for one section to 21.63 for another. Thus, a citation ma:y have no substance entries, or sometimes it may have dozens or, occasionally, hundreds. The frequency distributions of the unique substantive words for CASIA, CACon, and two subsets of CASIA are summarized in Table 11. The practical reasons for the differences in the above statistical distributions are clear from the nature of the data and CAS indexing policies. Consider for example, the number of words that might be derived from each of the data elements studied. There is only one title per citation, and it is limited to a fairly small number of words for practical reasons. Similarly, while there may be several keyword phrases, each will contain only three or four words. Keyword phrases tend LO bi: used to describle the major features of a citation and thus tend to be limited to a fairly small number, though the limitation is not one of CAS indexing policy. Thus, the title and keyword data elements are likely to contain a relatively small number of words both for practical reasons and because they are descriptors at the citation level. This is also likely to be true of those parts of CASIA which are associated with general subjects (i.e., CTH’s and TMD’s) rather than specific chemical substances (i.e., the nomenclature data elements: PAR, SUB, and MOD). It is unusual to have a large number of general subjects associated with a given citation; also, the number of concept headings per citation is more evenly distributed than the number of chemical substances. (For example, in Volume 83, the average number of concept headings per citation in CA sections varied from 0.61 to 3.5, a much narrower range than the corresponding one for chemical substance entries.) Text modifications contain more words than concept headings but are limited in length by practical considerations. Moreover, there will be, on the average, less than one text modification per index entry, and a significant number of these will be duplicates and therefore not contribute unique words. Thus, the number of unique words derived from text modifications per citation will be fairly regular. In contrast to the data elements just discussed, the distribution of chemical substance entries per citation will be irregular and will have a very wide range. CAS indexes most substances appearing; in a citation; thus, it is quite possible to index 100 compounds for a citation though it is very hard to imagine one with 100 different general subject headings.

.~

n S

._

>-

0I

-

-

10

5

-

- --.

-

I

.I

20

15

99

Figure 4. Distribution of m i q u e substantive words from CASIA nomenclature data elements

CACon vs. CASIA

Q I

b

C

d

2

3

4

5

6

7

8. 9

10

1 1 12

13

14

M’ords

CACON vs CASIA w i o nomenclature

CACon vs

CASIA nomenclature only CACon vs CASIA Text Modification only

Figure 5. Overlap of unique substantive words per citation between files.

The above argument explains why the histograms of Figures 1 and 4 have long “tails” (and large standard deviations) while those of Figures 2 and 3 extend over a much smaller word frequency range. The chemical substance nomenclature component of CASIA has quite different statistical characteristics from both the nonnomenclature part and from CACon, while the latter two are quite similar. Whether CACon and the nonnomenclature (CTH/TMD) component of CASIA will have equally similar retrieval characteristics will, of course, depend on their degree of overlap and the relative retrieval capabilities of their unique, nonoverlapping portions. Figure 5 summarizes the word overlap results. Of the unique substantive words associated with a citation, 46% appeared only in CASIA. 18% appeared only in CACon, and 36% were common to both files (see Figure 5a). Another way of looking at this is that 82% of the total citation words appeared on CASIA and 54% on CACon, and thus the ratio of unique substantive words between CASIA and CACon was 1.5 to 1. One would intuitively suspect that a large proportion of the words unique to CASIA is derived from its chemical substance nomenclature data elements, and Figures 5b and 5c show that this is indeed the case. Figure 5b shows that there is a very considerable overlap (46%) of the unique substantive words on CACon and CASIA without nomenclature (CTH/TMD), with each file contributing roughly half the mutually exclusive words (that is, 75% of the total words appear in CACon, compared to 7 1% in CASIA, and the ratio of CASIA to CACon is now 1 to 1.06). If we assume that the number of unique substantive words correlates with retrieval capability, then we would predict from this ratio that CACon and CASIA without nomenclature will give comparable retrieval results for queries which do not require specific nomenclature elements. Figure 5c presents a com-

Journal of Chemical Information and Computer Sciences, Vol. 17, No. 1, 1977

23

DAYTON ET Table 111. Distribution of Words Common to Files Compared

Mean Mode Median Std dev Range

CASIA vs. CACon

CTHITMD vs. CACon

PAR/SUB/ MOD vs. CACon

5.20 4 5 2.24 0-25

4.76 4 5 2.12 0-15

1.26 0 1 1.44 0-1 7

Table IV. Distribution of Words Unique to CASIA Subsets

Mean Mode Median Std dev Range

CASIA vs. CACon

CTH/TMD vs. CACon

PAR/SUB/ MOD vs. CACon

6.68 0 4 8-20 0-> 99

2.64

4.32 0 2 6.86 0-> 99

0

2 3.05 0-48

Table V. Distribution of Words Uniaue to CACon

Mean Mode Median Std dev Range

CASIA vs. CACon

CTHITMD vs. CACon

PAR/SUB/ MOD vs. CACon

2.53 1 2 2.07 0-1 8

2.96 2 3 2.34 0-20

6.48 6

6 2-8 3 0-22

pletely different picture; the very low amount of overlap (10%) indicates that few CAS index nomenclature words appear in CACon. Figure 5d can be used to assess the role of text modifications in providing additional access points for a citation. When contrasted to Figure 5b, it can be seen that most of the vocabulary overlap between CACon and CASIA without nomenclature can be derived from text modifications, which contain words also used in concept headings. Quantitatively, the concept heading data element provided an average of 1.20 unique substantive words not duplicated in the text modification per citation. Of these 1.20 words per citation, 0.41 word was duplicated in CACon while the remaining 0.79 word was unique to CASIA. The statistical parameters for the overlapping and nonoverlapping words for various file comparisons are given in Tables III-V (the units are unique substantive words per citation). The mean values from these tables were used to construct Figure 5. The prefix and suffix harmonization techniques explained earlier reduced the number of unique words per citation by 13.5% of the combined set of CASIA and CACon words after removal of duplicates. Suffix harmonization accounted for 11.7% of the reduction and prefix harmonization for the remaining 1.8%. Thus, prefix harmonization had a very small effect and could have been omitted without seriously influencing the statistical results. All the statistical results discussed so far refer to the overlap of words in CASIA and CACon on a citation-by-citation basis and give no indication of the degree to which the total vocabularies of the two files overlap. A comparison of the unique vocabularies of the files indicates that the ratio of unique CASIA to unique CACon words was 1.1 to 1. Further analysis of the unique vocabularies on the basis of the total files revealed that 61%of the words unique to CASIA were chemical substance terminology whereas only 30% of the words unique to CACon were chemical substance terminology. The great majority of the unique chemical substance words referred to organic rather than to inorganic groups or compounds and CACon, in particular, had many Occurrences of generic names 24

AL.

such as “alkylthiohydroxycyclohexanone”. A number of conclusions can be drawn from the statistical data gathered in this part of our study. CASIA contains two distinct vocabularies which are used in the chemical substance nomenclature (PAR, SUB, and MOD) and nonnomenclature (CTH and TMD) data elements. These have quite different statistical characteristics, and their word contents do not overlap to any great extent. The nonnomenclature parts of CASIA (CTH and TMD) have similar statistical properties to CACon (titles and keywords) and contain approximately the same number of unique substantive words per citation; their word contents overlap to a marked extent. If the individual words in CACon and the nonnomenclature parts of CASIA are, on the average, equally useful for retrieval, then CACon will be equivalent to CASIA without nomenclature for retrieval purposes. CASIA without nomenclature and CACon will complement each other for retrieval purposes because of incomplete overlap in the unique substantive words per citation which each contains. A unique feature of CASIA is its systematic nomenclature component, and this must be used in profile construction in order to exploit CASIA’s full potential. Based on the above data and conclusions, and assuming that the retrieval capability of a file is related to the number of unique substantive words per citation that it contains, one would predict that CACon and CASIA would perform equally well on queries requiring general subject terms or trivial chemical names, but would produce complementary rather than identical results, while CASIA would perform much better than CACon on queries requiring systematic chemical nomenclature terms; i.e., substance-oriented searches, for classes of compounds with designated structural features or for specific compounds. (The latter effect will, of course, be reinforced by the existence of Registry Numbers and molecular formulas on CASIA.) It should be emphasized that these predictions are based on the statistical data only and that the second part of our study deals with the actual search results. TEXT SEARCH STUDY Comparison of search results from two data bases using equivalent search profiles makes it possible to evaluate the relative search effectiveness of the data bases in the areas relevant to the search profiles. If the relative performance characteristics of the data bases demonstrate a pattern for several different search profiles, then it may be assumed that an intrinsic difference in the data bases is responsible for the pattern. Our text search study comparing the retrieval effectiveness of CACon and CASIA used a set of ten questions dealing with various topics of current interest in the chemical community. Another study comparing CACon and CASIA was completed recently;* our work expands on that study by emphasizing the most salient features of CASIA, namely, CAS Registry Numbers and chemical nomenclature. These features can be used very effectively with the help of available search aids; therefore, our study stressed the methodology for formulating queries as one of the most important factors affecting search results. Search Environment. Ten search questions were framed to cover a range of topics of current technical interest. Some of the questions were of a broad, general nature, while others emphasized features or properties of chemical substances. Table VI lists the topics covered by the questions and a brief statement of the scope of the question. The data base searched consisted of 31 855 citations selected from issues 21, 22, 23, and 24 of CA Volume 82 which OC-

Journal of Chemical Information and Computer Sciences, Vol. 17, No. 1 , 1977

COMPARISON OF CACONA N D CASIA Table VI.

--

Search Questions

Topic Electrodes Epitaxial growth Fuel gas Steel corrosion Magnetic properties SO, pollution Heterocyclic carcinogens Aromatic nitration

P-Keto acids Optical activity

Table VII. Scope of question

Search Aids for CAS Services

CA Condensates Search Aid Package CA Condensates Word Frequency List-Volume 78

Ion-selective electrodes for chloride and perchlorate Epitaxial growth of gallium arsenide and gallium phosphide for semiconductor devices Fuel gas from gasification of coal Corrosion resistance of stainless steels to acids

(1973) Key-Letter-InContext Index of CA Condensates Words of frequency greater than ten-Volume 78 (1973) Phrases from CA Condensares Document Titles-Volume 78 (1973)

Magnetic properties of rare earth ferrites (including magnets made from such materials) Atmospheric pollution by sulfur dioxide Carcinogenic heterocyclic compounds containing only carbon and nitrogen in the rings Nitration of carbocyclic ring systems containing four or fewer component rings at least one of which is fully unsaturated. Not more than one ring may have five atoms; the other rings must have six atoms p-Keto acids and esters C ( = O ) - C R , C ( = O ) Optically active organochalcogen compounds

CASIA Search Aid Package Manual-List of standard subdivisions for chemical substances and subdivided chemical Substance Headings Index Guide CA General Subject Index Headings List CA Taxonomic Subject Index Heading List Chemical Substance Name-Registry Number List

curred in both CACon and CASIA. [The difference (29) between the number of citations on the data base searched and in the corresponding printed issues is due mainly to citations which are delayed from appearing in the CASIA tapes because of additional editing requirements.] Search profiles were coded for a batch text-search program with simple Boolean logic capability which permitted right and left truncation; the program also allowed coordination of terms occurring in different data elements. Formulating the Searches. The searchable data in CACon, as mentioned earlier, consist of titles and keywords, representing essentially uncontrolled author terminology. CASIA, on the other hand, uses a controlled index vocabulary for concepts and for substances; natural language and author terminology is used in CASIA only in text modifications. The differences in content of CACon and CASIA make it necessary to use different approaches for formulating the search profiles. CACon can be searched effectively using “common-sense” search terms with support from word frequency lists and Key-Letter-In-Conte:xt indexes to add supplementary terms, or to assess the effect of truncation on the relevance of the retrieval. This same approach in CASJA is generally destined to produce poor resuilts. Since most of the vocabulary in CASIA is controlled, it is necessary to use the preferred index terms when searching concepts or chemical substance nomenclature. Text modifications, on the other hand, can be searched basically in the same way as CACon. The controlled terminology used in CASIA is given in publications such as the CA Index Guide and its supplements. I[t must be remarked, however, that many of the controlled general subject terms used in CASIA coincide with natural language terms which may be successfully, but not optimally, searched without recourse to the CA Index Guide and other search aids. Table VI1 lists some of the search aids’ which can assist in the use of CAS products. Nomenclature terms useful for retrieving substance-oriented data can be found in several search aids prepared for Substructure Searching Via Nomenclature, all of which have been described in previous paper^.^-'^ These search aids make it possible to start with features of a substance such as functional groups, ring systems, or nonsystematic names and derive terms which may be used to search CASIA nomenclature effectively. In fact, these techniques have been developed to the point that it is possible to perform chemical substructure searches using CASIA nomenclature terms.I0 CAS Registry Numbers are also very effective for searching for specific compounds in CASIA; CAS Registry Numbers

The 9CI Substructure Searching via Nomenclature Manual and Search Aids Nonsystematic Name Guide for the 9C1 Functional Group Guide for the 9CI Hetero-Atom-InContext Index of Component Elemental Ring Systems for the 9CI Keyword-OutOfContext Inverted Index of Ring Systems for the 9CI Unique and Non-Unique CA Index Name Nomenclature Words and Volume Frequency for Volume 77 (1972) Key-Letter-Intontext Index of CA Nomenclature Words for Volume 77 (1972)

can be found in many primary journals and in the CA Index Guide, the CA Formula Index, and CA Chemical Substance Index, or by chemical dictionaries such as NLM’s CHEMLINE and Lockheed’s CHEMNAME. In order to control the formulation of the search profiles, only publicly available search aids were used to derive search terms. This method may be exemplified by the development of a profile for “aromatic nitration”. As indicated in Table VI, the term “aromatic nitration” was taken to mean the introduction of a nitro group into a carbocyclic system of not more than four rings, not more than one of which contained five, and the others, six atoms. At least one of the rings was required to be fully unsaturated, and spiro systems were excluded. The Index Guide was consulted for possible cross-references. ,4ccordingly, for the CASIA search the concepts “Aromatic compounds”, “Aromatic hydrocarbons”, and “Nitration” were used. For the CASIA search, the substance profile was built up by constructing, with the aid of the CA Index Guide-Index of Ring Systems, the parent names for all possible ring systems within the scope of the topic as framed above, then reducing the terms to common roots by specifying truncation where feasible. The main concepts contained in this example are the coordination of a ring system with the word “nitration”. Although this example did not provide the opportunity to illustrate the use of the “CAS Functional Group Guide for the SCI”, it must be mentioned that this is one of the principal search aids required for searching nomenclature successfully. The Functional Group Guide provides a link between the groups of atoms in chemical structures and the corresponding CASIA nomenclature terms. The CACon search corresponding to this topic included common names such as “aniline” and “toluene” as well as the substance terms used in the CASIA search. One of the profiles used to search CASIA is illustrated in Figure 6. It is interesting to note that for this profile chemical nomenclature terms retrieved 49 index entries, 40 of which were relevant whereas the concept heading terms produced 3 1 index entries, 19 of which were relevant. The relevant index entries corresponded to 26 different citations. The CACon

Journal of Chemical Information and Computer Sciences, Vol. 17, No. 1, 1977

25

DAYTON ET AL. Index Headine Parent Terms

Text Modification Terms

*BENZ* *INDEN* *NAPHTH* *ACENAPHTH* *FLUOREN* AND *ANTHR* *PHENALEN* *PHENANTHR* *ACEANTH* *ACEPHENANTH* *FLUORANTH* *CHRYSEPi* ‘PYRE” *TRIPHENYLEN* ‘BIPHENY LEN’ *TERPHENYL* IQUATERPHENYL’ *PHENOL* *CYCLOPENT* (but not *CYCLOPENTANE* or *CYCLOPENTENE*) Concept Heading Terms AND

Q.,

Lq

0

*NITRATION*

*denotes truncation

Figure 6. Sample profile used to search the topic “Aromatic Nitration” in CASIA.

Topic Iteration (1)

(2)

CACon Percent Percent Relative Recall Relevance

29 67

IO0 100

16 29

75 100

Epitaxial Growth

(1)

(2) (3)

94 46 70

57 87 87

IO0 42 77

47 100 80

Fuel Gas

(1) (2) (3)

89 66 82

74 79 79

64 71 87

88 65 79

Steel Corrosion

(1) (2)

61 74

82 52

62 88

86 33

Magnetic Prop.

(1) (2)

34 81

94 81

26 70

81 44

so2

(1) (2)

96 92

36 86

100 84

26 64

Heterocyclic Carcinogens

(1) (2) (3)

23 33 42

67 95 95

22 23 31

29 33 38

Aromatic Nitration

(1) (2) (3)

35 65 86

88 81 75

13 82 48

13 28 38

8-keto Acids

(1) (2)

43 44

93 100

21 22

22 13

Optical Activity

(1) (2) (3)

44 79 91

97 90 98

7 II 11

7 6 8

Pollution

Figure 7. Results of the text search experiment.

searches for this topic produced significantly fewer relevant retrievals, as can be deduced from the results in Figure 7. In an effort to obtain the maximum recall from CACon, the word 26

1

I

2

3

4

5

6

7

8

9

1

0

Search Questions

Figure 8. Maximum recall.

*NITRATION*

Electrodes

?.

*NITRATION*

Concept Heading Term

CASIA Percent Percent Relative Recall Relevance

Q .

80

Text Modification Terms

*AROMATIC COMPOUNDS* *AROMATIC HYDROCARBONS*

a...

90

“nitration” was used as the sole search term since it was the limiting term of the profile; this produced 25 citations, only 12 of which were relevant. These results illustrate the value of nomenclature in CASIA, and at the same time serve to point out that CA indexing policy requires index entries to be specific. Generic entries such as “nitration” are made when the major emphasis is on the concept, but generally all substances would still be indexed individually. CACon and CASIA require different modes of searching. The keywords in CACon generally supplement the title of a document, making it desirable to combine the words in the title and in the keyword phrases and search them as a unit. This same technique in CASIA will increase the false coordination of terms. Suppose, for instance, that the index entry “Pentane, 3-methyl-, nitration of’, co-occurs with the entry “Benzene, ...”. If all text modifications and nomenclature terms for a citation were searched as a unit, the citation would be considered relevant to the topic “aromatic nitration” when in reality it is not. Hence, the precision of a search on CASIA is improved by searching each index entry rather than the set of terms from all index entries of the citation as a unit. Text Search Results. The general search strategy was to code CACon and CASIA profiles corresponding to each other as closely as possible, recognizing their different data element characteristics. Quite broad profiles were made to establish recall bases for the particular questions so as to determine the maximum number of citations that acceptably answered the stated questions. The recall base used in this paper consisted of all the unique relevant citations retrieved for all the searches on a specific topic. The typical CASIA profile was generally larger than its CACon counterpart. The average CASIA profile had 1.4 times as many terms as the average CACon profile; however, 43% of the CASIA terms were directed toward systematic nomenclature. After evaluating the citations retrieved by the first search, the questions were refined: (a) by adding other potentially useful terms that were encountered during analysis; (b) by narrowing the search profile to increase the proportions of relevant retrievals by correlating terms in a more restrictive manner; and (c) by eliminating terms that either resulted in only false retrievals or retrieved nothing useful. Although in a few cases second refinements were also performed, the aim of this experiment was not to optimize recall and relevancy to the point of complete satisfaction, but rather to assure that essentially equivalent profiles were being used for both files. The objective was to obtain meaningful comparisons of the search results. Figure 7 summarizes the results of the text search study. Each topic was searched two or three times; the numbers in parentheses are the iteration numbers for each topic. As can

Journal of Chemical Information and Computer Sciences, Vol. 17, No. 1, 1977

A N D CASIA COMPARISON C)F (~ A C O N

‘01 0

6

b 3

5rirtii

4 i b 2

----c--t-II------c

n

2

Search Questions

1 7

Figure 9. Average relevancy

be noticed, the best results were not always obtained in the latest searches. Figure 8 illustrates graphically the maximum relative recall obtained for each of the ten topics searched using CACon and CASIA. The maximum relative recall represents the best performance we obtained from these files. Ordering the topics in decreasing order of relative recall for CACon emphasized the differences of the files in terms of retrieval performance. The greatest divergence in performance is seen in the organic substance-oriented topics, for which CASIA performs significantly better than CACon. Buntrock’’ reported similar observations when comparing retrieval of organic substances in CACon and in the printed CA indexes; he also noted that CACon can be searched advantageously for inorganic substances. The average relevance obtained for the searches (Figure 9) was generally better for CASIA, particularly for the “optical activity” topic which made use of the stereochemical descriptor in the substance names. (Note that the topics have been reordered to reflect decreasing CASIA relevance.) It makes sense to ask why the uncontrolled vocabulary of CACon can perform as well as the controlled general subject vocabulary of CASIA, particularly since it is generally assumed that a controlled-vocabulary file should be superior to an uncontrolled-vocabulary one for searching. However, an uncontrolled-vocabulary file can in principle produce results at least as good as those from a controlled-vocabulary one.’? If the uncontrolled-vocabulary file is searched by an expert who is fully conversant with the field of the query and can thus generate all the necessary profile terms, including synonyms and related terms, then the results will be at least as good as those from a controlled-vocabulary file. In fact, they may be slightly better because a controlled vocabulary sacrifices precision in favor of predictability; Le., it is constrained to some extent in order to use a limited set of terms. Thus, the differencle between controlled and uncontrolled vocabularies lies not so much in their intrinsic or absolute searchability (though the uncontrolled vocabulary may be somewhat more machi ne-searchable in principle) but in their relationship to the expertise of the searcher. That is, the more conversant the searcher is with the field of knowledge to which the query relates, the better will be the results using the uncontrolled vocabulary (relatively speaking), and conversely. Thus, an ideal searcher might do somewhat better with the uncontrolled vocabulary, but one who was not personally familiar with the particular field knowledge would undoubtedly be wise to use both the controlled vocabulary and some kind of thesaurus. The reason for the superiority of CASIA for substanceoriented queries is due primarily to the fact that the systematic nomenclature contained in CASIA is generally not found in

CACon. However, there is a fundamental difference between the vocabularies used in CASIA to index concepts and substances. A systematic name for a substance is not merely a label attached to a certain chemical compound; it is a description of its structure. The name can usually be split into smaller information-bearing units that indicate structural features and the relationships between these features. Thus, systematic nomenclature is a precise notational system for describing the structure of chemical compounds. No comparable notation yet exists for describing concepts. General subject terms are simply labels that are, by common consent, associated with certain phenomena, and there is generally no way to deduce from the terms alone the actual relationships of similar phenomena. (This is not entirely true as the presence of certain suffixes in the term may contain information about the nature of the phenomenon; e.g., -0lysis and -ylation in chemical reactions and -itis or -oma in pathology.) Thus, substructure search is a successful and productive technique for systematic nomenclature but virtually inapplicable to general subject concepts. CONCLUSIONS This paper has compared the relative effectiveness of CACon and CASIA for information retrieval. Our vocabulary study has highlighted the differences in substantive word content between the two files. In addition, figures have been presented which quantify the contributions to the vocabularies of CASIA and CACon by various data elements. The data base analyzed was sufficiently large that the predictions which could be made from the data base characteristics were confirmed by text search studies on another sample of the data base. The text search results indicate that, when using appropriate search methodology, CASIA will give significantly better recall than CACon for substance-oriented topics while being roughly equivalent for general subject topics. Relevance is generally slightly higher for CASIA results than for corresponding CACon results. Our study emphasizes that knowledge of the data base content and its vocabulary gained through the use of search aids is very important for searching CASIA, particularly in the formulation of search profiles which include chemical nomenclature terms. This study provided insight which will help to improve CAS computer-readable files. One possibility which is being explored includes the combination of CACon bibliographic and keyword data with CASIA’s powerful substance-retrieval capability. This combination would result in the elimination of redundancy between the files while at the same time retaining the unique content contribution of each file and the vital relationship of the data elements and other logical file units. Human factors also need to be studied. The temptation to use ad hoc terms in the preparation of search profiles and the tendency to transfer CACon search techniques to CASIA need to be tempered by education in the data-base content, the use of search aids, and exploration of techniques for computerassisted profile development. CAS recently augmented its user education program to address some of these pr0b1ems.I~ The education program makes use of problem workbooks and workshops using on-line terminals to provide guidance in the use of CAS publications and files. This interface with the actual users of CAS services also provides feedback which can lead to improvements of the CAS Data Base. LITERATURE CITED ( I ) R . E. O’Dette, “The CAS Data Base Concept”, J . Chem. Sei., 15, 165-169(1975).

Comput.

Journal of Chemical Information and Computer Sciences, Vol. 17, No. 1, 1977 27

BUNNETT AND KOlVlSTO (2) W. C. Zipperer, R. E. Steams, Jr., and M. K. Park, “The Integrated Subject File. I. Data Base Characteristics”, J . Chem. Doc., 13,92-98 (1973). (3) W. C. Zipperer, M. K. Park, and J. L. Carmon, “The CA Integrated Subject File. 11. Evaluation of Alternative Data Base Organizations”, J . Chem. Doc., 14, 15-23 (1974). (4) C. H. ODonohue, “Profiling, the Key to Successful Information Retrieval”, J. Chem. Doc., 14, 29-31 (1974). ( 5 ) B. G. Prewitt, “On-Line Searching of Computer Data Bases”, J . Chem. DOC.,14, 115-117 (1974). (6) M. E.-Williams, “Analysis of Terminology in Various CAS Data Files as Access Points for Retrieval”, presented before the Division of Chemical Information, 170th National Meeting of the American Chemical Society, Chicago, Ill., Aug 24, 1975; J . Chem. Inf. Comput. Sci., paper in this issue. (7) C. P. Bourne and D. F. Ford, “A Study of the Statistics of Letters in

English Words,” InJ Control, 4, 48-67 (1961). (8) P. 8 . Schipma et al., “Final Report: Study of Indexing and Information

Display”, Report No. C6299-2, IIT Research Institute, 1975. (9) D. L. Dayton, “New Aids for Formulating Searches of CAS Indexes and Computer-Readable Files”, CAS Report No. 4, Aug 1975, pp 3-8. (IO) W. Fisanick, L. D. Mitchell, J. A. Scott, and G. G . Vander Stouw, “Substructure Searching of Computer-Readable Chemical Abstracts Service Ninth Collective Index Chemical Nomenclature Files”, J . Chem. In!, Comput. Sci., 15, 73-84 (1975). (1 1) R. E. Buntrock, “Searching Chemical Abstracts vs. CA Condensates”, J . Chem. InJ Comput. Sci., 15, 174-176 (1975). (12) F. W. Lancaster, “Vocabulary Control for Information Retrieval”, Information Resources Press, Washington, D.C., 1972. (13) R. J. Rowlett, Jr., ‘Symposium on User Reactions to CAS Data and Bibliographic Services. Concluding Remarks”, J . Chem. InJ Comput. Sci., 15, 186-189 (1975).

Feedback Control of Journal Manuscript Receipts JOSEPH F. BUNNETT’ and HELGA KOIVISTO University of California, Santa Cruz, California, 95064 Received November 5, 1976 The rate of receipt of manuscripts by Accounts of Chemical Research correlates inversely with the number of “Articles Accepted for Publication in Future Issues” as listed on the inside back cover. A feedback mechanism regulating manuscript flow is shown to operate. Feedback is delayed, and in consequence manuscript receipts and the backlog of unpublished articles are cyclic in character. Without really intending to do so, the journal Accounts of Chemical Research has during the past eight years conducted an experiment relevant to factors that influence the number of manuscripts submitted for publication in a scientificjournal. THE JOURNAL AND ITS PRACTICES Accounts of Chemical Research (hereafter abbreviated Accounts) is a journal of short reviews, usually of developments at or near the research frontier in a limited area of special interest to the author. Articles in Accounts are often concerned in large part with developments in the author’s own laboratory. Critical analyses of problems currently in a state of flux are also welcomed. Accounts is published by the American Chemical Society. A substantial fraction (about 75%) of the manuscripts published stem from invitations extended by the editor. Of the others, most are proposed by the author to the editor in preliminary correspondence, and then encouraged by the editor, but some come in “over the transom” without any prior contact. All manuscripts are sent to at least two reviewers for criticism, and not infrequently authors are asked to revise manuscripts as suggested by the reviewers or editor.’ The intended, ideal space allocation is six journal pages per article, and overlength manuscripts are returned to authors for shortening. Actual practice falls long of the ideal, for the average length of articles published through 1975 is 7.2 pages. An honorarium of $250 is paid per manuscript upon publication. In case of coauthorship, it is divided equally among the coauthors. No page charge is levied, but reprints must be purchased at usual rates. CONDITIONS OF THE “EXPERIMENT” Three features of Accounts have served the purpose of what we may regard retrospectively as an experiment. One is that the annual number of text pages (exclusive of covers, title page, and indexes) has been virtually fixed: 380 pages in 1968, 379 in 1969, 421 in 1972, and 427 pages in other years through 28

1975. The second is that articles, by virtue of editorial enforcement of the length restriction, are nearly constant in length (65% of those published in Volumes, 1-8, inclusive, were 6, 7, or 8 pages long). The third is that articles accepted for publication and to appear in future issues have, since June 1968, been listed by author(s) and title on the inside back cover.2 The number so listed has varied from a low of 6 to a high of 30. Finally records have been kept of manuscript receipts, month-by-m~nth,~ and of invitations issued by the editor to prospective authors. The principal question on which the accumulated data are now brought to bear is whether or how the number of manuscripts received in a month is related to the number of articles accepted for publication as listed on the inside back cover. An associated question is whether or how manuscript receipts are related to the number of invitations extended by the editor in the recent past. RESULTS Our data concerning manuscript receipts and the number of articles listed on the inside back cover are plotted month-by-month on a chronological basis in Figure 1. Fluctuations are evident, long-term swells and troughs in the number of articles listed, and short-term fluttering in monthly manuscript receipts. Nevertheless some trends in the latter seem apparent: a surge of manuscript receipts in late 1970 and early 1971, another in 1972-73, and a third in 1974-75, with “dry seasons” especially in 1971-72 and the latter parts of 1973 and 1975. These “dry seasons” of low manuscript receipts happen to coincide with swells in the number of articles listed, in late 1971, 1973, and 1975. This suggests a correlation. Accordingly the data for December 1968 through December 1975 were subjected to linear regression analysis, correlation being attempted between manuscript receipts and articles listed the same month, the previous month, and so forth back to the fifth month before. Correlations were also attempted with

Journal of Chemical Information and Computer Sciences, Vol. 17, No. I , 1977