SMITH, ANDERSON, A N D JACKSON H C = C = O 1 E3 MULTIPUNCH POSITIONS: 04&
C1-6 chain
042
Unbranched
050
C3-& chain
058
Not a polyvalent chain
069
Hydrocarbon
530 59&, 60-, 62-, 620
717
>1 double bond, not conjugated No rings Aliphatic molecule
1972-mid ‘76, 286 references
Figure 8. Simple compound multipunch retrieval.
ceptably low relevance. The multipunch code may be very effective for complicated molecules; but it is relat&ely ineffective for simple ones. Other retrieval problems stem from the fact that products are indexed, but not starting materials, again reflecting the interests of the end-product-oriented pharmaceutical and agricultural industries. It is very difficult to retrieve information on the utilization of a given starting material. Catalyst retrieval, too, is sketchy at the moment. For example, an attempt to isolate rhodium catalysis in the oxo test search gave only middling results. Hopefully the new catalysis manual codes introduced this year will help. We saw how effective title keywords were for retrieval in the oxo example. They would have been even more effective had it not been for a number of typographical errors: hydroformalation, hydroformulation, and hydro-formylation. The misspelling rate in Derwent titles is far too high, and the unstandardized use of various punctuation marks causes one to miss relevant items. New hyphenation rules promise to help the punctuation problem. Users should certainly try to truncate terms wherever possible, to allow for variant endings, misspellings near the end of the word, and hyphenated combinations. Use the “Neighbor” command liberally, too. But of course misspellings at the start of words are going to be missed no matter what you do, and there is a great need for Derwent to tighten up on the proofreading of their abstract headings. One more point for searchers: remember to use
both American and British spellings. Tests are underway on the addition of indexing keywords to Plasdoc, and, while this is welcome, it would seem more important to add this feature in other sections first, since Plasdoc retrieval facilities are strong in comparison with those for most other sections. In particular, Chemdoc needs better retrieval of starting materials, processes, and simple compounds. One of the problems in retrieval stems from the fact that changes in indexing have occurred over the years. Improved indexing is, of course, beneficial to users, but the searcher often has to devise different search strategies for different time periods. Given a system as complex as the CPI, one of the greatest needs for users is detailed instructional materials. In fact the telegraphic style of Derwent headings extends to their instruction manuals, and leaves much unexplained. Recent communiques from Derwent have acknowledged that fact and promised more detailed manuals for the future. These will be very welcome, as will a promised listing of past multipunch codings, which will be an invaluable aid in devising multipunch search strategy. Retrieval, then, presents more problems than does alerting. Nevertheless, there is considerable retrieval capability, and it has been greatly enhanced by the advent of the on-line file. There is evidence that titles, always a Derwent strong point, have improved and will continue to be improved. Hopefully the keywording experiment will succeed and will be extended to additional sections of CPI. In the meantime, CPI searchers are advised strongly to make use of all of the available, complementary, search parameters. This paper has not attempted to touch on every aspect of the Derwent products, much less explore them all in detail. Its aim has been to present a balanced picture of the most important strengths and limitations of the system. If there has been a stress on some of those limitations, it has been in the spirit of striving for improvements in a powerful but imperfect information resource. Let there be no misunderstanding: based on extensive experience we regard the Derwent CPI-WPI system as an invaluable information asset, indispensible for anyone who needs to know about the new technology of the present and the recent past.
On-Line Retrieval of Chemical Patent Information. An Overview and a Brief Comparison of Three Major Files? ROGER GRANT SMITH,* LOUISE P. ANDERSON, and SUSAN K. JACKSON Merck & Co., Inc., Rahway, New Jersey 07065 Received April 21, 1977 Databases which contain patent information and are available to the public via on-line computer networks are listed. A comparison of document and retrieval parameters is made for those on-line databases which provide comprehensive coverage of the chemical patent literature, Le., CHEMCON, CLAIMSTM,and WPI. Examples of representative searches run in each of these three databases are presented and discussed. The conclusion is reached that no one database may be relied upon to provide all possible relevant answers.
INTRODUCTION of on-line The ability to retrieve patents by to computers is a relatively recent phenomenon. Producers Presented in the symposium on “Meeting the Challenges of the Changing Patent Literature”, Division of Chemical Information, 173rd National Meeting of the American Chemical Society, New Orleans, La., March 21, 1977, and 1 Ith Middle Atlantic Regional Meeting of the American Chemical Society, Newark, Del., April 21, 1977.
*Author to whom inquires should be sent. 148
of machine-readable patent databases have been slow to capitalize on the advantages of on-line retrieval for obvious economic reasons: comprehensive patent databases are large and therefore expensive to maintain on-line, while their appeal has always been limited. Considering the cost factors, it would seem unlikely that many organizations would care to sustain the entire cost of operating an in-house on-line retrieval system for a large file of patents. One company that did perform pioneering work
Journal of Chemical Information and Computer Sciences, Vol. 17, No. 3, 1977
ON-LINERETRIEVAL OF CHEMICAL PATENTINFORMATION Table I. On-Line Databases Which Provide Access to Patents ON-LINL UATA BZSLb A H I L H PROVIDL ACCLSS TO I’ATLNTS PRODUCER
l:ULL NAML
41’1 1 C
American Petroleum I n s t i t u t e P a t e n t Indexa
American P e t r o l e u m I n s t i t u t e Nev Y o r k , N e w York
Air Pollution Technical Information Center
L n v i r o n m e n t a l P r o t e c t i o n Agency Research Triangle Park, N.C.
Chemical A b s t r a c t s Condensates CA PAll.hT CVNCORDANCL C A S lA/CtlLMhMll.
b
Chemical A b s t r a c t s S e r v i c e Columbus, Ohio
Chemical . A b s t r a c t s P a t e n t Concordance
(‘hemical A b s t r a c t s S e r v i c e Columbus, Ohio
Clieniical . \ h s t rlic t
Subj e c t Index
Chemical A b s t r a c t s S e r v i c e Columbus, Ohio
lit ivities
Chemical A b s t r a c t s S e r v i c e Columbus, Ohio
b
A 1e r t d
Chemical-Biolok I c a l
C l a s s Code, i 5 s i g n e e , I n d e x , Nethod Search/Chemis t r y
l F I / P l e n u m D a t a Company Arlington, Virginia
C l d s s Code, Ashipnee,
I n d e x , Method Search/General, L l e c t r i c a l , Nech a n i c a 1
I F I / P l e n u m D a t a Company Arlington, Virginia
C l a s s C o d e , r\h.;ignee, I n d e x , hlethod S e a r c h / U . S . P a t en t C l a s s i f i c a t l o n e
Il:I/l’lenum D a t a Company Arlington, Virginia
L I I e r g y Re 5 e a r c 11 Ab b t r 3 c t 5’
L n c r g y R e s c a r c l i And IJevelopnient Aclniinist r a t i o n , h a s h i n g t o n , i). C .
‘Ihe I n f o r n i a t i o n B d n h
New Y o r h Times I n f o r m a t i o n S e r v i c e s P a r s i p p a n y , Nek J e r s e y
I n t e r n a t i o n a l InCornidtion S e r v i c e s i n P h y s i c s , ~ l e c t r u t e c h n o l o g y , Colilputers And C o n t r o l c
I n s t i t u t e of Electrical Engineers London, Lngland
MKl S
Maritime Research Information Service
Maritime Transportation Research Board, Washington, D.C.
N’I 1 S
National Technical Information S e r v i c e R i b l i o g r a p h i c Data b i l c
National Technical Information Service Springfield, Virginia
P A PI-. RCH1.M
Paper Chemistry I n f o r m a t i o n
I n s t i t u t e Of P a p e r C h e m i s t r y c\ppleton, P’isconsin
PRLDICASTS F6!5 INDEX
I ’ r e d i c a s t s F4S Index”
Predicasts, Inc. C l e v e l a n d , Ohio
PREDICASTS MARKI‘T
t ’ r e d i c a s t s blarke t Abs t r a c t s a
Predicasts, Inc. C l e v e l a n d , Ohio
Ph 1
P h a r m a c e u t i c a l News I n d e x
Data Courier Inc. L o u i s v i l l e , kentucky
P O L LU T I ON A B ST RA CTS
Po 1l u t i o n A b s t r a c t s
Data C o u r i e r , I n c . L o u i s v i l l e , Kentucky
STAR
S c i e n t i f i c And T e c h n i c a 1 Aerospace Reports
U . S . N a t i o n a l A e r o n a u t i c s And S p a c e A d m i n i s t r a t i o n , Washington, D.C.
SI 1
Specialized Textile Information Servicea
Shirley Institute M a n c h e s t e r , England
swt
S e l e c t e d Water R e s o u r c e s A b s t r a c t s
Water Resources S c i e n t i f i c I n f o r m a t i o n C e n t e r , Washington, D.C.
Toxic M a t e r i a l s Information Center General F I le”
Toxic Materials Information Center Oak R i d g e , T e n n e s s e e
P e t r o leum 4 b s t r a c t s
U n i v e r s i t y Of T u l s a I n f o r m a t i o n Services Department, T u l s a , Oklahoma
W o r 1d .A 1umi num Ab s t r a c t s
American S o c i e t y F o r M e t a l s M e t a l s P a r k , Ohio ( F o r Aluminum A s s o c i a t i o n O f A m e r i c a )
World P a t e n t s I n d e x a
Derwent P u b l i c a t i o n s , L t d . London, U . K .
)A BST RACT S
I 0x 1C
w ’ r I:R I 4 1, s
TULSA
hPI
Restricted access. I>
Offered a s separate f i l e s , e.g.,
‘ Offered For
USE
a s two s e p a r a t e f i l e s :
1 9 7 0 - 1 9 7 1 a n d 1 9 7 2 - p r e s e n t by S D C ; 1 9 7 0 - 1 9 7 1 , 1 9 7 2 - 1 9 7 6 , 1 9 7 7 - p r e s e n t by Lockheed. ( 1 ) P h y s i c s a n d ( 2 ) E l e c t r o n i c s & Computers
w i t h CHEMCON.
e F o r u s e w i t h CLAIMSTM/CHEM a n d CLAIMSTM/GEM
Journal of Chemical Information and Computer Sciences, Vol. 17, No. 3, 1977
149
SMITH, ANDERSON, AND JACKSON Table 11. Details o n On-Line Databases Containing at Least 5 % Patent Information (as of March 21, 1976) APIPA: APPROXIMATE & U M B E R OF PATENTS I \ OK-LINE FILE OF FILE H H l C H I S PATENT I RFORVAT! OS
500,000
I
I
I
400,000
I 19%
100%
7%
100%
409,000’
v i a CHE:dCOU
I
CLAIMS~~ICHEM
CHEMCOK
CASIAICHEMHAME
CONCORJ,ANCE
j,600
:03.030
--
CA PAlENT
A P T IC
I
( v i a CL4IMSlChEhl and CI.AIMS/GEM!
loo\
I COUITRIES
COVE RED^
Prevention and
SUBJECT ARMS COVERED
I
YE4R R4\GE COVERED
BROKERS
7 O N - L I \ E COST
I
O F F - L l h E P R I S T COST
RESTRICTI35S IMPOSED B i PRODUCER
hates
a
- -- -
ICIREPAT C a u n t r ; C o d e s a r e u s e d , ~ . e . ,AR = A r g e n t i n a , AU A u s t r a l l a , BE = B e l g l u m , B R = B ~ a i i l , = C a n a d a , C H = S h l t i e r l a n d , CS = C z e c h o s l o v a k i a , DK Denmark, DL * E a s t Germans, = h e s t German), E l = I r e l a n d , ES = S p a i n , F R F r a n c e , GB G r e a t B r i t a i n , HU Hungar), I L = I s r a e l , IS = I n d i a , I T = I t a l y , 5 4 J a p a n , KL N e t h e r l a n d s , NO = NOT*a!’, OE .Austria, PO = P o l a n d . PT = P o r t u g a l . RU Romania, SF = Finland. SU = Soviet U n i o n . Sh = S h e d e n , US United S t a t e s , iA South A f r l c a CA DT
-
.
-
-
P h a r m a c e u t i c a l s 1 9 6 3 - P r e s e n t ; A g r i ~ u l t u r a l 1~9 6 5 . P r e s e n t ; Polymers 1 9 6 6 - P r e r e n t , A l l Chemirtr) I S ’ O - P r e s e n t ; A l l o t h e r p a t e n t s 1 9 7 4 - P ~ e ~ e n t .
in on-line retrieval of patent information was G. D. Searle Co. which developed the SOLD System for on-line retrieval from a large portion of Derwent’s Central Patents Index file in 1973.’ For the most part, however, it remained for the on-line information “brokers”, such as Lockheed Missiles & Space Co. (Lockheed), System Development Corp. (SDC), and Bibliographic Retrieval Services (BRS), to make on-line retrieval of patent information available to anyone with a computer terminal and sufficient money to pay the hourly usage fee. Currently, there are over 80 databases of all types on-line. The fact that some of these databases contain patent information is of little significance to the brokers. Their main concern is that each database generates sufficient income to justify its continued on-line availability. ON-LINE DATABASES CONTAINING PATENT INFORMATION At the present time there are 26 on-line databases which contain patent information. These are listed in Table I. More details are given in Table I1 for those databases in which patent information constitutes at least 5% of the total file. Additional information on all of these databases may be readily obtained from either the producers, the on-line broker^,^,^ or the Directory of Computer Readable Bibliographic Data Bases compiled by Williams and Rouse.4 THREE ON-LINE DATABASES PROVIDING COMPREHENSIVE COVERAGE OF CHEMICAL PATENTS Of the databases listed in Table 11, there are only three which attempt to provide comprehensive coverage of the 150
chemical patent literature. These three are Chemical Abstracts Condensates (CHEMCON), Class Code Assignee Index Method Search/Chemistry (CLAIMS), and World Patents Index (WPI). CHEMCON is produced by Chemical Abstracts Service of Columbus, Ohio. CLAIMS is produced by IFI/Plenum Data Company of Arlington, Virginia. WPI is produced by Derwent Publications Limited of London, England. CHEMCON is carried by three brokers, Lockheed, SDC, and BRS, while CLAIMS is carried exclusively by Lockheed, and WPI is carried exclusively by SDC. It is important to note that the last two of these, CLAIMS and WPI, are pure patent databases and thus geared exclusively toward patent collection and indexing. On the other hand, CHEMCON is a literature service which looks upon patents as simply another form of technical literature. CHEMCON may be used in conjunction with its companion files CA PATENT CONCORDANCE, CASIA, and CHEMNAME, which are also produced by Chemical Abstracts and are currently offered only by Lockheed. CASIA and CHEMNAME do not refer directly to citations but provide additional search terms which can be used either alone or in conjunction with search terms from CHEMCON to find patents in CHEMCON. CLAIMS also has a companion file called CLAIMS/CLASS which enables the searcher to determine appropriate U.S. Patent Office Classification Numbers for use in finding patents in CLAIMS. The balance of this paper will deal exclusively with a comparison between CHEMCON (including its companion files), CLAIMS, and WPI. PREVIOUS COMPARISONS There is extremely little research literature comparing the on-line retrieval afforded by different chemical patent da-
Journal of Chemical Information and Computer Sciences, Vol. 17, No. 3, 1977
ON-LINERETRIEVAL OF CHEMICAL PATENT INFORMATION
CLAIYS~'/
FAPLRCHEM
GEM
7 .I I 100,003
45,033
2,000
54.000
-,600
1,200,000
30%
168
IDC?
5DC
L o c k;ee d
SDC
SDC, pL S i S t s m 5 ,
Lockheed
Snlrle) lnitltilte
S6i/hour (SDC d Lockheed: Z0.10/Cl
5 i2jIhcdr
#man-subscriber r a t e :
tatlo"
s0.;0/cltatlan (non.subscriber ratel
iSDC i Lockheed)
i
I
\one
hone
Si0i?2-l:
Sc.lo~c,tatlon
SliOIhcur Inan-subscriber r a t e ) S0.li/cltatlon
("an-subscriber rate)
In-depth !ndexing T e r m s L i m i t e d To
S u b s c r i b e r s Cnl!
C u r r e n t l y o n l y r e s t d e n t p a t e n t e e s O T patentees f r o m c o u n t r i e s o t h e r t!la: t h e 26 listed are c o v e l r d l n A h , C h , C S , D h , D L . E S , H U , l i , I L , \ 3 , P O , RU. S F , S U , S H . Depends upon c o n t r a c t e d hcurs o f u s a g e .
tabases. This is due to the short period that these databases have been available in the on-line mode. For example, WPI, which contains the largest number of patents, did not go on-line until February 1976. The first comparative study, as far as the authors are aware, was presented by Hare in March 1976.5 His study of the results of two searches conducted in CHEMCON and WPI indicated that WPI was noticeably superior in the specific fields he searched, namely, quinoxaline N-oxides and the fermentation production of citric acid by Japanese companies. Hare concluded that WPI was weaker than CHEMCON in term searching but made up for this lack with its unique chemical fragment codes. COMPARISON METHODOLOGY The methodology of this comparison will involve the use of two basic approaches: comparison of database coverage and comparison of parallel search results from each of the three files. Coverage will be analyzed in two ways: first, in terms of the documents which are included in each database, and, second, in terms of the retrieval parameters-both subject and non-subject-which may be accessed in each database. The parallel search results will be analyzed in terms of total relevant answers retrieved and a concept which the authors call "exclusive relevant answers". The answers which were missed by each system will also be analyzed and their causes discussed. Recall and precision calculations will not be made for reasons which will be apparent in the discussion of the search results. Other comparative criteria, such as response time, user effort, output form, and cost will not be discussed in this paper. Discussion of the first three would be meaningless at this point for a variety of reasons; e.g., significant degrees of difference
exist between the searchers in their experience with on-line systems and their familiarity with the databases. Thus, reproducible measurement of these factors was not possible. The last-mentioned criteria, output form, is basically similar for all three databases, Le., a printout of bibliographic citations obtained either on-line or off-line. Typical citations include titles, patent numbers, database accession numbers, patentee names or codes, patent filing information, and classification codes. The format of the citations can be tailored to the user's needs. Neither abstracts nor full text are currently available from any of the three databases. The parallel searches which are studied will be "parallel" in the sense that the same questions will be asked of each system. The results will thus depend on the document coverage and retrieval parameters available in each database, as well as the strategy employed by the search specialist. The searches run in CHEMCON have the further qualification that only the active file, from 1972 to present, was used. CHEM7071 was excluded for reasons of expediency: the keyword indexing for the 1970-1971 period tends to be less detailed than the indexing used in the current file and the possibility of using both left-hand and right-hand truncation in the SDC file would require a separate search strategy to have been devised, executed, and analyzed for each of the sample questions. Because of the small number of sample searches this comparison is not intended to be a complete evaluation of the capabilities of each database. The conclusions which are drawn should be viewed in the context of the sample searches only. DOCUMENT COVERAGE
A tabular comparison of the document coverage of each of the three databases is shown in Table 111, arranged by country.
Journal of Chemical Information and Computer Sciences, Vol. 17, No. 3, 1977
151
SMITH, ANDERSON, A N D JACKSON Table 111. Document Coverage Countries
CHEMCON‘
Argentina Australia Austria Belgium Brazil Canada C2echoilovakia Denmark Finland France Germany, East Germany, h e s t
....
Table V. Retrieval Parameters-Subject kP1
CLAIMS-CHEM
19iO-Presentc 1970-Present 1970-Present 1976-Present 1970-Present
.... .... 19 jo-Presentb
.... ....
Great Britain Hungary
19i0-PresentC
1975-present
....
lndia Ireland Israel Italy Japan Netherlands Norway Poland Portugal Romania South A f r i c a Spain Sweden Switzerland United S t a t e s U.S.S.R.
1970-PresentC
....
....
....
1963-1969 197i-Present
....
....
Clasrlflcat>an index Terms
C*SIA
CLAIMS plPl x
x
x
x
xa
.
x
x
x
C.A. Publication S e c t i o n
....
19-4-Present
1963-Present 1363-Present 1963-Present 1963-Present
lgjO.Presentb
Patent Classiflcatlon (Orlglnal)
u.S.
....
1950-PTeSe”tb 195G-Presentb
x
U.S. Patent Classification
(Cross Reference) x
Classification Code
....
....
....
1963-Present 1963-Present 19*4-Prezent
....
x
Derwenr Manual Code
1950. presentb ....
1970-Present‘
....
....
....
1974-Present 1975-Present 1963-Present
....
l9:o-?resentc 1970-Present 197O-PresentC 19 7 0 -presentc 1970-PresentC 19?0-Present 197O-Present‘
~ n t e ~ n a t t o n aPatent l
Chemcon (Lockheed1 x
Title Terms
....
197O-Prese”tc lgiO.Presentc 19:0-PresentC 19iO-Present 197O-Prese”t= 1970-Prese”t 1970-Present
1970-PresentC 1370-1976 197F-Present 1970-Present 197O-Prese“tc
Chemcon (SDC)
Pa rame t e T
....
1975-1976 1963-1969 1975-Present 1963-Present 1976-Present 1963-Present 1975-Present 1974-Present
nerwent Class
xa
Derwent Punch Code
x
.
....
.... 1974-Present 1963-Present 1963-Present 1963-Present
.... ....
Ring Index Number
....
Rare Fragment Numbers
Xb
.
x=
-
....
1950-Present
....
a
1970-Present 1972-Present 1972-1975
a
Includes CHEM70il Coverage.
Table VI. Searches Used in Study
Covered as equivalents to U.S. only.
‘ Currently patentees
covers patents issued to resident patentees o r f r o m c o u n t r i e s other than the 26 covered by CAS
SEARCH
__
CHEMCVN
CHEhKVh [Lockheed]
(SDC)
C.A. A c c e s s i o n %umber
x
SPECIFIC COMPOUND
BLEOMYClh’, ITS DERIVATIVES AND RADIOISOTOPES
2
COMPANY
BRITISH ALUMINIUM
3
CHEMICAL STRUCTURE
12 -PHENYL (1,3)0XAZ IN0 ( 3 , 2 - d) - (1,4)BENZODIAZEPlh‘ES
4
CHEMICAL STRUCTURE
2-ANILINONICOTINIC ACIDS, INCLUDING NIFLUMIC ACID
5
COMPANY/ ACTIVITY
E. MERCK TRANQUILIZERS
6
EQUIVALENTS
U.S.
9
x
D e m e n t A c c e s s i o n humber xa
Xa
Accession Year
Company Code cornpan) sane
OT
Patent Asilgnee
Author
t
Xb
x
x
Patent Number
x
x
patent Count*)
x
x
Priority Date
xc
OT
Inrentor
P~lorityhwnber
c
Claim Type (C, P , MI
Notes:
a
Through l m i t command u s i n g
acce551on
number ranges
By I C s - each term I S indexed separately, e . g . ,
SyntexICS,
Lilly/CS and Eli/CS By Stringrearch Only available 1970-Present e
Only available 1972-Present through
limit
command.
O n l y available 1973-Present?
It should be noted that Derwent’s coverage of chemistry developed stepwise beginning with pharmaceuticals in 1963 and followed by agricultural chemicals in 1965, polymers in 1966, and finally full coverage of chemistry in 1970. It should also be kept in mind that each database employs a slightly different definition of what constitutes a chemical patent. Derwent, for example, uses a somewhat broader definition than Chemical Abstracts, and for that reason picks up a large number of “fringe” patents. 152
3,769,282 PATENT FAMILY MEMBERS
RETRIEVAL PARAMETERS
x
Prlorlty countrv
TITLE
1
Table IV. Retrieval Parameters-Nonsubject Fararneter
SEARCH
NUMBER _ _ _ TYPE _
The retrieval parameters which are accessible in each of the three databases are shown in Tables IV and V. CHEMCON offered by SDC and CHEMCON offered by Lockheed are listed separately because the Chemical Abstracts Condensates files can be accessed differently owing to differences in the retrieval programs of the two brokers. CASIA is also listed separately to illustrate that in itself it contains no patent information but provides additional index entry points to data in CHEMCON. The footnotes should be read carefully, since they indicate restrictions on how certain parameters may be used. It is apparent from Table V that the only subject parameter common to all three databases is the title terms category. Furthermore, the significance of this overlap should be minimized since the titles are frequently enhanced by the different database producers and consist of uncontrolled vocabulary. In essence, there is no subject retrieval parameter which appears consistently the same in all three databases. Of the three files only WPI makes available on-line all of the retrieval parameters which are available in the original manual or batch searchable database. Both CHEMCON and CLAIMS are designed to provide quick, nonexhaustive answers. These answers may then be used as aids in searching for additional answers in the printed Chemical Abstracts Subject Indexes or the batch-searchable IF1 comprehensive Database of Patents.
Journal of Chemical Information and Computer Sciences, Vol. 17, No. 3, 1977
ON-LINERETRIEVAL OF CHEMICAL PATENT INFORMATION Table VII. Search Strategies and Performance Dates SEARCH
1.
B L E O M Y C I N , ITS DERIVATIVES A N D
CHEMCON*/CASIA
AND P ( U C )
BLEOMYCIN:
D A T t PtRFORMED:
B R I T I S H ALUMIhIUM
2/4/77
WPI
C L A I M S S E A R C H STRATEG) AND DATE PERFORMED
on
SEARCH STRATtCY AND DATE P E R F O R M E D
H L E U M Y C I K S f r o m E X P A N D BLLObl
{[BOZ-B(MC) OR C O Z - B ( M C ) ] O R [ O l l / B , B 1 ( M P ) ] OR [ Oll/C , C 1 ( M P ) AhD S T R i N G S E A R C H :BLEO: (TI)
DATi. I ' L R F O R M I D :
DATL P E R F O R L I E D :
BLEOMYCIN O R BLEOMYCINIC
CIiEMCON:
RADIOISOTOPES
2.
SEARCH STR4TEGY
AND DATE P E R F O R M E D
1/14/77
1)
1/7/77
CHtMCUi: B R I T I S I I / C S AND ALUVIUIUM/CS AND P ( U C ) DATL PLRFORMLD:
3.
1 2 - P H E N Y L (1,3)OX.::IKO (3,Z-d) (1,4) BENZODIAZEFINtS
1/3/>-
[ (PllthYL:OW2lhO: BLN:ODI.AZtPIh:) OR
CHLMCON:
(Ox-2:lKo: -
B L K ; O D I A Z L P I N : ) on (OXAZINO AhD IiIh:ODIA\;I.PIU. JJ i i D
( L k P A N U U X A L I N AhD I X P A N D 8 L h Z O D I A L I . P ) OR ( L X P A N D P H I h Y L U X Z I N AhD L X P , \ K D BL\:UUIA~I.Pl l a 1 1 relevant tern!, here selccted f r o m each e x p a n d )
40388 (RR) OR ({[BOb-LOS(MC) OR C O b - t O S ( M C ) OR L O b - C O S ( M C ) ] OR Z b d / B U1 82 C C1 CZ E J ( M P 1 I O R
I 'AND '[ 373/n,Bl,BZ,C;~i.C2,~3(MP)} rco:Drogb/oiiicii
&hD
I'LUC)
STRINGSLARCH :OXAZINO: :BtNZODIAZlPIN: (TI))
(TI) AKD
belect: 1,3[h) OWiI.VO(4Ii) In B LN Z 0D I iL P 1h L CI I L >INAMI
i
U A l L PIKFOK\IIU
1/14/77 UA1L I'LHFORMEU:
1/7/77
1 3(h)OXA"INO ''le ( ~ k ) B L h Z 0 ~ 1 4 Z 1 I ' ~ P r i n t R e g l s t r y h u m b e r of ret r ie v a I cit a t Ions,
S e l e c t R e g i s t r y h u m b e r from CASIA f i l e . L i m i t /PAT. DATL P t n t o m L U :
3.
1 2 - P H E N Y L ( 1 , 3 ) 0 X ~ ZI h O (3,2-d1(1,41 BtN2ODIAZEPIYES
CHEMCOK:
i/5/7:
[(PHEKYL:OUZIUO:B E N Z O D I A Z E P I N : ) on (OXAZINO: B E K Z O D I A Z E P I N : ) OR
(OXAZIKO A N D AND
BFN:ODIlELPIK:)]
( E X P A N D O X A Z I N AND LXPAND BEirZODIAZTP) OR (EXPAND P H E N Y L O X A L I N AND L X P A N U BEYZODIAZtP) ( a l l r e l e v a n t t e r m s kere s e l e c t e d f r o m ea,:h e x p a n d )
40388 (RR) OR ( { [ B 0 6 - € O S ( M C ) OR C 0 6 - E 0 S ( M C ) OR E 0 6 - E 0 5 ( M C ) ] O R 2 6 6 / B B 1 BZ C C 1 CZ E J ( M P ) ] O R ~ c o i ~ ~ o g i i o ij ( i 'cA N ~D [ 373/B,BI,BZ,C,Cl,C2,E3(MF)}
1
AND
STRINGSEARCH :OXAZINO: ( T I ) AND : B E N Z O D I A Z E P I N : (TI))
P(UC) D4TL PI RPOR\lID.
1/14/77 DATE P E R F O R M E D :
1/7/77
1 , 3(h)OXAZlhO P r i n t R e g i s t r y Number of retrieval citations.
Select R e g i s t r y hwnber from CASIA f i l e . I.imit /PIT. DATL P F R F O R M t D
1.
2-Ahll.lNO N I C O T I I ~ I C A C l U S , lNCLUUlNG NIFLUMIC A C I D
CHIMCDN:
1/5/--
[ A N I L I K O : K I T i l II K : O R A N I L I N O .WU NICOTIN
( I X F A N D A N I L l h O N I C O T I N ) OR (LXPAND PHtNYLAMINONICOTINri) OR
{ [ B 0 7 - D 0 4 O R C O 7 - D O 4 OR E 0 7 - D O 4 ( M C ) ] AND [ ( 3 1 4 AND 3 7 3 AND 4 7 4 )
NICOTIU: OR P H L N Y I . W I N O AND N I C O T I N : on P H I . K Y L , I M I N O : -
(EXPAND T R I FLUOROMETHYLAN1L I N O N I C O T I N ) O R I F X F A N D TRIF1,IIORO-
/B,Bl,B2,C,Cl,C2,E3(MP)])
O n PHI..+I~..LMINO:
AND l(466 OR 4 6 - ] / B , B l , B Z , C , C l , C Z , EJ(MP)]
PYRIDIKI.CARdOXYL1C
OR P H t S Y L A M I h O AND FYRIDlhLCARBOXYLIC O R A N I L I h O P Y R 1 DINI; CIRBOXYLIC OR Akll.lNO
DATL P E R F O R M E D : A h I L I N O O R TRIFLUOROMETHYL
ri)
1/7/77
'
( a l l relevant t e r m s here i e l e c t e d from e a c h e x p a n d )
( 3 ) S PHENYLAMINO 3 A N D (1 O R 2 )
(4)
' r i n t e d R e g i s t r y Numbers j R h = i n C,\SI,\ file. irni t F A T ,
JATL F t R F 0 R M I . D :
(Kh's)
1/5/77
_____~
,.
E . MLRCK I R A h Q U l L I i I R S
CHtMCUh
(.I=: 0 5 4 1 4 4 O R
M L R C K / C S I U D P A T E h T / C S jIND A L L TRkhQUII.: A N D P (UC]
,IC = 0 5 4 1 4 6 )
AND T R A N Q U I L ? DATI. PLRFORMTD:
M E R E ( P C ) ANlJ { [ B I Z - C I O ( M C ) o n C ~ Z - C ~ O ( M C )on I [600/B, B l ,BZ , C , C 1 , C Z (MPI I]
2/17/77 DATE P E R F O R M E D :
1/7/77
~~
1.
U.S. 3 , 7 6 9 , 2 8 2
CA P A T E N T c u ~ C u m \ h c i : :
F A I E K T FAbIILY MLMBERS
L X P A N D P4 = U S 3 7 6 9 2 8 2 D A T L PLRFORMED:
*
U S 3 7 6 9 2 8 2 (FN)
1/24/77
DATL P t R F O R M C D :
1/24/77
DATE PERFORMED:
1/24/77
A L L CHEMCON s e a r c h e s were p e r f o r m e d o n t h e S D C f i l e .
Journal of Chemical Information and Computer Sciences, Vol. 17, No. 3, 1977
153
SMITH,
ANDERSON, AND JACKSON
Table VIII. Search Results
LRIHCO\,CISI 4
TOTAL H I T S
I
11
CL*IMS
I
1
WPI I2
I
1 DESCRIPTION OF PARALLEL SEARCHES
A total of six questions was employed in the study (see Table VI). The questions were hypothetical ones which the authors considered to be representive of the types of questions which are posed in the industrial/pharmaceutical environment in which they work. This environmental bias was considered an unavoidable factor since it enabled the authors to draw upon their areas of expertise in preparing search strategies and in analyzing the results.
with all on-line searches, was to be cost effective. At the end of each strategy, the date the search was performed is indicated. Since each of the databases is regularly updated, the performance date has a direct bearing on the size and content of the file at the time of the search. The use made of CHEMCON's companion files, CASIA and CA Patent Concordance, is noted in the CHEMCON column of Table VII. Hereafter, when reference is made to the CHEMCON search results, it will be assumed that the answers contributed by CHEMCON's companion files are
SEARCH STRATEGIES AND PERFORMANCE DATES The search strategies used to answer each of the six questions in the three-databases are presented in Table VII. The Boolean logic used in these searches is typical of the search systems provided by most on-line brokers. Search strategies are not necessarily exhaustive; the object of the searches, as 154
SEARCH RESULTS The bibliographic citations which resulted from the six searches were screened for relevancy by one or more of the authors by means of the corresponding hard copy abstract publications, Le., CHEMCON via Chemical Abstracts,
Journal of Chemical Information and Computer Sciences, Vol. 17, NO. 3, 1977
ON-LINERETRIEVAL OF CHEMICAL PATENT INFORMATION CLAIMS via the U.S. Patent Office Official Gazette, and WPI via Central Patents Index Basic Abstract Journal. These hard copy sources reflect the different meaning that each database ascribes to the word “citation”: CAS and Derwent consider it to be the first-issuing patent in a family of equivalent patents, both US.and foreign, whereas IFI/Plenum considers each U S . patent to be a citation, regardless of whether it is a continuation or divisional of another U.S. patent. These differing interpretations of the concept of a citation are a potential source for confusion which must be kept in mind when analyzing the search results. The results of the six searches are tabulated in Tables VII1.A through VII1.F. The authors employed the following conventions in totaling the search results in Tables VII1.A-E: Total hits is simply the total number of bibliographic citations which were produced as search output for a given search by each of the databases. Relevant hits is the number of citations considered to be relevant, based on a review of the corresponding abstract. Relevant inventions is the number of relevant hits minus “duplicate” hits. These “duplicate” hits are equivalents (i.e., members of the same patent families) of other hits in the search output. Exclusive relevant inventions is the number of relevant inventions found in that particular database only. Relevant inventions not retrieved is calculated by determining the total number of distinct inventions (Le., individual patent families) found collectively by the three databases and then substracting relevant inventions. Explanation of why relevant inventions were not retrieved is the authors’ determination of the reasons why relevant inventions were not retrieved by the search. Relevant U.S., 1972-present is calculated for the sole purpose of comparing the only area of document coverage having a high degree of overlap between all three databases. The figures indicate the number of relevant U S . patents found out of the total number of relevant U S . patents issued during the period, 1972-present. The authors employed a different set of conventions in analyzing the answers to the equivalents search (Table VII1.F). In this case, the object of the search was to find as many members of a single patent family as possible. Thus, all answers pertain to one invention only. In this context, “total hits”, “relevant hits”, and “exclusive relevant hits” refer to family members found while “relevant hits not retrieved” is the number of known family members not found. For the purposes of this comparison, the set of total known patent family members was restricted to patents found only in the three databases under consideration. DISCUSSION OF SEARCH RESULTS The obvious fact which emerges from the parallel search results is that no one database found all of the answers all of the time. This is shown clearly in Table IX which focuses on the number of relevant answers found in each database. (Answers in this context mean relevant inventions for searches 1-5 and relevant patent family members for search 6.) In general WPI achieved the best results. This is due to WPI’s combination of broad country coverage and high-specificity retrieval parameters such as company codes and chemical fragment codes which provided superior results in these instances. CHEMCON achieved high results in searches 1 and 4 because it contains as index terms the names of specific compounds (BLEOMYCIN) and chemical fragments (NICOTINIC, NIFLUMIC, etc.). CLAIMS failed to surpass its competitors in any of the searches because of its limited country coverage and its heavy reliance on title terms for subject matter retrieval. Search 6, the equivalents search,
Table IX. Summary of Search Results SEARCH _u1 _
SEARCH
SEARCH
SEARCH
_1 2 _ _1 3 _ - i d - -
SEARCH
SEARCH
i5
i6
POSSIBLE RELEVANT ANSWERS
30
68
13
51
35
10
CHEMCON
24
14
5
26
3
6
9
31
3
11
3
4
24
66
11
34
32
10
CYIIMS
WPI
Table X. Exclusive Relevant Answers SEARCH
SEARCH
SEARCH
SEARCH
SEARCH
SEARCH
2
0
I1 u2 Y3 Y4 XS 46 CHEMCON
6
0
1
11
CLAIMS
0
22
1
2
1
0
WPI
4
29
6
20
29
4
seems to indicate the superior coverage of WPI, since only WPI found all ten patent family members. Another way of looking at the search results is to focus on the relevant answers which were retrieved by only one of the three on-line databases. These so-called “Exclusive Relevant Answers” are an indication of how many answers would have been missed if that particular database were not searched. A summary of exclusive relevant answers is shown in Table X. Although WPI generally supplies the highest number of exclusive relevant answers, it is readily apparent that CHEMCON and CLAIMS are also quite capable of providing exclusive relevant answers. The ability of each database to provide unique sets of answers is a direct consequence of their differences in document coverage and retrieval parameters. Precision and recall ratios have not been calculated since to do so requires that a single meaning be applied to the word “citation” so that the results from each database are directly comparable to the others. Whichever definition of “citation” is chosen will necessarily enhance the Precision and Recall figures for one database to the detriment of the others. Moreover, to present a rigorous comparison of retrieval capability, it would be necessary to restrict the comparison to the only area of overlap between the databases, namely the US.Chemical patents, 1972 to present. This area of overlap will be examined below, but to focus on it exclusively would drastically limit the size of the files being compared, and would fail to reflect the coverage advantage enjoyed by each of the systems, e.g., CHEMCON’s and WPI’s broad foreign coverage and CLAIMS’ coverage back to 1950. The authors prefer the approach of highlighting the database differences by discussing the relevant answers which each of the databases failed to retrieve. A summary of answers which were missed in all six searches is presented categorically in Table XI. Keeping in mind that these figures are directly dependent on the questions which were posed, there are a few obvious inferences which may be drawn. The first category-Too Early For File-reaffirms what was determined by the coverage comparison, Le., that CHEMCON and WPI are inferior to CLAIMS if older patents are to be included in the search. Similarly, the second category shows how a lack of foreign coverage results in CLAIMS missing many answers which are outside the scope of its coverage. CHEMCON also missed a number of answers owing to its exclusion of patents on the fringe of chemistry. A more serious problem is the third category of answers which ought to have been in the file based on the boundaries of its coverage, but are missing. All three files are guilty of such lapses. The fourth category of answers, those which were missed due to title terms not being present (only CLAIMS and WPI were searched using this parameter), serves to
Journal of Chemical Information and Computer Sciences, Vol. 17, No. 3, 1977
155
SMITH, ANDERSON, AND JACKSON Table XIII. Exclusive Relevant U.S. Patents, 1972-Present
Table XI. Summary of Missed Answers
r
I
1
SEARCH
MISSED LVSWERS
SEARCH
84 -
SEARCH
CHEMCON
0
0
1
1
0
Too E a r l y For F i l e
71
0
21
CLAIMS
0
0
1
2
1
2
O u t s i d e S c o p e Of F i l e
23
129
2
WPI
0
3'
2
4
3
3.
M l s ~ i n g From F i l e
4.
Terms N o t I n T i t l e
Indexed Elsewhere In F i l e
1 1 11 1 1
2
6
0
15
5
*
IS -
Non-chemical
Table XIV. Summary of Missed U.S. Patents Issued since 1972. Searches 1 through 5
1;
2:
ON-LINE F I L E
CAUSE O F MISSED ANSWERS
Misspelled Retrieval Parameter
CLAIMS
CHEMCON I
I
R e t r i e v a l Program Limitation
WPI I
1.
Terms N o t I n T i t l e
0
9
2
2.
indexed Elsewhere In F i l e
11
0
6
3.
R e t r i e v a l Program Limitation
0
2
0
4.
U.S.
I
TOTALS
I
Table MI. Search Results-US. Patents, 1972-Present Only SEARCH
~1 1
SEARCH 12 _
SEARCH
SEARCH
13
(4
_
~
SEARCH *5
POSSIBLE RELEVANT U . S . PATENTS
12
10.
5
11
6
CHEMCON
10
3
3
8
0
CLAIMS
11
7
2
7
3
UP1
11
10
4
6
5
-
C i t e d As Equivalent To Pre-1972 Nan-U.S. Patents
-TOTALS L
3 of 1 0 are n o n - c h e m i c a l .
underscore the danger in relying upon title terms for thorough searching. The fifth category reflects the searchers' inability to predict all of the retrieval parameters associated with the relevant hits. (In some of the cases for CHEMCON this was due to incomplete indexing of the chemical compounds.) The sixth category indicates that CHEMCON is the only one of the three files to exhibit the human errors of misspelled retrieval terms. (Obviously, the authors are not convinced on this limited evidence that human errors are absent in CLAIMS and WPI!) The final missed answers result from the fact that left-hand truncation was not available for the CLAIMS and CHEMCON files; otherwise, the terms embedded in the titles might have been used to retrieve the patents which were missed. DISCUSSION OF SEARCH RESULTS FOR 1972-PRESENT U.S. PATENTS ONLY Coverage differences can be eliminated theoretically by examining only those search results which fall in the overlap area between the three on-line files. As mentioned above, this overlap area consists of U.S. patents issued since 1972. Table XI1 indicates the relevant U.S. patents which were found in each of the first five searches. Search 6 is excluded for the obvious reason that the only relevant U.S. patent was known a priori. The obvious inference which may be drawn from Table XI1 is that when only U S . patents are sought CLAIMS is now on a roughly equal footing with the other two on-line files. Except for WPI in search 2, none of the databases are able to supply all of the answers to any of the searches. 156
83 -
~~~
1.
7.
SEARCH
12 -
CHEMCON
-
1 1:
SEARCH
(1 -
ON-LINh FILE CAUSI O F
Journal of Chemical Information and Computer Sciences, Vol. 17, No. 3, 1977
I 4 I o I o I 15
11
8
INTERACTIVE SUBSTRUCTURE SEARCH SYSTEM
retrieval becomes outdated. Improvements are taking place so fast (in databases, search programs, and communications techniques and equipment) that there is little which can be said now which will remain unchanged for more than a few months. Users are advised to stay in close contact with the database producers, on-line brokers, and communications vendors to keep abreast of the changes. LITERATURE CITED ( 1 ) G. V. O’Bleness and G. Cohen, “Presentation of ‘SOLD” Computer Retrieval System”, paper presented at Central Patents Index Subscribers
Meeting, Washington, D.C., May 1973. (2) R. Donati (Lockheed Information Systems), “Overview of Patents Covered by DIALOG Retrieval Service’’, presented at Session on Patent Literature Systems from the Symposium on Intellectual Property as Sources for Scientific, Technical and Business Information sponsored by the Canadian Patent Office and others, Ottawa, Ontario, Nov 24, 1976. (3) M. Bonner (System Development Corporation), private communication, Dec 1976. (4) M. E. Williams and S. H. Rouse, Ed., “Computer-Readable Bibliographic Data Bases - A Directory And Data Sourcebook”, American Society for Information Science, Washington, D.C., 1976. (5) J. B. Hare, “On-Line Searching for Patent Information-A Comparison Of The CHEMCON And WPI Data Bases”, paper presented at Pharmaceutical Manufacturers Association Science Information Subsection Meeting, Hot Springs, Va., March 1976.
An Interactive Substructure Search System R. J. FELDMANN and G. W. A. MILNE National Institutes of Health, Bethesda, Maryland 20014
S . R. HELLER* Environmental Protection Agency, Washington, D.C. 20460 A. FEIN, J. A. MILLER, and B. KOCH Fein-Marquart Associates Inc., Towson, Maryland 21 21 2 Received April 29, 1977 A family of programs for searching on the basis of chemical structure through data bases of chemical information has been assembled and is now publically available on a commercial computer network. The design of and results obtained with these programs are reported, and the status of the system is described and discussed with particular reference to the NIH-EPA Chemical Information System (CIS) and the Toxic Substances Control Act (TSCA).
INTRODUCTION The ability to use a computer to search for a particular chemical structure or substructure in files of chemical data has for some time been sought after by chemists, and the need for such capability is currently becoming very pressing. A widening interest in the relationships between chemical structure, on the one hand, and various properties, such as toxicity, pharmacological activity, and mutagenicity, on the other, has led in recent years to considerable efforts to generate computer programs which will enable the scientist to locate all Occurrences of a given structure or substructure in chemical databases. Further decisive pressure behind these developments has been provided by the enactment, in November 1976, of the Toxic Substances Control Act (Public Law 94-469). This law will require that chemical compounds whose use in commerce is envisaged must first be located within Governmental regulatory files. If they are not in these files, they are deemed “new” and their use in commerce becomes subject to a series of regulations, depending upon their respective toxicities. In this paper, we describe the NIH-EPA substructure search system, a family of interactive computer programs which allow the user to define a chemical structure or substructure and then to search for occurrences of the structure or substructure in the various databases of the NIH-EPA Chemical Information System. During the past 25 years, a considerable number of methods of machine representation and handling of chemical structure have been proposed and studied for their utility in manual and
automatic data retrieval methods. Some of the better known among these include the German GREMAS system,’ the British CROSSBOW system,* and, in the U S . , the programs developed at the National Cancer I n s t i t ~ t e Walter ,~ Reed Army Institute of Research: Chemical Abstracts Service,’ and the Army’s Chemical Information Data System.6 With the resulting progress in the area. of computer-handling of chemical structures, it has become clear that structure records in the form of two-dimensional connection tables are absolutely necessary for structural representation and that both linear notations and chemical nomenclature are at a serious disadvantage vis-5-vis connection tables as far as unambiguity and completeness are concerned. For many years, however, there was no adequate means of screening such connection tables and so, in spite of their intrinsic value, they were not used in any retrieval system. In the area of structure retrieval, most effort was expended in the development of systems that were designed to fulfill a specific local need. A system of this type that is currently perhaps the most widely used by the chemical industry is the CROSSBOW program,2 a dozen or so versions of which are in operation around the world. Other systems, such as the GREMAS system’ or the Walter Reed system4require special equipment that is not generally available. While the larger industrial organizations can often afford such luxuries and often also demand in-house facilities of this sort, such systems are of little value to the general chemist. It is this dilemma which led to the development of the N I H nested tree structure searching system, which can operate on a connection table database and which is susceptible to wide dissemination and
Journal of Chemical Information and Computer Sciences, Vol. 17, No. 3, 1977
157