Stephen R. Heller Enrironmental Protection Agency Washington, D.C. 20460 George W’.A. Milne nfational Institirtes of Health Bethesda, M d . 20205 A major activity in modern chemistry is the identification of chemical substances from laboratory measurements made on those substances. The National Institutes of Health/Environmental Protection Agency ( N I H / EPA) Chemical Information System ( C I S ) Rermits searching through libraries of numeric data in a variety of efficient and inexpensive ways. C I S is used by scientists all over the world 798
Environmental Science € Technology i
seeking to identify unknown chemicals. A challenging task now dominating CIS development is the prediction of a substance’s properties and behavior from jts molecular structure. The short-term promise of such predictive ability is a tremendous savings in resources; large numbers of expensive and time-consuming laboratory measurements can be obviated by matching the properties of all substances in a set against selected experimental measurements. A longer term goal of the CIS development effort is an u n derstanding of the relationships between structure and properties. This paper will concentrate on the use of the C I S spectral data bases in support of Toxic Substances Control
Act (TSCA)-related monitoring activities. Because it is effective in solving problems related to T S C A , the C I S is being put to broader use than was first envisioned. With modern computer technology and electronics, the costs of computation has come down, while access to computers has increased through the use of computer networks accessible over standard telephone lines. Consequently, we have been able to develop a highly interactive, disk-oriented chemical information system of numerical data which is readily and inexpensively available to our own agencies’ laboratories, and o u r contractors, grantees, and scientific collaborators as well as the general public.
This article not subject to U.S. Copyright. Published 1979 American Chemical Society
A major problem that a chemist has in searching a chemical data base is that the best questions are often not known. An interactive system can provide the answer to a question immediately, and this enables the user to see the deficiencies in the question and frame ;i new quer).. In this way, a feedback loop is conipleted-the scientist acts a s a transducer, "tuning" the query until the system reports precisely what is required. T h e Iu I H / E P A Chemical Information System has been designed around this general approach, which requires that all of the appropriate information be gathered together within one computer system. This is an ideal that the CIS has tried to achieve by using the d a t a available in those areas of science that are felt to be most valuable in solving the enormous and diverse problems of improving health iind the environment.
System design CIS consists of a collection of chemical data bases plus a battery of computer programs for interactive scarching through these disk-stored data bases. In addition, CIS has a data referral capability a s well a s a data :inalysi:i software sj'stem. Thus, C I S has four main features: numerical data bases d a t a analysis software structure and nomenclature
the other CIS data bases, and retrieve the data pertaining to the compound. T h e system can be used in the reverse sense: a chemist, having identified a n unknown from its mass spectrum, will be provided with the C A S registry number of the possible compound, and may use this number within SANSS to learn the structure of the compound. In addition to the chemical substances in C I S , the master file that is searched by SANSS also includes all the chemicals in some 32 other files of chemical compounds. These a r e files such as the Merck Index, the T S C A Inventory and the International Trade Commission list. In this sense, SANSS can be used as a referral system, and it is a ver) useful means of locating chemicals i n its 41 files. T h e entire CIS structure ma) be viewed as a network of interlinked. but i n d e pe n d e n t n u mer i ca I d a t a bas e s , linked together b) using the C A S registry number as the universal. unique identifier for each compound. T h e use of this registry number to tag all CIS files was codified i n E P A regulation h o . 2800.2 in 1975 and. with the passage of T S C A i n late 1976. it was extended to the T S C A inventor), thus establishing t h e link betueen regulatory data and scientific data.
search system d a t a base referral. T h e numeric data bases include files of mass spectra, carbon- I 3 nuclear magnetic resonance (nmr), X-ray diffraction data for single crystals and powders, acute toxicity data, and aquatic toxicity data. There are several bibliographic d a t a bases directly associated with the mass spectrometry a n d X-ray crystallography files. T h e analytical programs include a family of statistical analysis and mathematical modeling algorithms and programs for the second-order analysis of nmr spectra. and energy minimization of chemical conformations. Programs that design chemical syntheses a r e being tested and may, if viable. become part of CIS in the future. A t the center of CIS (Figure 1 ) is t h e S t r u c t u r e a n d Nomenclature Search System (SANSS). T h e d a t a base associated with SANSS contains the names, synonyms and structure records of some 140 000 chemical substances, including all the compounds for which CIS has some numeric d a t a . This single, large data base can be searched by name. structure or substructure and provides the Chemical Abstracts Service ( C A S ) registry number for any compounds retrieved. T h e registry number can be used. in turn, to locate the compound in any of
System development A general protocol for updating CIS
components, or adding new conipo-
FIGURE 1
The NIH-EPA Chemical Information System
d
CTCPa
a
FR
noticesa
P
a
Cawen"
Under development
Volume 13, Number 7, July 1979
799
nents has been established. A schematic diagram of this protocol is shown in Figure 2. In the first phase, a d a t a base is acquired from one of a variety of sources. Some of the CIS data bases have been developed specifically for the CIS, a n example of this being the mass spectral data base. Others, such as the C a m bridge Crystal File, a r e leased for use in the CIS, and still others such as the X-ray powder diffraction file a r e operated within CIS by their owners. in this case the Joint Committee on Powder Diffraction Standards. In other cases, the information comes from other government agencies which retain responsibility for the file, its contents and its maintenance. An example of such a file is the N I O S H RTECS. Once inverted lists have been prepared. usually on the N I H I B M 370- I68 computer, the) are transferred to the K I H PDP- I O computer, which is primarily a time-sharing computer. Then the programs for generating the "searchable" files, and for searching through these files a r e written. Out of this work, there finally emerges a pilot version of each CIS component. This pilot version is then allowed to run on the N I H PDP- IO. and access to it is provided to a small number of people who can log into the N I H computer by telephone, using long-distance calls, if necessary. These users are provided with free computation and, in return, they test the component thoroughly for errors and deficiencies. Such problems are reported to the development team, which
attempts to deal with them. Depending upon the size and complexity of the component, this testing phase can last a s long as eighteen months. When testing is complete, the entire component is exported to a "networked" PDP- I O in the private sector and access to the version on the N I H computer is no longer supported. T h e component i n the private sector is available to the general scientific community, including government agencies, and is used on a fee-for-service basis. I n this phase, the government retains no financial interest; the routine operation of C I S components in the private sector is not subsidized. CIS programs have been designed for use with a "networked" D E C PDP-I 0 computer system because the ;I I ternat ive-export i ng progra ins and data bases to locally operated PDP- 10 computers-is less workable and contains a number of deficiencies that a r c overcome by a network. For instance, a "networked" machine means that data bases need only be stored once, a t the center of the network. A great deal of money is saved because duplicate storage is not necessary. Furthermore, B single copy of a data base is easy to maintain, whereas updating of a data base that resides in many computers is virtually impossible. Finally, communication among systems personnel and users is very simple in a network environment, as is monitoring of system performance. T h e only equipment that is required to establish access to a computer nctu o r k is a telephone-coupled computer terminal. Typewriter terminals are
becoming very common and relatively inexpensive. Such a terminal can be purchased from a variety of nianufacturers for between $1000 and $3000 and, in general, will operate a t 300 baud (30 characters/s). A cathode-ray terminal, capable of running a t 1200 baud can be purchased for as little as $2000. Equipment of this sort can either be leased or purchased.
SANSS All the compounds in the files of C I S have been assigned a registry number by the C A S [for registry numbers see Anal. Chem., 51, 567 (1979)]. T h e registry number is a unique identifier for that compound, and may be used to retrieve from the C A S Master Registry of over 4.5 million entries all the synonyms that the C A S has identified for the cumpound. Further, the registry number can be used to locate the connection table for the compound's structure. This is a two-dimensional record of all atoms in the molecule together with the atoms to which each is bonded and the nature of the bonds. The connection table is the basis of the substructure search component of the CIS. T h e purpose of the Structurc and Nomenclature Search System (SANSS) is to permit a search for a user-defined structure or substructure through data bases of the CIS. If a substructure is found to be in a CIS data base, then, armed with its CAS registry number, the user can "access" that file. locate the compound and retrieve whatever data are available on i t . T h e main ways in which the C I S Unified Data Base can be searched are
SANSS files Substances
File TSCA inventory CIS mass spectrometry CIS carbon 13 nmr spectrometry EPA pesticides active ingredients EPA OHM/TADS
Cambridge X-ray crystal Merck Index EPA pesticides analytical reference standards EPA STORET EPA chemical spills EPA AEROS SOTDAT NlMH pyschotropic drugs EPA AEROS SAROAD NBS proton affinities CPCS CHEMRIC EPA pesticides registered inert ingredients NBS gaseous ions NFPA hazardous chemicals FDA/EPA pesticides reference standards US. InternationalTrade Commission
800
Environmental Science & Technology
43 278 25 560 3805 1453 858 14 854 8959 473 234 577 572 2039 65 440 890 735 3167 396 613 9193
File
18 338 125 375 104 103 225 NSF chemicals list 4492 EROICA thermodynamics 4447 PHS149 carcinogenic activity 19 882 NIOSH RTECS 4560 NIOSH NOHS 4030 ORNL EMlC 324 1 ORNL ETlC 579 EPA selected organic air pollutants 4 Clean Air Act Section 1 1 2 91 EPA/NCTR study (1976) 21 EPA environmentalcarcinogen assessment program 23 EPA restricted use pesticides 25 EPA compounds for mutagenicity evaluation 27 CllT priority chemicals lists (toxicological) 15 NMFS survey of trace elements
NBS X-ray crystal EPA effluent guidelines EPA organic chemical producers IPC chemical product IPC chemical plant '
Substances
the following: n a m e / n a m e fragment search (NPROBE) nucleus/ring search (RPROBE) atom-centered fragment search (FPROBE) structure code search (SPROBE) molecular weight, molecular formula, partial molecular formula tot a I a t om - b y -a t o m , bond- bybond substructure search ( S U B S ) full structure search ( I D E N T ) . T h e ability to search for a chemical by name or partial name ( N P R O B E ) can be very useful. For example. if one uishes to search for ii drug or pesticide, most of which have simple, short, trivial names but rclatively complex structures. a name search is very useful. \ P R O B E was used to locate the entry for the potent carcinogen dioxin, C A S registr) number 828-00-2, in under one minute total time. This exercise also led to the structure and molecular formula o f dioxin. and the names of the nine files. including the TSCA C a n d i d a t e List a n d the Iv IOSH Registr) of Toxic Effects of Chemical Substances ( R T E C S ) in which information o n dioxin is found. Finally, the various jynonynis for the compound were all listed. As the first step in a structural search. the user must dcfine the substructure of interest to the computer. This is done \iith il family of structure-generation programs which can, for exaniple. create a ring of a given size, a chain of a given length, a fused ring system and so o n . As the quer). structure is developed by using these commands, the computer stores the grcx+ing connection table. I f the user uishes to view the current structure a t any point. the displa) command (D) can be invoked. This command. using the current connection table, generates a structure diagram that can be printed on a conve n t i on ii I t e I e t y pe tc r m i n a I . When thc appropriate quer) structure has been gcner;itcd. ;I number of search options can he invoked to find occurrcnces of this query structure in the datu base. T\io ver) useful search options are the fragment probe and the ring probe. The fragment probe will search through the assembled connection tables of thc data base for a11 occurrences of ii particular atom-centered fragment, that is. a specific atom. together with all its neighbors and bonds. T h e ring probe search searches for all structures i n thc data base containing the same ring or rings as the
query structure. A ring that is considered to be an answer t o such a query must be the same size as that in the query structure. It must also contain a t least as many non-carbon a t o m (heteroatoms) as the query structure. The type of bonding is not considered in a n R P R O B E search. Thus with ;I query structure of furan, the only e.ucict a n swer is furan. but the user may permit the retrieval of "imbedded" ansuers, w h i c h wou I d i n c I u d e fur a n . tetra h y d ro f u r a n a n d t h io p h e n e. I n addition to these structural searches. there are a number of "special properties" searches that often prove to be very useful in reducing the large list of answers resulting from structure searches. T h e special properties searches include those for a specific molecular weight or range of molecular weights, and for compounds containing a given number of rings of ;I given size. Searches may also be conducted for the molecular formula corresponding to the query structure. or for other user-defined molecular formulas. This m a y be specified completely or partially. If one's purpose is to determine only the presence or absence i n a data base of I I specific structure. this can be accomplished with the search option "IDEUT." This program "hash-encodes" the query structure's connection table. and searches through a file of hash-encoded connection tables for a n exact match. T h e seiirch. u hich is v c q fast b) substructure search s t a n dards, has been designed specificall! for those users who. to comply with the Toxic Substances Control Act. have to determine the presence or absence of specific compounds i n EPA's files. Finally. if one has completed ring probe and fragment probe searches for ;I specific query structure, and is still confronted with a sizeable file of
compounds that satisfy the criteria t h a t were nominated, a substructure search through this file may be carried out. This involves a n atom-by-atom. bond-by-bond comparison of every structure, and will retrieve any compound that contains the query structure. T h e structure and nomenclature search system is the center of CIS and operntes on a unified data base of 41 files (Figure 2). An additional 140145 files, including the list of some 20 000 chemicals covered by the Japanese Toxic Substances law, are being processed for later addition.
Mass spectral search system T h e M a s s Spectral Search System (MSSS)-developed in I97 I by the N I H , the E P A , the Uational Bureau of Standards and the Mass Spectrometry Data Centre in Englanduses a data base containing about 33 000 mass spectra representing the s a m e number of compounds. Every compound in the file has been assigned its CAS registry number, which is used to find the compound i n other CIS files. and provides structure and h y n o n ) m search capabilities throughout the CIS. Searches through the MSSS data biise can be carried out in a number of w a \ s . I n the most commonly used search, the program L+ i l l retrieve all the spectra containing a particular peak, specified bg mass/electron charge ( m / e ) and relative abundance. Then a second m / e and abundance pair can be entered and a11 spectra containing both peaks rctricved. This search converges. Lsually one is left with two or three spectra after entry of four or five peaks. Molecular w.eights or formulas ma) also bc used i n this search which then increases the convergence rate mar ked 14. Volume 13, Number 7, July 1979
801
Searches of this sort are used a great deal by chemists attempting to identify unknowns, but there is also a great demand for a batch search to which one can enter a number of spectra, which will then be searched for, one by one, without user intervention. A program that can do this is available within MSSS. After each search, this program reports the ten best fits from the library for each unknown, and gives a “similarity index” for each one. Finally, a reverse search is contained within MSSS. This program checks to see if each library spectrum is contained within the unknown spectrum. If such containment occurs, the library spectrum is subtracted from the unknown and the matching process resumes. In this way, it is possible to identify components of a mixture from an examination of the mass spectra of the mixture. Once an identification has been made, the name and registry number of the data base compound are reported to the user. I f necessary, the data base spectrum can be listed or, if a C R T terminal is being used, plotted to facilitate direct comparison of the unknown and standard spectra. Also within the MSSS are the accumulated files of the Mass Spectrometry Bulletin, a serial publication of the Mass Spectrometry Data Centre (UKCIS, Nottingham, England). The Bulletin, which since 1967 has collected about 60 000 citations to papers on mass spectrometry, can be searched for all papers by given authors, or specific subjects, or particular elements. In addition, citations dealing with general index terms may also be retrieved. Simple Boolean logic is available, and thus searches may be conducted for papers by Smith and Jones, or Smith but not Jones, and so on. Citations retrieved may be limited to specified publication years between 1967 and the present. The M S S S has been hidely available through computer networks since 1971. I t currently resides at the Interactive Sciences Corporation (ISC) computer where, every month, over 3000 searches and 2000 other transactions, such as retrievals, are carried out by over 300 laboratories.
Other numeric data bases The C I S contains a dozen numeric d a t a bases which are mostly modeled after MSSS. In general, it is possible within each component to learn the C A S registry number of the compound or compounds that give data correaponding to that entered by the user in 802
Environmental Science & Technology
a search. The reverse process, retrieving the d a t a corresponding to a compound whose registry number is entered, is also possible in each of the C I S components. T h e file of carbon- 13 nmr contains over 8800 spectra and can be searched by chemical shift and molecular formula. Any spectrum can be retrieved from this file by entry of either the C A S registry number or the spectrum number. Likewise, the Cambridge Crystal Data Base, which contains crystallographic and bibliographic d a t a on some 22 000 compounds, including full atomic coordinate data for I5 000 of them, can be searched within the CIS. The N B S Single Crystal Data Base is currently being merged into the CIS, and it should be possible, using this file. to identify crystalline compounds from searches based upon the dimensions and lattice type of the crystal. A third X-ray data base in the C I S is the file of standard powder diffraction patterns that has been assembled by the Joint Committee on Powder Diffraction Standards (Swarthmore, Pa.). Since powder diffraction patterns are most typically measured on mixtures, provision has been made in this case for reverse searching. This permits one to identify components of a mixture from the diffraction pattern obtained from the mixture. A trio of programs that operate without data bases has also been included in the CIS. The mathematical modeling system, M L A B , is a generalized statistical analysis package, capable of regression, curve-fitting and differential calculus. A program that can carry out the iterative secondorder analysis of complex nmr spectra is in the CIS. And finally, the program set C A M S E Q - I I has also been added to the system. CAMSEQ-I1 permits the calculation of the internal energy of any particular conformation of a molecule and by successive changes of torsional angles and recalculation of this internal energy, it can estimate the most stable conformation for the molecule in solution. This program can accept a C A S registry number as input and, in this sense, it is fully merged into the CIS.
NIOSH RTECS search system The National Institute for Occupational Safety and Health (NIOSH), created in 1970, is required by law to prepare a list containing all the toxic effects of chemicals that have been recorded. The registry of Toxic Effects of Chemical Substances ( R T E C S ) is the data base created and updated annually by KlOSH to comply with
this law. In 1977, the data base consisted of some 25 000 chemicals and the toxicity associated with each of these chemicals. The N I O S H R T E C S is the first nonspectroscopic CIS data base. It has proven to be a very valuable addition to the system. Interest in the data base has been shown by many groups within EPA involved in the implementation of TSCA. For example, a link has been provided within the C I S between spectral data and the N I O S H toxicity data so that following a mass spectral identification, the EPA laboratory in question can quickly be informed if the chemical it identified is toxic and hence requires immediate action. A typical search through t h e N I O S H R T E C S can be carried out in less than a minute to find, for example, the three reported measurements on compounds with a rodent oral LD50 value of less than 75 micrograms per kilogram of body weight. The references to such measurements can be listed, together with the N I O S H number and the C A S registry number for each compound and the actual LD50 values cited.
WaterDROP Over the past few years, pollutant monitoring activities, both in Europe and North America, have begun to provide some information on chemicals that are important as pollutants, and on the places in which such pollutants are to be found. The EPA research laboratory in Athens, G a . , realizing the need for a centralized source for collection, storage and dissemination of this information has started to develop a Distribution Register of Organic Pollutants in Water (WaterDROP). WaterDROP, which was begun in the summer of 1978, contains the identity of the chemical found, the sampling site and date, the reporting laboratory, the analytical method used and the date of entry into the system. These data will be collected in a number of ways. An important source of data is expected to be the system shown schematically in Figure 3. Here, MSSS users identify the unknown toxic pollutant using the batch search option of MSSS. This program has been modified; EPA laboratories are required to enter additional information when conducting a search. As each laboratory identifies an unknown, the central computer is building up a data base of information for WaterDROP. The results of all these mass spectral identifications will be a centralized report file, such as is shown i n Figure
FIGURE 3
Data input to the WaterDROP file
I t Registry no.
62759
Compound name
River
methanamrne, Ohio N-methyl-N-nitrosodimethylamine, N-nitrosodimethytnitrosamine
Centralized report Long.
River mile
137
85.32
Date
Lt.
4005
04-23-77
S.I. vaiue
Laboratory
0.981
1735
DMN DMNA
in which one can see that the usual MSSS results. the river, river mile. longitude. latitude, date, and laboratory entering the data has been recorded. Centralized reporting of this sort has been sorely needed. T h e data bank will be published by the EPA. When the W a t e r D R O P system is available, sometime i n late 1979, i t bill be possible to answer questions concerning the number of locations in which a given chemical (or class of chemicals) is found. Patterns of distribution of chemicals found in water. which indicate problems with plant effluents, may also be detected with this system. Answers to these and other similar questions, coupled with the toxicity d a t a from R T E C S and other CIS sources, should provide valuable technical facts to enable governments to regulate and control pollutants in a rational fashion. I
Aquatic toxicity A d a t a bank of aquatic toxicity is being developed by EPA in conjunction with A S T M C o m m i t t e e E35.2 l .O l . This data bank. expected to be available for testing on CIS in late 1979. will have information on the chemicals found in fish, their reported toxicities, the literature citations. common and scientific names of the species studied, temperature, pH and hardness or salinity of the water in the study. and a comments section for other desired information related to thc study. A Q L A T O X , Aquatic Toxicity Data Base. will be able to be searchcd on all the fields mentioned above, and i n addition, the whole file will be able
to be searched for sDecific chemicals bb using SANSS. 1; is hoped that in thc long term, the aquatic toxicity system will prove to be a useful complement to the W a t e r D R O P file.
Summing up One of the first goals of the CIS was to produce a series of "searchable" chemical data bases for use by working analytical chemists having no special computer expertise. T h e second aim was to link these data bases together so that the user need not be restricted to :I consideration of only mass spectral d a t a , for example. T h e various problems inherent in these plans included acquisition of data bases, design of programs. dissemination of the resulting system and linking, via C A S registration numbers, of the various CIS components. These problems have been solved conceptually and, to a large extent, practically. T h e CIS, a s it now stands, is the result. In summark, it is felt that, to date, programs with the CIS have demonstrated economic feasibility and scientific value in support of T S C A . The test before us is whether we can ciipitalize on this to explore the new and exciting possibilities inherent to the use of computer systems to support T S C A goals, and assist in meeting the needs of the country for a safer and healthier environment. For details on how to gain acccss tp CIS, please contact Kay Pool, CIS Project Manager, Suite 500, ISC, 91 8 16th St., N . W . , Washington. D C 20006 (202-223-6503 or 800-4249600).
-
Additional reading
Heller. S. R.. Milne. G . W . A , . Feldnian, R . J.. Scieticr, 195, 253-259 (1977). Milne. G . u'. A,. Heller. S. R.. Fein. A . E.. Frees. E. F.. Marquart. R . G.. McGill, J . R.. Miller. .I.A , . J . C'herii. 1171: C'onipirt. .Sr,i.. 18, I 8 I (1978). Hcller. S . R.. Milnc. G . u'.A , . The Y I H E P A Mass Spectral Handbook. N B S K S R D S 63 (1634 p p ) . Government Printing Office. Ordering S u m b e r , 003003-01987-9. Dalrlmplc, D.L.. Wilkins. C . L.. Milne. G . W. A,. Heller. S. R.. Org. Mcign. Reson., 11, 535 ( 1 9 7 8 ) . McGill. J . R., Heller. S. R., Milne. G . W . A,. J . To.uicol. Etrirori. Health, 2, 539 ( 1 978). Heller. S. R.. McGuire, J . W., Budde, W. L.. Eticiro,r. s e i . Techtiol., 9, 2 10-2 13 ( 1975).
Volume
13, Number 7, July 1979
803