Computer Networking and Chemistry - ACS Publications - American

Networking and a Collaborative Research. Community: a Case Study Using the DENDRAL. Programs. RAYMOND E. CARHART, SUZANNE M. JOHNSON, DENNIS H. SMITH,...
0 downloads 0 Views 3MB Size
13

N e t w o r k i n g and a Collaborative Research Community:

a Case Study Using the D E N D R A L

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

Programs RAYMOND E. CARHART, SUZANNE M. JOHNSON, DENNIS H. SMITH, BRUCE G. BUCHANAN, R. GEOFFREY DROMEY, and JOSHUA LEDERBERG Departments of Computer Science, Genetics, and Chemistry, Stanford University, Stanford, Calif. 94305 Computer Science is one of the newest, but also one of the least "cumulative" of the sciences. Gordon (1) has recently pointed out the upsetting disparity between the number of potentially sharable programs in existence and the number which are easily accessible to a given researcher. Although some mechanisms exist for the systematic exchange of program resources, for example the World List of Crystallographic Computer Programs (2), a great deal of programming effort i s duplicated among different research groups with common interests. The reasons for this are understandable: these groups are separated by geography, by incompatibilities in computer facilities and by a lack of a means to keep abreast of a rapidly changing field. The emergence of more economical technologies for data communications provides, in principle, a method for lowering these geographical and operational barriers; for creating, through computer networking, remote sites at which functionally specialized capabilities are concentrated. The SUMEX-AIM (Stanford University Medical Experimental computer - A r t i f i c i a l Intelligence in Medicine) project is an experiment in reducing this principle to practice, in the specific area of a r t i f i c i a l intelligence research applied to health sciences. The SUMEX-AIM computer facility (3) is a National Shared Computing Resource being developed and operated by Stanford University, in partnership with and with financial support from the Biotechnology Resources Branch of the Division of Research Resources, National Institutes of Health. It is national in scope in that a major portion of its computing capacity is being made available to authorized research groups throughout the country by means of communications networks. Aside from demonstrating, on managerial, administrative and technical levels, that such a national computing resource is a viable concept, the primary objective of SUMEX-AIM is the building of a collaborative research community. The aim is to encourage individual participants not only to investigate applications of a r t i f i c i a l intelligence in health science, but also to share their 192 Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

13.

CARHART

E T AL.

Collaborative

Research

Community

193

programs and d i s c u s s t h e i r ideas w i t h other r e s e a r c h e r s . T h i s places a r e s p o n s i b i l i t y upon SUMEX-AIM to develop e f f e c t i v e means of communication among community members and among the programs they w r i t e . I t a l s o places r e s p o n s i b i l i t y upon those members to design and document programs that r e a d i l y can be used and understood by others. Another aspect of the SUMEX f a c i l i t y i s p r o v i d i n g s e r v i c e to i n d i v i d u a l s whose i n t e r e s t i s i n u s i n g , r a t h e r then developing, the a v a i l a b l e computer programs. Although t h i s i s not a primary c o n s i d e r a t i o n , i t i s an e s s e n t i a l p a r t of the growth of these programs. Most of the SUMEX-AIM p r o j e c t s have formed, or are forming, t h e i r own user communities which provide v a l u a b l e " r e a l world" experience. F i g u r e 1 d e p i c t s the t y p i c a l i n t e r a c t i o n of such a p r o j e c t with i t s user community, and w i t h other p r o j e c t s . The p a r t i c i p a t i o n by users i n program development i s not j u s t r e s t r i c ted to suggestions, but can a l s o i n c l u d e software created by comp u t e r - o r i e n t e d users to s a t i s f y s p e c i a l needs. In some p r o j e c t s , methods are being considered to f u r t h e r promote t h i s k i n d of participation. The purposes of t h i s paper are t h r e e f o l d : f i r s t , to i n d i c a t e the range of research p r o j e c t s c u r r e n t l y a c t i v e at SUMEX; second, to d e s c r i b e i n d e t a i l one of these p r o j e c t s , DENDRAL, which i s of p a r t i c u l a r i n t e r e s t to chemists; and t h i r d , to d i s c u s s some problems and p o s s i b l e s o l u t i o n s r e l a t e d to networking and communitybuilding. I.

RESEARCH ACTIVITIES AT SUMEX-AIM

The community of p a r t i c i p a n t s i n -SUMEX-AIM can be d i v i d e d g e o g r a p h i c a l l y i n t o l o c a l (i._e., Stanford-based) p r o j e c t s and remote p r o j e c t s , and below i s given a b r i e f d e s c r i p t i o n of the major r e p r e s e n t a t i v e s of each. Communication with the remote proj e c t s i s accomplished through one or both of the communications networks shown i n F i g u r e 2 . In most cases, connection with SUMEXAIM from these remote s i t e s i n v o l v e s only a l o c a l telephone c a l l to the nearest network "node". The SUMEX-AIM system i s i t s e l f undergoing constant improvement which deserves to be c a l l e d research, and thus a t h i r d s e c t i o n i s i n c l u d e d here to represent system development. Remote P r o j e c t s The Rutgers P r o j e c t . O r i g i n a t i n g from Rutgers U n i v e r s i t y are s e v e r a l research e f f o r t s designed t o introduce advanced methods i n computer s c i e n c e - p a r t i c u l a r l y i n a r t i f i c i a l i n t e l l i g e n c e and i n t e r a c t i v e data base systems - i n t o s p e c i f i c areas of biomedical research. One such e f f o r t i n v o l v e s the development of computerbased c o n s u l t a t i o n systems f o r diseases of the eye, s p e c i f i c a l l y the establishment of a n a t i o n a l network of c o l l a b o r a t o r s f o r d i a g n o s i s and recommendations f o r treatment of glaucoma by computer.

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

COMPUTER NETWORKING AND CHEMISTRY

OTHER RESEARCH PROJECTS

^

exchange of research programs ^ and ideas

Figure 1.

suggested developments

^ RESEARCH PROJECT

use of production programs

PROJECT-RELATED USER COMMUNITY ^

Interactions in the S UMEX-AIM

Community

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

SUMEX

ARPANET

TYMNET Phone Lines

Direct Lines

m.

-7

LOCAL USERS

TYMNET ACCESS POINT

Lxal Call

Figure 2.

Figure 3.

REMOTE USERS

Access to

Local Call

ARPANET ACCESS POINT

SUMEX-AIM

Total ion current vs. spectrum number in a GC/LRMS run

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

13.

CARHART

ET

AL.

Collaborative

Research

Community

195

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

Another p r o j e c t concerns the BELIEVER program, which represents a theory of how people a r r i v e at an i n t e r p r e t a t i o n of the s o c i a l a c t i o n s of others. SUMEX-AIM provides an e x c e l l e n t medium f o r c o l l a b o r a t i o n i n the development and t e s t i n g of t h i s theory. The Rutgers p r o j e c t i n c l u d e s , i n a d d i t i o n , s e v e r a l fundamental s t u d i e s i n a r t i f i c i a l i n t e l l i g e n c e and system design, which provide much of the support needed f o r the development of such complex systems. The DIALOG P r o j e c t . The DIAgnostic LOGic p r o j e c t , based at the U n i v e r s i t y of P i t t s b u r g h , i s a l a r g e s c a l e , computerized medic a l d i a g n o s t i c system that makes use of the methods and s t r u c t u r e s of a r t i f i c i a l i n t e l l i g e n c e . U n l i k e most other computer d i a g n o s t i c programs, which are o r i e n t e d to d i f f e r e n t i a l diagnosis i n a rather l i m i t e d area, the DIALOG system has been designed to d e a l with the general problem of diagnosis i n i n t e r n a l medicine and c u r r e n t l y accesses a medical data base which encompasses approximately f i f t y percent of the major diseases i n i n t e r n a l medicine. The MISL P r o j e c t . The Medical Information Systems Laboratory at the U n i v e r s i t y of I l l i n o i s (Chicago C i r c l e campus) has been e s t a b l i s h e d to explore i n f e r e n t i a l r e l a t i o n s h i p s between a n a l y t i c data and the n a t u r a l h i s t o r y of s e l e c t e d eye diseases, both i n t r e a t e d and untreated forms. This p r o j e c t w i l l u t i l i z e the SUMEXAIM resource to b u i l d a data base which could then be used as a t e s t bed f o r the development of c l i n i c a l d e c i s i o n support algorithms. D i s t r i b u t e d Data-Base System f o r Chronic Diseases. This p r o j e c t , based at the U n i v e r s i t y of Hawaii, seeks to use the SUMEXAIM f a c i l i t y to e s t a b l i s h a resource sharing p r o j e c t f o r the development of computer systems f o r c o n s u l t a t i o n and research, and to make these systems a v a i l a b l e to c l i n i c a l f a c i l i t i e s from a set of d i s t r i b u t e d data bases. The r a d i o and s a t e l l i t e l i n k s which compose the communication network known as the AL0HANET, i n conj u n c t i o n with the ARPANET, w i l l make these programs a v a i l a b l e to other Hawaiian i s l a n d s and to remote areas of the P a c i f i c b a s i n . This p r o j e c t could w e l l have a s i g n i f i c a n t l y b e n e f i c i a l e f f e c t on the q u a l i t y of h e a l t h care d e l i v e r y i n these l o c a t i o n s . Modelling of Higher Mental Functions. A p r o j e c t at the U n i v e r s i t y of C a l i f o r n i a at Los Angeles i s using the SUMEX-AIM f a c i l i t y to c o n s t r u c t , t e s t , and v a l i d a t e an improved v e r s i o n of the computer s i m u l a t i o n of paranoid processes which has been developed. These simulations have c l i n i c a l i m p l i c a t i o n s f o r the understanding, treatment, and prevention of paranoid d i s o r d e r s . The current i n t e r a c t i v e v e r s i o n (PARRY) has been running on SUMEXAIM and has provided a b a s i s f o r improvement of the f u t u r e v e r sion's language c a p a b i l i t y .

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

196

COMPUTER

NETWORKING AND CHEMISTRY

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

Local Projects The P r o t e i n C r y s t a l l o g r a p h y P r o j e c t . The P r o t e i n C r y s t a l l o g r a p h y p r o j e c t i n v o l v e s s c i e n t i s t s a t two d i f f e r e n t u n i v e r s i t i e s (Stanford and the U n i v e r s i t y of C a l i f o r n i a a t San Diego), p o o l i n g t h e i r r e s p e c t i v e t a l e n t s i n p r o t e i n c r y s t a l l o g r a p h y and computer s c i e n c e , and u s i n g the SUMEX-AIM f a c i l i t y as the c e n t r a l r e p o s i t o r y f o r programs, data and other i n f o r m a t i o n of common i n t e r e s t . The general o b j e c t i v e of the p r o j e c t i s to apply problem s o l v i n g techniques, which have emerged from a r t i f i c i a l i n t e l l i g e n c e research, to the w e l l known "phase problem" of x-ray c r y s t a l l o g r a p h y , i n order to determine the three-dimensional s t r u c tures of p r o t e i n s . The work i s intended t o be of p r a c t i c a l as w e l l as t h e o r e t i c a l value to both computer s c i e n c e ( p a r t i c u l a r l y a r t i f i c i a l i n t e l l i g e n c e research) and p r o t e i n c r y s t a l l o g r a p h y . The MYCIN P r o j e c t . MYCIN i s an e v o l v i n g program that has been developed to a s s i s t p h y s i c i a n n o n s p e c i a l i s t s w i t h the s e l e c t i o n of therapy f o r p a t i e n t s w i t h b a c t e r i a l i n f e c t i o n s . The proj e c t has i n v o l v e d both p h y s i c i a n s , with e x p e r t i s e i n the c l i n i c a l pharmacology of b a c t e r i a l i n f e c t i o n s , and computer s c i e n t i s t s , with i n t e r e s t s i n a r t i f i c i a l i n t e l l i g e n c e and medical computing. The MYCIN program attempts to model the d e c i s i o n processes of the medical experts. I t c o n s i s t s of three c l o s e l y i n t e g r a t e d components: the C o n s u l t a t i o n System asks questions, makes c o n c l u s i o n s , and gives advice; the E x p l a n a t i o n System answers questions from the user to j u s t i f y the program's advice and e x p l a i n i t s methods; and the R u l e - A c q u i s i t i o n System permits the user to teach the s y s tem new d e c i s i o n r u l e s , or to a l t e r p r e - e x i s t i n g r u l e s that a r e judged to be inadequate or i n c o r r e c t . The DENDRAL P r o j e c t . T h i s p r o j e c t , being of p a r t i c u l a r chemical i n t e r e s t , i s d e s c r i b e d i n d e t a i l i n S e c t i o n I I . Through the SUMEX-AIM f a c i l i t y DENDRAL has gained a growing community of p r o d u c t i o n - l e v e l users whose experience with the programs i s a v a l u a b l e guide to f u r t h e r development. Although t e c h n i c a l l y u s e r s , some members of t h i s community might b e t t e r be d e s c r i b e d as c o l l a b o r a t o r s because they have provided SUMEX-AIM w i t h v a r i o u s special-purpose programs which a r e of i n t e r e s t to other chemists and which extend the usefulness of the DENDRAL programs. SUMEX-AIM System Development Current research a c t i v i t i e s a t SUMEX-AIM are developing along s e v e r a l l i n e s . On a system development l e v e l there are ongoing p r o j e c t s designed to make the system more user o r i e n t e d . Curr e n t l y , the system can be expected to provide help to the user who i s confused about what i s expected i n response to a c e r t a i n prompt. A " ? " typed by the user, w i l l , i n most cases, provide a l i s t of p o s s i b l e responses from which to choose. A l s o a v a i l a b l e

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

13.

CARHART

E T

AL.

Collaborative

Research

Community

197

i n response to t y p i n g "HELP" to the monitor i s a general help d e s c r i p t i o n c o n t a i n i n g p o i n t e r s to f i l e s l i k e l y to be of i n t e r e s t to a new user. In an e f f o r t to f a c i l i t a t e communication between c o l l a b o r a t o r s , a program c a l l e d CONFER has been developed to prov i d e an o r d e r l y method f o r m u l t i p l e p a r t i c i p a n t t e l e t y p e "conference c a l l s " . B a s i c a l l y , the program a c t s as a character processor f o r a l l the terminals l i n k e d i n the conference, a c c e p t i n g input from only one at a time, and passing i t out to the remaining t e r minals. In t h i s way, the conference, i n e f f e c t , has a "moderat o r " t e r m i n a l , thus a l l o w i n g f o r a more o r d e r l y t r a n s f e r of ideas and i n f o r m a t i o n . SUMEX-AIM i s a l s o aware of the n e c e s s i t y of making i t s f a c i l i t i e s a v a i l a b l e f o r t r i a l use by p o t e n t i a l users and c o l l a b o r a t o r s . To t h i s end, a GUEST mechanism has been e s t a b l i s h e d f o r persons who wish to have b r i e f , t r i a l access to c e r t a i n programs they f e e l may be of value to them, and about which they would l i k e to o b t a i n more knowledge. T h i s provides a convenient mechanism whereby persons, who have been given an a p p r o p r i a t e phone number and LOGIN procedure, can d i a l up SUMEX-AIM and r e c e i v e a c t u a l experience u s i n g a program they may only have heard about. Another area of system development c u r r e n t l y being explored at SUMEX-AIM i s that of c r e a t i n g a comprehensive " b u l l e t i n board" f a c i l i t y where users can f i l e " b u l l e t i n s " , that i s , messages of i n t e r e s t to the SUMEX-AIM community. The f a c i l i t y w i l l a l s o a l e r t users to new b u l l e t i n s which are l i k e l y to be of i n t e r e s t to them, as determined by i n d i v i d u a l u s e r - i n t e r e s t p r o f i l e . I I . DENDRAL - CHEMICAL APPLICATIONS OF INTERACTIVE COMPUTING IN A NETWORK ENVIRONMENT The major research i n t e r e s t of the DENDRAL P r o j e c t at Stanford U n i v e r s i t y i s a p p l i c a t i o n of a r t i f i c i a l i n t e l l i g e n c e techniques f o r chemical i n f e r e n c e , f o c u s i n g i n p a r t i c u l a r on molecular s t r u c t u r e e l u c i d a t i o n . P o r t i o n s of our research are i n the area of combined gas chromatography/high r e s o l u t i o n mass spectrometry and i n c l u d e instrumentation and data a c q u i s i t i o n hardware and software development. T h i s area i s beyond the scope of t h i s r e p o r t ; we focus i n s t e a d on the concurrent development of programs to a s s i s t chemists i n v a r i o u s phases of s t r u c t u r e e l u c i d a t i o n beyond the p o i n t of i n i t i a l data c o l l e c t i o n . SUMEX-AIM provides the computer support f o r development and a p p l i c a t i o n of these programs. Another aspect of our research i s our commitment to share developments among a wider community. We f e e l that s e v e r a l of our programs are advanced enough to be u s e f u l to chemists engaged i n r e l a t e d work i n mass spectrometry and s t r u c t u r e e l u c i d a t i o n i n general. These programs are w r i t t e n p r i m a r i l y i n the programming language INTERLISP, and thus are not e a s i l y exportable (exceptions are i n d i c a t e d subsequently). SUMEX-AIM provides a mechanism f o r

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

198

COMPUTER NETWORKING AND CHEMISTRY

a l l o w i n g others access to the programs without the requirement f o r any s p e c i a l programming or computer system e x p e r t i s e . The a v a i l a b i l i t y of the SUMEX f a c i l i t y over nationwide networks allows remote users to access the programs, i n many instances v i a a l o c a l telephone c a l l . Much of the f o l l o w i n g d i s c u s s i o n i s p r e l i m i n a r y because our programs have only r e c e n t l y been r e l e a s e d f o r outside use. Some announcement of t h e i r a v a i l a b i l i t y has been made, and other announcements w i l l occur i n the near f u t u r e , through t a l k s , p u b l i c a t i o n s i n press, demonstrations and i n f o r m a l d i s c u s s i o n s . Although most of our experience has been with l o c a l users, they have been good models of remote users i n that t h e i r previous exposure to the a c t u a l programs and computer systems i s minimal. T h e i r experience has been extremely u s e f u l i n h e l p i n g us to smooth out clumsy i n t e r a c t i o n s with programs and to l o c a t e and f i x program bugs. Such p o l i s h i n g i s important f o r programs which may be u t i l i z e d by users from widely d i f f e r i n g backgrounds with respect to computers, networks and time sharing systems. We are i n the processes of b u i l d i n g a community of remote users. We a c t i v e l y encourage such use f o r two reasons: 1) we f e e l the programs a r e capable of a s s i s t i n g others i n s o l v i n g c e r t a i n molecular s t r u c t u r e problems, and 2) such experience with outside users w i l l be a tremendous a s s i s t a n c e i n i n c r e a s i n g the power of our programs as the programs a r e forced to confront new real-world problems. The remainder of t h i s s e c t i o n o u t l i n e s the programs which a r e a v a i l a b l e v i a SUMEX, the u t i l i z a t i o n of these programs i n h e l p i n g to solve s t r u c t u r e e l u c i d a t i o n problems and the l i m i t a t i o n s we see to t h e i r use. We d i s c u s s current a p p l i c a t i o n s of the programs to our research and the research of other users to i l l u s t r a t e b e t t e r the v a r i e t y of p o t e n t i a l a p p l i c a t i o n s and to stimulate an i n t e r change of ideas. Where a p p r o p r i a t e , we p o i n t out current d i f f i c u l t i e s with the use both of our programs and of SUMEX. New a p p l i c a t i o n s and wider use w i l l c e r t a i n l y change the nature of these problems; we s t r i v e t o s o l v e current problems, but new ones w i l l always a r i s e t o take t h e i r p l a c e . DENDRAL Programs We have s e v e r a l programs which we employ i n d e a l i n g with v a r i o u s aspects of problems i n v o l v i n g unknown s t r u c t u r e s . Some of these programs a r e exportable, w h i l e the remainder are a v a i l a b l e at SUMEX. The a v a i l a b i l i t y o f each program i s discussed below. Our i n i t i a l emphasis i n studying a p p l i c a t i o n s of a r t i f i c i a l i n t e l l i g e n c e f o r chemical i n f e r e n c e was i n the area of mass spectrometry (4-6). This emphasis remains because many of our problems r e q u i r e mass spectrometry as the a n a l y t i c a l t o o l of choice i n p r o v i d i n g s t r u c t u r a l information on small q u a n t i t i e s of sample. More r e c e n t l y , we have been developing a program (C0NGEN, below) d i r e c t e d a t more general aspects of s t r u c t u r e e l u c i d a t i o n . This has extended the scope of problems f o r which we can provide

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

13.

CARHART

ET

AL.

Collaborative

Research

Community

199

computer a s s i s t a n c e . We w i l l begin, however, with d i s c u s s i o n of the mass spectrometry programs. The examples used i n the d i s c u s s i o n are c h a r a c t e r i s t i c of our c u r r e n t research problems, although we have focused on r e l a t i v e l y simple problems to keep the p r e s e n t a t i o n b r i e f . We t r a c e , i n what might be c h r o n o l o g i c a l terms, the a p p l i c a t i o n of the programs to v a r i o u s phases of a s t r u c t u r e problem. In t h i s way we hope to i l l u s t r a t e the p l a c e of each program i n the a n a l y s i s . We begin by d i s c u s s i n g preprocessing of mass s p e c t r a l data (CLEANUP and MOLION). Subsequent a n a l y s i s of such data i n terms of s t r u c t u r e i s then covered (PLANNER). The use of CONGEN i s discussed f o r problems which cannot be handled by the previous programs. F i n a l l y , we d i s c u s s e f f o r t s to d i s c o v e r , with the use of the computer, systematics i n the behavior of known substances i n the mass spectrometer as a means of extending the knowledge of the system f o r a p p l i c a t i o n s i n new areas (INTSUM and RULEGEN). Programs f o r Molecular S t r u c t u r e Problems The f i r s t three programs, CLEANUP, the l i b r a r y - s e a r c h program and MOLION are i n a sense u t i l i t y programs, but a l l three p l a y a c r i t i c a l r o l e i n p r o c e s s i n g mass s p e c t r a l data. Subsequent a p p l i c a t i o n s of programs (e.£., PLANNER) f o r more d e t a i l e d s p e c t r a l a n a l y s i s i n terms of s t r u c t u r e depend on the s u c c e s s f u l treatment of the data by CLEANUP and MOLION, w h i l e the l i b r a r y search program f i l t e r s out common s p e c t r a which need not undergo a f u l l a n a l y s i s . The examples used are drawn from our c o l l a b o r a t i o n w i t h persons i n the Genetics Research Center a t Stanford H o s p i t a l . The experimental data which are c o l l e c t e d are the r e s u l t s of combined gas chromatographic/low r e s o l u t i o n mass s p e c t r a l (GC/LRMS) analys i s of organic components (chemically f r a c t i o n a t e d and d e r i v a t i z e d where necessary) of body f l u i d s , blood, u r i n e . A t y p i c a l experiment c o n s i s t s of 500-600 i n d i v i d u a l mass s p e c t r a f o r each f r a c t i o n , taken s e q u e n t i a l l y over time as the v a r i o u s components, l a r g e l y separated from one another, e l u t e from the gas chromatograph and pass i n t o the mass spectrometer. Each mass spectrum c o n s i s t s of the mass analyzed fragment ions of the component(s) i n the mass spectrometer a t the time the spectrum was taken. Such s p e c t r a are r e l a t e d , i n d i r e c t l y , to the molecular s t r u c t u r e of the component(s). CLEANUP(7). The i n d i v i d u a l mass s p e c t r a obtained from f r a c t i o n a t e d GC/LRMS a n a l y s i s are q u i t e o f t e n poor r e p r e s e n t a t i o n s of corresponding s p e c t r a taken from pure compounds. They can be contaminated by the presence of a d d i t i o n a l peaks and/or d i s t o r t i o n s of the i n t e n s i t i e s of e x i s t i n g peaks i n the spectrum. Fragment ions from e i t h e r the l i q u i d phase of the GC column or from components incompletely separated by the gas chromatograph are r e s p o n s i b l e f o r the contamination. We have developed a program, r e f e r r e d to here as CLEANUP, which examines a l l mass s p e c t r a i n a

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

200

C O M P U T E R NETWORKING AND

CHEMISTRY

GC/LRMS run, s e l e c t s those s p e c t r a which contain ions other than background i m p u r i t i e s , and removes c o n t r i b u t i o n s from background and overlapping components. A spectrum r e s u l t s which compares f a v o r a b l y with the spectrum of a pure component. B i l l e r and Biemann (8) have developed a s i m i l a r but l e s s powerful program. For example, the CLEANUP program detected components at p o i n t s marked with a v e r t i c a l bar i n the ( p a r t i a l ) p l o t of t o t a l i o n current v s . scan number (time), F i g u r e 3. Note that o v e r l a p ping components were detected under the envelopes of the GC peaks i n the r e g i o n of scans 485-488, 525-529 and 539-552. We focus our a t t e n t i o n on the spectrum recorded at scan 492. The raw data, p r i o r to cleanup, are presented i n F i g u r e 4 (top). The spectrum r e s u l t i n g from CLEANUP i s presented i n Figure 4 (bottom). Note that the l a r g e ions (e.g., m/e 207, 221 and 315) from background i m p u r i t i e s are removed, and that the i n t e n s i t y r a t i o s of peaks at lower masses (e.g., 51 and 77) have been adjusted to r e f l e c t t h e i r true i n t e n s i t i e s i n the spectrum. The CLEANUP program i s capable of d e t e c t i o n of q u i t e l o w - l e v e l components i n complex mixtures as i n d i c a t e d by some of the areas of the t o t a l i o n current p l o t (Figure 3) where components were detected. I t i s completely general because nothing i n the program code i s s e n s i t i v e to the types of compounds analyzed or the c h a r a c t e r i s t i c s of p o s s i b l e i m p u r i t i e s a s s o c i a t e d with the compounds or from the GC column. I t s major l i m i t a t i o n i s that mass s p e c t r a must be taken r e p e t i t i v e l y during the course of a GC/MS run. I t s performance i s enhanced when such s p e c t r a are measured c l o s e l y i n time. The program i s o f f e r e d v i a SUMEX as an adjunct to use of our other programs; i t i s not o f f e r e d as a r o u t i n e s e r v i c e . Because the program i s w r i t t e n i n FORTRAN, we r o u t i n e l y use i t on our data a c q u i s i t i o n computer system so as not to burden SUMEX with tasks b e t t e r done elsewhere. S i m i l a r l y , we would a s s i s t other frequent users to mount the program on t h e i r own systems. L i b r a r y Search. With a s e t of " c l e a n " mass s p e c t r a a v a i l a b l e , the next problem i s i d e n t i f i c a t i o n of the v a r i o u s components. Over the course of s e v e r a l years, l i b r a r i e s of mass s p e c t r a l data have been assembled (9). These l i b r a r i e s can be very u s e f u l i n weeding out from a group of s p e c t r a those which represent known compounds (10). C l e a r l y , one should spend time on s o l v i n g the s t r u c t u r e s of unknown compounds, not on r e d i s c o v e r i n g o l d ones. The CLEANUP program provides mass s p e c t r a which are of s u f f i c i e n t q u a l i t y to expect that known compounds would be i d e n t i f i e d e a s i l y from such l i b r a r i e s . The s p e c t r a detected by CLEANUP i n the r e g i o n of scans 480 to 580 (Figure 3) were matched against the l i b r a r y of b i o m e d i c a l l y r e l e v a n t s p e c t r a compiled by S. Markey (National I n s t i t u t e s of Health) and our extensions to that l i b r a r y (we wish to thank S. Grotch, J e t P r o p u l s i o n Laboratory, Pasadena, Ca. f o r p r o v i d i n g some of the l i b r a r y matchings). E x c e l l e n t matches with the

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

4.

Spectrum number 492 from the GC/LRMS

by

trace shown in Figure 3: (top) raw data; (bottom) spectrum output CLEANUP

m/e

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

h-

1

to ο

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

202

C O M P U T E R NETWORKING AND

CHEMISTRY

l i b r a r y were obtained f o r scans 492, 496, 509, 519, 529 and 548. The components are i n d o l e a c e t i c a c i d methyl e s t e r , d i - n - b u t l y phthalate, c a f f e i n e , s a l i c y l u r i c a c i d methyl e s t e r , methoxyh i p p u r i c a c i d methyl e s t e r and n-C24 hydrocarbon r e s p e c t i v e l y ( s t r u c t u r e s given i n Figures 3 and 4). Spectra scans at 485, 487, 525, 530, 536, 539, 554 and 576 d i d not match w e l l w i t h any spectrum i n the l i b r a r y and thus must be examined f u r t h e r f o r s t r u c t u r i n g information. The n e c e s s i t y f o r preprocessing the data using CLEANUP p r i o r to l i b r a r y matching i s i l l u s t r a t e d from i n d o l e a c e t i c a c i d methyl e s t e r (scan 492). The " c l e a n " spectrum (Figure 4, bottom) was matched to the l i b r a r y spectrum of t h i s compound much b e t t e r than to that of any other compound. The raw spectrum (Figure 4, top), when compared to the l i b r a r y r e s u l t e d i n eleven other compounds which matched more c l o s e l y that the c o r r e c t one. This b r i e f example i l l u s t r a t e s the obvious v a l u e and l i m i t a t i o n s of l i b r a r y searching. The most i n t e r e s t i n g compounds f o r subsequent a n a l y s i s are those which are unknown. The f r a c t i o n s of u r i n e e x t r a c t s are r e p l e t e with u n i d e n t i f i e d compounds because of the inadequacy of current l i b r a r y compilations. As new compounds are i d e n t i f i e d they are, of course, added to the l i b r a r y so that f u t u r e analyses need not r e i n v e s t i g a t e the same m a t e r i a l . We c u r r e n t l y perform l i b r a r y searching on our data a c q u i s i t i o n and r e d u c t i o n computer systems. We can, i f necessary, o f f e r l i m i t e d l i b r a r y search f a c i l i t i e s v i a SUMEX. However, because commercial f a c i l i t i e s are a v a i l a b l e (e.&., over the GE network), r o u t i n e l i b r a r y search s e r v i c e i s not a v a i l a b l e on SUMEX. MOLION(ll). At t h i s stage we are l e f t with a c o l l e c t i o n of mass spectra of unknown compounds. The l i b r a r y search r e s u l t s may have provided some clues as to the type of compound present, je.j£., compound c l a s s . Structure e l u c i d a t i o n now begins i n earnest. The key elements i n problems of s t r u c t u r e e l u c i d a t i o n are the molecul a r weight and e m p i r i c a l formula of a compound. Without these e s s e n t i a l data, the s t r u c t u r a l p o s s i b i l i t i e s are u s u a l l y too immense to proceed f u r t h e r . Mass spectrometry i s f r e q u e n t l y used to determine molecular weights and formulae, but there i s no guarantee that the mass spectrum of a compound d i s p l a y s an i o n c o r r e s ponding to the i n t a c t molecule. For example, many of the d e r i v a t i v e s of the amino a c i d f r a c t i o n s of u r i n e d i s p l a y no molecular i o n s . When we are given only the mass spectrum (and f o r GC/MS a n a l y s i s a mass spectrum may be a l l that i s a v a i l a b l e ) we must somehow p r e d i c t l i k e l y molecular i o n candidates, The program MOLION performs t h i s task. Given a mass spectrum, i t p r e d i c t s and ranks l i k e l y molecular i o n candidates independent of the presence or absence of an i o n i n the spectrum corresponding to the i n t a c t molecule. The published manuscript (11) provides many examples of the performance of the program.

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

13.

CARHART

Collaborative

E T AL.

Research

Community

203

The mass spectrum of an example, unknown X, (which we w i l l pursue i n more d e t a i l below) i s given i n F i g u r e 5. The r e s u l t s obtained from MOLION are summarized i n Table I . The observed i o n at m/e 263 i s ranked as the most l i k e l y candidate.

138

100.

C

11H12N°3F3

-56

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

Π 94 93

150

100

263

-73

206

IJJ. • • Ifc.L •

50

z69_

190

250

200

m/e SUPERATOMS c

4

HgOOC-

C O N G E N RESULT 1

^ - ^ C O O C

4

H

g

Figure 5. Low-resolution mass spectrum of unknown X. The indicated superatoms were deduced from the spectrum and the chemical history of the sample. Based on these and other constraints, CONGEN obtains the indicated result.

Table I .

Results of Molecular Ion Determination f o r the Unknown Compound, X, whose Mass Spectrum i s Presented i n F i g u r e 5. CANDIDATE 263.0 307.0 299.0 295.0 281.0

RANKING INDEX 100 41 38 34 25

The MOLION program i s w r i t t e n to operate on e i t h e r low o r high r e s o l u t i o n mass s p e c t r a . The program has c e r t a i n l i m i t a t i o n s which have been summarized i n d e t a i l p r e v i o u s l y (11). MOLION i s a v a i l a b l e on SUMEX. A FORTRAN v e r s i o n , i n i t i a l l y f o r low r e s o l u t i o n mass s p e c t r a , i s being w r i t t e n so that the pro­ gram can be run on smaller computers and exported to o t h e r s . How­ ever, i t w i l l continue to be a v a i l a b l e v i a SUMEX so that others can access i t e a s i l y . MOLION i s contained w i t h i n PLANNER as one of the a v a i l a b l e methods f o r d e t e c t i n g candidate molecular i o n s .

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

204

COMPUTER

NETWORKING AND

CHEMISTRY

PLANNER(12). The PLANNER program i s designed to analyze the mass spectrum of a compound or of a mixture of r e l a t e d compounds. Because there i s no ab i n i t i o way of r e l a t i n g a mass spectrum of a complex organic molecule to the s t r u c t u r e of that molecule, PLANNER r e q u i r e s fragmentation r u l e s f o r the c l a s s of compounds to which the unknown belongs. T h i s i s i t s major l i m i t a t i o n . For our example the c l a s s was unknown, f o r c i n g us to r e s o r t to other means of a s s i s t a n c e . A p p l i c a t i o n s and l i m i t a t i o n s of PLANNER have been discussed e x t e n s i v e l y (12,13). The program i s very powerful i n instances where mass spectrometry r u l e s are strong (i.e.., g e n e r a l , w i t h few e x c e p t i o n s ) . In instances where r u l e s are weak or nonexistent, a d d i t i o n a l work on known s t r u c t u r e s and s p e c t r a may y i e l d u s e f u l r u l e s to make PLANNER a p p l i c a b l e (see INTSUM and RULEGEN, below) . One important f e a t u r e of PLANNER i s i t s a b i l i t y to analyze the s p e c t r a of mixtures i n a systematic and thorough way. Thus, i t can be a p p l i e d to s p e c t r a obtained as mixtures when GC/MS data are u n a v a i l a b l e or impossible to o b t a i n . PLANNER i s a v a i l a b l e i n an i n t e r a c t i v e v e r s i o n over SUMEX, r e q u i r i n g three kinds of informat i o n as i n p u t : the h i g h or low r e s o l u t i o n mass spectrum^ the c h a r a c t e r i s t i c s k e l e t a l s t r u c t u r e f o r molecules i n the s p e c i f i c compound c l a s s , and the fragmentation r u l e s f o r the c l a s s . Addit i o n a l knowledge about the unknown can be used by the program to c o n s t r a i n the s t r u c t u r a l p o s s i b i l i t i e s . C0NGEN(14,15). S t r u c t u r e problems are u s u a l l y not solved with mass spectrometry alone. Even when sample s i z e i s too l i m i t e d f o r o b t a i n i n g other s p e c t r o s c o p i c data, knowledge of chemical i s o l a t i o n and r e s u l t s of d e r i v a t i z a t i o n procedures f r e quently a c t as powerful c o n s t r a i n t s on s t r u c t u r a l p o s s i b i l i t i e s . Larger amounts of sample permit determination of other s p e c t r o s c o p i c data. Taken together, t h i s i n f o r m a t i o n allows determinat i o n of s t r u c t u r a l f e a t u r e s (substructures) of the molecule and c o n s t r a i n t s on the p l a u s i b i l i t y of ways i n which the substructures may be assembled. The CONGEN program i s capable of prov i d i n g a s s i s t a n c e i n s o l u t i o n of such problems. CONGEN performs the task of c o n s t r u c t i o n , or generation, of s t r u c t u r a l isomers under c o n s t r a i n t s . The program accepts as input known s t r u c t u r a l fragments of the molecule ("superatoms") and any remaining atoms (C,N,0,P,...), together w i t h c o n s t r a i n t s on how they may be assembled. I t i s based on the exhaustive s t r u c t u r e generator (16,17) and extensions (18) which permit a stepwise assembly of s t r u c t u r e s . In an i n t e r a c t i v e s e s s i o n with the program, a user s u p p l i e s s t r u c t u r a l i n f o r m a t i o n determined by h i s own a n a l y s i s of the data (perhaps w i t h the help of the above programs), together w i t h whatever other c o n s t r a i n t s are a v a i l a b l e concerning d e s i r e d and undes i r e d s t r u c t u r a l f e a t u r e s , r i n g s i z e s and so f o r t h . The program b u i l d s s t r u c t u r e s i n a s e r i e s of s t e p s , during which a user can i n t e r a c t f u r t h e r w i t h the procedure, f o r example, to add new

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

13.

CARHART

E T

AL.

Collaborative

Research

Community

205

c o n s t r a i n t s . Although very much a developing program, i t s a b i l i t y to accept u s e r - i n f e r r e d c o n s t r a i n t s from many data sources makes CONGEN a general t o o l f o r s t r u c t u r e e l u c i d a t i o n which we are mak­ i n g a v a i l a b l e v i a SUMEX-AIM i n i t s c u r r e n t form. For the unknown X, the observed fragment ions from the molecular i o n (M) a t m/e 263 (Figure 5) suggest s e v e r a l s t r u c t u r a l features when coupled w i t h the knowledge of the chemical d e r i v a t i z a t i o n procedures used on t h i s f r a c t i o n of the u r i n e e x t r a c t . The i o n at m/e 194 represents l o s s of 69 amu, probably CF3, from f r a g ­ mentation of a t r i f l u o r o a c e t y l d e r i v a t i v e of an amine. T h i s sug­ gests the p a r t i a l s t r u c t u r e 2, F i g u r e 5. The ions a t m/e 190 (M-74 amu) and m/e 162 (M-101 amu) suggest the c h a r a c t e r i s t i c fragmentation of an η-butyl e s t e r r e s u l t i n g from the second d e r i v a t i z a t i o n procedure, formation of the η-butyl e s t e r s of f r e e c a r b o x y l i c a c i d f u n c t i o n s . This suggests the p a r t i a l s t r u c t u r e _1, F i g u r e 5. Taken together, a l l the above i n f o r m a t i o n i m p l i e s ( i f no other elements are present) that the e m p i r i c a l formula contains an odd number of n i t r o g e n atoms, at l e a s t three oxygen atoms, three f l u o r i n e atoms and at l e a s t seven carbon atoms. I n t e r e s t ­ i n g l y , there i s only one p l a u s i b l e e m p i r i c a l formula under these c o n s t r a i n t s , C11H12N03F3. S t r u c t u r a l fragments ("superatoms") _1 and 2 were s u p p l i e d to CONGEN, together with the remaining four carbon atoms and three degrees of u n s a t u r a t i o n (that i s , r i n g s p l u s m u l t i p l e bonds). With no a d d i t i o n a l c o n s t r a i n t s , 155 s t r u c t u r e s r e s u l t . The i n c l u ­ s i o n of other p l a u s i b l e c o n s t r a i n t s (e._g., no aliènes, a c e t y l e n e s , cyclopropenes, cyclobutenes) reduces the number of s t r u c t u r a l cand i d a t e s to j u s t the two isomeric forms of _2, F i g u r e 5. This problem represents a simple example of a l a r g e c l a s s of such problems. Although a chemist could probably reach the same c o n c l u s i o n s q u i c k l y i n t h i s case, i n the general case, p i e c i n g together p o t e n t i a l s o l u t i o n s i s not a t r i v i a l task. Although s t i l l a developing program, CONGEN i s capable of considerable a s s i s t a n c e i n a wide v a r i e t y of s t r u c t u r e problems. Some areas of current a p p l i c a t i o n are summarized i n the subsequent s e c t i o n . I t i s already proving i t s v a l u e i n s t r u c t u r e e l u c i d a t i o n problems by suggesting s o l u t i o n s w i t h a guarantee that no p l a u s i b l e a l t e r n a t i v e s have been overlooked. The program has a great d e a l of f l e x i b i l i t y . Many of the types of c o n s t r a i n t s normally brought to bear on s t r u c t u r e e l u c i d a t i o n problems can be expressed. However, some types of cons t r a i n t s cannot be e a s i l y expressed (e.g., d i s j u n c t i o n s of f e a t u r e s and s t e r e o - c o n s t r a i n t s ) . Recent work by our group and Wipke's (19) w i l l make i t p o s s i b l e to add c o n s i d e r a t i o n s of stereoisomerism r e l a t i v e l y e a s i l y (a good example of c o l l a b o r a t i o n v i a SUMEX). We are depending on a broad user community to help us guide f u r t h e r development of CONGEN.

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

206

COMPUTER

NETWORKING

AND CHEMISTRY

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

Programs f o r Knowledge A c q u i s i t i o n INTSUM(20) and RULEGEN(21). When the mass spectrometry r u l e s f o r a given c l a s s of compounds are not known, the INTSUM and RULEGEN programs can help a chemist formulate those r u l e s . Essent i a l l y , these programs c a t e g o r i z e the p l a u s i b l e fragmentations f o r a c l a s s of compounds by l o o k i n g a t the mass s p e c t r a of s e v e r a l molecules i n the c l a s s . A l l molecules a r e assumed to belong to one c l a s s whose s k e l e t a l s t r u c t u r e must be s p e c i f i e d . A l s o , the mass s p e c t r a and the s t r u c t u r e s of a l l the molecules must be given to the program. INTSUM c o l l e c t s evidence f o r a l l p o s s i b l e fragmentations (within u s e r - s p e c i f i e d c o n s t r a i n t s ) and summarizes the r e s u l t s . For example, a user may be i n t e r e s t e d i n a l l fragmentations i n v o l v i n g one or two bonds, but not three; aromatic r i n g s may be known to be unfragmented; and the user may be i n t e r e s t e d only i n fragmentations r e s u l t i n g i n an i o n c o n t a i n i n g a heteroatom. Under these c o n s t r a i n t s , the program c o r r e l a t e s a l l peaks i n the mass s p e c t r a with a l l p o s s i b l e fragmentations. The summary of r e s u l t s shows the molecules whose s p e c t r a d i s p l a y evidence f o r each p a r t i c u l a r fragmentation, along w i t h the t o t a l (and average) i o n current a s s o c i a t e d with the fragmentation. The RULEGEN program attempts to e x p l a i n the r e g u l a r i t i e s found by INTSUM i n terms of the u n d e r l y i n g s t r u c t u r a l f e a t u r e s around the bonds i n question that seem to " d i r e c t " the fragmentat i o n s . For example, INTSUM w i l l n o t i c e s i g n i f i c a n t fragmentation of the two d i f f e r e n t bonds alpha to the carbonyl group i n a l i p h a t i c ketones. I t i s l e f t to RULEGEN to d i s c o v e r that these a r e both instances of the same fundamental alpha-cleavage process that can be p r e d i c t e d any time a bond i s alpha to a carbonyl group. These programs are p a r t of the s o - c a l l e d Meta-DENDRAL e f f o r t , whose general goal i s to understand r u l e formation a c t i v i t i e s . Both INTSUM and RULEGEN a r e a v a i l a b l e as i n t e r a c t i v e programs on SUMEX, the former being much more h i g h l y developed that the l a t t e r ; Although these programs can be very u s e f u l to chemists i n t e r e s t e d i n f i n d i n g new mass spectrometry r u l e s , they r e q u i r e having the c o l l e c t i o n of mass s p e c t r a and molecular s t r u c t u r e d e s c r i p t i o n s a v a i l a b l e i n one computer f i l e . Because of t h i s , they have been used mostly by chemists a t Stanford. A p p l i c a t i o n s and Resource

Sharing

The DENDRAL programs are being developed t o serve a broad community of chemists with s t r u c t u r e e l u c i d a t i o n problems. Our experience i s admittedly l i m i t e d . In t h i s s e c t i o n we d i s c u s s some of the a p p l i c a t i o n s , both l o c a l and from remote s i t e s , where these programs have proven u s e f u l . CLEANUP and MOLION. These programs are i n r o u t i n e use as p a r t of the Genetics Research Center's GC/LRMS e f f o r t s . In

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

13.

CARHART ET

AL.

Collaborative

Research

Community

207

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

a d d i t i o n , MOLION has been incorporated as part of PLANNER. T h e i r g e n e r a l i t y has proven very u s e f u l i n a p p l i c a t i o n s to a v a r i e t y of GC/MS problems i n v o l v i n g s t r u c t u r a l s t u d i e s of u r i n a r y metabolites. PLANNER. The planning program has been used to i n f e r p l a u s i b l e placement of s u b s t i t u e n t s around a s k e l e t a l s t r u c t u r e f o r numerous t e s t problems i n which the c l a s s of the sample was known and the fragmentation r u l e s f o r the c l a s s were known. Those t e s t s have r e s u l t e d i n a program that we b e l i e v e i s general. We have a p p l i e d t h i s program to unknown mixtures of estrogenic s t e r i o d s (13). We are preparing to use PLANNER f o r screening mass spectra of marine s t e r o l s to i d e n t i f y q u i c k l y those spectra of known compounds and to suggest s t r u c t u r e s f o r spectra of new compounds. CONGEN. CONGEN i s being used l o c a l l y and from remote s i t e s i n a wide v a r i e t y of a p p l i c a t i o n s . We have used i t f o r construct i o n of r i n g systems under c o n s t r a i n t s (22) and f o r generation of s t r u c t u r e s of chlorocarbons (23). We have i n v e s t i g a t e d s e v e r a l monoterpenoid and sesquiterpenoid s t r u c t u r e problems to suggest s o l u t i o n s and to ensure that a l l a l t e r n a t i v e s had been considered. We are c u r r e n t l y i n v e s t i g a t i n g the scope of terpenoid isomerism. Two problems r e l a t i n g to unknown photochemical r e a c t i o n products have been analyzed and r e s u l t s used to suggest f u r t h e r e x p e r i ments. In most cases we do not know the p r e c i s e problems under study by remote users, only that they are using the program. CONGEN w i l l perhaps be the most widely used (by remote users) program of those mentioned above as a c c e s s i b l e through SUMEX. This i s p r i m a r i l y a r e s u l t of the wider scope of problems which might b e n e f i t from use of the program. However, the need f o r remote users to have t h e i r mass s p e c t r a l data a v a i l a b l e at SUMEX f o r a n a l y s i s present a s i g n i f i c a n t energy b a r r i e r to use of the programs which r e q u i r e these data. INTSUM and RULEGEN. INTSUM i s e s s e n t i a l l y a production program now, and i s being used as such i n a v a r i e t y of a p p l i c a t i o n s i n v o l v i n g c o r r e l a t i o n s of molecular s t r u c t u r e s with t h e i r respective spectra. Recent or current a p p l i c a t i o n s i n c l u d e analys i s of the mass spectra of progestérones and r e l a t e d s t e r o i d s , androstanes, macrolide a n t i b i o t i c s , i n s e c t j u v e n i l e hormones and phytoecdysones. These s t u d i e s serve to develop fragmentation r u l e s which, i f of s u f f i c i e n t g e n e r a l i t y , can i n turn be used i n PLANNER i n the study of unknown compounds. III.

PROBLEMS RELATED TO NETWORKING

During t h i s f i r s t year of operation, the SUMEX-AIM f a c i l i t y has encountered a v a r i e t y of problems a r i s i n g from i t s network availability. In most cases, there has been no c l e a r precedent

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

208

COMPUTER NETWORKING A N D CHEMISTRY

f o r the handling of these s i t u a t i o n s , i n f a c t , many problem-areas s t i l l r e f l e c t the i n f l u e n c e s of a yet-developing p o l i c y . The hope i s that t h i s p r e s e n t a t i o n and d i s c u s s i o n of problems and t h e i r s o l u t i o n s may give f o r e s i g h t t o others who contemplate networking or network use. The problems to be discussed can be l o o s e l y a s s o c i a t e d i n t o three c l a s s e s ; those r e l a t e d t o the management of the f a c i l i t y , those p e r t a i n i n g to research a c t i v i t i e s on the s y s tem, and those i n v o l v i n g p s y c h o l o g i c a l b a r r i e r s to network use.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

Managerial

Problems

"Gatekeeping". The most general problem faced by the organizers of the SUMEX-AIM f a c i l i t y i s the question of "gatekeeping." In order t o i n s u r e a high q u a l i t y of p e r t i n e n t research, some k i n d of r e f e r e e i n g system i s needed to assess the value of proposed new p r o j e c t s . The o r g a n i z e r s of the f a c i l i t y would seem to be the best source of such judgements; y e t , because we are both organizers and members of the SUMEX community, there i s a danger that our d e c i s i o n s would u n f a i r l y favor l o c a l p r i o r i t i e s . In order to e s t a b l i s h c r e d i b i l i t y i n SUMEX-AIM as a t r u l y n a t i o n a l resource, a management system has been i n s t i t u t e d that a l l o c a t e s a defined f r a c t i o n ( i n i t i a l l y 50%) of the SUMEX resource to e x t e r n a l users, under the j u r i s d i c t i o n of an independent n a t i o n a l committee (the AIM advisory group). The remaining 50%, a l l o c a t e d f o r l o c a l use, contains a p o r t i o n f o r f l e x i b l e experiments outside of l o c a l p r o j e c t s , but on our own r e s p o n s i b i l i t y . Choice of computer and operating system. A second management l e v e l problem i s the choice of a computer and operating system which optimize the usefulness of the f a c i l i t y f o r a m a j o r i t y of users, and which encourage intercommunication between remote c o l l a b o r a t o r s . Because SUMEX-AIM i s intended to be used p r i m a r i l y f o r a p p l i c a t i o n s of a r t i f i c i a l i n t e l l i g e n c e , and because i n t e r a c t i v e LISP (24) i s a primary language i n t h i s type of work, the choice of TENEX (25) as an operating system was d i c t a t e d somewhat by n e c e s s i t y . TENEX incorporates m u l t i p l e address spaces, thereby a l l o w i n g m u l t i p l e " f o r k " s t r u c t u r e and paging, a design which i s necessary to c r e a t e the l a r g e memory v i r t u a l machine r e q u i r e d by INTERLISP. The PDP-10 i s a popular machine f o r i n t e r a c t i v e computing of a l l s o r t s i n u n i v e r s i t y research environments, and thus an added b e n e f i t of t h i s choice was expected - the p o s s i b i l i t y of e a s i l y t r a n s f e r r i n g to SUMEX programs developed a t other s i t e s . Many of these programs were w r i t t e n not under TENEX but under the 10/50 monitor s u p p l i e d by the manufacturer. Because a l a r g e and u s e f u l program l i b r a r y was already a v a i l a b l e under the 10/50 monitor, one of the design c r i t e r i a of TENEX was c o m p a t i b i l i t y with such programs; when a 10/50 program i s run under TENEX, a s p e c i a l "compati b i l i t y package" of r o u t i n e s i s invoked to t r a n s l a t e 10/50 monitor c a l l s i n t o equivalent TENEX monitor c a l l s . Although the concept

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

13.

CARHART E T A L .

Collaborative

Research

Community

209

i s sound, we have found that i n p r a c t i c e very few programs w r i t t e n f o r the 10/50 monitor are able t o run under TENEX without extensive m o d i f i c a t i o n . Other problems with TENEX i n c l u d e weaknesses i n the support of p e r i p h e r a l devices and the l a c k of a d e f a u l t l i n e - e d i t o r . The l a t t e r has caused a p r o l i f e r a t i o n of e d i t i n g programs, and some confusion has r e s u l t e d because e d i t o r conventions vary from program t o program. These d i f f i c u l t i e s have dampened somewhat our i n i t i a l enthusiasm f o r the TENEX system. Nonetheless, TENEX provides some f e a t u r e s which a r e c r u c i a l to a comfortable network environment. The standard support programs i n c l u d e d with t h i s system f a c i l i t a t e both the sending of messages to other users ( e i t h e r a t the same s i t e or a t other s i t e s on the ARPA network) and the t r a n s f e r of data and programs from s i t e t o s i t e on that network; a l s o the a b i l i t y to " l i n k " two or more terminals allows users t o communicate e a s i l y and immediately. Both the l i n k i n g and message f a c i l i t y have been found to be i n v a l u a b l e a i d s i n inter-group communications and i n such problems as i n t e r a c t i v e program debugging. When two t e r m i n a l s are l i n k e d , t h e i r output streams are merged, thus a l l o w i n g each t e r m i n a l to d i s p l a y everything typed a t the other t e r m i n a l . Since only the output stream i s a f f e c t e d under these circumstances, i t i s s t i l l p o s s i b l e f o r each t e r m i n a l to be used to provide input to separate programs, i n a d d i t i o n to being used i n a c o n v e r s a t i o n a l mode. Resource a l l o c a t i o n . As noted above, the computational resources of the SUMEX-AIM f a c i l i t y a r e apportioned by the AIM advisory group and SUMEX management. Some extensions to the b a s i c TENEX system have been made to r e f l e c t t h i s a p p o r t i o n i n g i n the a c t u a l use of the f a c i l i t y . B a s i c a l l y , i t was recognized that users of the f a c i l i t y are members o f groups working on s p e c i f i c p r o j e c t s , and i t i s among these p r o j e c t s that the f a c i l i t y i s apportioned. Disk space and cpu c y c l e s a r e now d i s t r i b u t e d among groups i n s t e a d of among i n d i v i d u a l u s e r s . For example, a user may exceed h i s i n d i v i d u a l d i s k a l l o c a t i o n somewhat without any i l l e f f e c t , so long as the t o t a l a l l o c a t i o n of h i s group remains w i t h i n the l i m i t s . S i m i l a r l y , a Reserve A l l o c a t i o n Scheduler has been added to TENEX which t r i e s to match the a d m i n i s t r a t i v e c y c l e d i s t r i b u t i o n over a n i n e t y second time frame. Thus a p a r t i c u l a r group cannot dominate the machine i f a l o t of i t s members are logged i n a t one time. I t i s t y p i c a l f o r usage of a f a c i l i t y to peak through the middle hours of the day. Indeed, one of the advantages of having users from around the country i s the spreading of the load caused by the d i f f e r e n c e i n time zones. Even so, the f a c i l i t y could o f f e r b e t t e r s e r v i c e i f more people would s h i f t t h e i r main usage hours toward e i t h e r end of the day. To encourage " s o f t - s c h e d u l i n g " w i t h i n groups on the system, SUMEX-AIM p u b l i s h e s a weekly p l o t of d i u r n a l l o a d i n g . T h i s p l o t shows the t o t a l number of jobs on the system as w e l l as the number of LISP j o b s , s i n c e these jobs seem to make the b i g g e s t demands of system resources. The r e s u l t

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

210

COMPUTER

NETWORKING AND

CHEMISTRY

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

has been an Increased awareness by users of system l o a d i n g and a n o t i c e a b l e i n c r e a s e i n the number of users at a l l hours of the n i g h t and e a r l y morning. P r o t e c t i o n and system s e c u r i t y . P r o t e c t i o n f o r a computer system covers a range of i d e a s . I t means the a b i l i t y to maintain secrecy - f o r example, to guarantee the p r i v a c y of p a t i e n t records. I t a l s o guarantees i n t e g r i t y by a s s u r i n g that programs and data are not modified by an unauthorized p a r t y . Questions of p r o t e c t i o n g e n e r a l l y become more i n t e r e s t i n g and complex as more s h a r i n g i s i n v o l v e d . Consider the example of a p r o p r i e t o r y program which generates l a y o u t s given a user's c i r c u i t data. The program owner demands assurance that he w i l l be p a i d whenever h i s program i s used and that copies of the program cannot be made. The user wants guarantees that h i s data s e t s cannot be destroyed or copied f o r a competitor. Yet the user must have access to the program and the program must have access to the d a t a Unable to support such complicated examples of p r o t e c t i o n , SUMEXAIM assumes that s h a r i n g takes p l a c e between f r i e n d l y u s e r s . T h i s i s not to imply that i s s u e s of p r o t e c t i o n and s h a r i n g have not appeared. For example, i n an e f f o r t to improve the human e n g i n e e r i n g of programs f o r p u b l i c use, the c a p a b i l i t y of r e c o r d i n g a s e s s i o n has been b u i l t i n t o s e v e r a l of the programs. Studied by the programs' designers to p i n p o i n t confusing aspects of programs, these recordings serve to improve program design. Since the i s s u e of v i o l a t i o n of p r i v a c y has been r a i s e d , some of these programs now request permission to r e c o r d a s e s s i o n b e f o r d doing so. At t h i s time, any guarantee of p r i v a c y must be provided by the program designer because TENEX i t s e l f does not have the a b i l i t y to render p r o t e c t i o n . The general design f o r systems o f f e r i n g " s t a t e of the a r t " p r o t e c t i o n i n v o l v e s a t o l e r a n c e f o r f a i l u r e ; that i s , i f a potent i a l offender succeeds i n breaking through some of the defenses, he s t i l l does not p l a c e the e n t i r e computer system at h i s mercy. Encrypting of data f i l e s provides an a d d i t i o n a l l i n e of defense. This method i s used by at l e a s t two calendar or appointment programs on the computer. At t h i s time, however, there are no general e n c r y p t i n g f a c i l i t i e s a v a i l a b l e and users must do t h i s f o r themselves as needed. TENEX provides the u s u a l keyword p r o t e c t i o n at l o g i n time and a measure of f i l e p r o t e c t i o n . Owners of a f i l e may a s s i g n a prot e c t i o n number which s p e c i f i e s some combination of READ, WRITE, EXECUTE, or APPEND access to a f i l e f o r owners, members of a group, or other users. T h i s l e v e l of p r o t e c t i o n i s b a s i c a l l y enough to prevent a c c i d e n t s and most m i s c h i e f . System programmer's around the country are aware of a number of TENEX bugs which permit t h i s access to be v i o l a t e d . One user of our system found a way to p l a c e h i m s e l f i n a mode where he could modify any f i l e on the system. To date, we have no examples of such a c t i v i t y a c t u a l l y having a d e l e t e r i o u s e f f e c t on SUMEX-AIM.

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

13.

CARHART

E T

AL.

CoUaborative

Research

Community

211

To make the use of SUMEX-AIM programs easy, on a t r i a l b a s i s f o r p r o s p e c t i v e users, a 'guest account system has been established. Since t h i s makes l o g g i n g i n t o SUMEX-AIM so easy, i t has i n v i t e d some misuse by people using t h i s account to p l a y computer games. A proposed extension to the system now being implemented i s a s p e c i a l "guest EXEC" which would extend the p r o t e c t i o n of the TENEX monitor by a l l o w i n g guest accounts access to only a more r e s t r i c t e d set of programs.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

1

F i l e backup. In order to assure the user maximum p r o t e c t i o n against l o s s of v a l u a b l e work, SUMEX operates a m u l t i - l e v e l f i l e backup system. In a d d i t i o n to a r o u t i n e f i l e backup system, there are f a c i l i t i e s to enable the user to s e l e c t i v e l y a r c h i v e h i s or her d i s k f i l e s . By i s s u i n g a simple command to the TENEX execut i v e , the user can transmit a message to the operator to copy s p e c i f i e d f i l e s to magnetic tape. Each such f i l e i s copied to two magnetic tapes w i t h i n 24 hours of i s s u i n g the a r c h i v e command. F i l e r e t r i e v a l i s a f f e c t e d by a s i m i l a r process. The user a l s o has the a l t e r n a t i v e o p t i o n of being able to lodge f i l e s i n a s p e c i a l backup d i r e c t o r y . F i l e s are h e l d i n t h i s d i r e c t o r y u n t i l the next e x c l u s i v e f i l e dump (see below) a t which time they are d e l e t e d . In t h i s way the user can remove f i l e s from h i s d i r e c t o r y at h i s own choosing knowing they w i l l be a r c h i v e d by the e x c l u s i v e dump. On a system l e v e l , an e f f o r t i s made to maintain f i l e backups such that the maximum p o s s i b l e l o s s , i n the event of a crash f a t a l to the f i l e system, would amount to no more that one day's work. Once each day a l l f i l e s that have been read or w r i t ten w i t h i n the l a s t 48 hours are dumped onto magnetic tape. F i l e s that e x i s t f o r 48 hours are thus h e l d on two separate tapes. The r o t a t i o n p e r i o d f o r f i l e s dumped i n t h i s way i s 60 days. Once each week a f u l l f i l e dump i s made to separate d i s k storage. Each such dump i s kept f o r two weeks a t which time i t i s r e p l a c e d by a new f i l e dump. Each month there i s a f u l l system dump from d i s k to magnetic tape. F i l e s can be recovered from the system backup by sending a message to the operator s p e c i f y i n g the f i l e name(s) and when the f i l e was l a s t read or w r i t t e n ( i f such i n f o r m a t i o n is available). Excessive demand f o r p r o d u c t i o n programs. One of the concepts behind the c r e a t i o n of a shared resource i s e l i m i n a t i o n of the problems which a r i s e when l a r g e , complex computer programs are exported. Since, i n theory, e x p o r t a b i l i t y i s no longer a problem, there i s greater l a t i t u d e i n choice of a language i n which program development can take p l a c e . In the case of some of the DENDRAL programs, i t was thought that program development should take p l a c e i n INTERLISP, a language that lends i t s e l f w e l l to the a r t i f i c i a l i n t e l l i g e n c e nature of these programs, but does not l e a d to p a r t i c u l a r l y e f f i c i e n t run-time code.

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

212

COMPUTER NETWORKING

AND

CHEMISTRY

In order to a s c e r t a i n the u s e f u l n e s s of these programs and to determine what areas remain i n need of work, chemist c o l l a b o r a t o r s are being sought. As these users i n c r e a s e i n number and begin to use the programs more f r e q u e n t l y , i t i s almost c e r t a i n that the inherent slowness of the predominately LISP code w i l l a f f e c t the whole system as w e l l as handicap the e f f i c i e n t use of the DENDRAL programs. A d d i t i o n a l l y , some of the chemist-users who are f i n d i n g the programs most u s e f u l and who are most e n t h u s i a s t i c about t h e i r p o t e n t i a l use, are persons working i n i n d u s t r y . Although, i n one sense, t h i s i n t e r e s t from i n d u s t r y could be i n t e r p r e t e d as an i n d i c a t i o n of the " r e a l - w o r l d " u s e f u l n e s s of the programs, i t came as r a t h e r a s u r p r i s e to both SUMEX and DENDRAL personnel. The f a c t that SUMEX-AIM i s funded by NIH as a n a t i o n a l resource p r o h i b i t s the f a c i l i t y from p r o v i d i n g a s e r v i c e , at taxpayer's expense, to a p r i v a t e i n d u s t r y . Although there i s precedent f o r a s i t e funded v i a government grant to charge a fee f o r s e r v i c e , such an arrangement leads to h i g h l y complicated bookkeeping, and i s contrary to the e s s e n t i a l purpose of SUMEX-AIM; to be a r e s e a r c h - o r i e n t e d r a t h e r than a s e r v i c e - o r i e n t e d f a c i l i t y . This leaves the i n d u s t r i a l users i n the p o s i t i o n of being more than w i l l i n g to pay f o r the use of the programs, but of having no mechanism whereby they can be charged. Furthermore, the f a c t that the programs are coded i n LISP f o r a h i g h l y s p e c i a l i z e d environment, almost guarantees the i m p o s s i b i l i t y of export, except to an almost i d e n t i c a l computer system. An intermediate s o l u t i o n that w i l l help to s o l v e the problem of i n d u s t r i a l users on SUMEX and w i l l help to a l l e v i a t e the system l o a d i n g r e s u l t i n g from heavy usage of LISP coded p r o d u c t i o n programs, i s to mount CONGEN on a c l o s e l y r e l a t e d computer which i s operated on a fee f o r s e r v i c e b a s i s . However, i n order to make t h i s program a v a i l a b l e at a reasonable f e e , i t has become evident that i t w i l l be necessary to recode the LISP s e c t i o n s of the p r o gram i n t o a more e f f i c i e n t and e a s i l y exportable language. Research-oriented

Problems

Community mindedness. Those i n v o l v e d i n computer s c i e n c e research at SUMEX face a general problem which i s absent or g r e a t l y lessened at non-network s i t e s ; the problem of community mindedness. The network provides a l a r g e and v a r i e d s e t of other researchers and users who have an i n t e r e s t i n t h e i r work. Although the network-TENEX combination provides new forms of communication w i t h these remote p a r t i e s , the t r a d i t i o n a l means of f u l l y d e s c r i b i n g the use and s t r u c t u r e of a complex program, a d e t a i l e d personto-person d i s c u s s i o n , i s not convenient. Comprehensive documentat i o n gains importance i n such a s i t u a t i o n , and w i t h i n the DENDRAL p r o j e c t a great d e a l of time has been needed i n the development of program d e s c r i p t i o n s which are adequate f o r a d i v e r s e audience. A l s o , i n both DENDRAL and MYCIN, e f f o r t has been and i s being d i r e c t e d toward "human e n g i n e e r i n g " i n program design; to p r o v i d e

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

13.

CARHART E T AL.

Collaborative

Research

Community

213

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

the user with commands which a s s i s t him i n using the programs, i n understanding the l o g i c by which the programs reach c e r t a i n d e c i sions and i n communicating questions or comments on the programs' operation to those r e s p o n s i b l e f o r development. Such "housekeeping" tasks can o f t e n be neglected, y e t are q u i t e important i n smoothing i n t e r a c t i o n with the community. Choice of programming language. High l e v e l programming languages which are designed f o r ease of program development are f r e q u e n t l y poor as p r o d u c t i o n - l e v e l languages. This i s because developmental languages f r e e the researcher from a r a f t of programming d e t a i l s , thus a l l o w i n g him to concentrate upon the c e n t r a l l o g i c a l i s s u e s of the problem, but the automatic handling of these d e t a i l s i s seldom optimal. A l s o , because such languages tend to be s p e c i a l i z e d f o r c e r t a i n computers and operating systems, the e x p o r t a t i o n of programs can be a s e r i o u s problem. One s o l u t i o n to these problems i s the recoding of r e s e a r c h - l e v e l programs i n t o more e f f i c i e n t language when f a s t and exportable v e r s i o n s are needed. Networking g r e a t l y eases the problem of e x p o r t a b i l i t y , but can a l s o aggrevate the problem of e f f i c i e n c y . As mentioned i n the previous s e c t i o n , DENDRAL programs, which are undergoing constant development, found a s u b s t a n t i a l number of p r o d u c t i o n - l e v e l u s e r s . Because of the i n e f f i c i e n c i e s of INTERLISP (a 50- to 100-fold improvement i n running time i s not uncommon when an INTERLISP program i s t r a n s l a t e d i n t o FORTRAN), t h i s use adversely impacted the e n t i r e system. Because the DENDRAL programs are q u i t e l a r g e and complex, t h e i r t r a n s l a t i o n i n t o other languages i s i m p r a c t i c a l l y tedious. A p a r t i a l s o l u t i o n to t h i s problem i s provided by the TENEX operating system, which allows some i n t e r f a c e between programs w r i t t e n i n d i f f e r e n t languages. With such intercommunicat i o n , time-consuming segments of an INTERLISP program which are not undergoing a c t i v e development can be reprogrammed i n another more e f f i c i e n t language. The developmental p a r t s of the program are l e f t i n INTERLISP, where m o d i f i c a t i o n s can e a s i l y be made and t e s t e d . The CONGEN program uses three languages; INTERLISP, FORTRAN and SAIL (26). The SAIL segment was added when a new f e a t u r e , whose implementation was f a i r l y s t r a i g h t f o r w a r d , was included i n CONGEN. Since then, the SAIL p o r t i o n g r a d u a l l y has been taking over some of the more time-consuming tasks. T h i s method allows a balance i n the t r a d e o f f between ease of program development and e f f i c i e n c y of the f i n a l program. Accumulation of expert knowledge i n knowledge-based programs. Just as s t a t i s t i c s - b a s e d programs need to worry about accumulation of l a r g e data bases, knowledge-based programs need to worry about the accumulation of l a r g e amounts of e x p e r t i s e . The performance of these programs i s t i e d d i r e c t l y to the amount of knowledge they have about the task domain — i n a phrase, knowledge i s power. Therefore, one of the goals of a r t i f i c i a l i n t e l l i g e n c e research i s

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

214

COMPUTER

NETWORKING AND

CHEMISTRY

to b u i l d systems that not only perform as w e l l as an expert but that a l s o can accumulate knowledge from s e v e r a l experts. Simple a c c r e t i o n of knowledge i s p o s s i b l e only when the " f a c t s " , or i n f e r e n c e r u l e s , that are being added to the program are e n t i r e l y separate from one another. I t i s unreasonable to expect a body of knowledge to be so w e l l organized that the f a c t s or r u l e s do not o v e r l a p . ( I f i s were so w e l l organized, i t i s u n l i k e l y that an a r t i f i c i a l i n t e l l i g e n c e program would be the best encoding of the problem s o l v e r . ) One way of d e a l i n g with the overlap i s to examine the new r u l e s on an i n d i v i d u a l b a s i s , as they are added to the system i n order to remove the o v e r l a p . T h i s was the s t r a t e g y f o r developing the e a r l y DENDRAL programs. However, i t i s very i n e f f i c i e n t and becomes i n c r e a s i n g l y more d i f f i c u l t as the body of knowledge grows. The problem of removing c o n f l i c t s , or p o t e n t i a l c o n f l i c t s , from o v e r l a p p i n g r u l e s becomes more acute when more that one expert adds new r u l e s to the knowledge base. Of course, the advantages of a l l o w i n g s e v e r a l experts to "teach" the system are enormous — not only i s the program's breadth of knowledge p o t e n t i a l l y greater than that of a s i n g l e expert, but the r u l e s are more apt to be r e f i n e d when looked at by s e v e r a l experts. On the other hand, one can expect not only a greater volume of new r u l e s but a higher percentage of c o n f l i c t s when s e v e r a l experts are adding r u l e s . Having a computer program that can accumulate knowledge presupposes having an o r g a n i z a t i o n of the program and i t s knowledge base that allows accumulation. I f the knowledge i s b u i l t i n t o the program as sequences of l o w - l e v e l program statements — as o f t e n happens — then changing the program becomes impossible. Thus, current a r t i f i c i a l i n t e l l i g e n c e research s t r e s s e s the importance of s e p a r a t i n g problem-solving knowledge from the c o n t r o l s t r u c t u r e of the program that uses that knowledge. Another problem, at a p o l i t i c a l r a t h e r than a programming l e v e l , becomes apparent with one accumulation process: how does the program d i s t i n g u i s h an expert from a novice? In the MYCIN p r o gram we have circumvented the problem by having the program ask the current user f o r a keyword that would i d e n t i f y him as an expert. I t i s then a b u r e a u c r a t i c d e c i s i o n as to which users are given that keyword. There i s nothing s u b t l e i n t h i s s o l u t i o n , and one can imagine f a r b e t t e r schemes f o r accomplishing the same t h i n g . The p o i n t here i s that not every user should have the p r i v i l e g e of changing r u l e s that experts have added to the system, and that some safeguards must be implemented. "Human nature" b a r r i e r s to SUMEX use Countering d i s b e l i e f . There i s sometimes a tendency among those u n f a m i l i a r with the c a p a b i l i t i e s and l i m i t a t i o n s of computers and computer programs to express d i s b e l i e f . T h i s i s not d i s b e l i e f i n the sense of worrying that the programs have e r r o r s and produce erroneous r e s u l t s . Indeed, the f a c t that a problem i s being done

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

13.

CARHART

E T

AL.

Collaborative

Research

Community

215

by a computer seems to generate some f a i t h that i t might be r i g h t , or a t l e a s t s i g n i f i c a n t l y reduces questions about c o r r e c t n e s s . The d i s b e l i e f i s that programs, which are designed to model, or to emulate, human problem s o l v i n g w i l l not be capable of u s e f u l performance. T h i s , of course, i s the c l a s s i c argument a g a i n s t a r t i f i c i a l i n t e l l i g e n c e - we t h i n k i n mysterious ways and have such a complex b r a i n that a computer program must be i n f e r i o r . In some cases, authors of a r t i f i c i a l i n t e l l i g e n c e programs have brought such c r i t i c i s m upon themselves by not s t r e s s i n g l i m i t a t i o n s , or by making extravagant c l a i m s . In the DENDRAL p r o j e c t , we have t r i e d to counter t h i s type of d i s b e l i e f i n a number of ways. We have t r i e d to s t r e s s that our programs are designed to a s s i s t , not r e p l a c e chemists. We have always d i s c u s s e d l i m i t a t i o n s to give a reasonable p e r s p e c t i v e on c a p a b i l i t i e s v s . l i m i t a t i o n s of a program. Most importantly however, we have focused on those aspects of problems which are amenable to systematic a n a l y s i s , j i . e . , those problems which can be done manually, but only with d i f f i c u l t y and w i t h the consumption of a great d e a l of time which a chemist could b e t t e r spend on more productive p u r s u i t s . Examples of t h i s would i n c l u d e the a p p l i c a t i o n of PLANNER to mixtures where a l l fragmentations may have to be considered as p o s s i b l e fragments of every molecular i o n , the systematic a n a l y s i s by INTSUM of p o s s i b l e fragmentation processes, the c o n s i d e r a t i o n by MOLION of a l l p l a u s i b l e p o s s i b i l i t i e s , and the s t r u c t u r e generation c a p a b i l i t i e s of CONGEN. We have a l s o t r i e d to reduce chemists' d i s b e l i e f by b l u r r i n g the " o u t s i d e r - i n s i d e r " d i s t i n c t i o n , i n p a r t i c u l a r by having t r a i n e d chemists work on the programs and make them u s e f u l to themselves first. F u r t h e r , when " o u t s i d e " chemists are f i r s t introduced to the programs, the i n t r o d u c t i o n i s done by another chemist who has already thought through and can r e a d i l y e x p l a i n many of the c h e m i s t r y - r e l a t e d problems. The u l t i m a t e way to counter d i s b e l i e f , however, i s to i l l u s t r a t e h i g h l e v e l s of performance. I f a p o t e n t i a l user i s aware of the goals ( i n t e n t ) of a program and i t s l i m i t a t i o n s , a few examples of r e s u l t s which would be extremely d i f f i c u l t to o b t a i n without the program are very c o n v i n c i n g . The " s e c u r i t y " of a l o c a l f a c i l i t y . Networking i s s t i l l a r e l a t i v e l y new concept to many people, and there i s a r e s i s t a n c e to departing from the " t r a d i t i o n a l " modes of computing. There i s a sense of s e c u r i t y i n having a l o c a l computing f a c i l i t y w i t h knowledgeable c o n s u l t a n t s w i t h i n walking d i s t a n c e , and i n having "hard" forms of input (e^j*., boxes of computer cards) and output (e.£., voluminous l i s t i n g s ) . These props are d i f f i c u l t to simul a t e over a network connection - i n most cases a user ' s i n t e r a c t i o n with the remote s i t e takes p l a c e e x c l u s i v e l y through a computer terminal - yet the q u a l i t y of s e r v i c e can match or exceed that of a l o c a l f a c i l i t y ; programs and l a r g e data s e t s can be entered and stored on secondary storage as can l a r g e output f i l e s ; a l l types

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

216

COMPUTER

NETWORKING AND

CHEMISTRY

of program and data e d i t i n g can be done w i t h i n t e r a c t i v e e d i t i n g programs; programs can be w r i t t e n i n an i n t e r a c t i v e mode so that small amounts of c o n t r o l i n f o r m a t i o n can be input and key r e s u l t s output i n " r e a l time" over the t e r m i n a l ; and as noted i n a p r e vious s e c t i o n , c o n s u l t a t i o n can be s i g n i f i c a n t l y more productive p r o v i d i n g that the remote operating systems supports the a p p r o p r i ate types of communication p o s s i b i l i t i e s . There can, of course, be no denying that there are problems i n l e a r n i n g to use a d i s t a n t computer system, be i t f o r program development or f o r the use of c e r t a i n programs. Whether or not overcoming these problems to gain access to the s p e c i a l resources which are a v a i l a b l e , i s worth the e f f o r t , i s a question answerable only by the i n d i v i d u a l s i n v o l v e d . F o r t u n a t e l y , there w i l l always be those persons who have a p r e s s i n g problem i n need of s o l u t i o n and who are w i l l i n g to t r y a new approach; r e g a r d l e s s of whether or not they have had p r i o r network experience. IV.

THE

SUMEX-AIM FACILITY

The SUMEX-AIM computer f a c i l i t y c o n s i s t s of a D i g i t a l Equipment Corporation model KI-10 c e n t r a l processor o p e r a t i n g under the TENEX time s h a r i n g monitor. I t i s l o c a t e d at Stanford U n i v e r s i t y Medical Center, Stanford, C a l i f o r n i a . The system has 256K words (36 b i t ) of high speed memory; 1.6 m i l l i o n words of swapping storage; 70 m i l l i o n words of d i s k s t o r age; two 9-track, 800 b p i i n d u s t r y tape u n i t s ; one dual DEC-tape u n i t ; a l i n e p r i n t e r ; and communications network i n t e r f a c e s providing user t e r m i n a l access v i a both TYMNET and ARPANET. Software support has evolved, and w i l l continue to evolve, based on user research goals and requirements. Major user l a n guages c u r r e n t l y i n c l u d e INTERLISP, SAIL, FORTRAN-10, BLISS-10, BASIC and MACRO-10. Major software packages a v a i l a b l e i n c l u d e OMNIGRAPH, f o r graphics support of m u l t i p l e t e r m i n a l types, and MLAB, f o r mathematical modelling. The SUMEX-AIM computer g e n e r a l l y i s l e f t w i t h no operator i n attendance; thereby h e l p i n g to e l i m i n a t e some overhead, but a l s o c r e a t i n g some problems. Users who wish to run jobs r e q u i r i n g tapes must make arrangements to mount t h e i r own tapes. Likewise, o b t a i n i n g l i s t i n g s from the l i n e p r i n t e r can be somewhat d i f f i c u l t s i n c e there i s no r e g u l a r schedule f o r d i s t r i b u t i o n of t h i s output. The s o l u t i o n to these two problems has been to make keys to the machine room a v a i l a b l e at s t r a t e g i c l o c a t i o n s , convenient to a l l groups of l o c a l u s e r s . T h i s experiment i n b a s i c "resource s h a r i n g " has not r e s u l t e d i n any of the major problems one might expect from having a f a i r l y l a r g e group of people with hands-on access to a computer.

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

13. carhart et al.

Collaborative Research Community

217

Acknowledgements The SUMEX-AIM facility is supported by the National Institutes of Health (grant number RR 00785-02), as is the DENDRAL project (grant number RR 00612-05A1). We wish to thank Robert Engelmore, Mark Stefik, Janice Aikens and Peter Friedland for their contributions to this report and Tom Rindfleisch and William White for valuable discussions.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

Literature Cited (1) Gordon, R. Μ., Datamation(1975), 21, (2), 127. (2) "World List of Crystallographic Computer Programs", Second Edition, D. P. Shoemaker, Ed., Bronder-Offset, Rotterdam, 1966. (3) Professor Joshua Lederberg, Principal Investigator. (4) Lederberg, J., Sutherland, G. L . , Buchanan, B. G., Feigenbaum, Ε. Α., Robertson, Α. V., Duffield, A. M. and Djerassi, C., J. Amer. Chem. Soc. (1969), 91, 2973. (5) Duffield, A. M., Robertson, Α. V., Djerassi, C., Buchanan, B. G., Sutherland, G. L . , Feigenbaum, E. A, and Lederberg, J., J. Amer. Chem. Soc. (1969), 91, 2977. (6) Buchanan, B. G., Duffield, A. M. and Robertson, Α. V., "Mass Spectrometry: Techniques and Applications," G. W. A. Milne, Ed., p. 121, John Wiley and Sons, New York, 1971. (7) Dromey, R. G., unpublished results, preprint available on request, Dept. of Computer Science, Serra House, Stanford University, Stanford, Calif. 94305. (8) Biller, J. E. and Biemann, Κ., Anal. Lett. (1974), 7, 515. (9) Several libraries of mass spectral data are available in various forms. The Aldermaston Data Centre (Mass Spectrometry Data Center, AWRE, Aldermaston, Reading RG7 4PR, England) can provide information on the availability of such libraries. (10) Hertz, H. S., Hites, R. A. and Biemann, K., Anal. Chem. (1971), 43, 681. (11) Dromey, R. G., Buchanan, B. G., Smith, D. H., Lederberg, J. and Djerassi, C., J. Org. Chem. (1975), 40, 770. (12) Smith, D. H., Buchanan, B. G., Engelmore, R. S., Duffield, A. M., Yeo, Α., Feigenbaum, Ε. Α., Lederberg, J. and Djerassi, C., J. Amer. Chem. Soc. (1972), 94, 5962. (13) Smith, D. H., Buchanan, B. G., Engelmore, R. S., Aldercreutz, H. and Djerassi, C., J. Amer. Chem. Soc. (1973), 95, 6078. (14) Smith, D. H. and Carhart, R. Ε., Abstracts, 169th Meeting of the American Chemical Society, Philadelphia, April 6-11, 1975. (15) Carhart, R. E . , Smith, D. H., Brown, H. and Djerassi, C., J. Amer. Chem. Soc., submitted for publication. (16) Masinter, L. M., Sridharan, N. S., Lederberg, J. and Smith, D. H., J. Amer. Chem. Soc. (1974). 96, 7702. (17) Masinter, L. M., Sridharan, N. S., Carhart, R. E . , and Smith, D. H., J. Amer. Chem. Soc. (1974), 96, 7714. (18) Brown, H., SIAM Journal on Computing, submitted for

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.

Downloaded by AUBURN UNIV on December 27, 2017 | http://pubs.acs.org Publication Date: June 1, 1975 | doi: 10.1021/bk-1975-0019.ch013

218

COMPUTER NETWORKING AND CHEMISTRY

publication. (19) Wipke, W. T. and Dyott, T. M., J . Amer. Chem. Soc. (1974), 96, 4825. (20) Smith, D. H., Buchanan, B. G., White, W. C., Feigenbaum, E. Α., Lederberg, J. and Djerassi, C., Tetrahedron(1973), 29, 3117. (21) Buchanan, B. G., to appear in the Proceedings of the NATO Advanced Study Institute on Computer Oriented Learning Processes, 1974, Bonas, France. (22) Carhart, R. E . , Smith, D. H. and Brown, H., J. Chem. Inf. Comp. Sci., in press (May, 1975). (23) Smith, D. H., Anal. Chem., in press (May, 1975). (24) Teitelman, W., "INTERLISP Reference Manual," Xerox Corp. (Palo Alto Research Center), Palo Alto, Calif., 1974. (25) Bobrow, D. G., Burchfiel, J. D. and Tomlinson, R. S., Commun. ACM(1972), 15, (3), 135. (26) VanLehn, Κ. Α., "SAIL User Manual," Stanford Artificial Intelligence Laboratory, Stanford, Calif., 1973.

Lykos; Computer Networking and Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1975.