Computer-Aided Selection of Novel Antitumor Drugs for Animal

Nov 28, 1979 - An earlier paper (1) described a method for estimating the probability of activity of compounds. In that report the method was applied ...
0 downloads 0 Views 2MB Size
26 Computer-Aided Selection of Novel Antitumor Drugs

Downloaded via TUFTS UNIV on July 12, 2018 at 09:43:13 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

for Animal Screening LOUIS HODES National Cancer Institute, 8300 Colesville Road, Silver Spring, MD 20910

An earlier paper (1) described a method for estimating the probability of activity of compounds. In that report the method was applied to a small set of data for the purpose of comparing the performance to that of more sophisticated pattern recognition methods. The intended application was the large volume of data from the antitumor screening program of the National Cancer Institute (NCI). Early in 1976 a new panel of animal models for screening was adopted. These include mouse colon, breast, and lung tumors; and corresponding human tumor xenografts in athymic mice. Most compounds are required to show activity in an in vivo pre-screen in order to receive the more extensive testing. P388 mouse lymphocytic leukemia was chosen as the pre-screen and, thus, P388 became the logical choice for a trial of the statistical-heuristic method. There were already substantial amounts of P388 data from earlier tests, and roughly 15,000 compounds a year are currently being screened in P388. From t h i s data two trials o f the method were designed to approximate its use in the operating environment. These a r e d e s c r i b e d l a t e r as experiments 1 and 2. A s e r i e s of enhancements followed. Some a r e reported on here, while others a r e s t i l l being worked on. There a r e two aspects of t h i s method i n the s e l e c t i o n o f compounds f o r antitumor screening. The f i r s t can be considered an enrichment i n the y i e l d of a c t i v e s . For example, suppose we can achieve an enrichment f a c t o r of 1.2 by t a k i n g a r b i t r a r i l y j u s t those compounds s c o r i n g i n the top 50%. In that case, i n s t e a d of screening 15,000 compounds, NCI can o b t a i n the same number o f P388 a c t i v e s by screening 12,500 compounds which were d e r i v e d from an i n i t i a l e v a l u a t i o n of 25,000 compounds by t h i s computer method. Such a model i s envisioned as p a r t of a p r e s e l e c t i o n module attached a t the beginning of the Drug Research and Development automated chemical i n f o r m a t i o n system. This module w i l l evaluate chemical s t r u c t u r e s p r i o r t o a c q u i s i t i o n . I t w i l l i n c l u d e the c u r r e n t l y performed search f o r d u p l i c a t e s , as w e l l as a comparison This chapter not subject to U . S . copyright. Published 1979 A m e r i c a n C h e m i c a l Society

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

584

COMPUTER-ASSISTED DRUG DESIGN

with an extensive l i s t of skeleton s t r u c t u r e s to detect analogs. A comprehensive r e p o r t would then be examined by the chemist r e s p o n s i b l e f o r a c q u i s i t i o n , who w i l l decide whether to acquire the compound and, i f so, whether s p e c i a l c o n s i d e r a t i o n or t e s t i n g is called for. The second aspect of the method i s s e l e c t i o n by s u r v e i l a n c e . This would i n v o l v e running even l a r g e r numbers of compounds through the model, perhaps a l l new Chemical A b s t r a c t s r e g i s t r a t i o n s . Under these c o n d i t i o n s , only the high s c o r i n g compounds would be reviewed manually. In c o n t r a s t , under the f i r s t aspect, even the lowest s c o r i n g can be reviewed. The main requirement under both aspects i s the a b i l i t y to detect new a c t i v e c l a s s e s of compounds. Therefore, the f i r s t enhancement of the method was the e l i m i n a t i o n of f a m i l i a r drugs from the t r a i n i n g s e t , f o l l o w i n g the o r i g i n a l v e r s i o n of the method (2). I t w i l l be seen that the r e s u l t i n g model gives higher scores to compounds with r e l a t i v e l y u n f a m i l i a r s t r u c t u r e s which have appeared i n a c t i v e P388 t e s t s . These compounds are more i n demand f o r f u r t h e r development than are analogs of known drugs. A l s o , c e r t a i n other p r o p e r t i e s of the method point i t toward novel s t r u c t u r e s . Since each f e a t u r e i s t r e a t e d independently, new combinations of f e a t u r e s which have not appeared together before are more l i k e l y to be s e l e c t e d i n t h i s method than i n other more elaborate methods. Moreover, as w i l l be seen l a t e r , a m o d i f i c a t i o n of the method gives more weight to lower incidence f e a t u r e s . No f e a t u r e i s discarded because of low incidence alone. In a d d i t i o n , a separate program p r i n t s out, f o r each candidate compound, i t s lowest i n c i d e n c e f e a t u r e i n P388 t e s t i n g as w e l l as a l l f e a t u r e s not yet having occurred i n P388 t e s t i n g . It must be emphasized that although t h i s method runs on a computer, i t i s not designed to a u t o m a t i c a l l y pass or r e j e c t compounds. Rather, i t i s proposed as a t o o l to a i d the m e d i c i n a l chemist i n s e l e c t i n g compounds. Although i t s assumption of f e a t u r e independence i s a strong l i m i t a t i o n , the unbiased use of much data should make the scores u s e f u l . Review of the Method. The method was descibed p r e v i o u s l y (1) so only a b r i e f review w i l l be attempted here. Some m o d i f i c a t i o n s are discussed i n l a t e r s e c t i o n s of t h i s paper. A model i s based on a c o l l e c t i o n of compounds of known a c t i v i t y , c a l l e d the t r a i n i n g s e t . A set of f e a t u r e s p e r t a i n i n g to these compounds are produced; we use molecular substructure descriptors. These f e a t u r e s must, of course, be g e n e r a l l y r e l e v a n t to a c t i v i t y f o r the method to work. An a c t i v i t y weight i s assigned to each f e a t u r e independently. This weight i s based on c i r c u m s t a n t i a l evidence, that i s , on how o f t e n the f e a t u r e occurs i n the a c t i v e compounds of the t r a i n i n g set r e l a t i v e to how o f t e n i t i s expected to occur assuming that

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

26.

HODES

Antitumor

Drugs

585

the f e a t u r e i s not r e l e v a n t to a c t i v i t y . In computing the expected i n c i d e n c e , we use as reference the i n c i d e n c e of the f e a t u r e i n the e n t i r e Drug Research and Development c o l l e c t i o n . The weight i s expressed as a number of standard d e v i a t i o n s using the s t a t i s t i c a l d i s t r i b u t i o n derived from c o n s i d e r i n g the a c t i v e compounds as a random s e l e c t i o n from the f i l e . This procedure compensates f o r the wide range of incidences among s t r u c t u r e features. The f e a t u r e s c u r r e n t l y are those r o u t i n e l y generated as keys f o r the substructure i n q u i r y system.(3) This system incorporates an open-ended f e a t u r e set as opposed to a d i c t i o n a r y . The main types of keys are the augmented atom (AA), the g a n g l i a augmented atom (GAA) the r i n g key and two kinds of nucleus key, i n a d d i t i o n to i n d i v i d u a l element keys. The number of f e a t u r e s appearing i n the P388 data are roughly 2000 without the GAA keys i n c r e a s i n g to roughly 8000 when the GAA keys are s u b s t i t u t e d f o r the AA keys. Other f e a t u r e s being considered are keys used i n the BASIC search system (4) and an a l g o r i t h m i c approximation to f u n c t i o n a l groups. The d i f f e r e n t c l a s s e s of f e a t u r e s w i l l be covered l a t e r . An a c t i v i t y score i s produced f o r a compound of presumably unknown a c t i v i t y by adding the weights of the f e a t u r e s i t presents. The score i s not intended to estimate the s t r e n g t h of a c t i v i t y , but only some measure of the l i k e l i h o o d that the compound i s a c t i v e . Experiments with a Large Data Set. Several experiments were performed to e s t a b l i s h the f e a s i b i l i t y of t h i s method on l a r g e amounts of data. These experiments r e q u i r e d a few searches of the Drug Research and Development b i o l o g i c a l f i l e to f i n d compounds that met v a r i o u s c r i t e r i a as a r e s u l t of P388 t e s t i n g . These c r i t e r i a are u s u a l l y s t a t e d i n terms of the T/C r a t i o , which i s the median l i f e - s p a n of the t r e a t e d animals d i v i d e d by that of the c o n t r o l s . The output of each b i o l o g i c a l search i s a l i s t of NSC numbers. * hese numbers are assigned s u c c e s s i v e l y to each compound upon r e g i s t r a t i o n of that compound i n the chemical information system (CIS). This occurs upon acceptance of the compound f o r screening. NSC number 1 was assigned i n 1955 and current (1979) NSC numbers are over 300,000. The r e s u l t s of the b i o l o g i c a l search are then sent to the CIS f o r e x t r a c t i o n of the s t r u c t u r e keys to be used as f e a t u r e s . This s e c t i o n d e s c r i b e s the experiments that were performed before the w e l l known c l a s s e s were removed. They showed that the l a r g e number of compounds and f e a t u r e s could e a s i l y be handled. It w i l l be seen l a t e r that removing the c l a s s e s of compounds l e d to some b a s i c changes i n the model. A l s o , i t w i l l be i n t e r e s t i n g to compare the r e s u l t s of the modified model. P r e l i m i n a r y Experiments. In November, 1975 a search of the b i o l o g i c a l f i l e s produced a l i s t of a c t i v e and a l i s t of i n a c t i v e

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

586

COMPUTER-ASSISTED

DRUG DESIGN

compounds w i t h r e s p e c t to P388 according to the f o l l o w i n g criteria. The a c t i v e s were picked from a f i l e of roughly 6000 s e l e c t e d agents and amounted to 489 compounds with a manually assigned code i n d i c a t i n g s i g n i f i c a n t a c t i v i t y i n P388 ( t h i s was based on a confirmed T/C of 175% or g r e a t e r . ) The i n a c t i v e compounds, numbering 4174, were c o l l e c t e d from two sources. (1) Compounds from the s e l e c t e d agents f i l e which had been assigned a manual code s i g n i f y i n g n e g a t i v e i n P388. (2) Compounds w i t h i n the l a t e s t 100,000 NSC number range which had d e f i n i t i v e P388 t e s t i n g w i t h T/C always l e s s than 175%. D e f i n i t i v e t e s t i n g r e q u i r e s a regimen of at l e a s t three dose l e v e l s w i t h the bottom two non-toxic. T o x i c i t y i s d e f i n e d as a T/C l e s s than 85% or l e s s than 2/3 s u r v i v o r s on i n i t i a l t o x i c i t y day. An a d d i t i o n a l weight l o s s c r i t e r i o n was not used. The p r e l i m i n a r y experiments used these compounds mostly as t r a i n i n g s e t s . At f i r s t , every t e n t h compound i n the NSC sequence beginning with the f i r s t was s e l e c t e d to be included i n a t e s t set, l e a v i n g the remaining 90% as the t r a i n i n g s e t . A second cut c o n s i s t e d of every t e n t h compound s t a r t i n g with the second. The a c t i v e s and i n a c t i v e s were cut s e p a r a t e l y to e q u a l i z e t h e i r r e l a t i v e i n c i d e n c e i n the t e s t and t r a i n i n g s e t s . F i v e such cuts were run. In summary, these f i v e cuts contained 236 a c t i v e compounds. (Of the 489 a c t i v e compounds, s e v e r a l were d i s q u a l i f i e d because they lacked chemical s t r u c t u r e keys, so that 236 a r e about h a l f of those remaining.) Only 5 of the 236 compounds ranked i n the lower 50% of the a c t i v i t y scores i n t h e i r r e s p e c t i v e c u t s . This remarkable performance i s due i n l a r g e measure to h e a v i l y populated c l a s s e s of s i m i l a r compounds o c c u r r i n g i n the f i l e of s e l e c t e d agents. Nevertheless, these r e s u l t s were considered s u f f i c i e n t l y encouraging to t r y data c l o s e r to c u r r e n t input. T h i s t r i a l a l s o r e v e a l e d p r o p e r t i e s of the method as a p p l i e d to the P388 prescreen that warrant f u r t h e r a t t e n t i o n . F i r s t , the i n a c t i v i t y score was not u s e f u l , so that the r e s u l t s were e q u a l l y good when i t was ignored. Second, conspicuous i n the small number of low s c o r i n g P388 a c t i v e compounds were 5 - f l u o r o u r a c i l , platinums and e l l i p t i c i n e s , s i n c e they are of great i n t e r e s t i n the treatment of human cancer. A l s o , these f a i l u r e s represented the three main c a t e g o r i e s of failure. The platinum scores increased upon b e t t e r data as described soon. More powerful f e a t u r e s a r e r e q u i r e d f o r compounds l i k e 5 - f l u o r o u r a c i l , whereas e l l i p t i c i n e s a r e more i n t r a c t a b l e s i n c e there a r e many i n a c t i v e s w i t h the same b a s i c s t r u c t u r e . A l l examples of these three compounds were l a t e r removed as analogs of w e l l known c l a s s e s . Test Set 1. A more r e a l i s t i c t r i a l was the s e l e c t i o n , as a t e s t s e t , of compounds as they had been input i n t o the program. The 5,000 compounds from NSC number 260,001 through 265,000 were s e l e c t e d as having a f a i r amount of P388 t e s t i n g a t the time. (October, 1976.) These were c a t e g o r i z e d by manual review of the

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

HODES

26.

Antitumor

Drugs

587

screening data summaries as f o l l o w s according to t h e i r a c t i v i t y i n P388. Ά ' — T / C at l e a s t 175% i n two separate screening t e s t s . » C — T / C at l e a s t 120% i n two separate screening t e s t s but not i n category A. Two t e s t s i n P388 w i t h T/C of 120% or greater i s s u f f i c i e n t to be considered a candidate f o r t e s t i n g i n the tumor p a n e l . 'D'—T/C a t l e a s t 120% i n one screening t e s t and not y e t r e t e s t e d . Έ ' — T / C a t l e a s t 120% i n one screening t e s t and T/C l e s s than 120% i n a separate screening t e s t and not i n category C or A. N — T / C l e s s than 120% i n a l l screening t e s t s . Must have a t e s t with a t l e a s t three dose l e v e l s , the bottom two non-toxic. M — i n s u f f i c i e n t t e s t i n g i n P388. T — not t e s t e d a t a l l i n P388. The t r a i n i n g s e t . There were reasons that performance on the C s could be improved by beginning again with a new t r a i n i n g s e t . F i r s t , there were a few f a l s e negative platinum compounds, and i t was n o t i c e d that the platinum key was not weighted h i g h l y simply because of a time l a g i n a s s i g n i n g compounds to the s e l e c t e d agents f i l e w i t h the a p p r o p r i a t e manual code. Second, i t was f e l t that a t r a i n i n g s e t c o n t a i n i n g C's as w e l l as A s should perform b e t t e r on incoming s e l e c t i o n s which would y i e l d T/C between 120% and 175%. T h i r d , a search of the e n t i r e f i l e r a t h e r than the most recent 100,000 compounds, and a T/C c u t o f f of 120% might make the i n a c t i v e t r a i n i n g set more u s e f u l . Thus, a more comprehensive t r a i n i n g set was c o l l e c t e d from the b i o l o g i c a l f i l e . The search took p l a c e e a r l y i n 1977 and covered NSC numbers 1 through 260,000. NSC number 260,000 corresponds c h r o n o l o g i c a l l y to near the end of 1975, so b i o l o g i c a l t e s t i n g should have been f a i r l y complete. F i v e l i s t s of NSC numbers were c o l l e c t e d according to degree of a c t i v i t y i n P388. The search of the b i o l o g i c a l f i l e was s p e c i f i e d according to the c a t e g o r i e s A, C, D, Ε and Ν e s t a b l i s h e d earlier. From the truncated f i l e of 260,000 compounds, category A yielded 880 compounds, category C y i e l d e d 1916 compounds, category D y i e l d e d 1402 compounds, category Ε y i e l d e d 1787 compounds, category Ν y i e l d e d 15524 compounds, f o r a t r a i n i n g s e t t o t a l of 21509 compounds i n P388. Experiment 1. The t e s t s e t f o r t h i s experiment c o n s i s t e d of the 5000 compounds from NSC number 260001 to 265000, of which 2322 s a t i s f i e d the b i o l o g i c a l t e s t i n g and chemical s t r u c t u r e data requirements. Various combinations of weights and scores d e r i v e d from the s e t s of A s , C s and N s of the t r a i n i n g set were t r i e d w i t h the f o l l o w i n g r e s u l t s . F i r s t , the 880 A s alone were used as the t r a i n i n g s e t and the 2322 compounds were scored and ranked, As shown i n Table I, a l l 26 A s scored i n the upper h a l f but there were 11 C s out o f 85 i n the lower h a l f . f

f

f

f

f

f

f

f

1

f

f

f

1

1

f

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

COMPUTER-ASSISTED DRUG DESIGN

588

Table I . Ranking of A c t i v e s i n Test Set 1. T r a i n i n g s e t A. RANK (Deciles)

NUMBER OF A COMPOUNDS

10 9 8 7 6

18 5 1 2 0

5 4 3 2 1

0 0 0 0 0

CUM. PERCENT 69 88 92 100

NUMBER OF C COMPOUNDS

CUM. PERCENT

41 19 5 3 6

48 71 76 80 87

5 1 4 1 0

93 94 99 100

1

f

A new t r a i n i n g s e t was formed by combining the A s and C s so that there were now 880 + 1916 = 2796 compounds i n the t r a i n i n g set. A c t u a l l y , each of the A compounds was counted as appearing twice i n the t r a i n i n g s e t , so the weightings r e s u l t e d i n what may be considered a 2A+C model. Now, 1 A out of 26 i n the t e s t s e t and 7 C s of 85 f e l l i n t o the lower h a l f . See Table I I . Again, the weights d e r i v e d from the r a t h e r l a r g e s e t of i n a c t i v e s d i d not help the performance. 1

Table I I . Ranking of A c t i v e s i n Test Set 1. T r a i n i n g s e t 2A+C. RANK (Deciles)

NUMBER OF A COMPOUNDS

10 9 8 7 6

15 8 0 2 0

5 4 3 2 1

0 1 0 0 0

CUM. PERCENT 58 88 92

100

NUMBER OF C COMPOUNDS

CUM. PERCENT

41 20 8 5 4

48 72 81 87 92

2 1 2 2 0

94 95 98 100

Expressed as an enrichment of the y i e l d o f a c t i v e s , these r e s u l t s show that s e l e c t i n g the upper h a l f under the 2A+C model would have changed the y i e l d of a c t i v e s from 111 out of 2322 or 4.8% to 103 out of 1161 or 8.9% Experiment 2. An attempt was now made to s e l e c t a sequence

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

26.

HODES

Antitumor

589

Drugs

of compounds which seemed r e l a t i v e l y f r e e of analogs. The new t e s t set chosen c o n s i s t e d of compounds w i t h NSC numbers 268001 through 272000. This t e s t s e t , l i k e the e a r l i e r one, was r a t e d i n t o b i o l o g i c a l c a t e g o r i e s A,C,D,E,N,M,T by manual review of the screening data summaries. Excluding M s and T s and compounds w i t h no chemical s t r u c t u r e keys, there were 3239 compounds l e f t t o t a l l y , of which 32 were A s and 145 were C s . Upon running the same 2A+C model, a l l 32 A s scored i n the upper h a l f and 45 out of 145 C s scored i n the lower h a l f . The r e s u l t s are summarized i n Table I I I . Expressed as an enrichment, the y i e l d went from 177 out of 3239, or 5.5%, to 132 out of 1620 f o r the upper h a l f or 8.1%. This time a l l the A s scored i n the upper 50% w i t h respect to the 2A+C model. f

f

1

f

1

f

f

Table I I I . Ranking of A c t i v e s i n Test Set 2. T r a i n i n g set 2A+C. RANK (Deciles)

NUMBER OF A COMPOUNDS

10 9 8 7 6

27 5 0 0 0

5 4 3 2 1

0 0 0 0 0

CUM. PERCENT 84 100

NUMBER OF C COMPOUNDS

CUM. PERCENT

41 24 15 13 7

28 45 55 64 69

10 10 11 7 9

76 83 90 95 100

Removal of Analogs. At t h i s point the model could be considered extremely s u c c e s s f u l from the point of view of enrichment of the y i e l d of actives. But, i t s performance depends to a l a r g e extent on the continued t e s t i n g of analogs of w e l l known a c t i v e compounds. The main purpose of P388 t e s t i n g and a l s o of the model remains the discovery of new a c t i v e c l a s s e s . The p r o f i c i e n c y of the model toward t h i s end must be enhanced and i t s e f f e c t i v e n e s s measured. Analogs of the w e l l known a c t i v e compounds, which w i l l now be r e f e r r e d to simply as analogs, make up a l a r g e part of the t r a i n i n g s e t . More than 85% of the h i g h l y a c t i v e A s and over one t h i r d of the C s belong to these c l a s s e s , .. h e r e f o r e , any new t r a i n i n g set formed without them should produce s u b s t a n t i a l l y different results. I t i s c l e a r that the new model w i l l not perform as w e l l on f

f

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

COMPUTER-ASSISTED

590

DRUG DESIGN

any f u r t h e r analogs of those removed. Many of the p r e v i o u s l y most h i g h l y weighted f e a t u r e s w i l l now be decreased i n weight or e l s e be completely absent from the model. Hence, the new model w i l l tend to show decreased p r e d i c t a b i l i t y on a t e s t set to the extent that i t contains analogs. R e c a l l that a p r o v i s i o n f o r d e t e c t i n g analogs i s i n c l u d e d i n the s e l e c t i o n process. On the other hand, the presence of l a r g e numbers of analogs i n the t r a i n i n g set can d e t e r i o r a t e performance of the model on those compounds which may p r o v i d e new l e a d s . T h i s occurs when a known a c t i v e group such as an a l k y l a t i n g f u n c t i o n i s combined w i t h otherwise i n a c t i v e m o i e t i e s , producing an a c t i v e compound. The weights of a l l i t s keys i s r a i s e d . When such keys occur i n i n a c t i v e compounds they can r a i s e t h e i r score s u f f i c i e n t l y to y i e l d many f a l s e p o s i t i v e s . F a l s e p o s i t i v e s , of course, w i l l lower the r e l a t i v e rank of the t r u e p o s i t i v e s which score below them. Removal of analogs from the t r a i n i n g set was f i r s t performed on the h i g h l y a c t i v e A s by Ken P a u l l , then with Starks C. P., the a c q u i s i t i o n s c o n t r a c t o r . He presented Table IV, which c l a s s i f i e s analogs among the 846 A s w i t h d e f i n e d s t r u c t u r e . Familiar a l k y l a t i n g agents, which i n c l u d e mustards, methanesulfonates, epoxides, a z i r i d i n e s and n i t r o s o u r e a s , make up 52% of the 846. Other, more s p e c i a l i z e d c l a s s e s reduced the A s f u r t h e r u n t i l they were cut by 87%. 1

1

1

f

f

Table IV. A c t i v e C l a s s e s i n A T r a i n i n g Set. 846 S t r u c t u r e s w i t h P388 T/C % >175% CLASS

NUMBER

A l k y l a t i n g Agents Anthracyclines Wander Antifols Actinomycins Nucleocides 9-Aminoacr i d i n e Platinum Planar Quaternary P o l y c y c l i c s T r i t y l cysteines Quinoliniums Mosemicarbizones Hydrazonoyl h a l i d e s Thioxanthenones Diazo compounds

440 50 50 32 23 22 20 19 18 16 14 9 9 8 7

CUM.

PERCENT 52% 64%

70%

82%

87%

f

The r e d u c t i o n on the C s was l e s s d r a s t i c , the number of compounds going from 1916 to 1182. Several a d d i t i o n a l classes were removed as shown i n Table V. The s e l e c t i o n was performed

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

26.

HODES

Antitumor

Drugs

591

by Ken P a u l l i n c o o r d i n a t i o n with Mike Hazard and Bob Ing of NCI. Table V. A d d i t i o n a l A c t i v e Classes from T r a i n i n g Sets. Bruceantin Halacanthane Colchicine Antimony Hydroxyurea Phosphonium Triazeno Emetine Methyl GAG

Removed

Podophyllotoxin Phenazine Cycloleucine Mithramycin Charged Nitrogen Sulfonium Ellipticine Vincristine S t y r y l Quinoline

Of course, many compounds i n most of these a c t i v e c l a s s e s t e s t e d negative i n P388. A great many others were not t e s t e d a t a l l i n P388. I d e a l l y , one would d e s i r e a new f i l e with a l l these compounds e l i m i n a t e d . R e c a l l that the method uses the i n c i d e n c e of a f e a t u r e i n the e n t i r e f i l e as a r e f e r e n c e . The analogs i n the f i l e would d i s t o r t the s t a t i s t i c s f o r the same reasons given earlier. However, i t was i m p r a c t i c a l to search a l l those c l a s s e s over the e n t i r e f i l e . Machine searches would consume too much computer time. It was p r a c t i c a l to remove the analogs from the s e t of P388 i n a c t i v e s . Machine searches f o r a l l the c l a s s e s were performed by Mike Hazard and Frank S o r d y l , reducing the i n a c t i v e s from 15524 to 14357 compounds. The method must now be r e v i s e d so that the t r a i n i n g s e t r a t h e r than the e n t i r e f i l e provides the r e f e r e n c e f o r the expected i n c i d e n c e of a f e a t u r e among the a c t i v e s . Thus, a f e a t u r e would have p o s i t i v e weight i f i t s p r o p o r t i o n a l i n c i d e n c e were greater i n the a c t i v e s than i n the i n a c t i v e s . C o r r e c t i o n of the Method. A c o r r e c t i o n to a more p r e c i s e model became more imperative when the removal of analogs induced the universe of compounds to be l i m i t e d to the t r a i n i n g s e t , as depicted i n the preceding section. The method as d e s c r i b e d i n (1) assumes, under the n u l l hypothesis, that the a c t i v e compounds are a random s e l e c t i o n of compounds from the f i l e . The p r o b a b i l i t y of f e a t u r e i n c i d e n c e i n the a c t i v e s was s a i d to f o l l o w the binomial d i s t r i b u t i o n . Robert Tarone of NCI advised me as f o l l o w s . 1) The binomial d i s t r i b u t i o n holds only when the random s e l e c t i o n i s done with replacement. 2) A more p r e c i s e model should assume random s e l e c t i o n without replacement. 3) Feature i n c i d e n c e i n the

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

COMPUTER-ASSISTED

592 a c t i v e s then f o l l o w s the hypergeometric standard d e v i a t i o n , S, i s now given by 2

distribution.

DRUG DESIGN

4) The

2

S = (F(T-F)N(T-N))/(T (T-l)). Here F i s the number of compounds w i t h the f e a t u r e and Τ i s the t o t a l number of compounds, now the number of compounds i n the t r a i n i n g s e t . Ν i s the number of a c t i v e s , so now T-N i s the number of i n a c t i v e s . Under these c o n d i t i o n s there i s e f f e c t i v e l y one weight f o r each f e a t u r e , s i n c e the a c t i v e and i n a c t i v e weights become equal i n magnitude and opposite i n s i g n . The same holds t r u e f o r the scores f o r each p r e d i c t e d compound. Thus, when the t r a i n i n g s e t i s the u n i v e r s e of compounds, there i s only one a c t i v i t y score. The i n a c t i v i t y score which had not proved u s e f u l anyway, being reduntant, can be ignored. If Ν i s much l e s s than Τ then the hypergeometric d i s t r i b u t i o n becomes i n d i s t i n g u i s h a b l e from the binomial d i s t r i b u t i o n where S*=P(1-P)N and Ρ i s simply F/T. This c o n d i t i o n c e r t a i n l y holds when the e n t i r e f i l e i s used as a r e f e r e n c e . I t i s even a good approximation when only the t r a i n i n g set i s used s i n c e the a c t i v e s are much l e s s numerous than the i n a c t i v e s . However, the c o r r e c t i o n was necessary upon c o n s i d e r a t i o n of weights f o r m u l t i p l e occurrences of f e a t u r e s . Here, f o r example, when computing the weight f o r two or more occurrences of a f e a t u r e the above formula f o r the standard d e v i a t i o n becomes 2

2

S = ( F (F-F )Hj (F-N ) ) / ( F ( F - l ) ) · 2

2

1

Here F i s , as before, the number of compounds w i t h one or more occurrences of the f e a t u r e , F^ i s the number of compounds with two or more occurrences and N| i s the number of a c t i v e s with the f e a t u r e . Now, i f the f e a t u r e happens to be more prominent i n the a c t i v e s than i n the i n a c t i v e s , then N^ can be s i g n i f i c a n t compared w i t h F. Thus, i n some cases the c o r r e c t e d v e r s i o n w i l l show a s i g n i f i c a n t l y d i f f e r e n t weight. These d i f f e r e n c e s a r e much more l i k e l y when the t r a i n i n g s e t becomes the r e f e r e n c e u n i v e r s e . M o d i f i c a t i o n of the Method. The removal of analogs accentuated a weakness of the method which allowed some h i g h l y p r e v a l e n t f e a t u r e s to r e c e i v e weights of much l a r g e r magnitude than seems reasonable. For example, keys l i k e benzene and s i x - c a r b o n sugars showed i n c i d e n c e s i n the a c t i v e s that were between f i v e and ten standard d e v i a t i o n s away from t h e i r expected i n c i d e n c e s . Before the removal of analogs, these numbers were dominated by the higher weighted analog keys such as a z i r i d i n e which ranged i n t o the twenties and h i g h e r . In the new model the presumably innocuous keys exerted a dominant

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

26.

HODES

Antitumor

Drugs

593

i n f l u e n c e and became more o b v i o u s l y troublesome. I t was necessary to modify the method so that such f e a t u r e s are a u t o m a t i c a l l y deemphasized. The anomalous e f f e c t i s due to the f a c t that a r e l a t i v e l y minor d e v i a t i o n from the expected incidence i n the a c t i v e s of a h i g h incidence f e a t u r e has a l a r g e s t a t i s t i c a l s i g n i f i c a n c e . For example, a f e a t u r e with 60% i n c i d e n c e , which i s expected to occur i n 1000 out of 1667 a c t i v e s has standard d e v i a t i o n approximately 20. Thus, the chance that i t occurs i n more than 1100 or l e s s than 900 a c t i v e s i s f i v e standard d e v i a t i o n s away. The model i s based on random s e l e c t i o n , but of course the s e l e c t i o n of compounds i s not r e a l l y random. The l a r g e d e v i a t i o n s are due to s e l e c t i o n parameters which may or may not be r e l e v a n t to a c t i v i t y , but which tend to give high incidence f e a t u r e s increased e f f e c t . A h e u r i s t i c f o r reducing the magnitude of these weights i s simply the d i v i s i o n of the computed weight of each f e a t u r e by the logarithm of i t s i n c i d e n c e . We use the incidence i n the t r a i n i n g set of about 15000 compounds, a f t e r the removal of analogs. To i l l u s t r a t e , we can d i v i d e the weight by the logarithm to the base ten of the i n c i d e n c e f o r every f e a t u r e with i n c i d e n c e greater than ten, so that no weight i s i n c r e a s e d . R e c a l l that l o g 10 = 1. So, f o r example, a f e a t u r e that occurs 100 times w i l l have i t s weight d i v i d e d by two; i f i t occurs 1000 times i t s weight i s d i v i d e d by 3; e t c . This m o d i f i c a t i o n of the weights has the s e r e n d i p i t o u s e f f e c t of emphasizing low incidence f e a t u r e s , which helps point the model toward the d i s c o v e r y of new c l a s s e s . R e s u l t s without

Analogs .

A new model c o n s i s t e d of weights generated from the t r a i n i n g set without analogs. T h i s model r e q u i r e d the c o r r e c t i o n described e a r l i e r . R e s u l t s are to be compared with those from the o l d model i n c l u d i n g analogs. A f u r t h e r model incorporated the m o d i f i c a t i o n based on i n c i d e n c e . Since the r a t i o of C's to A s a f t e r the removal of analogs increased to more than ten to one, a 5A+C model, presenting each A f i v e times, replaced the 2A+C model used e a r l i e r . The f i v e to one r a t i o was a h e u r i s t i c compromise; something l i k e ten to one g i v i n g too much i n f l u e n c e to i n d i v i d u a l compounds. R e s u l t s on the Test Set. Results on t e s t set 2, which contained fewer analogs, are more comparable to the e a r l i e r results. The new r e s u l t s are shown i n Table VI. Compared with Table I I I , performance on the A s d e t e r i o r a t e s somewhat, but r e s u l t s on the C s are about the same. O v e r a l l , the top 50% contained 70% of the a c t i v e s , compared w i t h 75% e a r l i e r . 1

T

f

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

COMPUTER-ASSISTED

594

DRUG

DESIGN

Table V I . Ranking of A c t i v e s i n Test Set 2. T r a i n i n g s e t 5A+C. NUMBER OF A COMPOUNDS

10 9 8 7 6

19 3 2 1 1

59 68 75 78 81

40 15 11 21 12

28 40 46 60 68

5 4 3 2 1

0 1 3 2 0

84 94 100

13 10 9 11 3

77 84 90 98 100

CUM. PERCENT

NUMBER OF C COMPOUNDS

CUM. PERCENT

RANK (Deciles)

The r e s u l t s a f t e r the i n c i d e n c e m o d i f i c a t i o n a r e shown i n Table V I I . Performance on the A compounds i s e f f e c t i v e l y the same. Performance on the C's i s improved by the incidence m o d i f i c a t i o n , but not s i g n i f i c a n t l y . Table V I I . Ranking of A c t i v e s i n Test Set 2. T r a i n i n g set 5A+C. A f t e r Incidence M o d i f i c a t i o n . RANK (Deciles)

NUMBER OF A COMPOUNDS

10 9 8 7 6

18 4 2 2 0

5 4 3 2 1

0 1 3 2 0

CUM. PERCENT 56 69 75 81

84 94 100

NUMBER OF C COMPOUNDS

CUM. PERCENT

42 17 17 16 12

29 41 52 63 72

9 15 4 10 3

78 88 91 98 100

As expected, the removal of analogs lowers the o v e r a l l performance of the model. However, i t i s a l s o p o s s i b l e to get a b e t t e r i n d i c a t i o n of i t s performance on novel compounds as follows. The a c t i v e s (A+C) i n the t e s t sets were examined to f l a g analogs according to the c r i t e r i a used i n removing them from the t r a i n i n g s e t . There were only twenty a c t i v e non-analogs i n the 2322 compounds of t e s t s e t 1 and 72 i n the 3239 compounds of t e s t

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

26.

HODES

Antitumor

Drugs

595

set 2. F u r t h e r , these compounds tend to come i n small c l u s t e r s of s i m i l a r s t r u c t u r e s so the d i v e r s i t y i s q u i t e l i m i t e d , e s p e c i a l l y on t e s t set 1. The r e s u l t s on the 68 non-analogs that r a t e d C i n t e s t set 2 are shown i n Table V I I I f o r three v e r s i o n s of the model. The two v e r s i o n s of the 5A+C-analogs model, before and a f t e r the i n c i d e n c e m o d i f i c a t i o n , d i d not show much d i f f e r e n c e , but the d i f f e r e n c e from the 2A+C model may be s i g n i f i c a n t . Table V I I I . Ranking of Non-Analogs i n Test RANK (Deciles)

2A+C MODEL NO. CUM.%

5A+C MODEL CUM.% NO.

Set 2.

WITH INCID. MOI CUM.% NO.

10 9 8 7 6

11 10 8 7 5

16 31 43 53 60

18 9 6 6 5

26 40 49 57 65

20 8 6 6 4

29 41 50 59 65

5 4 3 2 1

4 8 6 3 6

66 78 87 91 100

7 5 3 7 2

75 82 87 97 100

6 7 2 7 2

74 84 87 97 100

f

The non-analogs do not do q u i t e as w e l l as the t o t a l C s i n t h e i r r e s p e c t i v e models. At l e a s t i n t h i s t e s t s e t , the analogs tend to score f a i r l y high, even when they have been removed from the t r a i n i n g s e t . The T r a i n i n g Set as a Test Set. I t has j u s t been shown that the r e s u l t s on the t e s t set do not provide a completely s a t i s f a c t o r y i n d i c a t i o n of any e f f e c t of removing analogs from the t r a i n i n g s e t . However, the t r a i n i n g set minus anologs does provide a d i v e r s e body of novel compounds to compare both models. Since these compounds were used i n the c o n s t r u c t i o n of the models, the performance should be b e t t e r than on a new t e s t s e t ; but s i n c e they were used i n both models one should o b t a i n a d i r e c t i n d i c a t i o n of any improvement. The t r a i n i n g set of A s , C's and N's minus analogs was combined and run through the same three v e r s i o n s of the model as a p p l i e d to t e s t set 2. The graph i n F i g u r e 1 shows the performance on a l l the a c t i v e s , as they rank among the N's. The A s and the C's are graphed s e p a r a t e l y . R e c a l l that the C's are more than ten times as numerous as the A's. Again, the 5A+C model was run before and a f t e r the m o d i f i c a t i o n f o r i n c i d e n c e . F i g u r e 1 shows a d e f i n i t e improvement on the novel compounds upon the removal of anologs. Although there are two b a s i c d i f f e r e n c e s i n the c o n s t r u c t i o n of the models, i . e., the A/C r a t i o , and the r e f e r e n c e i n c i d e n c e , the r e s u l t s are comparable f

1

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

COMPUTER-ASSISTED DRUG DESIGN

596

Figure 1. Cumuhtive percent active for the training set without analogs vs. decile. Percentages are plotted at the lower end of each decile and joined linearly. ( ; A's; ( ) C's; (Q) original 2A - f C model; (Π) 5A + Canalogs; and (A) 5A + C-analogs with incidence modification.

Figure 2. The branching carbon atom yields exactly the same AA keys as the two structures shown, but different GAA keys (AA keys cannot distinguish carbon central atom in these structures; GAA keys can)

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

26.

HODES

Antitumor Drugs

597

because each model was s u i t e d to the c o n d i t i o n s under which i t was constructed. What i s most s u r p r i s i n g i s the e f f e c t of the i n c i d e n c e m o d i f i c a t i o n . By t h i s change, the f e a t u r e s whose weights are diminished are those most prevalent i n the t r a i n i n g s e t , so one would t h i n k these f e a t u r e s would have induced greater separation at t h e i r o r i g i n a l higher weight, at l e a s t on the training set. In order to achieve the improvement shown i n F i g u r e 1, t h i v e f f e c t must be more than balanced by the e f f e c t due to r e l a t i v e l y increased weight of low i n c i d e n c e f e a t u r e s . S u f f i c i e n t l y n o v e l compounds w i l l o f t e n have one or more low i n c i d e n c e f e a t u r e . I t was g r a t i f y i n g to see that t h i s i n f l u e n c e d enough compounds to outweigh the l o s s due to diminished weight of the high i n c i d e n c e f e a t u r e s even on the t r a i n i n g s e t .

The GAA

Model.

Thus f a r the f e a t u r e s used were as described i n (1). These are augmented atom (AA) keys and c e r t a i n kinds of r i n g keys. The AA keys c o n s i s t of a c e n t r a l atom, i t s bonds, and the neighboring atoms attached to the bonds. A l l combinations of attachments are permitted. There was a l s o a v a i l a b l e an a l t e r n a t e set of keys f o r searching which c o n s i s t s of f e a t u r e s s l i g h t l y l a r g e r than the AA keys. These were c a l l e d the g a n g l i a augmented atom (GAA) keys. They i n c l u d e the AA key and a l l the bonds attached to a l l the atoms. The GAA keys and not the AA keys are capable of d i s t i n g u i s h i n g the c e n t r a l carbon atom i n F i g u r e s 2a and 2b. Examination of some of the f a i l u r e s i n the preceeding models i n d i c a t e d that the GAA keys would improve performance, although there were other examples which were l e s s t r a c t a b l e . The easy a v a i l a b i l i t y of these f e a t u r e s became the determining f a c t o r i n d e c i d i n g to use them. To avoid questions of redundancy, the GAA keys were used i n place of the AA keys. The r e s u l t s on t e s t set 2 shown i n Table IX should be compared to Tables VII and V I I I . There appears to be a s i g n i f i c a n t improvement, e s p e c i a l l y i n the c e n t r a l column which l i s t s the t o t a l C r a t e d compounds.

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

COMPUTER-ASSISTED DRUG

598

DESIGN

Table IX. Ranking of A c t i v e s i n Test Set 2. GAA Key 5A+C Model w i t h Incidence M o d i f i c a t i o n . RANK (Deciles)

A COMPOUNDS NO. CUM.%

C COMPOUNDS NO. CUM.%

C NON-ANALOGS NO. CUM.%

10 9 8 7 6

19 1 1 3 2

58 61 64 73 79

60 25 11 9 13

40 57 64 70 79

20 14 4 3 6

29 50 56 60 69

5 4 3 2 1

0 4 2 1 0

91 97 100

11 7 5 4 5

86 91 94 97 100

8 5 2 4 2

81 88 91 97 100

The r e s u l t s on the t r a i n i n g s e t i t s e l f a r e compared w i t h the AA models i n F i g u r e 3. The GAA model i s 5A+C-analogs w i t h i n c i d e n c e m o d i f i c a t i o n , as i s the best AA model. On the t r a i n i n g s e t , the main improvement l i e s i n performance on the A's. The Rarest Key Report. Since the emphasis of the screening program remains the t e s t i n g of n o v e l compounds, a crude approximation to a measure of n o v e l t y was attempted. A program was w r i t t e n t o simply p r i n t o u t , f o r each compound i t s key which has l e a s t i n c i d e n c e i n P388 t e s t i n g and a l s o any new keys, which have not y e t appeared i n P388. The r a r e s t key a l s o shows s i g n i f i c a n t d i f f e r e n c e s i n switching from AA keys t o GAA keys as seen from the example i n F i g u r e 4. The r a r e s t AA key has occurred i n 107 compounds which have been t e s t e d i n P388, but the r a r e s t GAA key, with a d i f f e r e n t c e n t r a l atom, has occurred only once. Work i n Progress. O p e r a t i o n a l Model. During the past year an o p e r a t i o n a l model has been i n s t a l l e d a t Chemical A b s t r a c t S e r v i c e , where the NCI Chemical Information System i s maintained. T h i s model w i l l y i e l d r e p o r t s on a c t i v i t y score and r a r e s t key beginning A p r i l 1, 1979. The i n s t a l l e d model i s based on an extension of the t r a i n i n g set without analogs as d e s c r i b e d e a r l i e r . A search of the b i o l o g i c a l f i l e , NSC numbers 260001 and above, was performed May, 1978, followed by a machine search to remove analogs s i m i l a r to the search performed on the e a r l i e r t r a i n i n g s e t . The augmented t r a i n i n g set has the A s increased from 83 to 115, the C's from f

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

26.

HODES

Antitumor

599

Drugs

Figure 3. The performance of the GAA model on the same training set is shown in (Q) together with the graphs of Fig­ ure 1. Note that the 5A + C-analogs model is now represented by (Φ) instead

of(n).

IS THIS COMPOUND NOVEL?

NH—Me

NO - AA KEY 107 OCCURRENCES

oI

YES - GAA KEY 1 OCCURRENCE

Figure 4. The rarest key often turns out to be quite different in the AA key et from that of the GAA key set S

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

600

COMPUTER-ASSISTED

DRUG DESIGN

1182 to 1950 and the N's from 14357 to 33828. The model was run f i r s t w i t h AA, then w i t h GAA keys. At t h i s time a new v e r s i o n of the GAA key model was generated; here keys which had g a n g l i a on the c e n t r a l atom were e l i m i n a t e d . Thus o n l y the l a r g e s t GAA keys were kept. T h i s amounts to a decrease i n redundancy which i s e a s i e r to perform on the GAA than the AA keys. Performance was again compared by running the extended t r a i n i n g set through the r e s p e c t i v e models. The model u s i n g the l a r g e s t GAA key was d e c i d e d l y the b e s t , w i t h about 45% of the a c t i v e s i n the top 10%. A more a p p r o p r i a t e i n d i c a t i o n of performance would i n v o l v e s e t t i n g a s i d e ten or twenty percent cuts as t e s t s e t s . This i s being worked on now and w i l l be r e p o r t e d soon. The Test of 1000. An assortment of 1000 p o t e n t i a l a c q u i s i t i o n s was provided by Starks C. P., the a c q u i s i t i o n s c o n t r a c t o r as a t e s t of the model. These compounds were f i r s t reviewed by Ken P a u l l and were assigned on the b a s i s of t h e i r chemical s t r u c t u r e only to one of the f o l l o w i n g three c a t e g o r i e s : a c t i v e , n o v e l , or i n a c t i v e . Then they were run through the models at CAS. In a d d i t i o n , the e n t i r e 1000 have been put i n t o P388 testing. This experiment serves s e v e r a l purposes. I t was undertaken mainly to f a m i l i a r i z e Starks C. P. w i t h the model. A l s o , i t shows how unselected compounds rank against the NCI f i l e . I f one uses the s c a l e of the t r a i n i n g s e t , then 23-30 of the 1000 f a l l i n the highest d e c i l e and 350-500 i n the lowest d e c i l e , depending on the v e r s i o n of the model. Another purpose of the t e s t of 1000 i s to compare the r e l a t i v e performance of an experienced chemist w i t h the performance of the model, not to see which i s b e t t e r (undoubtedly the chemist) but to see i f t h e r e are some t h i n g s the model w i l l c a t c h which are not obvious to the chemist. In f a c t , review by the chemist of the 30 compounds which scored i n the top d e c i l e , l e d him to r e v i s e h i s o p i n i o n of a c t i v i t y from o r i g i n a l l y 15 to almost a l l , i . e., 28 as p o s s i b l y a c t i v e . A second chemist, Frank W i l l i a m s , performed the review. T h i s experiment should be completed i n about a year, when the P388 t e s t i n g i s f i n i s h e d . I t w i l l be r e p o r t e d at that time. Other S t r u c t u r e Features. In t h i s type of work, more s p e c i f i c f e a t u r e s tend to y i e l d b e t t e r r e s u l t s , as i s evidenced by the use of the GAA keys. With permission from the B a s e l Information Center f o r Chemistry (BASIC), we have been experimenting w i t h t h e i r keys on our data at CAS. The BASIC keys i n c l u d e l i n e a r sequences, which are chains of atoms of l e n g t h four to s i x w i t h bonds s p e c i f i e d only as to t h e i r r i n g or non-ring c h a r a c t e r . The BASIC has f o r some time been performing s u b s t r u c t u r e search on the e n t i r e CAS f i l e . Thus, BASIC keys are r o u t i n e l y generated i n l a r g e volume f o r compounds newly r e g i s t e r e d at CAS and can be used f o r s u r v e i l a n c e at low c o s t .

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

26.

HODES

Antitumor

Drugs

601 1

Performance on t e s t set 2 i s shown i n Table X. The A s and C's are combined i n these s t a t i s t i c s f o r a comparison of performance on a l l the a c t i v e s . So f a r the BASIC keys are not as d i s c r i m i n a t i n g as the NCI keys. Improvement was expected by the use of c o n d i t i o n a l p r o b a b i l i t i e s among l i n e a r sequences to cut down redundancy, but e a r l y r e s u l t s show no such improvement. T h i s i s being worked on by Paul Blower of CAS, who i s a l s o i n v e s t i g a t i n g the use of co-occurrences. Table X. Cumulative Percent A c t i v e i n Test Set 2. BASIC and NCI models. DECILE

BASIC

NCI-AA

NCI-GAA

10 9

30.5 40.1

33.9 45.8

43.5 58.7

Another set of s t r u c t u r e f e a t u r e s i s given by an a l g o r i t h m i c d e f i n i t i o n of a f u n c t i o n a l group. This i s e s s e n t i a l l y any connected subset of atoms which does not c o n t a i n a carbon-carbon s i n g l e bond or a carbon-carbon r i n g a l t e r n a t i n g bond. A program f o r generating such groups has been presented to the author by Paul Blower and t h i s w i l l be worked on soon. An i n t r i g u i n g combination would be l i n e a r sequences w i t h i n t e r n a l carbon atoms and f u n c t i o n a l groups. Summary. Current animal screening methods at the N a t i o n a l Cancer I n s t i t u t e i n c l u d e the use of a standard pre-screen (P388) to t e s t roughly 15,000 new compounds a year. R e s u l t s from P388 t e s t i n g have beeen used to c r e a t e an experimental s t r u c t u r e - a c t i v i t y model to a i d i n s e l e c t i n g compounds l i k e l y to be a c t i v e i n P388. In order to b e t t e r detect new l e a d s , well-known a c t i v e c l a s s e s of compounds have been e l i m i n a t e d from the t r a i n i n g data. There were s t i l l aproximately 1300 a c t i v e compounds and 14,000 i n a c t i v e s so that a s t a t i s t i c a l model f o r p r e d i c t i o n of a c t i v i t y was the method of c h o i c e . The p r e d i c t o r s are molecular s t r u c t u r e f e a t u r e s p r e v i o u s l y used i n searching the chemical s t r u c t u r e f i l e . The method assumes independence of f e a t u r e s , so new combinations can be e a s i l y detected. Emphasis on low i n c i d e n c e f e a t u r e s a l s o helps p o i n t toward n o v e l compounds. In t h i s v e i n , f o r each compound the f e a t u r e o c u r r i n g l e a s t o f t e n and a l l f e a t u r e s not yet i n P388 t e s t i n g are f l a g g e d . At t h i s time there i s some general agreement regarding the usefulness of the methods. The r e s u l t s on t e s t data are q u i t e encouraging, e s p e c i a l l y w i t h the GAA model. The methods are being put i n t o o p e r a t i o n a l use so more concrete r e s u l t s are expected.

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

COMPUTER-ASSISTED DRUG DESIGN

602

Meanwhile, improvements are plannned f o r the models. Acknowledgement. Besides the people mentioned i n the paper I owe thanks to Sid Richman and Ruth Geran and s e v e r a l others a t NCI and CAS.

Literature Cited. 1. Hodes, Louis et al, J. Med. Chem., (1977), 20, 469. 2. Cramer, Richard D. et al, J . Med. Chem. (1974), 17, 533. 3. Richman, Sidney et al, in Retrieval of Medicinal Chemical Information, ACS Symposium Series, (1978). 4. Schenk, H.R. and Wegmuller, F . , J. Chem. Inf. Comput. Sci., (1976), 16, 153. RECEIVED

June

8, 1979.

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.