9 A Computer System for Structure-Activity Studies Using Chemical Structure Information Handling and Pattern Recognition Techniques
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
A. J. STUPER, W. E. BRUGGER, and P. C. JURS Department of Chemistry, The Pennsylvania State University, University Park, PA 16802
The study of relationships between chemical structures and their biological activity is currently receiving a great deal of attention. The term biological activity covers a range from pharmaceuticals and drugs to agricultural chemicals such as pest icides and herbicides to toxic reactions such as those of poisions, carcinogens, teratogens, and mutagens. A variety of methods have been exploited for structure-activity studies: (1) The semiempirical linear free energy related (LFER) or extrathermodynamic model developed by Hansch and co-workers. The LFER method is applied to homologous series of compounds that are related in that they are formed by placing substituents on a par ent compound. The method depends on defining quantitative corr elations between physicochemical parameters of a compound and the biological response observed. An equation of the form 2
log (1/C = aπ + bπ + ρσ + cE + d s
is fit to the set of data using linear regression. The variables are as follows: C is the concentration of the compound necessary to produce a standard biological response; π is the difference between the logarithm of the 1-octanol/water partition coefficient of the parent compound and the substituted compound; σ is the Hammett substituent constant that provides a measure of the elec tronic effect on the reaction rate; and E is a steric factor which compares sizes of substituents to that of methyl taken as a standard. (2) The de novo or additivity model proposed by Free and Wilson. In this approach the contributions to the parameter de fining biological response by each substituent group is assumed to be additive. The equation is s
Ai = y + Zj a
j # p
where μ is the overall average activity (the contribution of the constant part of the molecule, the parent structure), aj^ is the p
165 In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
166
CHEMOMETRICS: THEORY AND APPLICATION
c o n t r i b u t i o n t o the a c t i v i t y from the j t h s u b s t i t u e n t i n the p t h p o s i t i o n i n the parent s t r u c t u r e , and A i i s the standard b i o l o g i c a l response f o r drug compound i . Regression a n a l y s i s i s used t o obtain numerical values f o r the s u b s t i t u e n t c o n t r i b u t i o n s . (3) Quantum mechanical methods. These methods have been used t o c a l c u l a t e parameters t o be c o r r e l a t e d with a c t i v i t y and f o r the determination o f p r e f e r r e d conformations o f b i o l o g i c a l l y a c t i v e molecules. The purpose o f the present p r o j e c t was t o apply the ADAPT computer system t o s p e c i f i c s t r u c t u r e - a c t i v i t y problems. The ADAPT computer system combines techniques o f chemical s t r u c t u r e information handling and p a t t e r n r e c o g n i t i o n f o r the study o f chemical s t r u c t u r e - b i o l o g i c a l a c t i v i t y r e l a t i o n s . T h i s system can be used t o enter and s t o r e a s e t o f d i v e r s e chemical s t r u c t u r e s , generate s t r u c t u r a l d e s c r i p t o r s , and analyze them using p a t t e r n r e c o g n i t i o n methods. These three steps are i l l u s t r a t e d i n Figure 1. Several premises a r e i n v o l v e d i n t h i s approach t o the study of structure-activity relations: - S t r u c t u r e and b i o l o g i c a l a c t i v i t y are r e l a t e d . - S t r u c t u r e s o f compounds can be adequately represented as a set o f molecular d e s c r i p t o r s . -A r e l a t i o n can be discovered between the s t r u c t u r e and a c t i v i t y by a p p l y i n g p a t t e r n r e c o g n i t i o n methods t o a s e t o f t e s t e d compounds. -The r e l a t i o n can be e x t r a p o l a t e d t o untested compounds. Introduction to Pattern
Recognition
Chemical and b i o l o g i c a l data are being produced a t a p r o d i g ious r a t e . T h i s had l e d t o burgeoning i n t e r e s t i n computer a s s i s t ed methods f o r the accumulation, handling, and i n t e r p r e t a t i o n o f these data. Standard approaches t o the i n t e r p r e t a t i o n problem i n clude s t a t i s t i c a l i n t e r p r e t a t i o n , curve f i t t i n g and model f i t t i n g . The development o r v e r i f i c a t i o n o f mathematical expressions r e l a t ing independent v a r i a b l e s and observable dependent v a r i a b l e s i s the goal o f such s t u d i e s . The i n t e n t i s t o c r e a t e a model whose parameters represent q u a n t i t i e s with p h y s i c a l s i g n i f i c a n c e . Then best values f o r the parameters are developed from the data by model fitting. In the absence o f a mathematical model, curve f i t t i n g using general f u n c t i o n s , e_.£., polynomials, can be employed. Not a l l problems faced by the chemist, however, l e n d themselves t o such exacting s o l u t i o n : f r e q u e n t l y , equations d e s c r i b i n g processes o f i n t e r e s t are d i f f i c u l t o r impossible t o o b t a i n , and a host o f problems have not y i e l d e d t o a s a t i s f a c t o r y o r usable t h e o r e t i c a l exp l a n a t i o n . In the absence o f t h e o r e t i c a l l y - b a s e d s o l u t i o n s , empi r i c a l l y - d e r i v e d methods w i l l o f t e n s u f f i c e t o y i e l d u s e f u l and p r a c t i c a l s o l u t i o n s t o complex problems. Standard approaches t o the e x t r a c t i o n o f information from complex data forms have i n c l u d e d l i n e a r o p t i m i z a t i o n , information
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET
AL.
Structure-Activity Studies
167
ENTRY AND STORAGE OF CHEMICAL STRUCTURES
Connection Tables
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
DESCRIPTOR GENERATION
Data Matrix
PATTERN RECOGNITION ANALYSIS
Figure 1. Steps in experimental procedure
theory, and a p l e t h o r a o f s t a t i s t i c a l a n a l y s i s techniques. Since the e a r l y 1950's p a t t e r n r e c o g n i t i o n methods have a l s o been a p p l i ed to a v a r i e t y o f data i n t e r p r e t a t i o n problems and have p a r a l l e l ed the computer's growth i n speed and s o p h i s t i c a t i o n with a c o r r esponding expansion i n scope and c a p a c i t y . Pattern recognition techniques have found a p p l i c a t i o n i n such v a r i e d f i e l d s as compute r and information science, engineering, s t a t i s t i c s , b i o l o g y , p h y s i c s , medicine, and physiology. Each o f these d i s c i p l i n e s has adapted the b a s i c methods of p a t t e r n r e c o g n i t i o n to i t s own s p e c i f i c requirements. Pattern recogniton comprises the d e t e c t i o n , p e r c e p t i o n , and r e c o g n i t i o n o f i n v a r i a n t p r o p e r t i e s among sets o f measurements o f o b j e c t s or events. The purpose of p a t t e r n r e c o g n i t i o n i s generall y to c a t e g o r i z e a sample o f observed data as a member o f the c l a s s t o which i t belongs. T h i s general approach has been a p p l i e d to problems from a number o f d i v e r s e f i e l d s . Several e x c e l l e n t r e views o f the p a t t e r n r e c o g n i t i o n l i t e r a t u r e have appeared which dramatize the enormous breadth o f p a t t e r n r e c o g n i t i o n a p p l i c a t i o n s (1-5). There i s a growing l i t e r a t u r e addressed to the a p p l i c a t i o n s o f p a t t e r n r e c o g n i t i o n t o chemical data i n t e r p r e t a t i o n . Pattern r e c o g n i t i o n methods are uniquely s u i t e d to a v a r i e t y o f s t u d i e s because of s e v e r a l novel a t t r i b u t e s . No mathematical model i s used, but r a t h e r r e l a t i o n s h i p s are sought which provide d e f i n i t i o n s o f s i m i l a r i t y between d i v e r s e groups o f data. Pattern r e c o g n i t i o n techniques are able to d e a l with high dimensional data (data f o r which more than three measurements are used to represent each o b j e c t ) . Such high dimensional data can not be d i r e c t l y v i s u a l i z e d or d i s p l a y e d . In a d d i t i o n p a t t e r n r e c o g n i t i o n t e c h n i ques can d e a l with multisource data or data i n which the r e l a t i o n ships are discontinuous. In multisource data each measurement can
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
168
CHEMOMETRICS: THEORY AND APPLICATION
be the r e s u l t o f an independent generating algorithm or experiment, and each can have a d i f f e r e n t s c a l e , o r i g i n , d i s t r i b u t i o n , e t c . from a l l the other measurements. Therefore, there w i l l be no d i r e c t f u n c t i o n a l r e l a t i o n s h i p between the measurements i n multisource data as there must be, f o r example, i n an absorbance vs. concentrat i o n p l o t . In a p p l i c a t i o n s o f p a t t e r n r e c o g n i t i o n to s t r u c t u r e a c t i v i t y r e l a t i o n s , the data i s always multisource data. For problems p r o v i d i n g multisource data, i t i s d i f f i c u l t to know i n advance whether an appropriate s e t of measurements has been generated to e f f e c t a s a t i s f a c t o r y s o l u t i o n . The generation o f s u f f i c i e n t l y i n formative multisource measurements can become i n i t s e l f a major p a r t o f the o v e r a l l p a t t e r n r e c o g n i t i o n experiment. When a number o f measurements are a v a i l a b l e , p a t t e r n r e c o g n i t i o n can be used to judge t h e i r r e l a t i v e q u a l i t y o r u t i l i t y with regard to s p e c i f i c questions. I t i s t h i s a b i l i t y to d e f i n e r e l a t i o n s through use of a d i v e r s e s e t o f measurements which a f f o r d s p a t t e r n r e c o g n i t i o n t e c h niques t h e i r u t i l i t y i n such a wide v a r i e t y o f f i e l d s . When p r o p e r l y used, p a t t e r n r e c o g n i t i o n techniques a l l o w the chemist to develop c r i t e r i a which r e l a t e the presence o f p r o p e r t i e s to a p a r t i c u l a r sub-set o f the t o t a l number o f measurements. Once the important measurements are i d e n t i f i e d , they can be used to guide the development o f subsequent experiments. For example, i f a chemist were to f i n d t h a t ten s t r u c t u r a l parameters were important i n d i c a t o r s o f a p a r t i c u l a r b i o l o g i c a l e f f e c t , then he might hypothesize s e v e r a l as y e t unstudied s t r u c t u r e s , and use the r e s u l t s from the p a t t e r n r e c o g n i t i o n a n a l y s i s to make an educated guess as to t h e i r e f f e c t s . A l t e r n a t i v e l y , the f a c t t h a t the p a r t i c u l a r ten parameters were shown t o be important may l e a d to added i n s i g h t s i n t o the problem. T h i s a b i l i t y to p i c k a subset o f the o r i g i n a l measurements which contains the bulk o f the t o t a l i n f o r mation content i s extremely d e s i r a b l e . As r e l a t i o n s between seve r a l v a r i a b l e s are not e a s i l y deduced through o b s e r v a t i o n , t h i s i s an extremely u s e f u l c a p a b i l i t y of p a t t e r n r e c o g n i t i o n . B a s i c P a t t e r n Recognition System. A general p a t t e r n r e c o g n i t i o n system f o r s t r u c t u r e - a c t i v i t y s t u d i e s must be capable o f acce p t i n g numerical d e s c r i p t o r s from the d e s c r i p t o r development rout i n e s performing p r i o r f e a t u r e s e l e c t i o n p r e p r o c e s s i n g the data, and c l a s s i f y i n g the compound. A schematic r e p r e s e n t a t i o n o f t h i s b a s i c system i s shown i n F i g u r e 2. I t c o n s i s t s o f four i n t e r r e l a t e d subunits: p r i o r f e a t u r e s e l e c t i o n , p r e p r o c e s s i n g , c l a s s i f i c a t i o n , and feedback feature s e l e c t i o n . The p r i o r f e a t u r e s e l e c t i o n r o u t i n e accepts the data to be c l a s s i f i e d and transforms them to make the c l a s s i f i c a t i o n task e a s i e r . Then, the preprocessor attempts to pursue the f o l l o w i n g two goals simultaneously: (a) to reduce o r e l i m i n a t e the f r a c t i o n o f information contained i n the raw data t h a t i s i r r e l e v a n t or even confusing; and (b) to preserve s u f f i c i e n t information to allow d i s c r i m i n a t i o n among the p a t t e r n c l a s s e s . The c l a s s i f i e r operates on the transformed p a t t e r n vect o r to produce a c l a s s i f i c a t i o n d e c i s i o n . The feedback loop i n -
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET AL.
Structure-Activity
169
Studies
d i c a t e s t h a t the p a t t e r n r e c o g n i t i o n system may use the r e s u l t s o f i t s c l a s s i f i c a t i o n t o develop a s u p e r i o r f e a t u r e e x t r a c t i o n app roach. The e n t i r e p a t t e r n r e c o g n i t i o n system i s g e n e r a l l y imple mented w i t h computer software.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
Classifiers. Methods o f c l a s s i f i c a t i o n f a l l n a t u r a l l y i n t o two c a t e g o r i e s : parametric and nonparametric methods. Parametric t r a i n i n g methods c o n s i s t o f e s t i m a t i n g the s t a t i s t i c a l parameters o f the samples forming t h e t r a i n i n g s e t and then u s i n g these s t a t i s t i c a l parameters f o r the s p e c i f i c a t i o n o f the d i s c r i m i n a n t f u n c t i o n . Nonparametric d i s c r i m i n a n t f u n c t i o n s a r e developed d i r e c t l y from a sample o f data themselves. Learning Machines. Data t o be used i n p a t t e r n r e c o g n i t i o n s t u d i e s are represented as v e c t o r s , X = (x^, x # ···/ n ) ' where XJ represents one o b s e r v a t i o n . S t r u c t u r e s o f molecules can be coded i n t h i s format u s i n g numerical d e s c r i p t o r s f o r the Xj e n t r i es. F o r example, e n t r i e s c o u l d i n c l u d e the molecular weight, num bers o f oxygen atoms, length, volume, l i p o p h i l i c i t y , d i p o l e moment, number o f times a p a r t i c u l a r substructure i s imbedded i n the s t r u c t u r e , e t c . F o r computational convenience an e x t r a d e s c r i p t o r , whose value i s s e t equal t o a constant, i s added t o each pattern vector. Data represented as v e c t o r s can be thought o f e i t h e r as p o i n t s i n an η-dimensional E u c l i d e a n space o r as v e c t o r s p o i n t i n g from the o r i g i n t o those p o i n t s , hence p a t t e r n v e c t o r s . Thus, a set o f data such as a c o l l e c t i o n o f mass s p e c t r a o r a s e t o f s u i t ably encoded chemical s t r u c t u r e s can be represented as a s e t o f η-dimensional p a t t e r n v e c t o r s . Experience shows t h a t p o i n t s r e p r e s e n t i n g p a t t e r n s with common c h a r a c t e r i s t i c s c l u s t e r i n l i m i t ed regions o f the p a t t e r n space. F o r example, a s e t o f p o i n t s r e p r e s e n t i n g the molecular s t r u c t u r e s o f compounds a c t i v e as t r a n q u i l i z e r s may c l u s t e r i n a d i f f e r e n t r e g i o n . There i s an important r e l a t i o n s h i p connecting the number o f p o i n t s i n a data s e t , m, and the number o f d e s c r i p t o r s p e r p o i n t , n, the d i m e n s i o n a l i t y o f the space. As shown by N i l s s o n (6) and by Tou and Gonzalez {!) t h e a b i l i t y o f a b i n a r y p a t t e r n c l a s s i f i e r to separate p o i n t s i s high, even f o r random p o i n t s , i f m i s l e s s than twice as l a r g e as n. The p r o b a b i l i t y o f f i n d i n g a l i n e a r d e c i s i o n s u r f a c e capable o f s e p a r a t i n g any randomly p l a c e d 50 p o i n t s i n a 25-dimensional space i s n e a r l y u n i t y . D i r e c t t e s t s i n our l a b o r a t o r y a r e i n agreement w i t h the theory o f BPC's and show t h a t one has not e l i m i n a t e d the p o s s i b i l i t y o f meaningless t r a i n i n g u n t i l m i s two p o i n t f i v e o r three times as l a r g e as n. Thus, i f one f i n d s a s e p a r a t i n g l i n e a r d e c i s i o n s u r f a c e f o r 75 p o i n t s i n a 25-space, then the p r o b a b i l i t y i s overwhelming t h a t the s e p a r a t i o n i s meaningful, and i t i s not a mathematical a r t i fact. I f the c l u s t e r s a r e dense and a r e f a r apart from each other, and i f the d i m e n s i o n a l i t y o f the space i s s u f f i c i e n t l y low, then x
2
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
170
CHEMOMETRICS:
d i s p l a y o r mapping techniques can be used. T h i s i s done by p e r forming a one-to-one mapping o f p a t t e r n p o i n t s from the o r i g i n a l η-dimensional space t o a 2- o r 3-dimensional space with as l i t t l e d i s t o r t i o n as p o s s i b l e . I f these techniques can be s u c c e s s f u l l y employed, then one can observe the c l u s t e r s d i r e c t l y on a 2- o r 3-dimensional p l o t . An a l t e r n a t i v e way t o i n v e s t i g a t e the s t r u c t u r e o f the s e t o f p o i n t s i s t o separate the c l u s t e r s from one another by d e c i s i o n s u r f a c e s . The simplest d e c i s i o n s u r f a c e i s a hyperplane. Two c l u s t e r s o f p o i n t s which can be completely separated by a hyper plane a r e s a i d t o be l i n e a r l y separable. Any hyperplane has a s s o c i a t e d with i t a normal v e c t o r , c a l l e d here the weight v e c t o r . The weight v e c t o r c o n s i s t s o f an ordered sequence o f components, W = (w^, w # w ) , which stands i n one t o one correspondence with the components o f the p a t t e r n s t o be c l a s s i f i e d . Specifi c a t i o n o f the components o f the weight v e c t o r i s completely e q u i v a l e n t t o s p e c i f i c a t i o n o f t h e p o s i t i o n o f a hyperplane d e c i s i o n surface. Any p a t t e r n p o i n t i n a hyperspace can be c l a s s i f i e d with r e s p e c t t o a hyperplane d e c i s i o n s u r f a c e by t a k i n g the dot product between t h a t p a t t e r n v e c t o r and the normal v e c t o r , o r weight v e c t or: 2
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
THEORY AND APPLICATION
n
s = W.X = w^x^ + W2X2 + ... + w x n
n
= Iw|
|x| cos θ
i n which θ i s the angle between the two v e c t o r s . Since |w| and |x| are always p o s i t i v e , then the value o f θ determines the s i g n o f the dot product. For p a t t e r n s on one s i d e o f the plane the dot prod u c t i s always p o s i t i v e , and f o r p a t t e r n s on the opposite s i d e the dot product i s always negative. The dot product i s normally com puted from the summation o f p a i r w i s e products o f the components o f the two v e c t o r s f o r convenience. The correspondence between c a t e gory 1 and category 2 and the two s i d e s o f the hyperplane i s a r b itary. The l o g i c a l o p e r a t i o n d e s c r i b e d above i s performed by a t h r e s h o l d l o g i c u n i t o r TLU. The TLU accepts the p a t t e r n v e c t o r t o be c l a s s i f i e d , c a l c u l a t e s t h e dot product between the p a t t e r n v e c t o r and the weight v e c t o r , compares the dot product a g a i n s t zero, and c l a s s i f i e s the p a t t e r n according t o the s i g n o f the dot product. D i s c r i m i n a n t Function Development. Given the system d i s c u s s e d above f o r performing c l a s s i f i c a t i o n s , the outstanding problem i n the development o f u s e f u l p a t t e r n c l a s s i f i e r s becomes t h a t o f f i n d i n g u s e f u l d e c i s i o n s u r f a c e s . T h i s can be done, f o r the nonpara m e t r i c systems o f i n t e r e s t , by a method c a l l e d t r a i n i n g . A t r a i n i n g s e t o f p a t t e r n s whose c o r r e c t c l a s s i f i c a t i o n s a r e known i s used t o develop an e f f e c t i v e d e c i s i o n s u r f a c e . The members o f the t r a i n i n g s e t o f o b j e c t s a r e presented t o the TLU being t r a i n e d one a t a time. The weight v e c t o r being trained i s i n i t i a l i z e d a r b i t r a r i l y . When an i n c o r r e c t c l a s s i f i -
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET AL.
Structure-Activity
Studies
171
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
c a t i o n i s made, the weight v e c t o r i s a l t e r e d . The a l t e r a t i o n i s performed i n such a way as to i n s u r e t h a t the new weight v e c t o r w i l l c o r r e c t l y c l a s s i f y the p a t t e r n . T h i s process continues u n t i l a l l the p a t t e r n s o f the t r a i n i n g s e t are c o r r e c t l y c l a s s i f i e d . If the procedure does not f i n d a weight v e c t o r capable o f c o r r e c t l y c l a s s i f y i n g a l l the members o f the t r a i n i n g s e t , then the r o u t i n e i s terminated i n order t o conserve computer time. Learning Machine A t t r i b u t e s . The c a p a b i l i t i e s and performance o f l e a r n i n g machines can be d e s c r i b e d i n terms o f three p r i n cipal attributes: r e c o g n i t i o n , convergence r a t e , and p r e d i c t i o n . Recognition i s the a b i l i t y o f the t r a i n e d b i n a r y p a t t e r n c l a s s i f i e r to c o r r e c t l y c l a s s i f y the members o f i t s t r a i n i n g s e t . Recognition i s 100% f o r a b i n a r y p a t t e r n c l a s s i f i e r whose d e c i s i o n s u r f a c e i s i n the r e g i o n between two separated c l u s t e r s . That i s , a f t e r t r a i n i n g i s complete f o r such a case, the TLU can c o r r e c t l y c a t e g o r i z e any of the members o f the t r a i n i n g s e t . Convergence r a t e r e f e r s t o the r a t e a t which a TLU approaches 100% r e c o g n i t i o n . Since computer time i s an expensive commodity, i t i s o f i n t e r e s t t o minimize t r a i n i n g time. The t r a i n i n g procedures used to f i n d u s e f u l TLU*s are commonly a l t e r e d so as t o force rapid learning. P r e d i c t i o n r e f e r s t o the a b i l i t y o f the TLU to c o r r e c t l y c l a s s i f y unknowns which were not members o f the t r a i n i n g s e t . P r e d i c t i o n i s the most i n t e r e s t i n g and p o t e n t i a l l y u s e f u l o f the a t t r i b u t e s because high p r e d i c t i v e a b i l i t y demonstrates t h a t the TLU has been able to l e a r n something about how to d i s c r i m i n a t e between the two c l a s s e s being t r a i n e d f o r , and the a b i l i t y to c o r r e c t l y c l a s s i f y unknown s p e c t r a i n t o u s e f u l chemical c a t e g o r i e s i s one d r i v e behind a l l automation o f chemical data i n t e r p r e t a t i o n . P r e d i c t i v e a b i l i t y i s normally t e s t e d by s p l i t t i n g the a v a i l a b l e data s e t i n t o two p a r t s - a t r a i n i n g s e t and a p r e d i c t i o n s e t . A f t e r t r a i n i n g i s complete, and without f u r t h e r adjustment o f the weight v e c t o r , the members o f the p r e d i c t i v e s e t are c l a s s i f i e d and the percentage c o r r e c t i s taken as the p r e d i c t i v e a b i l i t y . Another approach, known as the leave-one-out procedure, i n v o l v e s t r a i n i n g a BPC using a t r a i n i n g set c o n t a i n i n g a l l the data on hand except one member, and then p r e d i c t i n g the c l a s s o f the one unknown a f t e r t r a i n i n g i s complète. When averaged over a number of independent t r i a l s , the percentage o f unknowns c o r r e c t l y c l a s s i f i e d i s a measure o f the p r e d i c t i v e a b i l i t y . Feedback Feature S e l e c t i o n . A f t e r a s e r i e s o f weight v e c t o r s have been t r a i n e d f o r the same q u e s t i o n , then they can be used to perform feedback feature s e l e c t i o n . One method t h a t has been used f o r a number o f problems i s weight-sign feature s e l e c t i o n . Implementation o f t h i s method takes advantage o f the f a c t t h a t the exact o r i e n t a t i o n of a t r a i n e d weight v e c t o r (that i s , the r e l a t i v e magnitudes of i t s components) depends on the i n i t i a l i z a t i o n used p r i o r to t r a i n i n g , the magnitude o f the nth component o f the p a t t e r n
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
172
CHEMOMETRICS : THEORY AND APPLICATION
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
v e c t o r s , x ^ the feedback t r a i n i n g methods employed, the sequence i n which the members o f the t r a i n i n g s e t a r e presented t o the c l a s s i f i e r during t r a i n i n g , and s e v e r a l other f a c t o r s . In other words, the exact o r i e n t a t i o n o f a t r a i n e d weight vector depends on the d e t a i l s o f how i t was found. In weight-sign feature s e l e c t ion a p a i r o f weight v e c t o r s i s developed f o r the same question but u s i n g s l i g h t l y d i f f e r e n t approaches, £.CJ.. , d i f f e r e n t i n i t i a l i z a t i o n s . Then the a l g e b r a i c s i g n s o f t h e i r components are compared pairwise. When the components o f the two weight v e c t o r s t h a t both correspond t o a p a r t i c u l a r d e s c r i p t o r disagree i n s i g n , t h a t desc r i p t o r i s discarded; when the signs agree, the d e s c r i p t o r i s r e tained. The procedure i s repeated i t e r a t i v e l y u n t i l two weight v e c t o r s are t r a i n e d t h a t a r e i n complete agreement f o r a l l d e s c r i p t o r s t h a t a r e most u s e f u l f o r a p a r t i c u l a r c l a s s i f i c a t i o n . More r e c e n t l y , a new feedback f e a t u r e s e l e c t i o n procedure much s u p e r i o r t o the weight-sign method has been developed. The v a r i ance f e a t u r e s e l e c t i o n method a l s o takes advantage o f the f a c t t h a t the o r i e n t a t i o n o f a t r a i n e d weight v e c t o r i s dependent on how i t was developed. Here, a group o f weight v e c t o r s a r e t r a i n e d f o r a c l a s s i f i c a t i o n problem i n a manner designed t o e x p l o i t these dependencies. The s e r i e s o f weight v e c t o r s i s then used t o rank the d e s c r i p t o r s t h a t were most u s e f u l i n s e p a r a t i n g the two c l a s s e s under i n v e s t i g a t i o n . The ranking i s done by developing an ordered l i s t o f the d e s c r i p t o r s based on the r e l a t i v e v a r i a t i o n o f the corresponding weight vector components among the s e r i e s o f t r a i n e d weight v e c t o r s . Then the i n t r i n s i c d e s c r i p t o r s (those forming the minimal s e t o f d e s c r i p t o r s s u f f i c i e n t t o e f f e c t separation) can be discarded. The variance f e a t u r e s e l e c t i o n method has been a p p l i e d to a wide v a r i e t y o f problems i n our l a b o r a t o r y . Chemical A p p l i c a t i o n s o f P a t t e r n Recognition. Application s t u d i e s o f chemical problems using p a t t e r n r e c o g n i t i o n techniques have been reported i n a number o f areas (8-14). These a r e l i s t e d i n subsets because each general area r e q u i r e s some d i f f e r e n t approaches and techniques. (1) S p e c t r a l Data A n a l y s i s . E l u c i d a t i o n o f chemical s t r u c ture information from s p e c t r o s c o p i c data i s the area that has r e c e i v e d the most a t t e n t i o n from those p r a c t i c i n g p a t t e r n recognition. Studies have been done with mass s p e c t r a , i n f r a r e d s p e c t r a , s t a t i o n a r y e l e c t r o d e polarograms, gamma-ray s p e c t r a , proton and C n u c l e a r magnetic resonance s p e c t r a . (2) M a t e r i a l s Science. The c l a s s i f i c a t i o n o f m a t e r i a l s as t o o r i g i n o r s u i t a b i l i t y w i t h respect t o production s p e c i f i c a t i o n s has been reported. The data used are g e n e r a l l y multi-source data coming from a v a r i e t y o f a n a l y t i c a l techniques. (3) C l a s s i f i c a t i o n o f Complex Mixtures. The i d e n t i f i c a t i o n o f petroleum samples by a n a l y z i n g a n a l y t i c a l data by p a t t e r n r e c o g n i t i o n techniques has been reported. Data used f o r c l a s s i f i c a t i o n i n d i f f e r e n t s t u d i e s has i n c l u d e d gas chromatograms, i n f r a r e d s p e c t r a , fluorescence s p e c t r a , t r a c e metals c o n c e n t r a t i o n s . A 1 3
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET AL.
Structure-Activity Studies
173
second example o f data a n a l y s i s o f complex mixtures i s from the b i o l o g i c a l mixtures, e_.£., serum, are f e a s i b l e and have been reported, (4) Modeling o f Chemical Experiments. Pattern r e c o g n i t i o n techniques have been used t o model complex chemical systems where the d e t a i l s o f the chemical and/or p h y s i c a l i n t e r a c t i o n s were not completely understood, e_.£.,. r e l a t i v e r e t e n t i o n o f compounds on d i f f e r e n t chromatographic l i q u i d phases. (5) P r e d i c t i o n o f P r o p e r t i e s from Molecular S t r u c t u r e . A number o f s t u d i e s o f the a p p l i c a t i o n o f p a t t e r n r e c o g n i t i o n to the problem o f searching f o r c o r r e l a t i o n s between molecular s t r u c t u r e and b i o l o g i c a l a c t i v i t y have been reported.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
A p p l i c a t i o n s o f P a t t e r n Recognition t o S t r u c t u r e - A c t i v i t y R e l a t i o n s A p p l i c a t i o n s t o S t r u c t u r e - A c t i v i t y R e l a t i o n s . W i t h i n the l a s t few years r e p o r t s have begun to appear o f work d e a l i n g with c l u s t e r a n a l y s i s and p a t t e r n r e c o g n i t i o n a p p l i c a t i o n s to drug s t r u c t u r e a c t i v i t y r e l a t i o n s t u d i e s . A paper by Hansch, Unger, and Forsythe (15) d i s c u s s e d the a p p l i c a t i o n o f h i e r a c h i c a l c l u s t e r a n a l y s i s techniques to the problem o f s e l e c t i o n o f s u b s t i t u e n t s . The data used to represent each drug were the l i p o p h i l i c π constant, e l e c t r o n i c parameters, the approximate s t e r i c molar r e f r a c t i v i t y and molecular weight constants — physicochemical parameters. A paper by H i l l e t aJU (16) d i s c u s s e d the problem o f drug design as app roached by using a t h r e e - l a y e r perceptron network. Forty-six 1,3-dioxane molecules were used as the data s e t f o r t r a i n i n g and p r e d i c t i o n o f perceptrons t o determine a n t i c o n v u l s a n t a c t i v i t y . P r e d i c t i v e a b i l i t i e s i n the range o f 68 t o 76 percent were r e p o r t ed. A paper by T i n g e t a l . (17) reported c o r r e l a t i o n s between the low r e s o l u t i o n mass s p e c t r a o f s i x t y - s i x drugs and t h e i r pharma c o l o g i c a l a c t i v i t y as sedatives o r t r a n q u i l i z e r s . T h i s paper was c r i t i c i z e d with regard t o the s e t o f drugs used i n the a n a l y s i s (18) and with regard to the number o f drugs used and t h e i r r e l a t i v e s i m i l a r i t i e s (19). Several papers (20-22) have r e c e n t l y appeared r e p o r t i n g s t u d i e s i n which molecules were represented by a l i s t o f s t r u c t u r a l f e a t u r e s o f the molecules. Adamson and Bush (20) used l i b r a r y searching programs t o generate a l l s t r u c t u r a l fragments i n t h e i r data s e t and represented the drugs by l i s t s o f the number of occurences o f each substructure i n the molecules. Chu (21) used a number o f p a t t e r n r e c o g n i t i o n and c l u s t e r a n a l y s i s programs to analyze a s e t o f s i x t y - s i x drugs represented by f o r t y - s i x fragments. Kowalski and Bender (22) used three p a t t e r n r e c o g n i t i o n c l a s s i f i e r s t o attempt t o c l a s s i f y 200 drugs w i t h respect to a c t i v i t y f o r the Adenocarcinoma 755 B i o l o g i c a l A c t i v i t y T e s t . T h e i r paper has been c r i t i c i z e d f o r the choice o f the twenty d e s c r i p t o r s used (23). Chu e t a l . (24) reported on the a p p l i c a t i o n o f p a t t e r n r e c o g n i t i o n and s u b s t r u c t u r a l a n a l y s i s t o the problem o f i n v e s t i g a t i n g the a n t i n e o p l i a s t i c a c t i v i t y o f a s e t o f drugs i n the
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
174
CHEMOMETRICS: THEORY AND APPLICATION
experimental mouse b r a i n tumor system. The s e t o f molecules were represented by augmented atom fragments, "heteropath" fragments, and r i n g fragments. Nearest neighbor and l e a r n i n g machine methods of c l a s s i f i c a t i o n were employed, and i t was concluded t h a t these methods could be s u c c e s s f u l l y a p p l i e d t o the problem. C r a i g and Waite (25) have reported the use of p a t t e r n r e c o g n i t i o n techniques to the p r e d i c t i o n o f t o x i c i t y o f o r g a n i c compounds.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
S t r u c t u r e - A c t i v i t y Studies Using P a t t e r n
Recognition
In order to apply p a t t e r n r e c o g n i t i o n techniques t o s t u d i e s o f molecular s t r u c t u r e - b i o l o g i c a l a c t i v i t y c o r r e l a t i o n s the data must be taken through a number o f i n d i v i d u a l steps. These are l i s t e d i n order t o show how i n t e r r e l a t e d the steps become. (a) I d e n t i f y data s e t . (b) E n t e r molecular s t r u c t u r e s . A complete d e s c r i p t i o n o f the s t r u c t u r e o f each molecule must be entered i n t o a file. (c) Generate usable f i l e . A subset o f compounds must be s e l e c t e d from the master s t r u c t u r e f i l e . T h i s may i n volve searching o f keys f o r the s t r u c t u r e s , and w i l l r e q u i r e c a r r y i n g along an i d e n t i f y i n g l a b e l f o r each s t r u c ture. (d) D e s c r i p t o r development. The molecular s t r u c t u r e s s t o r e d i n a general purpose form (£.2/ 9 connection t a b l e s ) must be decomposed i n t o sets o f d e s c r i p t o r s . The three gene r a l c l a s s e s are t o p o l o g i c a l , geometrical, and e x t e r n a l l y generated d e s c r i p t o r s . (e) Form data matrix. The subset o f the a v a i l a b l e d e s c r i p t o r s t o be used i s i d e n t i f i e d , and a matrix o f data i s generated. I t may be p a r t i t i o n e d i n t o a t r a i n i n g s e t and a prediction set. (f) P r i o r feature s e l e c t i o n . Techniques can be a p p l i e d to determine which d e s c r i p t o r s are expected to be most important. (g) Discriminant development. The data s e t i s used t o develop a d i s c r i m i n a n t f u n c t i o n . A f t e r development, the d i s criminant f u n c t i o n can be t e s t e d on unknowns to assess predictive a b i l i t y . (h) Feedback feature s e l e c t i o n . The r e s u l t s o f c l a s s i f i c a t i o n can be used to i d e n t i f y the most u s e f u l d e s c r i p t o r s . One o f the primary p r e r e q u i s i t e s f o r a u s e f u l general purpose p a t t e r n r e c o g n i t i o n system i s a general, data-independent, f i l e management system. A general purpose system has been developed (26) t h a t c o n s i s t s o f a s e t o f i n t e r a c t i v e computer r o u t i n e s known c o l l e c t i v e l y as ADAPT (Automated Data A n a l y s i s using Pattern recogn i t i o n Techniques). T h i s system p r o v i d e s a g e n e r a l i z e d framework t h a t takes i n t o account the p r a c t i c a l c o n s i d e r a t i o n s inherent i n the implementation o f the p a t t e r n r e c o g n i t i o n framework shown i n F i g u r e 1.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
9. STUPER ET AL.
Structure-Activity
Studies
175
ADAPT A r c h i t e c t u r e . F i g u r e 1 does not make c l e a r the i n h e r e n t d i v e r s i t y o f the data h a n d l i n g problem. Not only must measurements from the transducer(s) be i n p u t , but they must be s t o r e d and l a b e l l ed. Each data p o i n t must be given a c l a s s d e s i g n a t i o n and i n d e n t i f i c a t i o n number. C l a s s d e s i g n a t i o n s must be e a s i l y assigned o r m o d i f i e d . T h i s ease o f d e f i n i t i o n and r e d e f i n i t i o n i s o f utmost importance i n the o v e r a l l data a n a l y s i s . The source o f the data i s a l s o important. Sources such as d i g i t i z e d s p e c t r a o r complex molecular s t r u c t u r e s would have widely d i f f e r e n t storage r e q u i r e ments. Since the o p e r a t i o n s performed on one type o f data may bear l i t t l e s i m i l a r i t y t o the o p e r a t i o n s performed on o t h e r types o f data, a system designed with a high degree o f modularity i s r e q u i r e d . To accomodate these requirements, the ADAPT system i s implemented i n independent segments. Each segment can execute independently, o b t a i n i n g a l l necessary i n f o r m a t i o n e i t h e r from a s e t o f d i s c s t o r age f i l e s o r by i n t e r a c t i o n with the user. T h i s mode o f o p e r a t i o n o f f e r s s e v e r a l advantages, t h e most obvious o f which i s a savings i n core storage. The modularity decreases the complexity o f the system and p r o v i d e s a means t o i n c o r p o r a t e a d d i t i o n a l algorithms i n t o the system a t any time. Thus the e n t i r e system i s adapted t o any user's i n d i v i d u a l requirements s i n c e o n l y those o v e r l a y s which are r e l e vant t o the p a r t i c u l a r problem a t hand need be executed. In addi t i o n , these r o u t i n e s a r e r e l a t i v e l y inexpensive t o use because they do not r e q u i r e l a r g e s c a l e f a c i l i t i e s f o r e x e c u t i o n . Finally, the system i s i n t e r a c t i v e i n the sense t h a t the user d i r e c t s which manipulations are t o be performed upon the data. ADAPT thus c o n s i s t s o f a framework w i t h i n which an u n l i m i t e d number o f independent segments can be supported. Each segment performs a s p e c i f i c , independent o p e r a t i o n ranging from i n i t i a l input o f data t o f i n a l output o f r e s u l t s . The g e n e r a l u t i l i t y o f the system a r i s e s from the f a c t t h a t the user has a l a r g e number of o p t i o n s t o choose from, and he can c o n v e n i e n t l y i n t e r a c t with h i s data s e t . I n t e r a c t i o n with ADAPT i s p r o v i d e d v i a a T e k t r o n i x 4010 CRT t e r m i n a l . Data i s s t o r e d i n a s e r i e s o f d e f i n e d f i l e s on c a r t ridge d i s c s . T h i s allows f a s t access and ease o f manipulation. C u r r e n t l y , ADAPT c o n s i s t s o f approximately 70 d e f i n e d f i l e s which use 2.4 m i l l i o n bytes o f storage (one c a r t r i d g e d i s c ) . The ADAPT r o u t i n e uses approximately 90,000 bytes o f core storage f o r i t s l a r g e s t o v e r l a y and i s c u r r e n t l y implemented using a s i x t e e n - b i t M0DC0MP 11/25 computer l o c a t e d i n the Department o f Chemistry a t The Pennsylvania State U n i v e r s i t y . The segments o f the ADAPT system can be broken down i n t o the following l i s t : (1) F i l e generator, i n c l u d i n g g r a p h i c a l i n p u t o f s t r u c t u r e s (2) C l a s s maker (3) Three-dimensional model b u i l d e r (4) D e s c r i p t o r developer
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
176
CHEMOMETRICS: THEORY AND APPLICATION
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
(5) (6) (7) (8) (9)
Collator Preprocessor P r i o r feature s e l e c t o r D i s c r i m i n a n t developer Feedback feature s e l e c t o r
(1) F i l e Generator. The l i b r a r y o f drugs t o be s t u d i e d i s e n t e r ed through the f i l e generator r o u t i n e . S t r u c t u r e s are input by drawing them i n two dimensions on the screen o f an i n t e r a c t i v e graphics t e r m i n a l under the c o n t r o l o f a general s t r u c t u r a l input r o u t i n e , UDRAW, which has been f u l l y d e s c r i b e d elsewhere (27). A molecule's s t r u c t u r e , along with corresponding pharmacological data, i s entered i n t o a d i s c r e s i d e n t permanent f i l e . Information saved f o r f u t u r e use i n c l u d e s a compressed connection t a b l e , r i n g information, a l i s t o f reported a c t i v i t i e s , the two-dimensional coordinates o f the atoms when entered ( f o r p o s s i b l e redrawing o f the s t r u c t u r e s l a t e r ) , an i d e n t i f i c a t i o n number, and the chemical name o f the compound. In a d d i t i o n t o generation, the f i l e can be a l t e r e d by making changes t o information s t o r e d f o r a drug, a drug can be e n t i r e l y d e l e t e d from the f i l e , o r any f i l e member can be d i s p l a y e d . A s e l e c t i o n o f r e c a l l a b l e molecular backbones can be s t o r e d f o r more convenient entry o f s e r i e s o f s t r u c t u r a l l y r e l a t e d compounds. These s t r u c t u r e s can then be made t o appear upon the i n i t i a l UDRAW sketch pad and a complete molecule can be b u i l t up s t a r t i n g from t h i s backbone. This allows the user t o input a s e r i e s o f s t r u c t u r a l l y s i m i l a r compounds without redrawing the base s t r u c t u r e each time. The r o u t i n e t h a t oversees s t r u c t u r e input and f i l e generation can maintain a f i l e o f 1000 s t r u c t u r e s and a s s o c i a ted a u x i l i a r y information. The f i r s t s t r u c t u r e f i l e now s t o r e d i n the system c o n s i s t s o f approximately one thousand c e n t r a l nervous system agents taken from the l i t e r a t u r e (28). Among the b i o l o g i c a l a c t i v i t y c l a s s e s reported there are a n a l g e s i c s , a n t i c o n v u l s a n t s , depressants, hypnotics, r e l a x a n t s , s e d a t i v e s , s t i m u l a n t s , and t r a n q u i l i z e r s ? there are approximately f o r t y c l a s s e s a l t o g e t h e r , many o f which overlap. The second f i l e o f molecular s t r u c t u r e s c u r r e n t l y r e s i d e n t on the ADAPT d i s c f i l e c o n s i s t s o f 184 5 , 5 - d i s u b s t i t u t e d b a r b i t u r a t e s taken from a reference volume (29). A study using t h i s data s e t w i l l be discussed i n a l a t e r s e c t i o n . The t h i r d f i l e contains approximately 500 compounds comprising an o l f a c t i o n data s e t taken from Amoore (30). Molecules reported to have musk, camphor, mint, ether, f l o r a l , pungent, and p u t r i d odors are present. T h i s data s e t i s being used i n s t u d i e s o f the r e l a t i o n between molecular s t r u c t u r e and odor q u a l i t y . The f o u r t h f i l e c o n s i s t s o f a s e t o f molecules comprising an o l f a c t i o n data s e t taken i n a study o f t r i g e m i n a l d e t e c t i o n o f compounds. These compounds are being employed i n a study o f the s i m i l a r i t i e s and d i f f e r e n c e s observed i n t r i g e m i n a l as opposed t o
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
177
Structure-Activity Studies
9. STUPER ET AL.
o l f a c t o r y d e t e c t i o n o f chemicals by humans.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
(2) C l a s s Maker. The c l a s s maker r o u t i n e i s used t o access the l i b r a r y f i l e and t o create s e t s o f l i b r a r y members t h a t s a t i s f y q u e r i e s entered by the user. Thus, t h e s e t o f a l l f i l e members which have been reported t o be sedatives can be formed i n t o an a c t i v e data s e t . T h i s r o u t i n e i s used t o generate c l a s s e s o f s t r u c t u r e s t o be used as data s e t s f o r the development o f d i s c r i m inants by another s e c t i o n o f ADAPT. When the property being sought i s known q u a n t i a t i v e l y , the data s e t i s assembled i n i n c r e a s i n g sequence. Then a s e r i e s o f d i s c r i m i n a n t s can be t r a i n e d f o r d i f f e r e n t t h r e s h o l d c u t o f f s between the a c t i v e and i n a c t i v e c l a s s e s without moving any data but only by r e a l l o c a t i n g c l a s s memberships. (3) Three-Dimensional Model B u i l d e r . The three-dimensional mole c u l a r model b u i l d e r routine i s used t o d e r i v e information on the s p a c i a l conformation o f molecules. A molecule can be viewed as a c o l l e c t i o n o f p a r t i c l e s h e l d together by simple harmonic o r e l a s t i c f o r c e s . These f o r c e s can be d e f i n e d by p o t e n t i a l energy f u n c t i o n s whose terms are the atom coordinates o f the molecule. T h i s f u n c t i o n can then be minimized t o o b t a i n a s t r a i n - f r e e t h r e e dimensional model o f the molecule. Geometric parameters can then be e x t r a c t e d . A wealth o f information already e x i s t s d e s c r i b i n g the procedures and r e s u l t s o f s e v e r a l d i f f e r e n t molecular mechani c s algorithms (31,32). Therefore, f i n d i n g and implementing an a l g o r i t h m t o model sets o f molecules i s a r e l a t i v e l y s t r a i g h t forward procedure. A modified v e r s i o n o f the molecular mechanics routine described by Wipke, e t a l (33-35) has been developed and i n t e r f a c e d t o the ADAPT system so t h a t geometric d e s c r i p t o r s can be d e r i v e d from the r e s u l t i n g molecular s t r u c t u r e . The molecular mechanics r o u t i n e , MOLMEC, used i n conjunction with the ADAPT system i s h i g h l y i n t e r a c t i v e and r e l i e s on g r a p h i c a l input and output. A graphics u n i t i s a l s o supported and i s u t i l i z ed by MOLMEC f o r d i s p l a y i n g the molecule being modelled. The s t r u c t u r e input s e c t i o n o f MOLMEC has been designed t o allow the user t o e i t h e r read the molecule's connection t a b l e from ADAPT*s d i s c f i l e s o r e l s e accept the s t r u c t u r e from the CRT v i a UDRAW (27). Thus, MOLMEC can be used independently o f the ADAPT system. Once the molecule has been entered, c o n t r o l branches t o the i n t e r a c t i v e s e c t i o n where the user can d i r e c t the d i f f e r e n t phases o f modelling as w e l l as monitor the r e s u l t s . In the s t r a i n minimization s e c t i o n , the atom coordinates are s y s t e m a t i c a l l y a l t e r e d u n t i l a minimum i s found i n the s t r a i n o r p o t e n t i a l energy f u n c t i o n . The a c t u a l s t r a i n f u n c t i o n used i n MOLMEC i s : w
E
strain
s
The
Ebond first
+
E ngle a
+
Etorsion
+
E on-bond n
+
E s
tereo
four terms o f the f u n c t i o n are commonly found i n a l l
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
178
CHEMOMETRICS:
THEORY AND APPLICATION
molecular mechanics s t r a i n f u n c t i o n s and are m o d i f i e d Hooke's Law f u n c t i o n s . The l a s t term o f the f u n c t i o n has been added t o assure the proper stereochemistry about an asymmetric atom. The a c t u a l minimization o f the f u n c t i o n i s b e s t accomplished by some type o f n o n l i n e a r programming method (£.£., steepest descent) · In MOLMEC, an adaptive p a t t e r n search r o u t i n e (36) i s used because i t does n o t r e q u i r e a n a l y t i c a l d e r i v a t i v e s . The amount o f time necessary t o o b t a i n good molecular models depends upon the number o f atoms i n the molecule, the i n i t i a l s t r a i n o f the molecule, and the degrees o f freedom i n the s t r u c t u r e . I f a s m a l l molecule i s being modelled, only one pass through the minimization s e c t i o n may be s u f f i c i e n t t o o b t a i n a good s t r u c t u r e . However, t h i s i s seldom the case. U s u a l l y , t h e molecules are r a t h e r l a r g e and r e q u i r e s e v e r a l passes. The a c t u a l amount o f time p e r pass i s l i m i t e d by a c u t o f f parameter so t h a t the user may analyze the progress o f t h e modelling a t d i f f e r e n t i n t e r v a l s . The graphics i n t e r a c t i o n s e c t i o n o f MOLMEC c o n t a i n s r o u t i n e s capable o f r o t a t i n g and a l i g n i n g the molecule i n t o any d e s i r e d p o s i t i o n . Since the graphics u n i t i s o n l y a two-dimensional screen, r o t a t i o n i s e s s e n t i a l t o o b t a i n a good view o f the s t r u c t u r e . Furthermore, these r o u t i n e s are u s e f u l i n l o c a t i n g atoms trapped i n l o c a l minima. I f such an atom i s found, the user can move the trapped atom t o a new p o s i t i o n by a MOVE r o u t i n e found i n the graphics s e c t i o n . N a t u r a l l y , i f the s t r u c t u r e i s a l t e r e d the molecule should be passed through the minimization r o u t i n e a t l e a s t once more. When the molecule i s f i n a l l y i n a low s t r a i n energy conformat i o n , the molecular parameters can be e i t h e r l i s t e d on an output device, o r e l s e the s t r u c t u r e ' s coordinate matrix can be s t o r e d on a d i s c f i l e f o r .further p r o c e s s i n g . An automatic v e r s i o n o f MOLMEC has a l s o been developed so t h a t l a r g e molecular data s e t s can be modelled without continuous superv i s i o n . The program c o n s i s t s on an i n p u t s e c t i o n , which reads the molecule's connection t a b l e and present coordinate matrix from the ADAPT f i l e s , a m i n i m i z a t i o n s e c t i o n w i t h a l l output suppressed, and a s e c t i o n which s t o r e s the f i n a l coordinate matrix. Good models can e a s i l y be obtained i n t h i s manner. However, before the coordinate matrices can be used f o r c a l c u l a t i n g d e s c r i p t o r s , the s t r u c t u r e s a r e reviewed t o make sure t h a t the molecules are i n acceptable conformations. Once modelling i s complete, geometric d e s c r i p t o r s can be d e r i v e d . D e s c r i p t o r s c u r r e n t l y being used i n c l u d e the absolute o r r e l a t i v e magnitudes o f t h e p r i n c i p a l moments o f i n e r t i a o f t h e molecule, the presence o r absence o f p a r t i c u l a r s p a c i a l arrangements o f atoms which have been c a l l e d pharmacophores, and the molecular volume. (4) D e s c r i p t o r Developer. The next step i n s t u d i e s o f s t r u c t u r e a c t i v i t y r e l a t i o n s i s t h e development o f d e s c r i p t o r s f o r the molec u l e s contained i n the a c t i v e data s e t . T h i s s u b j e c t has been
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
9.
STUPER ET AL.
Structure-Activity Studies
179
d i s c u s s e d i n a recent p u b l i c a t i o n (37). D e s c r i p t o r s belong t o two general c l a s s e s : t o p o l o g i c a l and geometrical. T o p o l o g i c a l desc r i p t o r s are d e r i v e d from the t o p o l o g i c a l r e p r e s e n t a t i o n o f a compound — the connection t a b l e . Geometrical d e s c r i p t o r s are d e r i v ed from the three-dimensional model o f the molecule. The i n d i v i d ual d e s c r i p t o r s that have been used i n reported s t u d i e s are desc r i b e d i n the f o l l o w i n g paragraphs. (a) Atom and bond d e s c r i p t o r s — Fragment d e s c r i p t o r s . Atom d e s c r i p t o r s i n c l u d e the number o f C., N, 0, S, P, F, C l , Br, I atoms i n the s t r u c t u r e . Numbers o f bonds o f each type are a l s o generated. Both atom and bond d e s c r i p t o r s are developed d i r e c t l y from the s t o r e d connection t a b l e . (b) Substructure D e s c r i p t o r s . Searching the molecule f o r the presence o f l a r g e r fragments provides an a l t e r n a t i v e method f o r generating d e s c r i p t o r s . I f the substructure i s found i n the mole c u l e , the d e s c r i p t o r can be given a value o f one. Otherwise, i t has a value o f zero. Therefore, to generate substructure d e s c r i p t o r s f o r a given molecular data s e t , two things are needed: a substructure searching a l g o r i t h m and a l i b r a r y of appropriate substructures. Algorithms f o r substructure searching f a l l i n t o two general c a t e g o r i e s . The f i r s t , atom-by-atom searching, i s the e a s i e s t to implement on a d i g i t a l computer because i t simply matches the s t r u c t u r e and substructure atoms and a s s o c i a t e d bonds one a t a time using a l l p o s s i b l e combinations. However, f o r l a r g e s t r u c t u r e s and substructures the time r e q u i r e d f o r a s i n g l e search becomes p r o h i b i t i v e because o f the number o f p o s s i b l e combinations i n c r e a s e s f a c tor i a l l y . The second category u t i l i z e s s e t r e d u c t i o n techniques t o accomplish the substructure search, and f a c t o r i a l c a l c u l a t i o n s are not i n v o l v e d . Although they are more complex than atom-by-atom searching techniques, algorithms implementing s e t r e d u c t i o n are very a t t r a c t i v e because o f t h e i r searching speed. Several d i f f e r ent algorithms have been d e s c r i b e d which use s e t r e d u c t i o n (38-40). In the ADAPT system, a v a r i a t i o n o f the techniques d e s c r i b e d by Sussenguth (38) i s used f o r generating substructure d e s c r i p t o r s . The m o d i f i c a t i o n s allow f o r g r e a t e r substructure s p e c i f i c i t y , a wider v a r i e t y of substructure types, and numeric i n s t e a d of b i n a r y searches. A d i s c u s s i o n o f the changes made i n the Sussenguth's a l g o r i t h m has been reported (41_) and w i l l not be d e t a i l e d here. The problem of c r e a t i n g a substructure l i b r a r y i s not as easy to s o l v e as o b t a i n i n g a good substructure searching algorithm. One approach t o t h i s problem i n v o l v e s the systematic combing o f the b a s i c atom and bond fragments i n t o s u b s t r u c t u r e s . However, the f i n a l number o f substructures generated i n t h i s manner would be t o t a l l y unmanageable. The d i s c r i m i n a t i o n between usable and usel e s s substructures would r e q u i r e some type o f p a t t e r n r e c o g n i t i o n system, and t h i s approach i s not f e a s i b l e . A more workable approach to the problem i s to study the data s e t o f molecules under i n v e s t i g a t i o n and allow the chemist to decide on a c o l l e c t i o n o f
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
180
CHEMOMETRICS : THEORY AND APPLICATION
substructures to be a p p l i e d t o the data s e t . The ADAPT system u t i l i z e s t h i s second method to generate a substructure l i b r a r y . A set o f substructure d e s c r i p t o r s can now be generated. Two types of searches are p o s s i b l e . For a general search, a match i s made i f the i n d i c a t e d substructure i s l o c a t e d anywhere i n the molecule; a l l r i n g information i s ignored. However, during a s p e c i f i c search, r i n g information i s taken i n t o c o n s i d e r a t i o n . Therefore, i f the substructure i s not s p e c i f i e d to be i n a r i n g , i t cannot p o s s i b l y be matched to a molecular fragment t h a t i s con t a i n e d i n a r i n g system. The a c t u a l information contained i n any one s u b s t r u c t u r a l des c r i p t o r depends h i g h l y upon the judgement o f the person s e l e c t i n g the substructure l i b r a r y , i n some a p p l i c a t i o n s , good d e s c r i p t o r s can be obtained immediately because s u f f i c i e n t a p r i o r i knowledge e x i s t s . However, i n other cases, a t r i a l - a n d - e r r o r procedure may be warranted where a l a r g e number o f p o s s i b l e substructures are generated and poor d e s c r i p t o r s are e l i m i n a t e d by some prescreening criterion. In g e n e r a l , substructure d e s c r i p t o r s serve a very im p o r t a n t purpose i n t h a t they r e s t o r e a p o r t i o n o f the s t r u c t u r a l information l o s t i n the atom and bond fragmentation. Nevertheless, considerable s t r u c t u r a l information i s s t i l l missing. (c) Environment D e s c r i p t o r s . The d e s c r i p t i o n of s t r u c t u r e s using fragment and substructure d e s c r i p t o r s i n d i c a t e the components o f a molecule. However, the manner i n which these i n d i v i d u a l p a r t s are connected i s not d e s c r i b e d . Environment d e s c r i p t o r s take i n t o account how d i f f e r e n t areas o f a molecule f i t together and provide a measure o f the "environment" i n which a s i n g l e atom fragment finds i t s e l f . The environment d e s c r i p t o r describes the fragment's surround ings by i n c l u d i n g i t s f i r s t and second nearest neighbors and t h e i r bonds i n t o a s i n g l e parameter which r e f l e c t s the atom and bond types connected t o i t . There may be more than one i d e n t i c a l f r a g ment i n a molecule but they do not n e c e s s a r i l y belong to the same f u n c t i o n a l group. For example, the fragment, -C-, i s found once i n both s t r u c t u r e s A and Β below, but twice i n s t r u c t u r e C: Ο
OU
t\ CH -C-0-CH 3
C = CH - CH
3
CH (A)
0 II
l 3
CH ~ 3
CH. ,
C - CH ~ 2
3
CH = C CH
3
(B)
3
(C)
Obviously, the environment seen by t h i s fragment would be d i f f e r ent i n each o f the three cases. Of course, t h i s d i f f e r e n c e i s de pendent upon the d e f i n i t i o n incorporated t o c a l c u l a t e the e n v i r o n ment d e s c r i p t o r . In the ADAPT system, the three forms most o f t e n used are: bond environment d e s c r i p t o r s (BED), weighted e n v i r o n ment d e s c r i p t o r s (WED), and augmented environment d e s c r i p t o r s (AED).
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
9. STUPER ET AL.
Structure-Activity Studies
181
The procedure used t o c a l c u l a t e these three parameters f o r a p a r t i c u l a r environment fragment i s as f o l l o w s : (1) A s s i g n a r b i t r a r y values t o each type o f atom and bond. The v a l u e s already employed i n the connection t a b l e w i l l suffice. (2) For "BED", sum the number o f bonds connected t o the f r a g ment's f i r s t and second nearest neighbor. (3) For "WED", sum the values assigned t o each bond type i n stead o f merely counting the bonds. (4) For "AED", sum the product o f the bond's assigned value and the assigned values f o r the two atoms which form the bond. The BED, WED, and AED v a l u e s f o r the fragment and s t r u c t u r e s g i v e n above are as f o l l o w s : f o r s t r u c t u r e A, BED = 5, WED = 6, AED = 11; f o r s t r u c t u r e B, BED = 5, WED = 6, AED = 6; f o r s t r u c t u r e C., BED = 12, WED = 15, AED = 17. Since there may be more than one fragment p r e s e n t , the en vironment d e s c r i p t o r i n d i c a t e s the sum o f a l l the environments f o r a given fragment. T h i s f e a t u r e makes them u s e f u l when used i n con j u n c t i o n with s u b s t r u c t u r e d e s c r i p t o r s . The s u b s t r u c t u r e d e s c r i p t o r s i n d i c a t e the number o f times a p a r t i c u l a r fragment i s found i n the molecule and the environment d e s c r i p t o r s i n d i c a t e the con t e x t i n which the fragment i s found. The r o u t i n e t h a t generates the environment d e s c r i p t o r s must have access t o the f i l e o f molecular s t r u c t u r e s and t o the atom centered fragment l i b r a r y which i s c o n s t r u c t e d by the user. The a c t u a l c a l c u l a t i o n o f the environment d e s c r i p t o r s proceeds extrem e l y r a p i d l y s i n c e both the fragment l o c a t i o n and necessary c a l c u l a t i o n s are e a s i l y done by a computer. The concept o f the environment i s not l i m i t e d t o c o n n e c t i v i t i e s , but c o u l d take i n t o account e l e c t r o n d e n s i t i e s , bond d i s t a n c e s , e l e c t r o n e g a t i v i t i e s , o r other p h y s i c a l parameters. T h i s can be done by r e p l a c i n g the v a l u e s assigned i n step one by the d e s i r e d parameters. In t h i s manner, more i n f o r m a t i v e d e s c r i p t o r s may be obtained. Use o f the environment d e s c r i p t o r s may r e v e a l r e l a t i o n s which are not p a r t i c u l a r l y obvious. Note t h a t both s t r u c t u r e s A and Β have the same BED and WED v a l u e s . These s t r u c t u r e s , which a t f i r s t glance appear q u i t e d i f f e r e n t , do indeed have these parameters i n common. However, when one takes i n t o account the type o f atoms connected t o these bonds the d i f f e r e n c e becomes apparent. Such r e l a t i o n s h i p s may o r may not prove s i g n i f i c a n t . Their ultimate u t i l i t y depends on the type o f environment measure, the molecule being coded, and the problem being attacked. (d) Geometric D e s c r i p t o r s . Geometric d e s c r i p t o r s are d e r i v ed from the three-dimensional c o n f i g u r a t i o n as generated by MOLMEC. P r e s e n t l y , two b a s i c types o f geometric d e s c r i p t o r s are c a l c u l a t e d from the molecular s t r u c t u r e s . The three p r i n c i p a l axes o f the molecule form the b a s i s f o r the f i r s t type o f geometric d e s c r i p t o r . Since the o r i e n t a t i o n o f the o r i g i n a l molecule i n space i s
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
182
CHEMOMETRICS: THEORY AND APPLICATION
e s s e n t i a l l y random, the r a d i i must be s o r t e d i n some manner. T h i s i s done by a r b i t r a r i l y a s s i g n i n g X t o the longest r a d i u s , Y t o the second longest r a d i u s , and Ζ t o the s h o r t e s t r a d i u s . Once s o r t e d , the three r a t i o s , X/Y; X/Z and Y/Z, are a l s o c a l c u l a t e d . Due t o t h e i r small values, a l l o f the r a d i i are m u l t i p l i e d by some con s t a n t s c a l i n g f a c t o r t o prevent l o s s o f information during t r u n c a t i o n . These s i x geometric parameters are then used as new des c r i p t o r s and c o n s t i t u t e t h e f i r s t s e t o f geometric d e s c r i p t o r s . The van der Waals volume o f a molecule i s the other type o f geometric d e s c r i p t o r generated i n the ADAPT system. Before t h i s c a l c u l a t i o n can be done, the bond d i s t a n c e s and the van der Waals r a d i i o f the atoms must be known. The bond d i s t a n c e s are e a s i l y obtained from the molecular modelling r e s u l t s . For the van d e r Waals r a d i i , an a r t i c l e p u b l i s h e d by A. Bondi (42) was consulted. The volume occupied by an atom i s taken as t h a t o f a sphere with r a d i u s equal t o the van der Waals r a d i u s o f the atom minus the volume o f o v e r l a p with adjacent atoms. The o v e r l a p volumes a r e c a l c u l a t e d from standard s p h e r i c a l geometry formulas. The a c t u a l volume i s not found f o r two reasons: the assumption o f sphere and s p h e r i c a l segments i s not t o t a l l y c o r r e c t , and the r a d i i used were s e l e c t e d as being the "best" values from a l a r g e c o l l e c t i o n o f data using an e m p i r i c a l s e l e c t i o n method. The t o t a l molecular v o l ume f o r the molecule i s taken as the sum o f the c o n t r i b u t i o n s f o r each atom found as d e s c r i b e d above. The volume c o n t r i b u t i o n s o f attached hydrogens are a l s o i n c l u d e d i n the c a l c u l a t i o n o f the t o t a l volume. In order t o make the r o u t i n e more v e r s a t i l e , the o p t i o n o f e i t h e r using standard bond d i s t a n c e s o r modelled bond d i s t a n c e s i s i n c l u d e d . Since MOLMEC uses the standard bond d i s t a n c e s t o d e t e r mine a low s t r a i n geometry, i t i s not s u r p r i s i n g t h a t f o r a w e l l modelled data s e t , the molecular volumes c a l c u l a t e d using the two d i f f e r e n t bond d i s t a n c e s are very s i m i l a r . However, d i s c r e p a n c i e s can a r i s e when the molecule contains r i n g s o f f i v e o r fewer atoms which cause a l a r g e amount o f bond s t r a i n . The volumes are i n i t i a l l y c a l c u l a t e d i n u n i t s o f c u b i c Angstroms per atom but are then converted t o u n i t s o f c c per mole. The molecular volume can then be used as another geometric d e s c r i p t o r . Each geometric d e s c r i p t o r contains some information about the molecule. The r a d i i and r a t i o s d e s c r i b e the general shape o f the molecule which may be very important i n systems where receptor s i t e s are i n v o l v e d . However, t h i s i s only a r e l a t i v e shape s i n c e the model obtained i s f o r the molecule i n a vacuum: i n some environments, the molecule's shape w i l l change, e s p e c i a l l y i f long chains are present. On the other hand, the molecular volume i s e s s e n t i a l l y constant r e g a r d l e s s o f how the molecule i s bent. How ever, l i k e any other d e s c r i p t o r , the a c t u a l value o f any geometric d e s c r i p t o r depends upon the s p e c i f i c a p p l i c a t i o n i n which i t i s used. (5)
Collator.
The c o l l a t o r r o u t i n e i s used t o s e l e c t which o f the
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9. STUPER ET AL.
Structure-Activity
Studies
183
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
a v a i l a b l e d e s c r i p t o r s w i l l be i n c l u d e d i n the data s e t t o be passed t o other p a r t s o f ADAPT, The experimenter has complete f l e x i b i l i t y i n d e c i d i n g which data s e t o r subset t o use and how t o s t r u c t u r e problems when they are t o be passed t o the p r i o r f e a t u r e s e l e c t i o n algorithms o f the d i s c r i m i n a n t development algorithms. T h i s r o u t i n e i s used t o s e l e c t f i r s t one subset o f the a v a i l a b l e d e s c r i p t o r s t o be used f o r d i s c r i m i n a n t development, and then on subsequent t r i a l s other subsets o f d e s c r i p t o r s . Thus, o v e r a l l performance o f the system can be evaluated with respect t o which d e s c r i p t o r s a r e being i n c l u d e d i n the a n a l y s i s . (6) Preprocessor. The preprocessor r o u t i n e accepts the raw desc r i p t o r s developed by the d e s c r i p t o r development r o u t i n e s and p e r forms the d e s i r e d preprocessing necessary f o r f u r t h e r p r o c e s s i n g . One example o f such p r e p r o c e s s i n g i s a u t o s c a l i n g , where each desc r i p t o r over a data s e t i s a l t e r e d so t h a t the mean i s zero and the standard d e v i a t i o n i s u n i t y . The s t a t i s t i c s l i t e r a t u r e c a l l s t h i s procedure s t a n d a r d i z i n g t h e v a r i a b l e s . (7) P r i o r Feature S e l e c t i o n . A f t e r a s e t o f drugs have been formed i n t o a l a b e l l e d data s e t ready f o r p r e s e n t a t i o n t o the d i s criminant developer, i t i s d e s i r a b l e t o submit i t t o feature s e l e c t i o n i f p o s s i b l e . One method f o r s e l e c t i n g the d e s c r i p t o r s expected t o be most u s e f u l has been the use o f the well-known F i s h e r r a t i o (e_.£., 21). A number o f other s t a t i s t i c a l l y based methods suggest themselves, but they mostly r e q u i r e making the assumption t h a t the b e s t , i..e_., most s e p a r a t i n g , d e s c r i p t o r s i d e n t i f i e d one a t a time w i l l a l s o be the best s e t o f d e s c r i p t o r s . T h i s assumption i s r a r e l y v a l i d . In the s t u d i e s performed t o date, we have u s u a l l y t r i e d t o s e l e c t subsets o f d e s c r i p t o r s i n as wise a manner as we c o u l d devise; we have r e l i e d on being able t o i n v e s t i g a t e a l a r g e enough number o f subsets o f d e s c r i p t o r s t o f e e l reasonably c o n f i d e n t t h a t we have found good d e s c r i p t o r s e t s . Feature s e l e c t i o n i s performed as an i n t e g r a l p a r t o f s t e p wise descriminant a n a l y s i s such as t h a t implemented i n the BMD (43) package as BMD07M. T h i s w i l l be d i s c u s s e d l a t e r i n the s e c t i o n on d i s c r i m i n a n t development and feedback feature s e l e c t i o n . (8) D i s c r i m i n a n t Developer. The d i s c r i m i n a n t developer accepts the s e t o f data generated by the previous s e c t i o n s o f ADAPT and attempts t o develop d i s c r i m i n a n t f u n c t i o n s capable o f c o r r e c t l y c l a s s i f y i n g t h e data. The development o f such d i s c r i m i n a n t s can proceed through the use o f (a) e r r o r c o r r e c t i o n feedback l e a r n i n g machines, (b) i n t e r a c t i v e l e a s t squares development o f l i n e a r d i s criminant f u n c t i o n , (c) other parametric and nonparametric r o u t i n e s . The e r r o r c o r r e c t i o n feedback t r a i n i n g method has been used i n the s t u d i e s on b a r b i t u r a t e s t o be d e s c r i b e d i n the f o l l o w i n g s e c t i o n o f this article. The i t e r a t i v e l e a s t squares development method was developed s e v e r a l years ago i n t h i s l a b o r a t o r y (44) and has been
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
184
CHEMOMETRICS: THEORY AND APPLICATION
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
i n t e r f a c e d i n t o ADAPT, (9) Feedback Feature S e l e c t i o n . In many chemical a p p l i c a t i o n s o f p a t t e r n r e c o g n i t i o n a s e t of data i s coded using more d e s c r i p t o r s than are necessary t o c o r r e c t l y c l a s s i f y the members. However, the necessary and unnecessary d e s c r i p t o r s cannot u s u a l l y be i d e n t ified a priori. (When they can, t h i s i s o b v i o u s l y the method o f choice.) Therefore, feature s e l e c t i o n must o f t e n be approached from a systems viewpoint, whereby the r e s u l t s o f c l a s s i f i c a t i o n are used to t r y to i d e n t i f y the minimal s e t o f necessary d e s c r i p t o r s . T h i s approach i s shown by the feedback loop i n Figure 2. An e a r l y approach to feedback feature s e l e c t i o n was weights i g n feature s e l e c t i o n . Here, two weight v e c t o r s , i n i t i a l i z e d with each component equal to +1 or -1, r e s p e c t i v e l y , were developed using e r r o r c o r r e c t i o n feedback t r a i n i n g with i d e n t i c a l t r a i n i n g s e t s . A component by component comparison was made between the two t r a i n e d weight v e c t o r s , and those d e s c r i p t o r s correspondi n g t o weight v e c t o r components with s i g n disagreements were d i s carded. T h i s method was shown to be e f f e c t i v e f o r some c l a s s e s of data i n s e v e r a l s t u d i e s . The variance feature s e l e c t i o n method, d e s c r i b e d e a r l i e r , has been incorporated i n t o ADAPT and has been used e f f e c t i v e l y on s e v e r a l types of data. The variance method allows r a p i d e x t r a c t i o n o f f e a t u r e s r e s p o n s i b l e f o r l i n e a r seperability. I t i s much s u p e r i o r t o the weight-sign method i n terms of speed and r e l i a b i l i t y . B a r b i t u r a t e Study The s e t o f compounds used i n the present study c o n s i s t s o f 160 5,5·-substituted b a r b i t u r a t e s s e l e c t e d from a standard r e f e r ence (290 . These compounds range i n molecular weight from 172 t o 276 and have d u r a t i o n times ranging from 10 minutes to 600 minutes. The method of a d m i n i s t r a t i o n was e i t h e r i n t r a p e r i t o n e a l o r subcutaneous, using mice, r a t s , o r r a b b i t s as t e s t animals. The compounds were grouped i n t o c l a s s e s according to the dura t i o n o f depressant e f f e c t . These c l a s s e s were formed by d i v i d i n g the d u r a t i o n time expressed i n minutes by ten. The r e s u l t i n g c l a s s d e s i g n a t i o n was rounded up i f the remainder was f i v e o r g r e a t e r , and down otherwise. Thus a compound whose duration time was 227 minutes would be p l a c e d i n c l a s s 23, whereas a compound having a d u r a t i o n time o f 223 minutes would be p l a c e d i n t o c l a s s 22. Compounds with a d u r a t i o n greater than 650 minutes were p l a c e d i n t o c l a s s 65. T h i s r e s u l t e d i n a t o t a l o f 65 d i f f e r e n t c l a s s e s which are d i s t r i b u t e d as shown i n F i g . 3. Three types o f d e s c r i p t o r s were employed f o r these s t u d i e s ; numeric fragment d e s c r i p t o r s , s u b s t r u c t u r a l d e s c r i p t o r s , and environmental d e s c r i p t o r s . The d e s c r i p t o r s were generated using the automated d e s c r i p t o r packages d e s c r i b e d p r e v i o u s l y . A l i s t o f the i n i t i a l s e t of d e s c r i p t o r s used i s given i n Table 1. Each d e s c r i p t o r i s contained i n a minimum o f 20% o f the s t r u c t u r e s . In no case
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET
AL.
Structure-Activity Studies
Numerical Prior Descriptors — F e a t u r e —Preprocessing Selection
Di scrimi nant Resu1ts Function —». of ^Development -, Analysis
—
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
Figure 2.
185
Feedback Feature Selection
Basic pattern recognition system for studies of structure-activity relationships
I2H
ΙΟ
ω
3
2\
200
1
400
600
DURATION TIME (MIN.)
Figure 3.
Histogram of barbiturate duration times for the drugs in the data set
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
186
CHEMOMETRICS : THEORY AND APPLICATION
TABLE I·
Molecular S t r u c t u r e D e s c r i p t o r s f o r the B a r b i t u r a t e Data Set
ATOM AND BOND DESCRIPTORS
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
1 3 5 7
Number Number Number Number
of of of of
atoms Carbon atoms Oxygen atoms double bonds
2 4 6 8
Number o f bonds Number o f Nitrogen atoms Number o f s i n g l e bonds Length a
ENVIRONMENT DESCRIPTORS 15
Atom Centered Fragment
General
9 - 11
CH -
1, 2, 3
12 - 14
-CH -
1, 2, 3
15 - 17
-CH-
1, 2, 3
3
2
24 - 26
I -C I 0 =
1, 2, 3
27 - 29
-HC =
1, 2, 3
30 - 35
>C -
1, 2, 3
18 - 23
Cyclic
1, 2, 3
1, 2, 3
1, 2, 3
SUBSTRUCTURAL DESCRIPTORS 36
C H
3
C H
39 42
a
b
-CH-
2"
37
-CH (CH )CH -
38
CH -
40
-CH CH -
41
CH CH CH -
43
-HC =
2
3
2
2
3
3
2
2
L e n g t h * 4*(Number o f s i n g l e bonds) + 2*(Number o f double bonds) l
» BED, 2 » WED, 3 » AED
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET AL.
187
Structure-Activity Studies
does any one d e s c r i p t o r , o r any b i n a r y combination o f d e s c r i p t o r s , c o n t a i n s u f f i c i e n t information to s u c c e s s f u l l y c l a s s i f y the data. Thus the a c t i v e data s e t c o n s i s t s o f 160 compounds each coded with 43 d e s c r i p t o r s . Preprocessing o f the raw data p r i o r t o c l a s s i f i c a t i o n c o n s i s t e d o f a u t o s c a l i n g so t h a t each d e s c r i p t o r had an average o f zero and a standard d e v i a t i o n o f 127. T h i s allowed the data t o be truncated to i n t e g e r values with a n e g l i g i b l e l o s s o f precision. (Loss of p r e c i s i o n i s known t o be n e g l i g i b l e as r e c a l c u l a t i o n a f t e r t r u n c a t i o n y i e l d e d a standard d e v i a t i o n o f 127 and a mean o f 0 +0.17.) Net r e t e n t i o n o f information was assured by t e s t i n g the p r e d i c t i v e a b i l i t y f o r each d e s c r i p t o r before and a f t e r preprocessing. A value o f 250 was used f o r X + i because i t p r o v i d ed f a s t t r a i n i n g and high p r e d i c t i v e a b i l i t y . Since the data were c o l l e c t e d from a s e r i e s o f s t u d i e s on d i f f e r e n t animals, a t d i f f e r e n t l a b o r a t o r i e s , i t i s not unreason able to expect the c l a s s i f i c a t i o n s to d i f f e r . I t was t h e r e f o r e f e l t t h a t an e r r o r range would take i n t o account the v a r i a t i o n s due t o d i f f e r e n t c l a s s i f i c a t i o n methods. Thus, any one c l a s s i f i e r w i l l develop a r u l e which answers the question, "Is the d u r a t i o n time l e s s than χ minutes?", where there i s a deadzone o f s e v e r a l minutes around t h i s l e v e l . Thus, t o t e s t f o r d i s c r i m i n a t i o n a b i l i t y a t a t h r e s h o l d l e v e l o f 100 minutes using a t h i r t y minute deadzone, a l l members from c l a s s e s 1 through 10 would c o n s t i t u t e one category, and a l l members from 14 through 65 would c o n s t i t u t e the other category. The l i n e a r l e a r n i n g machine was used to develop d i s c r i m i n a n t f u n c t i o n s which b i s e c t the data with as many d i f f e r e n t thresholds as p o s s i b l e , o b t a i n i n g 100% r e c o g n i t i o n a b i l i t y f o r each range. Attempts a t such d i s c r i m i n a t i o n were accomplished using f i r s t a f i f t y , and l a t e r a t h i r t y , minute e r r o r range. To generate a p r e l i m i n a r y estimate o f the c l u s t e r i n g and s e l f consistency of the data the f o l l o w i n g experiment was done. F i v e t r a i n i n g s e t / p r e d i c t i o n s e t s were chosen with seven compounds i n each p r e d i c t i o n s e t and the remaining compounds i n each t r a i n i n g set. The o v e r a l l data s e t i s d i v i s i b l e i n t o halves by 59 t h r e s holds using 50 minute e r r o r ranges. A l l f i v e t r a i n i n g s e t s were used to develop independent d i s c r i m i n a n t s a t each o f the 59 t h r e s holds. These d i s c r i m i n a n t s were then used to p r e d i c t the seven unknowns i n the r e s p e c t i v e p r e d i c t i o n s e t . The c l a s s assignments were made by examining the sequence o f responses produced by the 59 p r e d i c t i o n s ; i f only one change from answers o f "greater than" t o " l e s s than" occurred, t h i s p o i n t was taken as the p r e d i c t e d d u r a t i o n time. I f there were s e v e r a l changes i n p r e d i c t e d r e s ponse, then the p r e d i c t e d duration time was taken as 30 minutes greater than the s h o r t e s t d u r a t i o n time i n d i c a t e d by the f i r s t change i n response. When t h i s procedure was used, 19 o f the 35 unknowns were c l a s s i f i e d as having duration times w i t h i n 20 min utes o f the a c t u a l value and 31 were c l a s s i f i e d as having d u r a t i o n times w i t h i n 50 minutes o f the a c t u a l value. The d u r a t i o n times
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
n
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
188
CHEMOMETRICS: THEORY AND APPLICATION
TABLE I I .
F i n a l Sets o f Molecular S t r u c t u r e D e s c r i p t o r s Supporting L i n e a r D i s c r i m i n a n t Functions a t Thresholds I and I I .
THRESHOLD I
THRESHOLD I I
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
ATOM AND BOND DESCRIPTORS Number o f Oxygen atoms
Number o f Oxygen atoms
Number o f double bonds
Number o f s i n g l e bonds
SUBSTRUCTURAL DESCRIPTORS
ENVIRONMENT DESCRIPTORS
CHo-
CH -(G,2) 3
ι -CH-(G,1)
-CH 2
-CH CH -
-HO(G,2)
CH CH -
>C=(G 3) (C,l)
2
3
2
f
2
Average Predictive Ability b
a
ATOM AND BOND DESCRIPTORS
G = General
93.8%
0
SUBSTRUCTURAL DESCRIPTORS
ENVIRONMENT DESCRIPTORS
CHo
-HC-(G,1) -HC=(G,1)
—CH CH ~ 2
2
i >C=(G,3) -C(C,3) i -CH CH (CH )2
3
Average Predictive Ability
94.9%
1 3
search, C = C y c l i c search, 1 = BED, 2 = WED, 3 = AED
^ P r e d i c t i v e a b i l i t y measured using leave one out procedure
o f only four compounds were i n e r r o r by more than 50 minute e r r o r range used f o r each t h r e s h o l d . Thus t h i s p r e l i m i n a r y experiment showed t h a t a s e t o f l i n e a r c l a s s i f i e r s working i n concert c o u l d p r e d i c t the d u r a t i o n times o f the compounds i n the data s e t reas onably a c c u r a t e l y . S i m i l a r r e s u l t s were obtained f o r the 61 poss i b l e thresholds developed using a 30 minute e r r o r range. In order t o g a i n a b e t t e r i n s i g h t i n t o these r e l a t i o n s h i p s two thresholds were s u b j e c t t o exhaustive f e a t u r e s e l e c t i o n . The t h r e s h o l d I data i n c l u d e s c l a s s e s 1 through 10 and 14 through 65. The t h r e s h o l d I I data i n c l u d e s c l a s s e s 1 through 24 and 28 through
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
9.
STUPER ET AL.
Structure-Activity Studies
189
65. The r e s u l t s o f the s e l e c t i o n process and a p r e d i c t i v e a b i l i t y e s t i m a t i o n i s reported i n Table I I , Through a p p l i c a t i o n o f the variance feature s e l e c t i o n method, a s e t o f features r e s p o n s i b l e f o r t h e s e p a r a b i l i t y o f the data were found. Removing any o f these d e s c r i p t o r s r e s u l t s i n the l o s s of l i n e a r s e p a r a b i l i t y . Therefore, the d e s c r i p t o r s s e l e c t e d cons t i t u t e a minimum set capable o f supporting the r e l a t i o n s h i p w i t h i n the data. The p r e d i c t i v e a b i l i t y , estimated by the leave one out procedure (45), i n d i c a t e d that these f e a t u r e s were capable o f p r o v i d i n g accurate information concerning the d u r a t i o n o f b a r b i t urate a c t i v i t y . Thus, i t i s c l e a r t h a t a r e l a t i o n s h i p i s present which i s r e a d i l y i d e n t i f i e d using the ADAPT system. Further i n v e s t i g a t i o n s using t h i s data set have uncovered several interesting correlations. D e t a i l s o f the experimental r e s u l t s a r e reported elsewhere (46). What has been sought f o r here i s a c l e a r demonstration o f the u t i l i t y o f ADAPT i n e l l u c i d a t i n g r e l a t i o n s w i t h i n a l a r g e body o f data. Note t h a t feature s e l e c t i o n o f the two s p e c i f i c thresholds was e a s i l y accomplished as was i n i t i a l development o f d i s c r i m i n a n t s f o r 61 d i f f e r e n t classes. C l e a r l y such s t u d i e s would be inconvenient without the degree o f o r g a n i z a t i o n provided by automation o f the d e s c r i p t i v e , storage, and p a t t e r n r e c o g n i t i o n techniques. The ADAPT system has c o n s i s t e n t l y shown high u t i l i t y i n s e v e r a l areas and promises t o continue t o a i d i n the a p p l i c a t i o n o f p a t t e r n r e c o g n i t i o n t o problems i n chemistry.
Literature
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
Cited
Minsky, Marvin, Proc. IEEE, 49, 8 (1961). Solomonoff, R.J., Proc. IEEE, 54, 1687 (1966). Rosen, CA., Science, 156, 38 (1967). Nagy, George, Proc. IEEE, 56, 836 (1968) Levine, M.D., Proc. IEEE, 57, 1391 (1969). Nilsson, N.J., Learning Machines, McGraw-Hill Book Co., New York, 1965. Tou, J.T. and Gonzalez, R.C., Pattern Recognition Principles, Addison-Wesley Publishing Co., Reading, Mass., 1974. Kowalski B.R. and Bender,C.F.,Jour. Amer. Chem. Soc., 94 5632 (1972); 95, 686 (1973). Isenhour, T.L., Kowalski, B.R., Jurs, P.C., Crit. Rev. Anal. Chem., 4, 1 (1974). Kowalski, B.R., "Pattern Recognition in Chemical Research," in Computers in Chemical and Biochemical Research, Vol. 2, C.E. Klopfenstein and C.L. Wilkins, Eds., Academic Press, New York, 1974. Jurs, P.C. and Isenhour, T.L., Chemical Applications of Pattern Recognition, Wiley-Interscience, New York, 1975. Kowalski, B.R. and Bender,C.F.,Naturwissenschaften, 62, 10 (1975).
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
190
CHEMOMETRICS: THEORY AND APPLICATION
13. Kowalski, B.R., Anal. Chem., 47, 1152A (1975). 14. Jurs,P.C.,Proceedings of the Workshop on Chemical Applica tions of Pattern Recognition, Washington, D.C, May 1975. 15. Hansch, C., Unger,S.H.,Forsythe, A.B., Jour. Med. Chem., 16, 1217 (1973). 16. Hiller, S.A., et al., Comp. Biomed. Res., 6, 411 (1973). 17. Ting, K.-L.H., et al., Science, 180, 417 (1973). 18. Perrin, C.L., Science, 183, 551 (1974). 19. Clerc, J.T., Naegeli, P., Seibl, J., Chimia, 27, 639 (1973). 20. Adamson, G.W. and Bush, J.A., Nature, 248, 406 (1974). 21. Chu, K.C., Anal. Chem., 46, 1181 (1974). 22. Kowalski, B.R. and Bender,C.F.,Jour. Amer. Chem.Soc.,96, 916 (1974). 23. Unger, S.H., Cancer Chem. Rpts., Part 2, 4(4), 45 (1974). 24. Chu,K.C.,et al., Jour. Med. Chem., 18, 639 (1975). 25. Craig, P.N. and Waite, J.H., Analysis and Trial Application of Correlation Methodologies for Predicting Toxicity of Organic Chemicals, EPA Office of Toxic Substances, 1976. 26. Stuper, A.J. and Jurs, P.C., Jour. Chem. Infor. Comp. Sci., 16, 99 (1976) 27. Brugger, W.E. and Jurs, P.C., Anal. Chem., 47, 781 (1975). 28. Usdin, E. and Efron, D.H., Psychotropic Drugs and Related Compounds, 2nd ed., DHEW Publication No. (HSM) 72-9074, 1972. 29. Doran, W.J., Medicinal Chemistry, Vol. IV, John Wiley and Sons, New York, 1959. 30. Amoore, J.E., Molecular Basis of Odor, Thomas, Springfield, 111., 1970. 31. Engler, E.M., Andose, J.D., Schleyer, P. von R., Jour. Amer. Chem.Soc.,95, 8005 (1973). 32. Williams, J.E., Strang, P.J., Schleyer, P. von R., Ann. Rev. Phys. Chem., 19, 531 (1968). 33. Wipke, W.T., Dyott, T.M., Verbalis, J.G., Abstract, 161st American Chemical Society National Meeting, Los Angeles, CA, March 1971. 34. Wipke, W.T., Gund, P., Verbalis, J.G., Dyott, T.M., Abstract, 162nd American Chemical Society National Meeting, Washington, DC, September 1971. 35. Wipke, W.T., Gund, P., Dyott, T.M., Verbalis, J.G., unpublish ed manuscript. 36. Buffa, E.S. and Taubert, W.H., "Production-Inventory Systems, Planning and Control," Rev. Ed., R.D. Irwin, Inc., Homewood, 111., 1972. 37. Brugger, W.E., Stuper, A.J., Jurs, P.C., Jour. Chem. Infor. Comp. Sci., 16, 105 (1976). 38. Sussenguth, E.H., Jr., Jour. Chem.Soc.,5,36 (1965). 39. Ming, T.-K. and Tauber, S.J., Jour. Chem. Doc., 11, 47 (1971). 40. Figeras, J., Jour. Chem. Doc., 12, 237 (1972). 41. Zander, G.S. and Jurs, P.C., Anal. Chem., 47, 1562 (1975). 42. Bondi, Α., Jour. Phys. Chem., 68, 441 (1964).
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET AL.
Structure-Activity Studies
191
Downloaded by UNIV OF PITTSBURGH on May 3, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch009
43. Dixon, W.J.,Ed., BMD-Biomedical Computer Programs, 3rd Ed., Univ. of Calif, Press, Berkeley, CA, 1973. 44. Pietroantonio,L.,and Jurs,P.C.,Pattern Recog., 4, 391 (1972). 45. Lachenbruch, P.A. and Miche, R.M., Technometrics, 10, 1 (1968). 46. Stuper, A.J. and Jurs, P.C., submitted for publication.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.