Computer-Assisted Drug Design - ACS Publications - American

The number and r e l a t i v e f r e q u e n c y f o r 0, 1,. 2 .... 0. 0.2. 7. 1.1. 5. 0.9. 5. 0.3. 0. 0.4. 7. 0.0. 9. 0 . 0. 1. 3. 2. 0. 219. 0. 0.2...
0 downloads 0 Views 1MB Size
5 Chance Factors in QSAR Studies JOHN G. TOPLISS and ROBERT P. EDWARDS

Downloaded via TUFTS UNIV on July 12, 2018 at 10:44:41 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

Schering-Plough Research Division, Bloomfield, NJ 07003

­

During the past decade quantitative structure -activity relationships (QSAR) have emerged as an im­ portant tool in drug design ( 1 ) . The role of the computer has been essential in allowing practical use of s t a t i s t i c a l methods such as multiple regression analysis (2). However, there is a risk of arriving at fortuitous correlations when too many variables are screened relative to the number of available ob­ servations. In this regard a critical distinction must be made between the number of variables screened for possible correlation and the number which actu­ ally appear in the regression equation. By way of i l l u s t r a t i o n take a case where there are ten observations and corresponding activity values A to Α and eight variables V to V are con­ sidered. Multiple regression analysis yields equa­ tion (1) where r = 0.85, F = 19.8 (p < 0.005) and 1

10

1

8

2

2,7

A = aV1 + bV + c 2

(1)

V and V are each significant at the level p < 0.01. However, the s t a t i s t i c a l test for significance of the equation only relates to the variables included in the actual correlation equation and does not take account of the fact that eight variables were screened for possible correlation. It is clear that the greater the number of variables which are screened for possible inclusion the greater w i l l be the prob­ ability that a chance correlation w i l l occur. The p value of < 0.005 for the equation shown overstates to some extent the probability that the relationship given by the equation is real rather than occurring by chance. The question is how misleading is this p value? The studies to be described w i l l attempt to provide some answer to this question. 1

2

0-8412-0521-3/79/47-112-131$05.00/0 © 1979 American Chemical Society

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

132

COMPUTER-ASSISTED

DRUG

DESIGN

E a r l i e r work on t h i s problem was r e p o r t e d some y e a r s ago (^). However, t h e s e s t u d i e s were t o o l i m i t ­ ed i n scope t o p r o v i d e more t h a n a d e m o n s t r a t i o n t h a t i n d e e d t h e r e was a problem from chance c o r r e l a t i o n s under c e r t a i n c o n d i t i o n s and t o p r o v i d e v e r y g e n e r a l guidelines. A c c o r d i n g l y , i n view o f t h e i n t e r e s t i n t h i s problem f o l l o w i n g p u b l i c a t i o n o f t h e e a r l i e r study, p l a n s were made t o g e n e r a t e more e x t e n s i v e and d e t a i l e d data. Method The b a s i c approach was t o s e t up a l a r g e number and wide range o f s i m u l a t e d c o r r e l a t i o n s t u d i e s u s i n g random numbers f o r b o t h o b s e r v a t i o n s and s c r e e n e d v a r i a b l e s and a s s e s s t h e i n c i d e n c e and degree o f any resulting correlations. Computer System. The program used was d e v e l o p e d by m o d i f y i n g an e x i s t i n g FORTRAN M u l t i p l e R e g r e s s i o n A n a l y s i s Program (BMD 02R), t h e essence o f t h e modi­ f i c a t i o n s b e i n g t o g e n e r a t e t h e d a t a t o be a n a l y z e d u s i n g a pseudo-random number g e n e r a t o r and t o accumu­ l a t e s t a t i s t i c s from each o f t h e i n d i v i d u a l simula­ t i o n s and produce output r e p o r t s , f r e q u e n c y d i s t r i ­ b u t i o n s and summary t a b l e s . The program was r u n on an IBM 560/158 computer p e r f o r m i n g under 0S/VS2. Random Number G e n e r a t o r . The pseudo-random num­ b e r s were g e n e r a t e d u s i n g t h e IBM RANDU s u b r o u t i n e ( k ) which employs t h e M u l t i p l i c a t i v e C o n g r u e n t i a l Method and r e t u r n s a v e c t o r o f random numbers, each u n i f o r m l y d i s t r i b u t e d i n t h e range 0 t o 1. The random numbers were s c a l e d by a f a c t o r o f 1000. S t a r t i n g from a u s e r - d e f i n e d seed v a l u e , RANDU gen­ e r a t e s a stream o f pseudo-random numbers and w i l l produce 2 terms b e f o r e r e p e a t i n g . By comparison of t h e o b s e r v e d as a g a i n s t t h e e x p e c t e d f r e q u e n c y d i s t r i b u t i o n ( u n i f o r m ) o f numbers, RANDU i s an ade­ quate pseudo-random number g e n e r a t o r . To b e t t e r approximate pure randomness each computer r u n had a d i f f e r e n t e x t e r n a l l y imposed i n i t i a l seed v a l u e . I n a d d i t i o n o n l y e v e r y f o u r t h pseudo-random number was sampled from t h e stream. 1

2 9

Computer Program Methodology. The f i r s t s t e p i n each s i m u l a t i o n i s t o g e n e r a t e t h e d a t a m a t r i x w i t h r rows (one f o r each o b s e r v a t i o n ) and η + 1 columns (n independent v a r i a b l e s and one dependent). The v a l u e s o f r and η a r e s p e c i f i e d when t h e program i s

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

5.

TOPLiss A N D

EDWARDS

QSAR Studies

133

run. The d a t a are t h e n a n a l y z e d u s i n g a stepwise M u l t i p l e R e g r e s s i o n Technique ( 2 ) . At each stage o f t h i s p r o c e d u r e the independent v a r i a b l e most h i g h l y c o r r e l a t e d w i t h the dependent v a r i a b l e i s brought i n t o the model p r o v i d e d t h a t the p a r t i a l F - t e s t v a l u e f o r t h a t v a r i a b l e i s s i g n i f i c a n t at the 10$ l e v e l . Each independent v a r i a b l e i n the r e g r e s s i o n model i s t h e n re-examined t o see i f i t i s s t i l l making a s i g n i f i c a n t c o n t r i b u t i o n . Any v a r i a b l e whose p a r t i a l F - t e s t v a l u e i s not s i g n i f i c a n t at the 10$ l e v e l i s dropped from t h e model. T h i s p r o c e s s c o n t i n u e s u n t i l no more v a r i a b l e s e n t e r the model and none are r e j e c t ed. From each s i m u l a t i o n , the number o f v a r i a b l e s e n t e r i n g the r e g r e s s i o n (zero t o n) i s s t o r e d as w e l l as the r s t a t i s t i c i n the case o f at l e a s t one v a r i a b l e e n t e r i n g the r e g r e s s i o n a n a l y s i s . T h i s whole p r o c e s s i s r e p e a t e d f o r each s i m u l a t i o n , the number o f s i m u l a t i o n s t o be c a r r i e d out b e i n g s p e c i f i e d when the program i s run. The r e s u l t s o f a l l the simulat i o n s are t h e n c o n s o l i d a t e d t o g i v e the f o l l o w i n g statistics : (1) The number and r e l a t i v e f r e q u e n c y f o r 0, 1, 2, 3 e t c . v a r i a b l e s e n t e r i n g the r e g r e s s i o n a n a l y s i s . (2) The number and r e l a t i v e f r e q u e n c y f o r at l e a s t one v a r i a b l e e n t e r i n g the r e g r e s s i o n a n a l y s i s . (3) F o r each s i g n i f i c a n t c o r r e l a t i o n o b s e r v e d (a) mean r s t a t i s t i c (b) median r s t a t i s t i c ( c ) minimum r s t a t i s t i c (d) maximum r statistic. (4) The number and r e l a t i v e f r e q u e n c y o f obs e r v e d c o r r e l a t i o n s by r interval: 1.0-0.9;0.9-0.8; 0.8-0.7 etc. 2

2

2

2

2

2

Scope o f Study. The number o f v a r i a b l e s s c r e e n e d was 3-15> 20, 30, ho and the number o f obs e r v a t i o n s was 5-30, 50, 100, 200, kOO, 600. Altog e t h e r some 300 c o m b i n a t i o n s o f v a r i a b l e s and observ a t i o n s were examined. The number of runs f o r each c o m b i n a t i o n ranged from 120-2190. I t should be noted t h a t the study was o r i g i n a l l y d e s i g n e d f o r f a r fewer combinations of v a r i a b l e s and o b s e r v a t i o n s and runs per combination. However, due t o the unexpected a v a i l a b i l i t y , f o r a s h o r t p e r i o d , of a computer which was b e i n g phased out, the scope o f the study was g r e a t l y expanded. Results Due to space l i m i t a t i o n s i t i s not f e a s i b l e t o p r e s e n t i n t a b u l a r form a l l o f the d a t a o b t a i n e d i n

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

134

COMPUTER-ASSISTED

DRUG

DESIGN

t h e study. However, T a b l e I p r o v i d e s an i l l u s t r a t i v e sample. The columns i n t h e t a b l e g i v e t h e number o f independent v a r i a b l e s , t h e number o f o b s e r v a t i o n s , the number o f r u n s , t h e f r e q u e n c y o f a chance c o r r e l a t i o n , t h e average number o f v a r i a b l e s e n t e r e d ( f o r those runs f o r which a t l e a s t one v a r i a b l e i s ent e r e d ) , t h e maximum, minimum and mean r v a l u e s , and the f r e q u e n c y o f those runs where r ^ 0 . 5 and ^ 0 . 8 . F o r t h e s i t u a t i o n w i t h t e n independent v a r i a b l e s s c r e e n e d and t e n o b s e r v a t i o n s , some t w o - t h i r d s o f t h e runs gave a c o r r e l a t i o n and f o r these an average o f 2.20 v a r i a b l e s e n t e r e d . R ranged from 0 . 2 8 t o 1.00 w i t h a mean v a l u e o f 0 . 6 k . Almost h a l f o f t h e t o t a l runs made gave c o r r e l a t i o n s w i t h r v a l u e s o f 0 . 5 o r h i g h e r w h i l e about one q u a r t e r o f t h e t o t a l runs made gave c o r r e l a t i o n s w i t h r ^ v a l u e s o f 0 . 8 o r h i g h e r . As t h e number o f o b s e r v a t i o n s i n c r e a s e d t h e r e was a downward t r e n d i n r v a l u e s so t h a t f o r 15 observat i o n s o n l y h^o o f t h e t o t a l runs gave c o r r e l a t i o n s w i t h r ^ . 0 . 8 and f o r 30 o b s e r v a t i o n s no c o r r e l a t i o n s w i t h r ^ 0 . 8 were found. F o r 20 v a r i a b l e s s c r e e n e d and o n l y 15 observat i o n s , p r a c t i c a l l y a l l o f t h e runs r e s u l t e d i n some l e v e l o f chance c o r r e l a t i o n w i t h about kklo o f t h e t o t a l runs r e s u l t i n g i n a c o r r e l a t i o n w i t h r ^ 0 . 8 . As t h e number o f o b s e r v a t i o n s i n c r e a s e d t o 30 the number o f runs y i e l d i n g c o r r e l a t i o n s w i t h r ^ 0 . 8 d e c l i n e d t o about 1#. P-values f o r t h e observed c o r r e l a t i o n s ranged from a minimum v a l u e o f 0.10 up t o 0 . 0 0 . A t t h i s p o i n t i t must be noted t h a t the degree o f chance c o r r e l a t i o n found i n t h i s study was subs t a n t i a l l y l e s s t h a n t h a t r e p o r t e d i n an e a r l i e r study (jj). Mean r v a l u e s averaged about o n e - h a l f those r e p o r t e d i n t h e e a r l i e r study, r a n g i n g from t h r e e - q u a r t e r s down t o o n e - q u a r t e r . The cause o f t h i s d i s c r e p a n c y c o u l d not be determined s i n c e t h e b a s i c r e c o r d s from t h e e a r l i e r study were not retained. However, i t must be c o n c l u d e d t h a t t h e e a r l i e r study was i n e r r o r . The p r e s e n t study was v e r y c a r e f u l l y checked and s e l e c t e d combinations setup and performed manually showed s i m i l a r r e s u l t s . From d a t a g e n e r a t e d i n t h e p r e s e n t study i t i s now p o s s i b l e t o answer t h e q u e s t i o n r a i s e d e a r l i e r c o n c e r n i n g the p - v a l u e o f e q u a t i o n 1. With t e n obs e r v a t i o n s and e i g h t s c r e e n e d v a r i a b l e s t h e expected f r e q u e n c y o f chance c o r r e l a t i o n s comes out t o be about 12#. Thus, t h e v a l i d i t y o f t h e c o r r e l a t i o n e q u a t i o n i s f a r l e s s c e r t a i n t h a n the r e p o r t e d p - v a l u e indicates. 2

2

2

2

2

2

2

2

2

2

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

ko

40

1.26 1.15

0.26 0.27 0.28

2190 2190 2190 120 398 2190 2190 120 398 398 2190 120 120 398 2190 2190 120 120 2190 1991 1991 1991 199 199

5 10 20 5 7 10 20 10 12 15 30 10 15 17 20 30 15 20 30 50 50 100 50 100

3 3 3 5 5 5 5 10 10 10 10 15 15 15 15 15 20 20 20 20 30 30 l . k l

o.ko

0.67 0.71 0.71 0.74 0.88 0.87 0.86 0.85 0.85 0.94 0.97 0.95 0.95 0.99 0.99 1. 00 1.00

O.kl

5.00 3.53 3.53 3.41 5.59 5.17 7. ik 7. ok

2.48

3.33 2.77 2.6l

k.k2

1.84

I.30 2.20 2.09 1.92

0.39

O.kk

1.87 1.5*

1.1k

AV. NO. VAR. ENTERED

FREQ. CHANCE CORR.

NO. RUNS

NO. OBS.

0.73 0.76 0.52 0.90 0. ^7

0.9k

1.00 1.00 0.98 0.97 0.86 1. 00 0.98

0.7k

1. 00 0.95 0.63 1. 00 1.00 0.99 0.79 1.00 1.00 0.98

R

2

02 0.03

0.0k

0.

0. Ok

0.65 0.30 0. Ik 0.6l 0.45 O.3O 0. Ik 0.28 0.23 0.18 0.08 0.26 0.19 0.l6 0.13 0. 08 0.17 0.13 0.08 0. Ok

MIN R

2

0.37 0.18 0.49 0.24

0.24

0.23 0.79 O.60 0.53 0.45 0.30 0.73 Ο.56 0.39

0.46

0.57

0.64

0.83 0.47 0.24 O.80 0.73 0. $k 0.28

MEAN R

u s i n g Random Numbers

MAX

Table 1 Correlations

). IND. VAR.

Data from S i m u l a t e d 2

0.34 0.11 0.76 0.58 0.29 0.03 0.19 0. 00 0.44 0.01

0.46

0.28 0.03 0.74 0.56

0.46 0.41

0.26 0.09 0.01 0.44 0.34 0.21 0.03

0.15 0. 01 0. 00 0.28 0.15 0. o4 0.00 0.23 0.14 0.04 0. 00 0. 54 0.28 0.14 0.06 0.00 0.44 0.17 0. 01 0.00 0. 00 0.00 0. 02 0. 00

2

FREQ. R > 0.8 0.5

136

COMPUTER-ASSISTED DRUG

DESIGN

The r e s u l t s a r e f u r t h e r i l l u s t r a t e d i n F i g u r e s 1-7· The r e l a t i o n s h i p between r and t h e number o f v a r i a b l e s s c r e e n e d f o r v a r y i n g numbers o f observa­ tions i s i l l u s t r a t e d i n Figure 1· I t can be seen t h a t f o r a c o n s t a n t number o f o b s e r v a t i o n s r i n ­ c r e a s e s as t h e number o f v a r i a b l e s s c r e e n e d i n c r e a s e s . A l s o , f o r a g i v e n number o f v a r i a b l e s screened, r i n c r e a s e s as t h e number o f o b s e r v a t i o n s d e c r e a s e s . In F i g u r e 2 t h e r e l a t i o n s h i p between the number o f o b s e r v a t i o n s and t h e p r o b a b i l i t y o f a chance c o r r e ­ l a t i o n w i t h t e n screened v a r i a b l e s f o r v a r i o u s r v a l u e s i s shown. From the graph c o r r e s p o n d i n g t o r ^ 0 , 8 i t may be seen t h a t the p r o b a b i l i t y o f en­ c o u n t e r i n g a chance c o r r e l a t i o n a t t h i s l e v e l i s about 22$ f o r t e n o b s e r v a t i o n s , r e a c h i n g zero f o r 23 observations. The graph shown i n F i g u r e 3 g i v e s t h e r e l a t i o n ­ s h i p between t h e number o f o b s e r v a t i o n s and t h e prob­ a b i l i t y o f a chance c o r r e l a t i o n w i t h r ^ 0 . 8 f o r f i v e , t e n and f i f t e e n v a r i a b l e s . Thus, f o r f i v e screened v a r i a b l e s t h e p r o b a b i l i t y o f e n c o u n t e r i n g a chance c o r r e l a t i o n i s about 22$ f o r s i x o b s e r v a t i o n s , 10$ f o r e i g h t o b s e r v a t i o n s , 3$ f o r t e n o b s e r v a t i o n s and 1$ f o r twelve o b s e r v a t i o n s , A g u i d e l i n e o f t e n s t a t e d i s t h a t a c e r t a i n num­ ber o f o b s e r v a t i o n s a r e r e q u i r e d i n o r d e r t o have much c o n f i d e n c e i n a g i v e n c o r r e l a t i o n e q u a t i o n . T h i s r u l e o f thumb i s d e f i c i e n t on two c o u n t s ; f i r s t , v a r i a b l e s screened r a t h e r than v a r i a b l e s i n c l u d e d i n the e q u a t i o n should be c o n s i d e r e d and second, t h e number o f o b s e r v a t i o n s p e r v a r i a b l e i s not a l i n e a r f u n c t i o n o f t h e number o f v a r i a b l e s as i t p e r t a i n s t o a s p e c i f i e d l e v e l o f chance c o r r e l a t i o n . In Figure h are p l o t t e d the number o f o b s e r v a t i o n s p e r v a r i a b l e a g a i n s t t h e number o f v a r i a b l e s f o r chance c o r r e l a ­ t i o n l e v e l s o f 1$ f o r r v a l u e s o f ] ^ 0 . 5 , 0 . 8 and 0 . 9 · C o n s i d e r i n g t h e r ^ 0 . 8 curve, t h e number o f obser­ v a t i o n s p e r v a r i a b l e i s 3 · 7 f o r t h r e e v a r i a b l e s drop­ p i n g t o 1.75 f o r t h i r t e e n v a r i a b l e s . F i g u r e 5 shows t h e r e l a t i o n s h i p between observa­ t i o n s and v a r i a b l e s f o r a chance c o r r e l a t i o n p r o b a b i l ­ i t y o f 1$ o r l e s s f o r v a r i o u s r l e v e l s . The l i n e a r r e l a t i o n s h i p s shown a r e each s t a t i s t i c a l l y s i g n i f i c a n t a t t h e ρ < 0.0001 l e v e l . T h i s graph p e r m i t s the de­ t e r m i n a t i o n o f t h e number o f o b s e r v a t i o n s r e q u i r e d t o s c r e e n , f o r example, t e n v a r i a b l e s w h i l e k e e p i n g t h e p r o b a b i l i t y o f e n c o u n t e r i n g a chance c o r r e l a t i o n w i t h r ^ 0 . 8 a t t h e 1$ l e v e l o r l e s s . From t h e graph i t can be e s t i m a t e d t h a t t h i s number o f o b s e r v a t i o n s i s about 20. F o r r ^ 0 . 9 , t h e number r e q u i r e d i s l e s s , 2

2

2

2

2

2

2

2

2

2

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

5. TOPLiss

A N D EDWARDS

QSAR Studies

137

Figure 1. Rehtionship between mean r values and number of variables screened (number of observations: (V) 7; (O )10; (A) 15; (Π) 20; (0 ) 25; (Φ) 30; (M) 50; (V)100) 2

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

138

COMPUTER-ASSISTED

DRUG

DESIGN

0.301-

N0. OBSERVATIONS

Figure 2. Rehtionship between number of observations and probability of a chance correlation (Pc) for 10 screened variables with r ^ 0.7 (O), 0.8 (A), and 0.9(H) 2

NO. OBSERVATIONS

Figure 3. Rehtionship between number of observations and probability of a chance correlation for 5 (O), 10 (A), and 15 (Q) screened variables with r ^ 0.8 2

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

5.

TOPLISS

QSAR Studies

A N D EDWARDS

139

>4 Ο

e

3h

10 15 NO. VARIABLES

20

Figure 4. Number of observations/number of variables as a function of number of variables for specified levels of chance correlation (Pc < 0.01: R > 0.5 (O); R > 0.8(A); R > 0.9(D)) 2

2

2

i at % 8

6

8

J

I

I L_

10 12 14 16 18 20 22 24 26 28 30 NO. OBSERVATIONS

Figure 5. Rehtionship between number of observations and number of variables for chance correlation level Pc ^ 0.01 with r ^ 0.7 (O), 0.8 (A), and 0.9 (\J) 2

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

140

COMPUTER-ASSISTED DRUG

DESIGN

16· E x t r a p o l a t i o n o f t h e p r e v i o u s graph a l l o w s one ( F i g u r e 6 ) t o make s i m i l a r e s t i m a t i o n s f o r l a r g e r numbers o f screened v a r i a b l e s . Thus, some 56 obser­ v a t i o n s would be r e q u i r e d t o s c r e e n ko v a r i a b l e s w h i l e k e e p i n g t h e p r o b a b i l i t y o f e n c o u n t e r i n g a chance c o r r e l a t i o n with r 0 . 8 a t t h e 1$ l e v e l o r l e s s . The p r e c e d i n g graphs make p o s s i b l e t h e e s t i m a t i o n o f t h e number o f o b s e r v a t i o n s needed t o s c r e e n a spec­ i f i e d number o f v a r i a b l e s w h i l e m a i n t a i n i n g t h e prob­ a b i l i t y o f e n c o u n t e r i n g a chance c o r r e l a t i o n o f de­ f i n e d r l e v e l a t 1$ o r l e s s . I n F i g u r e 7 t h e graph shows t h e e f f e c t on t h e number o f r e q u i r e d observa­ t i o n s as t h e t o l e r a t e d chance c o r r e l a t i o n p r o b a b i l i t y i n c r e a s e s t o 5 o r 10$. A g a i n t h e d e p i c t e d l i n e a r r e l a t i o n s h i p s a r e each s t a t i s t i c a l l y s i g n i f i c a n t a t the ρ < 0.0001 l e v e l . Thus, f o r 10 v a r i a b l e s , 20 o b s e r v a t i o n s a r e needed a t a chance c o r r e l a t i o n l e v e l o f 1$ which reduces t o 15 a t t h e 5$ l e v e l and 13-14 at the 10$ l e v e l . Some l i m i t a t i o n s must be p o i n t e d out t o t h e approach d e s c r i b e d i n t h i s study f o r e s t i m a t i n g prob­ a b l e chance c o r r e l a t i o n l e v e l s . In actual p r a c t i c e s c r e e n e d v a r i a b l e s have v a r y i n g degrees o f c o l l i n e a r i t y and not i n f r e q u e n t l y some v a r i a b l e s a r e h i g h l y collinear. G e n e r a l l y , c o l l i n e a r i t y w i l l be more i n e v i d e n c e i n a c t u a l p r a c t i c e among a group o f s c r e e n e d v a r i a b l e s t h e n among random number s i m u l a t e d s c r e e n e d variables. To t h e e x t e n t t h a t t h i s o c c u r s i t has t h e e f f e c t o f r e d u c i n g t h e number o f independent v a r i a b l e s a c t u a l l y o p e r a t i v e as such. To r e f e r e n c e a group o f s c r e e n e d v a r i a b l e s used i n an a c t u a l problem w i t h t h e random number d a t a i t i s t h e r e f o r e suggested t h a t when the c o r r e l a t i o n c o e f f i c i e n t between two s c r e e n e d v a r i ­ a b l e s i s 0 . 8 o r h i g h e r , t h e n o n l y one o f t h e s e v a r i ­ a b l e s i s counted i n a r r i v i n g a t t h e e f f e c t i v e number o f independent v a r i a b l e s s c r e e n e d . T h i s approximate c o r r e c t i o n device should avoid s e r i o u s o v e r e s t i m a t i o n o f chance c o r r e l a t i o n e f f e c t s . A second l i m i t a t i o n has t o do w i t h use o f t h e s t e p w i s e m u l t i p l e r e g r e s s i o n method. T h i s method p r o b a b l y misses some c o r r e l a t i o n s and as such l e a d s t o some u n d e r e s t i m a t i o n o f t h e i n c i d e n c e o f chance correlations. O v e r a l l , the r e s i d u a l b i a s a f t e r c o r r e c t i o n f o r t h e c o l l i n e a r i t y e f f e c t tends t o c o u n t e r b a l a n c e t h e b i a s from use o f t h e s t e p w i s e r e g r e s s i o n method so t h e net e s t i m a t i o n e r r o r may not be t h a t l a r g e . about

2

2

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

5.

TOPLiss A N D

10

141

QSAR Studies

EDWARDS

20

30

40

50

60

70

80

90

100

Να OBSERVATIONS

Figure 6. Rehtionship between number of observations and number of variables for chance correlation level Pc $ζ 0.01 with r ^ 0.7 (O), 0.8 (A), and 0.9 (Q) (extrapolated) 2

J

I

I

I

I

2

4

6

8

10

I

I

I

I

1—ι—ι—ι—ι

1

12 14 16 18 20 NO. OBSERVATIONS

22

24

26

28

30

Figure 7. Relationship between number of observations and number of variables with r ^ 0.80 at specified probability levels: Pc < 0.01 (O); Pc < 0.05 (A); Pc < 0.10 (Π) 2

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

142

COMPUTER-ASSISTED DRUG

DESIGN

A p p l i c a t i o n s to Reported C o r r e l a t i o n Studies I n the l i g h t o f t h e chance c o r r e l a t i o n phenomenon d e s c r i b e d , some comments w i l l be made on a number o f c o r r e l a t i o n s t u d i e s r e p o r t e d i n the l i t e r a t u r e which i n v o l v e d l a r g e numbers o f screened v a r i a b l e s . The f i r s t o f t h e s e r e p o r t e d by P e r a d e j o r d i _et a l . (6) concerned a quantum c h e m i c a l approach t o s t r u c ­ t u r e - a c t i v i t y r e l a t i o n s h i p s of t e t r a c y c l i n e a n t i b i o t ­ ics. The c o r r e l a t i o n e q u a t i o n (2) c o n t a i n e d seven i n ­ dependent v a r i a b l e s each s i g n i f i c a n t at the 1$ l e v e l log

κ]

=

18.3975+56.1733Qoio - 1.1063E

0 l l

+71.3254Q

0 i 2

+ l 6

*

9 1 5 5 E

Oio

+l8.3655E

+ l f 8 # 7 9 5 6 Q

0 i g

04

+3.3880Q

C

(2) w i t h the o v e r a l l e q u a t i o n s i g n i f i c a n t at the 0.1$ level. The r v a l u e g i v e n was Ο.986 so t h a t the equa­ t i o n accounts f o r almost a l l o f the v a r i a n c e i n the data. There were 20 o b s e r v a t i o n s and some l 8 p o s s i b l e independent v a r i a b l e s were s c r e e n e d . The random num­ b e r s i m u l a t e d c o r r e l a t i o n d a t a g e n e r a t e d i n the cur­ r e n t study d i d not i n c l u d e the c o m b i n a t i o n o f 20 ob­ s e r v a t i o n s and 18 v a r i a b l e s . The c l o s e s t r e f e r e n c e p o i n t s were 20 o b s e r v a t i o n s and 20 v a r i a b l e s f o r which the i n c i d e n c e o f chance c o r r e l a t i o n s w i t h r ^ 0 . 9 was about 8$, and 17 o b s e r v a t i o n s and 15 v a r i a b l e s w i t h a 5$ chance c o r r e l a t i o n i n c i d e n c e . The approximate r i s k of a chance c o r r e l a t i o n i s t h e r e f o r e 5-8$. The r e p o r t e d c o r r e l a t i o n e q u a t i o n , a l t h o u g h i t s t i l l may be v a l i d , thus appears f a r l e s s secure t h a n the r e p o r t e d p - v a l u e o f < 0.001 would i n d i c a t e . It s h o u l d be noted t h a t we are not s a y i n g t h a t t h e r e i s a 5-8$ p r o b a b i l i t y t h a t the r e p o r t e d e q u a t i o n c o u l d a r i s e by chance. We are s a y i n g t h a t from 20 observa­ t i o n s and 18 v a r i a b l e s t h e r e i s a 5-8$ p r o b a b i l i t y t h a t some c o r r e l a t i o n w i t h r ^ 0 . 9 c o u l d emerge from chance f a c t o r s a l o n e . Gibbons jet al.. ( j ) r e p o r t e d on Q u a n t i t a t i v e S t r u c t u r e - A c t i v i t y R e l a t i o n s h i p s among s e l e c t e d P y r i midinones and H i l l R e a c t i o n I n h i b i t i o n . The c o r r e l a ­ t i o n e q u a t i o n (3) a r r i v e d at was based on 17 observa2

2

2

Rplso = - 0 . 4 6 ( ± 0 . 1 ΐ ) τ τ r i n g + 8 . 9 2 ( ± 1 . 8 5 ) σ +0.0^5 (±0.008 )EMRtions,

contained

0.15

(3)

t h r e e independent v a r i a b l e s , had

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

an

5.

TOPLiss

AND

QSAR Studies

EDWARDS

143

r = 0,83 and a p - v a l u e o f < 0 . 0 1 . Some 20 v a r i a b l e s were s c r e e n e d f o r p o s s i b l e i n c l u s i o n which may be e q u i v a l e n t to about 18 v a r i a b l e s a l l o w i n g f o r c o l l i n earity factors. T h i s g i v e s the c o m b i n a t i o n 17 obser­ v a t i o n s and 18 v a r i a b l e s . The n e a r e s t reference p o i n t s a v a i l a b l e f o r chance c o r r e l a t i o n s are the com­ b i n a t i o n s 20 o b s e r v a t i o n s and 20 v a r i a b l e s and 15 ob­ s e r v a t i o n s and 15 v a r i a b l e s f o r which the c o r r e l a t i o n frequencies for r ^ , 0 . 8 are 17^ and 28^ r e s p e c t i v e l y . I t must, t h e r e f o r e , be c o n c l u d e d t h a t t h e r e i s a sub­ s t a n t i a l r i s k t h a t the r e p o r t e d e q u a t i o n i s s p u r i o u s i n sharp c o n t r a s t w i t h the r e p o r t e d p - v a l u e o f < 0 . 0 1 . A study by Timmermans and van Zwieten (J3) on QSAR i n C e n t r a l l y A c t i n g I m i d a z o l i n e s S t r u c t u r a l l y R e l a t e d t o C l o n i d i n e l e d to the development o f equa­ tion ( k ) . There were 27 o b s e r v a t i o n s and e f f e c t i v e l y 2

2

log

I/ED30 = -0.00032(±0.0008)(£Par) +0.105(±0.03)EPar 2

- 0.695(±0.17)ApKa°+5.333(±1.89)H0M0(P) (h)

+6.752(±2.25)EE(P)+2.1*94 η = 27,

r

2

= 0.91,

Ρ
10). They a l s o searched f o r c o r r e ­ l a t i o n s where a l l X v a l u e s were r e p l a c e d by random numbers and r e a l Y v a l u e s were r e t a i n e d . S i n c e t h i s does not p r e s e r v e t h e p a t t e r n o f c o l l i n e a r i t y p r e s e n t i n t h e r e a l X v a r i a b l e s e t t h i s l a t t e r p r o c e d u r e may not p r o v i d e as good an e s t i m a t e o f chance e f f e c t s . It i s also possible to replace selected X variables o n l y , w i t h random numbers. T h i s would be u s e f u l i n s i t u a t i o n s where i t i s s u s p e c t e d t h a t i n a c o r r e l a ­ t i o n e q u a t i o n c e r t a i n terms, e g . it, it were r e a l c o r r e l a t i o n s w h i l e o t h e r s were s p u r i o u s . " I n t h i s case, r e a l it and i t numbers would be r e t a i n e d and t h e o t h e r X v a r i a b l e s r e p l a c e d by random numbers. In e v a l u a t i n g c o r r e l a t i o n s f o r chance e f f e c t s a t t e n t i o n s h o u l d a l s o be p a i d t o whether some obser­ v a t i o n s have been dropped from t h e s e t i n o r d e r t o obtain a b e t t e r data f i t . T h i s i s a f a i r l y common o c c u r r e n c e and u s u a l l y g i v e s much h i g h e r r v a l u e s . However, sometimes t h e j u s t i f i c a t i o n g i v e n f o r drop­ ping the poorly f i t observations are questionable. The c o r r e l a t i o n e q u a t i o n o b t a i n e d from u t i l i z i n g a l l o f t h e o b s e r v a t i o n s ( i e . a f t e r adding back i n t h e dropped o b s e r v a t i o n s ) s h o u l d t h e r e f o r e be checked f o r p o s s i b l e chance c o r r e l a t i o n s t o g i v e a b e t t e r o v e r a l l evaluation o f the r e l i a b i l i t y o f the r e s u l t . 2

2

Summary The r e s u l t s o f t h e s t u d i e s d e s c r i b e d show t h a t chance c o r r e l a t i o n s a r e a r e a l phenomenon o c c u r r i n g when t h e number o f v a r i a b l e s s c r e e n e d f o r p o s s i b l e c o r r e l a t i o n i s l a r g e compared t o t h e number o f obser­ vations. F o r t h i s r e a s o n some c o r r e l a t i o n s a r e l e s s s i g n i f i c a n t t h a n t h e i r s t a n d a r d p- v a l u e s i n d i c a t e as has been demonstrated by r e f e r e n c e t o some r e p o r t e d correlations. The p r e s e n t study p r o v i d e s g u i d e l i n e s f o r t h e approximate i n c i d e n c e o f chance c o r r e l a t i o n s a t s p e c i f i e d r v a l u e s f o r v a r i o u s combinations o f obser­ v a t i o n s and s c r e e n e d v a r i a b l e s . Specific correlations o b t a i n e d under c o n d i t i o n s where t h e g u i d e l i n e s i n d i ­ c a t e an a p p r e c i a b l e r i s k o f chance c o r r e l a t i o n s s h o u l d be i n d i v i d u a l l y checked as d e s c r i b e d . 2

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.

5.

TOPLiss

A N D EDWARDS

QSAR Studies

145

F i n a l l y , i t s h o u l d "be p o i n t e d out t h a t t h e r e i s no i n t e n t i o n o f a d v o c a t i n g c o r r e l a t i o n s t u d i e s o n l y under c o n d i t i o n s where t h e r e i s a m i n i s c u l e r i s k o f chance c o r r e l a t i o n s . However, t h e importance o f having a true idea o f the r e l i a b i l i t y of a c o r r e l a ­ t i o n equation i s obvious. I t i s one t h i n g t o d e v e l o p and use a c o r r e l a t i o n e q u a t i o n which i s known t o be somewhat tenuous b u t q u i t e another t o b e l i e v e t h a t i t has a s o l i d f o u n d a t i o n when i n f a c t i t has n o t . Acknowledgement s The a u t h o r s a r e i n d e b t e d t o L. Weber f o r a s s i s ­ tance i n d a t a t a b u l a t i o n and t o M. M i l l e r and A. Saltzman f o r h e l p f u l d i s c u s s i o n s .

Literature Cited 1. 2. 3. 4.

Martin, Y. C . , "Quantitative Drug Design", Marcel Dekker, Inc., New York, 1978. Draper, N. R., and Smith, H . , "Applied Regression Analysis", Wiley, New York, 1966. Topliss, J . G . , and Costello, R. J., J . Med. Chem. ( 1 9 7 2 ) , 15, 1066. International Business Machines Corporation, "System/360 Scientific Subroutine Package, Version III, Programmer's Manual, Program Number 360A-CM-03X" Manual GH20-0205-4 (Fifth Edition), IBM Corporation, White Plains, New York, August 1970.

5.

6. 7. 8. 9.

International Business Machines Corporation, "Random Number Generation and Testing,, Manual G20-8011, IBM Corporation, White Plains, New York. Peradejordi, F . , Martin, A. N., and Cammarata, Α., J . Pharm. Sci. ( L 9 7 L ) , 60, 576. Gibbons, L. K., Koldenhoven, E. F., Nethery, R. E . , Montgomery, R. E., and Purcell, W. P., J . Agric. Food Chem. (1976), 24, 203. Timmermans, B. M. W. M., and van Zwieten, P. Α . , J. Med. Chem. ( 1 9 7 7 ) , 20, 1636. Kier, L. B., and Hall, L. H . , J. Med. Chem. (1977),

10.

20,

1631.

Kier, L. B., and Hall, L. H . , J. Pharm. Sci. (1978),

RECEIVED

67,

1409.

June 8, 1979.

Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.