5 Chance Factors in QSAR Studies JOHN G. TOPLISS and ROBERT P. EDWARDS
Downloaded via TUFTS UNIV on July 12, 2018 at 10:44:41 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
Schering-Plough Research Division, Bloomfield, NJ 07003
During the past decade quantitative structure -activity relationships (QSAR) have emerged as an im portant tool in drug design ( 1 ) . The role of the computer has been essential in allowing practical use of s t a t i s t i c a l methods such as multiple regression analysis (2). However, there is a risk of arriving at fortuitous correlations when too many variables are screened relative to the number of available ob servations. In this regard a critical distinction must be made between the number of variables screened for possible correlation and the number which actu ally appear in the regression equation. By way of i l l u s t r a t i o n take a case where there are ten observations and corresponding activity values A to Α and eight variables V to V are con sidered. Multiple regression analysis yields equa tion (1) where r = 0.85, F = 19.8 (p < 0.005) and 1
10
1
8
2
2,7
A = aV1 + bV + c 2
(1)
V and V are each significant at the level p < 0.01. However, the s t a t i s t i c a l test for significance of the equation only relates to the variables included in the actual correlation equation and does not take account of the fact that eight variables were screened for possible correlation. It is clear that the greater the number of variables which are screened for possible inclusion the greater w i l l be the prob ability that a chance correlation w i l l occur. The p value of < 0.005 for the equation shown overstates to some extent the probability that the relationship given by the equation is real rather than occurring by chance. The question is how misleading is this p value? The studies to be described w i l l attempt to provide some answer to this question. 1
2
0-8412-0521-3/79/47-112-131$05.00/0 © 1979 American Chemical Society
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
132
COMPUTER-ASSISTED
DRUG
DESIGN
E a r l i e r work on t h i s problem was r e p o r t e d some y e a r s ago (^). However, t h e s e s t u d i e s were t o o l i m i t ed i n scope t o p r o v i d e more t h a n a d e m o n s t r a t i o n t h a t i n d e e d t h e r e was a problem from chance c o r r e l a t i o n s under c e r t a i n c o n d i t i o n s and t o p r o v i d e v e r y g e n e r a l guidelines. A c c o r d i n g l y , i n view o f t h e i n t e r e s t i n t h i s problem f o l l o w i n g p u b l i c a t i o n o f t h e e a r l i e r study, p l a n s were made t o g e n e r a t e more e x t e n s i v e and d e t a i l e d data. Method The b a s i c approach was t o s e t up a l a r g e number and wide range o f s i m u l a t e d c o r r e l a t i o n s t u d i e s u s i n g random numbers f o r b o t h o b s e r v a t i o n s and s c r e e n e d v a r i a b l e s and a s s e s s t h e i n c i d e n c e and degree o f any resulting correlations. Computer System. The program used was d e v e l o p e d by m o d i f y i n g an e x i s t i n g FORTRAN M u l t i p l e R e g r e s s i o n A n a l y s i s Program (BMD 02R), t h e essence o f t h e modi f i c a t i o n s b e i n g t o g e n e r a t e t h e d a t a t o be a n a l y z e d u s i n g a pseudo-random number g e n e r a t o r and t o accumu l a t e s t a t i s t i c s from each o f t h e i n d i v i d u a l simula t i o n s and produce output r e p o r t s , f r e q u e n c y d i s t r i b u t i o n s and summary t a b l e s . The program was r u n on an IBM 560/158 computer p e r f o r m i n g under 0S/VS2. Random Number G e n e r a t o r . The pseudo-random num b e r s were g e n e r a t e d u s i n g t h e IBM RANDU s u b r o u t i n e ( k ) which employs t h e M u l t i p l i c a t i v e C o n g r u e n t i a l Method and r e t u r n s a v e c t o r o f random numbers, each u n i f o r m l y d i s t r i b u t e d i n t h e range 0 t o 1. The random numbers were s c a l e d by a f a c t o r o f 1000. S t a r t i n g from a u s e r - d e f i n e d seed v a l u e , RANDU gen e r a t e s a stream o f pseudo-random numbers and w i l l produce 2 terms b e f o r e r e p e a t i n g . By comparison of t h e o b s e r v e d as a g a i n s t t h e e x p e c t e d f r e q u e n c y d i s t r i b u t i o n ( u n i f o r m ) o f numbers, RANDU i s an ade quate pseudo-random number g e n e r a t o r . To b e t t e r approximate pure randomness each computer r u n had a d i f f e r e n t e x t e r n a l l y imposed i n i t i a l seed v a l u e . I n a d d i t i o n o n l y e v e r y f o u r t h pseudo-random number was sampled from t h e stream. 1
2 9
Computer Program Methodology. The f i r s t s t e p i n each s i m u l a t i o n i s t o g e n e r a t e t h e d a t a m a t r i x w i t h r rows (one f o r each o b s e r v a t i o n ) and η + 1 columns (n independent v a r i a b l e s and one dependent). The v a l u e s o f r and η a r e s p e c i f i e d when t h e program i s
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
5.
TOPLiss A N D
EDWARDS
QSAR Studies
133
run. The d a t a are t h e n a n a l y z e d u s i n g a stepwise M u l t i p l e R e g r e s s i o n Technique ( 2 ) . At each stage o f t h i s p r o c e d u r e the independent v a r i a b l e most h i g h l y c o r r e l a t e d w i t h the dependent v a r i a b l e i s brought i n t o the model p r o v i d e d t h a t the p a r t i a l F - t e s t v a l u e f o r t h a t v a r i a b l e i s s i g n i f i c a n t at the 10$ l e v e l . Each independent v a r i a b l e i n the r e g r e s s i o n model i s t h e n re-examined t o see i f i t i s s t i l l making a s i g n i f i c a n t c o n t r i b u t i o n . Any v a r i a b l e whose p a r t i a l F - t e s t v a l u e i s not s i g n i f i c a n t at the 10$ l e v e l i s dropped from t h e model. T h i s p r o c e s s c o n t i n u e s u n t i l no more v a r i a b l e s e n t e r the model and none are r e j e c t ed. From each s i m u l a t i o n , the number o f v a r i a b l e s e n t e r i n g the r e g r e s s i o n (zero t o n) i s s t o r e d as w e l l as the r s t a t i s t i c i n the case o f at l e a s t one v a r i a b l e e n t e r i n g the r e g r e s s i o n a n a l y s i s . T h i s whole p r o c e s s i s r e p e a t e d f o r each s i m u l a t i o n , the number o f s i m u l a t i o n s t o be c a r r i e d out b e i n g s p e c i f i e d when the program i s run. The r e s u l t s o f a l l the simulat i o n s are t h e n c o n s o l i d a t e d t o g i v e the f o l l o w i n g statistics : (1) The number and r e l a t i v e f r e q u e n c y f o r 0, 1, 2, 3 e t c . v a r i a b l e s e n t e r i n g the r e g r e s s i o n a n a l y s i s . (2) The number and r e l a t i v e f r e q u e n c y f o r at l e a s t one v a r i a b l e e n t e r i n g the r e g r e s s i o n a n a l y s i s . (3) F o r each s i g n i f i c a n t c o r r e l a t i o n o b s e r v e d (a) mean r s t a t i s t i c (b) median r s t a t i s t i c ( c ) minimum r s t a t i s t i c (d) maximum r statistic. (4) The number and r e l a t i v e f r e q u e n c y o f obs e r v e d c o r r e l a t i o n s by r interval: 1.0-0.9;0.9-0.8; 0.8-0.7 etc. 2
2
2
2
2
2
Scope o f Study. The number o f v a r i a b l e s s c r e e n e d was 3-15> 20, 30, ho and the number o f obs e r v a t i o n s was 5-30, 50, 100, 200, kOO, 600. Altog e t h e r some 300 c o m b i n a t i o n s o f v a r i a b l e s and observ a t i o n s were examined. The number of runs f o r each c o m b i n a t i o n ranged from 120-2190. I t should be noted t h a t the study was o r i g i n a l l y d e s i g n e d f o r f a r fewer combinations of v a r i a b l e s and o b s e r v a t i o n s and runs per combination. However, due t o the unexpected a v a i l a b i l i t y , f o r a s h o r t p e r i o d , of a computer which was b e i n g phased out, the scope o f the study was g r e a t l y expanded. Results Due to space l i m i t a t i o n s i t i s not f e a s i b l e t o p r e s e n t i n t a b u l a r form a l l o f the d a t a o b t a i n e d i n
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
134
COMPUTER-ASSISTED
DRUG
DESIGN
t h e study. However, T a b l e I p r o v i d e s an i l l u s t r a t i v e sample. The columns i n t h e t a b l e g i v e t h e number o f independent v a r i a b l e s , t h e number o f o b s e r v a t i o n s , the number o f r u n s , t h e f r e q u e n c y o f a chance c o r r e l a t i o n , t h e average number o f v a r i a b l e s e n t e r e d ( f o r those runs f o r which a t l e a s t one v a r i a b l e i s ent e r e d ) , t h e maximum, minimum and mean r v a l u e s , and the f r e q u e n c y o f those runs where r ^ 0 . 5 and ^ 0 . 8 . F o r t h e s i t u a t i o n w i t h t e n independent v a r i a b l e s s c r e e n e d and t e n o b s e r v a t i o n s , some t w o - t h i r d s o f t h e runs gave a c o r r e l a t i o n and f o r these an average o f 2.20 v a r i a b l e s e n t e r e d . R ranged from 0 . 2 8 t o 1.00 w i t h a mean v a l u e o f 0 . 6 k . Almost h a l f o f t h e t o t a l runs made gave c o r r e l a t i o n s w i t h r v a l u e s o f 0 . 5 o r h i g h e r w h i l e about one q u a r t e r o f t h e t o t a l runs made gave c o r r e l a t i o n s w i t h r ^ v a l u e s o f 0 . 8 o r h i g h e r . As t h e number o f o b s e r v a t i o n s i n c r e a s e d t h e r e was a downward t r e n d i n r v a l u e s so t h a t f o r 15 observat i o n s o n l y h^o o f t h e t o t a l runs gave c o r r e l a t i o n s w i t h r ^ . 0 . 8 and f o r 30 o b s e r v a t i o n s no c o r r e l a t i o n s w i t h r ^ 0 . 8 were found. F o r 20 v a r i a b l e s s c r e e n e d and o n l y 15 observat i o n s , p r a c t i c a l l y a l l o f t h e runs r e s u l t e d i n some l e v e l o f chance c o r r e l a t i o n w i t h about kklo o f t h e t o t a l runs r e s u l t i n g i n a c o r r e l a t i o n w i t h r ^ 0 . 8 . As t h e number o f o b s e r v a t i o n s i n c r e a s e d t o 30 the number o f runs y i e l d i n g c o r r e l a t i o n s w i t h r ^ 0 . 8 d e c l i n e d t o about 1#. P-values f o r t h e observed c o r r e l a t i o n s ranged from a minimum v a l u e o f 0.10 up t o 0 . 0 0 . A t t h i s p o i n t i t must be noted t h a t the degree o f chance c o r r e l a t i o n found i n t h i s study was subs t a n t i a l l y l e s s t h a n t h a t r e p o r t e d i n an e a r l i e r study (jj). Mean r v a l u e s averaged about o n e - h a l f those r e p o r t e d i n t h e e a r l i e r study, r a n g i n g from t h r e e - q u a r t e r s down t o o n e - q u a r t e r . The cause o f t h i s d i s c r e p a n c y c o u l d not be determined s i n c e t h e b a s i c r e c o r d s from t h e e a r l i e r study were not retained. However, i t must be c o n c l u d e d t h a t t h e e a r l i e r study was i n e r r o r . The p r e s e n t study was v e r y c a r e f u l l y checked and s e l e c t e d combinations setup and performed manually showed s i m i l a r r e s u l t s . From d a t a g e n e r a t e d i n t h e p r e s e n t study i t i s now p o s s i b l e t o answer t h e q u e s t i o n r a i s e d e a r l i e r c o n c e r n i n g the p - v a l u e o f e q u a t i o n 1. With t e n obs e r v a t i o n s and e i g h t s c r e e n e d v a r i a b l e s t h e expected f r e q u e n c y o f chance c o r r e l a t i o n s comes out t o be about 12#. Thus, t h e v a l i d i t y o f t h e c o r r e l a t i o n e q u a t i o n i s f a r l e s s c e r t a i n t h a n the r e p o r t e d p - v a l u e indicates. 2
2
2
2
2
2
2
2
2
2
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
ko
40
1.26 1.15
0.26 0.27 0.28
2190 2190 2190 120 398 2190 2190 120 398 398 2190 120 120 398 2190 2190 120 120 2190 1991 1991 1991 199 199
5 10 20 5 7 10 20 10 12 15 30 10 15 17 20 30 15 20 30 50 50 100 50 100
3 3 3 5 5 5 5 10 10 10 10 15 15 15 15 15 20 20 20 20 30 30 l . k l
o.ko
0.67 0.71 0.71 0.74 0.88 0.87 0.86 0.85 0.85 0.94 0.97 0.95 0.95 0.99 0.99 1. 00 1.00
O.kl
5.00 3.53 3.53 3.41 5.59 5.17 7. ik 7. ok
2.48
3.33 2.77 2.6l
k.k2
1.84
I.30 2.20 2.09 1.92
0.39
O.kk
1.87 1.5*
1.1k
AV. NO. VAR. ENTERED
FREQ. CHANCE CORR.
NO. RUNS
NO. OBS.
0.73 0.76 0.52 0.90 0. ^7
0.9k
1.00 1.00 0.98 0.97 0.86 1. 00 0.98
0.7k
1. 00 0.95 0.63 1. 00 1.00 0.99 0.79 1.00 1.00 0.98
R
2
02 0.03
0.0k
0.
0. Ok
0.65 0.30 0. Ik 0.6l 0.45 O.3O 0. Ik 0.28 0.23 0.18 0.08 0.26 0.19 0.l6 0.13 0. 08 0.17 0.13 0.08 0. Ok
MIN R
2
0.37 0.18 0.49 0.24
0.24
0.23 0.79 O.60 0.53 0.45 0.30 0.73 Ο.56 0.39
0.46
0.57
0.64
0.83 0.47 0.24 O.80 0.73 0. $k 0.28
MEAN R
u s i n g Random Numbers
MAX
Table 1 Correlations
). IND. VAR.
Data from S i m u l a t e d 2
0.34 0.11 0.76 0.58 0.29 0.03 0.19 0. 00 0.44 0.01
0.46
0.28 0.03 0.74 0.56
0.46 0.41
0.26 0.09 0.01 0.44 0.34 0.21 0.03
0.15 0. 01 0. 00 0.28 0.15 0. o4 0.00 0.23 0.14 0.04 0. 00 0. 54 0.28 0.14 0.06 0.00 0.44 0.17 0. 01 0.00 0. 00 0.00 0. 02 0. 00
2
FREQ. R > 0.8 0.5
136
COMPUTER-ASSISTED DRUG
DESIGN
The r e s u l t s a r e f u r t h e r i l l u s t r a t e d i n F i g u r e s 1-7· The r e l a t i o n s h i p between r and t h e number o f v a r i a b l e s s c r e e n e d f o r v a r y i n g numbers o f observa tions i s i l l u s t r a t e d i n Figure 1· I t can be seen t h a t f o r a c o n s t a n t number o f o b s e r v a t i o n s r i n c r e a s e s as t h e number o f v a r i a b l e s s c r e e n e d i n c r e a s e s . A l s o , f o r a g i v e n number o f v a r i a b l e s screened, r i n c r e a s e s as t h e number o f o b s e r v a t i o n s d e c r e a s e s . In F i g u r e 2 t h e r e l a t i o n s h i p between the number o f o b s e r v a t i o n s and t h e p r o b a b i l i t y o f a chance c o r r e l a t i o n w i t h t e n screened v a r i a b l e s f o r v a r i o u s r v a l u e s i s shown. From the graph c o r r e s p o n d i n g t o r ^ 0 , 8 i t may be seen t h a t the p r o b a b i l i t y o f en c o u n t e r i n g a chance c o r r e l a t i o n a t t h i s l e v e l i s about 22$ f o r t e n o b s e r v a t i o n s , r e a c h i n g zero f o r 23 observations. The graph shown i n F i g u r e 3 g i v e s t h e r e l a t i o n s h i p between t h e number o f o b s e r v a t i o n s and t h e prob a b i l i t y o f a chance c o r r e l a t i o n w i t h r ^ 0 . 8 f o r f i v e , t e n and f i f t e e n v a r i a b l e s . Thus, f o r f i v e screened v a r i a b l e s t h e p r o b a b i l i t y o f e n c o u n t e r i n g a chance c o r r e l a t i o n i s about 22$ f o r s i x o b s e r v a t i o n s , 10$ f o r e i g h t o b s e r v a t i o n s , 3$ f o r t e n o b s e r v a t i o n s and 1$ f o r twelve o b s e r v a t i o n s , A g u i d e l i n e o f t e n s t a t e d i s t h a t a c e r t a i n num ber o f o b s e r v a t i o n s a r e r e q u i r e d i n o r d e r t o have much c o n f i d e n c e i n a g i v e n c o r r e l a t i o n e q u a t i o n . T h i s r u l e o f thumb i s d e f i c i e n t on two c o u n t s ; f i r s t , v a r i a b l e s screened r a t h e r than v a r i a b l e s i n c l u d e d i n the e q u a t i o n should be c o n s i d e r e d and second, t h e number o f o b s e r v a t i o n s p e r v a r i a b l e i s not a l i n e a r f u n c t i o n o f t h e number o f v a r i a b l e s as i t p e r t a i n s t o a s p e c i f i e d l e v e l o f chance c o r r e l a t i o n . In Figure h are p l o t t e d the number o f o b s e r v a t i o n s p e r v a r i a b l e a g a i n s t t h e number o f v a r i a b l e s f o r chance c o r r e l a t i o n l e v e l s o f 1$ f o r r v a l u e s o f ] ^ 0 . 5 , 0 . 8 and 0 . 9 · C o n s i d e r i n g t h e r ^ 0 . 8 curve, t h e number o f obser v a t i o n s p e r v a r i a b l e i s 3 · 7 f o r t h r e e v a r i a b l e s drop p i n g t o 1.75 f o r t h i r t e e n v a r i a b l e s . F i g u r e 5 shows t h e r e l a t i o n s h i p between observa t i o n s and v a r i a b l e s f o r a chance c o r r e l a t i o n p r o b a b i l i t y o f 1$ o r l e s s f o r v a r i o u s r l e v e l s . The l i n e a r r e l a t i o n s h i p s shown a r e each s t a t i s t i c a l l y s i g n i f i c a n t a t t h e ρ < 0.0001 l e v e l . T h i s graph p e r m i t s the de t e r m i n a t i o n o f t h e number o f o b s e r v a t i o n s r e q u i r e d t o s c r e e n , f o r example, t e n v a r i a b l e s w h i l e k e e p i n g t h e p r o b a b i l i t y o f e n c o u n t e r i n g a chance c o r r e l a t i o n w i t h r ^ 0 . 8 a t t h e 1$ l e v e l o r l e s s . From t h e graph i t can be e s t i m a t e d t h a t t h i s number o f o b s e r v a t i o n s i s about 20. F o r r ^ 0 . 9 , t h e number r e q u i r e d i s l e s s , 2
2
2
2
2
2
2
2
2
2
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
5. TOPLiss
A N D EDWARDS
QSAR Studies
137
Figure 1. Rehtionship between mean r values and number of variables screened (number of observations: (V) 7; (O )10; (A) 15; (Π) 20; (0 ) 25; (Φ) 30; (M) 50; (V)100) 2
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
138
COMPUTER-ASSISTED
DRUG
DESIGN
0.301-
N0. OBSERVATIONS
Figure 2. Rehtionship between number of observations and probability of a chance correlation (Pc) for 10 screened variables with r ^ 0.7 (O), 0.8 (A), and 0.9(H) 2
NO. OBSERVATIONS
Figure 3. Rehtionship between number of observations and probability of a chance correlation for 5 (O), 10 (A), and 15 (Q) screened variables with r ^ 0.8 2
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
5.
TOPLISS
QSAR Studies
A N D EDWARDS
139
>4 Ο
e
3h
10 15 NO. VARIABLES
20
Figure 4. Number of observations/number of variables as a function of number of variables for specified levels of chance correlation (Pc < 0.01: R > 0.5 (O); R > 0.8(A); R > 0.9(D)) 2
2
2
i at % 8
6
8
J
I
I L_
10 12 14 16 18 20 22 24 26 28 30 NO. OBSERVATIONS
Figure 5. Rehtionship between number of observations and number of variables for chance correlation level Pc ^ 0.01 with r ^ 0.7 (O), 0.8 (A), and 0.9 (\J) 2
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
140
COMPUTER-ASSISTED DRUG
DESIGN
16· E x t r a p o l a t i o n o f t h e p r e v i o u s graph a l l o w s one ( F i g u r e 6 ) t o make s i m i l a r e s t i m a t i o n s f o r l a r g e r numbers o f screened v a r i a b l e s . Thus, some 56 obser v a t i o n s would be r e q u i r e d t o s c r e e n ko v a r i a b l e s w h i l e k e e p i n g t h e p r o b a b i l i t y o f e n c o u n t e r i n g a chance c o r r e l a t i o n with r 0 . 8 a t t h e 1$ l e v e l o r l e s s . The p r e c e d i n g graphs make p o s s i b l e t h e e s t i m a t i o n o f t h e number o f o b s e r v a t i o n s needed t o s c r e e n a spec i f i e d number o f v a r i a b l e s w h i l e m a i n t a i n i n g t h e prob a b i l i t y o f e n c o u n t e r i n g a chance c o r r e l a t i o n o f de f i n e d r l e v e l a t 1$ o r l e s s . I n F i g u r e 7 t h e graph shows t h e e f f e c t on t h e number o f r e q u i r e d observa t i o n s as t h e t o l e r a t e d chance c o r r e l a t i o n p r o b a b i l i t y i n c r e a s e s t o 5 o r 10$. A g a i n t h e d e p i c t e d l i n e a r r e l a t i o n s h i p s a r e each s t a t i s t i c a l l y s i g n i f i c a n t a t the ρ < 0.0001 l e v e l . Thus, f o r 10 v a r i a b l e s , 20 o b s e r v a t i o n s a r e needed a t a chance c o r r e l a t i o n l e v e l o f 1$ which reduces t o 15 a t t h e 5$ l e v e l and 13-14 at the 10$ l e v e l . Some l i m i t a t i o n s must be p o i n t e d out t o t h e approach d e s c r i b e d i n t h i s study f o r e s t i m a t i n g prob a b l e chance c o r r e l a t i o n l e v e l s . In actual p r a c t i c e s c r e e n e d v a r i a b l e s have v a r y i n g degrees o f c o l l i n e a r i t y and not i n f r e q u e n t l y some v a r i a b l e s a r e h i g h l y collinear. G e n e r a l l y , c o l l i n e a r i t y w i l l be more i n e v i d e n c e i n a c t u a l p r a c t i c e among a group o f s c r e e n e d v a r i a b l e s t h e n among random number s i m u l a t e d s c r e e n e d variables. To t h e e x t e n t t h a t t h i s o c c u r s i t has t h e e f f e c t o f r e d u c i n g t h e number o f independent v a r i a b l e s a c t u a l l y o p e r a t i v e as such. To r e f e r e n c e a group o f s c r e e n e d v a r i a b l e s used i n an a c t u a l problem w i t h t h e random number d a t a i t i s t h e r e f o r e suggested t h a t when the c o r r e l a t i o n c o e f f i c i e n t between two s c r e e n e d v a r i a b l e s i s 0 . 8 o r h i g h e r , t h e n o n l y one o f t h e s e v a r i a b l e s i s counted i n a r r i v i n g a t t h e e f f e c t i v e number o f independent v a r i a b l e s s c r e e n e d . T h i s approximate c o r r e c t i o n device should avoid s e r i o u s o v e r e s t i m a t i o n o f chance c o r r e l a t i o n e f f e c t s . A second l i m i t a t i o n has t o do w i t h use o f t h e s t e p w i s e m u l t i p l e r e g r e s s i o n method. T h i s method p r o b a b l y misses some c o r r e l a t i o n s and as such l e a d s t o some u n d e r e s t i m a t i o n o f t h e i n c i d e n c e o f chance correlations. O v e r a l l , the r e s i d u a l b i a s a f t e r c o r r e c t i o n f o r t h e c o l l i n e a r i t y e f f e c t tends t o c o u n t e r b a l a n c e t h e b i a s from use o f t h e s t e p w i s e r e g r e s s i o n method so t h e net e s t i m a t i o n e r r o r may not be t h a t l a r g e . about
2
2
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
5.
TOPLiss A N D
10
141
QSAR Studies
EDWARDS
20
30
40
50
60
70
80
90
100
Να OBSERVATIONS
Figure 6. Rehtionship between number of observations and number of variables for chance correlation level Pc $ζ 0.01 with r ^ 0.7 (O), 0.8 (A), and 0.9 (Q) (extrapolated) 2
J
I
I
I
I
2
4
6
8
10
I
I
I
I
1—ι—ι—ι—ι
1
12 14 16 18 20 NO. OBSERVATIONS
22
24
26
28
30
Figure 7. Relationship between number of observations and number of variables with r ^ 0.80 at specified probability levels: Pc < 0.01 (O); Pc < 0.05 (A); Pc < 0.10 (Π) 2
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
142
COMPUTER-ASSISTED DRUG
DESIGN
A p p l i c a t i o n s to Reported C o r r e l a t i o n Studies I n the l i g h t o f t h e chance c o r r e l a t i o n phenomenon d e s c r i b e d , some comments w i l l be made on a number o f c o r r e l a t i o n s t u d i e s r e p o r t e d i n the l i t e r a t u r e which i n v o l v e d l a r g e numbers o f screened v a r i a b l e s . The f i r s t o f t h e s e r e p o r t e d by P e r a d e j o r d i _et a l . (6) concerned a quantum c h e m i c a l approach t o s t r u c t u r e - a c t i v i t y r e l a t i o n s h i p s of t e t r a c y c l i n e a n t i b i o t ics. The c o r r e l a t i o n e q u a t i o n (2) c o n t a i n e d seven i n dependent v a r i a b l e s each s i g n i f i c a n t at the 1$ l e v e l log
κ]
=
18.3975+56.1733Qoio - 1.1063E
0 l l
+71.3254Q
0 i 2
+ l 6
*
9 1 5 5 E
Oio
+l8.3655E
+ l f 8 # 7 9 5 6 Q
0 i g
04
+3.3880Q
C
(2) w i t h the o v e r a l l e q u a t i o n s i g n i f i c a n t at the 0.1$ level. The r v a l u e g i v e n was Ο.986 so t h a t the equa t i o n accounts f o r almost a l l o f the v a r i a n c e i n the data. There were 20 o b s e r v a t i o n s and some l 8 p o s s i b l e independent v a r i a b l e s were s c r e e n e d . The random num b e r s i m u l a t e d c o r r e l a t i o n d a t a g e n e r a t e d i n the cur r e n t study d i d not i n c l u d e the c o m b i n a t i o n o f 20 ob s e r v a t i o n s and 18 v a r i a b l e s . The c l o s e s t r e f e r e n c e p o i n t s were 20 o b s e r v a t i o n s and 20 v a r i a b l e s f o r which the i n c i d e n c e o f chance c o r r e l a t i o n s w i t h r ^ 0 . 9 was about 8$, and 17 o b s e r v a t i o n s and 15 v a r i a b l e s w i t h a 5$ chance c o r r e l a t i o n i n c i d e n c e . The approximate r i s k of a chance c o r r e l a t i o n i s t h e r e f o r e 5-8$. The r e p o r t e d c o r r e l a t i o n e q u a t i o n , a l t h o u g h i t s t i l l may be v a l i d , thus appears f a r l e s s secure t h a n the r e p o r t e d p - v a l u e o f < 0.001 would i n d i c a t e . It s h o u l d be noted t h a t we are not s a y i n g t h a t t h e r e i s a 5-8$ p r o b a b i l i t y t h a t the r e p o r t e d e q u a t i o n c o u l d a r i s e by chance. We are s a y i n g t h a t from 20 observa t i o n s and 18 v a r i a b l e s t h e r e i s a 5-8$ p r o b a b i l i t y t h a t some c o r r e l a t i o n w i t h r ^ 0 . 9 c o u l d emerge from chance f a c t o r s a l o n e . Gibbons jet al.. ( j ) r e p o r t e d on Q u a n t i t a t i v e S t r u c t u r e - A c t i v i t y R e l a t i o n s h i p s among s e l e c t e d P y r i midinones and H i l l R e a c t i o n I n h i b i t i o n . The c o r r e l a t i o n e q u a t i o n (3) a r r i v e d at was based on 17 observa2
2
2
Rplso = - 0 . 4 6 ( ± 0 . 1 ΐ ) τ τ r i n g + 8 . 9 2 ( ± 1 . 8 5 ) σ +0.0^5 (±0.008 )EMRtions,
contained
0.15
(3)
t h r e e independent v a r i a b l e s , had
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
an
5.
TOPLiss
AND
QSAR Studies
EDWARDS
143
r = 0,83 and a p - v a l u e o f < 0 . 0 1 . Some 20 v a r i a b l e s were s c r e e n e d f o r p o s s i b l e i n c l u s i o n which may be e q u i v a l e n t to about 18 v a r i a b l e s a l l o w i n g f o r c o l l i n earity factors. T h i s g i v e s the c o m b i n a t i o n 17 obser v a t i o n s and 18 v a r i a b l e s . The n e a r e s t reference p o i n t s a v a i l a b l e f o r chance c o r r e l a t i o n s are the com b i n a t i o n s 20 o b s e r v a t i o n s and 20 v a r i a b l e s and 15 ob s e r v a t i o n s and 15 v a r i a b l e s f o r which the c o r r e l a t i o n frequencies for r ^ , 0 . 8 are 17^ and 28^ r e s p e c t i v e l y . I t must, t h e r e f o r e , be c o n c l u d e d t h a t t h e r e i s a sub s t a n t i a l r i s k t h a t the r e p o r t e d e q u a t i o n i s s p u r i o u s i n sharp c o n t r a s t w i t h the r e p o r t e d p - v a l u e o f < 0 . 0 1 . A study by Timmermans and van Zwieten (J3) on QSAR i n C e n t r a l l y A c t i n g I m i d a z o l i n e s S t r u c t u r a l l y R e l a t e d t o C l o n i d i n e l e d to the development o f equa tion ( k ) . There were 27 o b s e r v a t i o n s and e f f e c t i v e l y 2
2
log
I/ED30 = -0.00032(±0.0008)(£Par) +0.105(±0.03)EPar 2
- 0.695(±0.17)ApKa°+5.333(±1.89)H0M0(P) (h)
+6.752(±2.25)EE(P)+2.1*94 η = 27,
r
2
= 0.91,
Ρ
10). They a l s o searched f o r c o r r e l a t i o n s where a l l X v a l u e s were r e p l a c e d by random numbers and r e a l Y v a l u e s were r e t a i n e d . S i n c e t h i s does not p r e s e r v e t h e p a t t e r n o f c o l l i n e a r i t y p r e s e n t i n t h e r e a l X v a r i a b l e s e t t h i s l a t t e r p r o c e d u r e may not p r o v i d e as good an e s t i m a t e o f chance e f f e c t s . It i s also possible to replace selected X variables o n l y , w i t h random numbers. T h i s would be u s e f u l i n s i t u a t i o n s where i t i s s u s p e c t e d t h a t i n a c o r r e l a t i o n e q u a t i o n c e r t a i n terms, e g . it, it were r e a l c o r r e l a t i o n s w h i l e o t h e r s were s p u r i o u s . " I n t h i s case, r e a l it and i t numbers would be r e t a i n e d and t h e o t h e r X v a r i a b l e s r e p l a c e d by random numbers. In e v a l u a t i n g c o r r e l a t i o n s f o r chance e f f e c t s a t t e n t i o n s h o u l d a l s o be p a i d t o whether some obser v a t i o n s have been dropped from t h e s e t i n o r d e r t o obtain a b e t t e r data f i t . T h i s i s a f a i r l y common o c c u r r e n c e and u s u a l l y g i v e s much h i g h e r r v a l u e s . However, sometimes t h e j u s t i f i c a t i o n g i v e n f o r drop ping the poorly f i t observations are questionable. The c o r r e l a t i o n e q u a t i o n o b t a i n e d from u t i l i z i n g a l l o f t h e o b s e r v a t i o n s ( i e . a f t e r adding back i n t h e dropped o b s e r v a t i o n s ) s h o u l d t h e r e f o r e be checked f o r p o s s i b l e chance c o r r e l a t i o n s t o g i v e a b e t t e r o v e r a l l evaluation o f the r e l i a b i l i t y o f the r e s u l t . 2
2
Summary The r e s u l t s o f t h e s t u d i e s d e s c r i b e d show t h a t chance c o r r e l a t i o n s a r e a r e a l phenomenon o c c u r r i n g when t h e number o f v a r i a b l e s s c r e e n e d f o r p o s s i b l e c o r r e l a t i o n i s l a r g e compared t o t h e number o f obser vations. F o r t h i s r e a s o n some c o r r e l a t i o n s a r e l e s s s i g n i f i c a n t t h a n t h e i r s t a n d a r d p- v a l u e s i n d i c a t e as has been demonstrated by r e f e r e n c e t o some r e p o r t e d correlations. The p r e s e n t study p r o v i d e s g u i d e l i n e s f o r t h e approximate i n c i d e n c e o f chance c o r r e l a t i o n s a t s p e c i f i e d r v a l u e s f o r v a r i o u s combinations o f obser v a t i o n s and s c r e e n e d v a r i a b l e s . Specific correlations o b t a i n e d under c o n d i t i o n s where t h e g u i d e l i n e s i n d i c a t e an a p p r e c i a b l e r i s k o f chance c o r r e l a t i o n s s h o u l d be i n d i v i d u a l l y checked as d e s c r i b e d . 2
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.
5.
TOPLiss
A N D EDWARDS
QSAR Studies
145
F i n a l l y , i t s h o u l d "be p o i n t e d out t h a t t h e r e i s no i n t e n t i o n o f a d v o c a t i n g c o r r e l a t i o n s t u d i e s o n l y under c o n d i t i o n s where t h e r e i s a m i n i s c u l e r i s k o f chance c o r r e l a t i o n s . However, t h e importance o f having a true idea o f the r e l i a b i l i t y of a c o r r e l a t i o n equation i s obvious. I t i s one t h i n g t o d e v e l o p and use a c o r r e l a t i o n e q u a t i o n which i s known t o be somewhat tenuous b u t q u i t e another t o b e l i e v e t h a t i t has a s o l i d f o u n d a t i o n when i n f a c t i t has n o t . Acknowledgement s The a u t h o r s a r e i n d e b t e d t o L. Weber f o r a s s i s tance i n d a t a t a b u l a t i o n and t o M. M i l l e r and A. Saltzman f o r h e l p f u l d i s c u s s i o n s .
Literature Cited 1. 2. 3. 4.
Martin, Y. C . , "Quantitative Drug Design", Marcel Dekker, Inc., New York, 1978. Draper, N. R., and Smith, H . , "Applied Regression Analysis", Wiley, New York, 1966. Topliss, J . G . , and Costello, R. J., J . Med. Chem. ( 1 9 7 2 ) , 15, 1066. International Business Machines Corporation, "System/360 Scientific Subroutine Package, Version III, Programmer's Manual, Program Number 360A-CM-03X" Manual GH20-0205-4 (Fifth Edition), IBM Corporation, White Plains, New York, August 1970.
5.
6. 7. 8. 9.
International Business Machines Corporation, "Random Number Generation and Testing,, Manual G20-8011, IBM Corporation, White Plains, New York. Peradejordi, F . , Martin, A. N., and Cammarata, Α., J . Pharm. Sci. ( L 9 7 L ) , 60, 576. Gibbons, L. K., Koldenhoven, E. F., Nethery, R. E . , Montgomery, R. E., and Purcell, W. P., J . Agric. Food Chem. (1976), 24, 203. Timmermans, B. M. W. M., and van Zwieten, P. Α . , J. Med. Chem. ( 1 9 7 7 ) , 20, 1636. Kier, L. B., and Hall, L. H . , J. Med. Chem. (1977),
10.
20,
1631.
Kier, L. B., and Hall, L. H . , J. Pharm. Sci. (1978),
RECEIVED
67,
1409.
June 8, 1979.
Olson and Christoffersen; Computer-Assisted Drug Design ACS Symposium Series; American Chemical Society: Washington, DC, 1979.