Ligand-Target Prediction Using Winnow and Naive Bayesian

the design of the methods that invariably render some more complex than others. .... script with MySQL support that we also provide as Supporting ...
0 downloads 0 Views 2MB Size
J. Chem. Inf. Model. 2008, 48, 2313–2325

2313

Ligand-Target Prediction Using Winnow and Naive Bayesian Algorithms and the Implications of Overall Performance Statistics Florian Nigsch,† Andreas Bender,‡,§ Jeremy L. Jenkins,‡ and John B. O. Mitchell*,† Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom; Lead Discovery Informatics, Center for Proteomic Chemistry, Novartis Institutes for BioMedical Research, 250 Massachusetts Avenue, Cambridge, Massachusetts 02139; and Division of Medicinal Chemistry, Leiden/Amsterdam Center for Drug Research, Leiden University, Einsteinweg 55, 2333 CC, Leiden, The Netherlands Received March 7, 2008

We compared two algorithms for ligand-target prediction, namely, the Laplacian-modified Bayesian classifier and the Winnow algorithm. A dataset derived from the WOMBAT database, spanning 20 pharmaceutically relevant activity classes with 13 000 compounds, was used for performance assessment in 24 different experiments, each of which was assessed using a 15-fold Monte Carlo cross-validation. Compounds were described by different circular fingerprints, ECFP_4 and MOLPRINT 2D. A detailed analysis of the resulting ≈2.4 million predictions led to very similar measures for overall accuracy for both classifiers, whereas we observed significant differences for individual activity classes. Moreover, we analyzed our data with respect to the numbers of compounds which are exclusively retrieved by either of the algorithmssbut never by the othersor by neither of them. This provided detailed information that can never be obtained by considering the overall performance statistics alone. 1. INTRODUCTION

Classification methods have a central role in contemporary drug discovery. They are used in virtual screening experiments to sift through digital collections of molecules and to decide to which class a certain molecule is most likely to belong. The class label is typically determined according to some user-defined threshold (e.g., active if the IC50 value is Tx for the wrong class (false positive), the weights of the active features in wc are multiplied by a “demotion” factor dmin e d < 1, with pmax ) 1.3 and dmin ) 0.7. For a false positive, the weights of the active features are therefore reduced, according to wnew ) dwj, whereas for a false j negative the weights are increased according to wnew ) pwj. j The two multiplicative factors d and p are calculated by a mapping of the distance between score and threshold on sigmoid functions bounded by either [dmin, 1) or (1, pmax]. We also use a threshold exclusion area (“thick threshold”): if the score of a correct prediction is within a certain neighborhood of the threshold, |Scx - Tx| < Tx, it is equally considered a mistake.32 Consequently, the classifier not only learns upon misclassification, but also if the classification was correct but too close to the threshold. This leads to an increased separation between correct and incorrect predictions and ultimately to a more robust classifier. The “thick threshold” of  ) 0.15 is only used with OSBs.

LIGAND-TARGET PREDICTION USING BAYESIAN ALGORITHMS

Laplacian-Modified Naive Bayesian Classifier. This classifier relies on Bayes’ theorem of conditional probabilities:33 P(B|A)P(A) P(B) Let A be the event for which a compound is active, and B be the event for which a group of n features fi are all present simultaneously in the compound: P(A|B) )

P(B) ) P(f1 ∩ f2 ∩ · · · ∩ fn) The assumption of the naïve Bayesian classifier is that all features are independent. Two events are independent if and only if P(A∩B) ) P(A)P(B). It follows that, in the case of an arbitrary number m of independent events Am, P(A1∩A2 · · · ∩Am) ) P(A1)P(A2) · · · P(Am). The conditional probability of the event of a compound being active, given that it contains a set of n features, is then P(B|A) ) P(A)

∏in P(fi|A) ∏in P(fi)

Let Ci and Ti be the respective number of occurrences of feature fi in the actives and the total dataset. The uncorrected conditional probability of a compound being active, given feature fi, would then be Ci P(A|fi) ) Ti For small numbers of Ti, i.e., when feature fi is only scarcely present in the dataset, this estimate would be too large and distant from reality, where, instead, it should approach P(A) to reflect the dataset. For illustration, consider the case of a feature being present four times in the entire dataset, all of them in active compounds: as a result, P(A|fi) would be 1. The corrected conditional probability P′(A|fi) introduces additional virtual samples D to alleviate this problem:25 Ci + DP(A) P(A|fi) ) Ti + D If feature fi is present very rarely, Ci f 0 and Ti f 0, and we find the desired result: lim P ′ (A|fi) ) lim

Ct,Tif0

Ct,Tif0

Ci + DP(A) ) P(A) Ti + D

In the Laplacian correction, D ) P(A)-1 and we find P(A|fi) )

Ci + 1

Ci + 1 ) Ti + P(A)-1 P(A)-1[TiP(A) + 1]

Division by P(A) yields the relative probability estimate Prel: Prel(A|fi) )

P(A|fi) Ci + 1 ) P(A) TiP(A) + 1

The likelihood Pactive of a compound being active, given a set of features fi, is calculated from the relative estimates Prel(A|fi): Pactive )

∏i Prel(A|fi)

To avoid potential numerical problems through the resulting very small numbers, and to have interpretable results, this

J. Chem. Inf. Model., Vol. 48, No. 12, 2008 2317

is typically implemented using logarithms to yield a combined score S: S ) log Pactive ) log

∏i Prel(A|fi) ) ∑i log Prel(A|fi)

We built such a model for every single class in our dataset. A Python script for this Laplacian-modified multiclass Bayesian classifier, including OSB and MySQL support, is available as Supporting Information, as is a comparison of our implementation with a commercial (SciTegic) one. Figures of Merit. For both algorithms, we considered the class label corresponding to the highest score as the predicted class label. Starting from an n × n confusion matrix Z ) (zij) for n classes, it can easily be shown that the sum of false positives Fp ) ∑nfnp equals the sum of false negatives Fn ) ∑nfnn (column-wise and row-wise marginals minus diagonal elements, respectively). Thus, the Matthews correlation coefficient (see below) is not an unbiased measure for the overall accuracy of multiclass classification. Therefore, we use another measure, known as accuracy, which is defined as the fraction of correct predictions (FCP): FCP )

tr(Z) Ntot

where tr(Z) is the trace of the confusion matrix and Ntot is the number of total predictions made. The FCP parameter accounts for the fraction of all correct predictions (diagonal elements of the confusion matrix), with respect to the total number of predictions made. To report the accuracy of prediction for a given class, we mainly use the Matthews correlation coefficient (MCC), which is calculated according to MCC )

tptn - fpfn

√(tp + fp)(tp + fn)(tn + fp)(tn + fn)

where tp is the number of true positives, tn the number of true negatives, fp the number of false positives, and fn the number of false negatives. We provide an Excel spreadsheet with detailed results of every experiment that we performed as Supporting Information. This spreadsheet includes, for every single class, for every experiment, the 15-fold Monte Carlo cross-validated figures (and standard deviations over the 15 runs) for true/ false positives/negatives, positive/negative recall, positive/ negative precision, MCC, and average percentage class correct (APCC). 4. RESULTS AND DISCUSSION

Overall Analysis of Experiments. To establish the performance of the two algorithms in different settings, we varied the following parameters for our experiments: (1) classification algorithm (n1 ) 2); (2) molecular fingerprint (CFP2, CFP3, or ECFP_4, n2 ) 3); (3) window size, w, for the creation of orthogonal sparse bigrams (w ) 1 or w ) 3, n3 ) 2); and (4) size of training set (30% or 70%, n4 ) 2). Therefore, the number of possible combinations of parameters is n1 × n2 × n3 × n4 ) 24. The cross-validated accuracies, in terms of FCP, with confidence intervals of 99% (CI 99%), for all 24 experiments are summarized in Table 2 and Figure 2.

2318 J. Chem. Inf. Model., Vol. 48, No. 12, 2008

NIGSCH

Table 2. Accuracy of the Different Experiments, Using the Fraction of Correct Predictions (FCP) and the Corresponding Standard Deviation (sFCP)a CFP2 CFP3 ECFP_4 window b c method size, w  %test FCP sFCP FCP sFCP FCP sFCP Bayes Winnow Bayes Winnow

1 1 1 1

Bayes Winnow Bayes Winnow

3 3 3 3

0 0

0.15 0.15

30 30 70 70

Group A 0.6575 0.0072 0.6482 0.0076 0.6567 0.0052 0.6173 0.0118

0.6398 0.6417 0.6550 0.6257

0.0056 0.0091 0.0042 0.0097

0.6753 0.6687 0.6794 0.6469

0.0041 0.0091 0.0044 0.0122

30 30 70 70

Group B 0.6575 0.0072 0.6805 0.0062 0.6567 0.0052 0.6724 0.0048

0.6398 0.6624 0.6550 0.6657

0.0056 0.0062 0.0042 0.0052

0.6753 0.6959 0.6794 0.6896

0.0041 0.0057 0.0044 0.0056

a

Numbers given for the fraction of correct predictions (FCP) are averages over 15 independent runs for each experiment. Groups A and B are as identified in Figure 2, where B includes OSBs. b Window size for the creation of OSBs. c “Thick threshold” parameter, which only applies to Winnow.

There are two groups (A and B) with different characteristics, as implied by the dashed vertical line in Figure 2. The difference in experimental setup between the two groups A and B is the use of orthogonal sparse bigrams (OSBs) in group B. For group A, the Bayesian classifier is almost always better than Winnow, whereas in group B the contrary is true. The inclusion of OSBs does not change the performance of the Bayesian classifier. In addition to the separation of these two groups, there are three groups, according to the fingerprints: the rank-order of the fingerprints with respect to increasing classification performance is CFP3 < CFP2 < ECFP_4. The ECFP_4 fingerprints perform better than either of the other two fingerprints. The unhashed circular fingerprints with a radius of 3 (CFP3) perform worse in all experiments, compared to their counterpart with a radius of 2 (CFP2). The average FCPs of all 24 experiments are within a range of 0.60-0.70; the corresponding standard deviations span the range from 0.0041 (Bayes, ECFP_4) to 0.0122 (Winnow, ECFP_4, w ) 1). Apart from one exception, the standard deviations are always smaller for the Bayesian classifier, and they get smaller in the case of Winnow when OSBs are included. For the best-performing ECFP_4 fingerprints, all experiments lie within the very narrow interval between 0.6469 and 0.6959. The dependence of accuracy on changes in the size of the training set is conditioned by the use of OSBs: without them, Winnow suffers losses upon reduction of the training set size from 70% to 30% in all cases, whereas the Bayesian classifier is practically immune to these changes. In contrast, when OSBs are used, Winnow is more resilient to changes in the size of the training set in all cases. When 30% of the data are used for training, the best accuracy (FCP ) 0.6896) is achieved by Winnow, using ECFP_4 fingerprints and a window size of w ) 3. The second-best result (FCP ) 0.6794) is observed for the Bayesian classifier using ECFP_4 fingerprints, regardless of OSBs. Using 70% of the data for training, the best accuracy (FCP ) 0.6959) is realized through the use of Winnow, ECFP_4 fingerprints, and a window size of w ) 3.

ET AL.

In summary, the ECFP_4 fingerprints were determined to be superior to the others; the pair of classifiers with the highest accuracies using 70% of the data for testing were Winnow, with a window size of w ) 3, with its counterpart being a Bayesian classifier without OSBs. Furthermore, the Bayesian classifier is immune to the exclusion of OSBs. Influence of OSBs. The creation of OSBs increases the total number of available features. Although the Bayesian classifier does not seem to be dependent on their presence or absence, the Winnow algorithm consistently profits from their inclusion (see Figure 3). For the two experiments using ECFP_4 fingerprints and 70% of the data as training, the inclusion of OSBs results in a relative increase of 4% in the overall accuracy (FCP) for Winnow. The use of additional features through the inclusion of OSBs results in increased performance, in terms of MCC of the Winnow algorithm on all 20 classes. For certain classes in the same pair of experiments previously mentioned, the inclusion of OSBs results in considerable increases in accuracy. Examples are observed with CDK 4, CDK 5, and mGluR5: the respective increases in MCC are 44%, 43%, and 18%. The fact that Winnow generally performs better when provided with a larger number of features was demonstrated earlier, and that was the motivation for its inventor: to separate “relevant from irrelevant attributes”.31 For all classes in all experiments reported here, the performance of Winnow was determined to improve upon the inclusion of OSBs. The Bayesian classifier yields the same results, regardless of the presence or absence of OSBs. This is indicative of the fact that, in the case of the Winnow algorithm, the inclusion of OSBs allows that algorithm to account for nonlinearity in the response function. Analysis on a Per Class Basis. In terms of positive recall, Bayes outperforms Winnow in 14 of the 20 classes (see Figure 4). The maximum absolute difference in performance is 0.1640 for the CDK5 class (Bayes > Winnow; the relative increase, with respect to Winnow, is 29%), followed by the kappa class, where Bayes is superior by 0.1160 absolute units (the relative increase, with respect to Winnow, is 26%). For the two cholecystokinins 1 and 2 (denoted as CCK1 and CCK2, respectively), we see reversed performances: better results are achieved by Bayes for CCK1 (+14%, with respect to Winnow), whereas for CCK2, Winnow outperforms Bayes (+16%, with respect to Bayes). The Bayesian classifier retrieves 15% more compounds for the CDK1/cyclin B class than Winnow, whereas Winnow retrieves many more ligands of the mu receptor (+40%). Variations for other classes are smaller. For positive precision, Winnow outperforms Bayes in 15 of the 20 classes. In 12 out of the 14 classes where Bayes has a higher recall of positives, it also has a lower positive precision; these classes are CA II, CCK2, CDK2/cyclin A, CDK4, CDK5, D2, Src, alpha 1a, gamma secretase, kappa, mGluR5, and mu (see Figure 5). In terms of average MCC, there are seven classes that exhibit an overall difference of >5% for the two classifiers (with respect to the lower one of the two; see Figure 6). These seven classes are CCK2, CDK1/cyclin B, CDK4, CDK5, gamma secretase, kappa, and mu. The Bayesian classifier is better in three of them, notably, CDK1, CDK5, and kappa, with relative increases, with respect to Winnow, of 6%, 10%, and 6%, respectively. For the remaining four

LIGAND-TARGET PREDICTION USING BAYESIAN ALGORITHMS

J. Chem. Inf. Model., Vol. 48, No. 12, 2008 2319

Table 3. Positive Recall (Rp), Positive Precision (Pp) and Matthews Correlation Coefficients (MCC) for the Two Best Classifiers, Using 70% of the Data for Testinga Bayesian

Bayesian-Winnowb

Winnow

class

Rp

Pp

MCC

Rp

Pp

MCC

Rp

Pp

MCC

alpha 1a CA II CCK1 CCK2 CDK1/cyclin B CDK2/cyclin A CDK4 CDK5 D2 delta γ secretase kappa m1 m2 mGluR5 mu PDE4 PDE5 Src tubulin

0.8979 0.9685 0.6673 0.6609 0.6767 0.7891 0.8333 0.7333 0.9049 0.4495 0.9870 0.5617 0.5868 0.5390 0.9196 0.2592 0.9010 0.9724 0.8833 0.9248

0.8245 0.9978 0.5798 0.6825 0.7164 0.8680 0.6048 0.5918 0.9403 0.4448 0.7887 0.4004 0.6019 0.5579 0.9491 0.3951 0.9688 0.8694 0.8725 0.9468

0.8514 0.9814 0.6008 0.6519 0.6922 0.8245 0.7076 0.6550 0.9095 0.3751 0.8804 0.4106 0.5690 0.5239 0.9336 0.2417 0.9315 0.9161 0.8758 0.9346

0.8162 0.9924 0.5857 0.7691 0.5900 0.7882 0.7822 0.5694 0.9663 0.4612 0.9732 0.4457 0.6247 0.5048 0.9096 0.3639 0.8925 0.9673 0.8473 0.9175

0.9203 0.9785 0.6823 0.6785 0.7309 0.9031 0.7157 0.6283 0.8612 0.4512 0.9325 0.4490 0.6038 0.5750 0.9557 0.4050 0.9321 0.8861 0.9339 0.9459

0.8588 0.9840 0.6121 0.7033 0.6520 0.8409 0.7448 0.5939 0.8964 0.3819 0.9520 0.3877 0.5885 0.5136 0.9315 0.2945 0.9082 0.9227 0.8877 0.9302

0.0817 -0.0239 0.0816 -0.1082 0.0867 0.0009 0.0511 0.1640 -0.0614 -0.0116 0.0138 0.1160 -0.0380 0.0343 0.0100 -0.1048 0.0085 0.0051 0.0360 0.0073

-0.0957 0.0194 -0.1025 0.0040 -0.0145 -0.0351 -0.1110 -0.0365 0.0791 -0.0064 -0.1438 -0.0486 -0.0019 -0.0171 -0.0066 -0.0099 0.0367 -0.0167 -0.0613 0.0009

-0.0074 -0.0025 -0.0113 -0.0514 0.0401 -0.0164 -0.0372 0.0611 0.0131 -0.0068 -0.0716 0.0230 -0.0195 0.0103 0.0020 -0.0528 0.0233 -0.0065 -0.0118 0.0044

a Reported numbers are the averages over 15 independent runs for each experiment. b The column “Bayesian-Winnow” shows the difference of the corresponding columns shown to the left.

Table 4. Distribution of Compounds Correctly Identified Exclusively by One of the Algorithms or by Neither of Thema Class

B NOT W

W NOT B

NOT (W OR B)

Class Size

alpha 1a CCK1 CCK2 CDK1/cyclin B CDK2/cyclin A CDK4 CDK5 delta kappa m1 m2 mGluR5 mu PDE4 PDE5 Src tubulin CA II D2 γ secretase total

52 (6.9) 17 (2.7) 1 (0.1) 5 (3.8) 7 (3.3) 2 (4.7) 6 (11.1) 28 (1.9) 39 (3.3) 6 (0.8) 25 (3.7) 2 (1.9) 35 (2.1) 10 (1.8) 2 (0.4) 4 (2.8) 1 (0.6) 0 (0.0) 0 (0.0) 0 (0.0) 242 (1.9)

6 (0.8) 18 (2.8) 64 (8.6) 2 (1.5) 8 (3.8) 0 (0.0) 0 (0.0) 214 (14.5) 44 (3.7) 21 (2.7) 13 (1.9) 0 (0.0) 409 (24.6) 3 (0.5) 1 (0.2) 2 (1.4) 2 (1.1) 13 (1.1) 38 (2.0) 1 (0.8) 859 (6.6)

29 (3.9) 83 (13.0) 37 (5.0) 11 (8.3) 22 (10.4) 2 (4.7) 4 (7.4) 100 (6.8) 67 (5.6) 38 (5.0) 64 (9.6) 4 (3.8) 100 (6.0) 14 (2.6) 5 (1.0) 2 (1.4) 2 (1.1) 5 (0.4) 19 (1.0) 0 (0.0) 608 (4.7)

753 639 741 133 211 43 54 1473 1191 764 670 105 1664 549 490 143 180 1183 1891 118 12995

a B ) Bayes, W ) Winnow. Numbers shown in parentheses are the percentages, with respect to the total class sizes.

classes above the 5% threshold in differencesCCK2, CDK4, γ secretase, and the mu receptorsWinnow has a greater accuracy, by 8%, 5%, 8%, and 22%, respectively. For eight classes, the difference is between 5% and 1% (CCK1, CDK2/ cyclin A, D2, delta, m1, m2, PDE4, and Src). For the remaining five classes (alpha 1a, CA II, mGluR5, PDE5, and tubulin), the difference in average MCC is