Ligand Prediction for Orphan Targets Using Support Vector

Training Based on Ligand Efficiency Improves Prediction of Bioactivities of Ligands and Drug Target Proteins in a Machine Learning Approach. Nobuyoshi...
0 downloads 7 Views 3MB Size
J. Chem. Inf. Model. 2009, 49, 2155–2167

2155

ARTICLES

Ligand Prediction for Orphan Targets Using Support Vector Machines and Various Target-Ligand Kernels Is Dominated by Nearest Neighbor Effects Anne Mai Wassermann, Hanna Geppert, and Ju¨rgen Bajorath* Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universita¨t Bonn, Dahlmannstrasse 2, D-53113 Bonn, Germany Received July 21, 2009

Support vector machine (SVM) calculations combining protein and small molecule information have been applied to identify ligands for simulated orphan targets (i.e., targets for which no ligands were available). The combination of protein and ligand information was facilitated through the design of target-ligand kernel functions that account for pairwise ligand and target similarity. The design and biological information content of such kernel functions was expected to play a major role for target-directed ligand prediction. Therefore, a variety of target-ligand kernels were implemented to capture different types of target information including sequence, secondary structure, tertiary structure, biophysical properties, ontologies, or structural taxonomy. These kernels were tested in ligand predictions for simulated orphan targets in two target protein systems characterized by the presence of different intertarget relationships. Surprisingly, although there were target- and set-specific differences in prediction rates for alternative target-ligand kernels, the performance of these kernels was overall similar and also similar to SVM linear combinations. Test calculations designed to better understand possible reasons for these observations revealed that ligand information provided by nearest neighbors of orphan targets significantly influenced SVM performance, much more so than the inclusion of protein information. As long as ligands of closely related neighbors of orphan targets were available for SVM learning, orphan target ligands could be well predicted, regardless of the type and sophistication of the kernel function that was used. These findings suggest simplified strategies for SVMbased ligand prediction for orphan targets. 1. INTRODUCTION

The search for active compounds is one of the major focal points of chemoinformatics research and applications for which machine learning approaches are increasingly utilized.1 The term Support Vector Machine (SVM) refers to an advanced machine learning methodology2,3 that was originally developed for binary object classification. For SVM learning, objects with known class labels are projected into a feature space, and the learning process generally attempts to identify a hyperplane in this space that best separates objects belonging to two classes. The resulting linear function is then applied to predict class labels of other objects. The use of kernel functions4,5 is a key feature of SVM learning because they make it possible to also solve nonlinear classification problems. In chemoinformatics, SVM has become increasingly popular for the classification of active versus inactive compounds and, in addition, the prediction of ligands for given protein targets.6-9 Ligand prediction via SVM includes investigations that aim at predicting active small molecules for target proteins for which no ligand information is available. This search for ligands of so-called orphan targets is highly relevant for computer-aided drug discovery and chemical biology. Regardless of the methods that are * Corresponding author phone: +49-228-2699-306; fax: +49-228-2699341; e-mail: [email protected].

applied for this purpose, finding ligands for orphan targets generally requires to relate biological target and chemical ligand information to each other. Using SVM, this task can be elegantly accomplished by designing separate kernel functions for pairs of proteins and pairs of ligands.8,9 State-of-the-art protein kernels for ligand prediction include, for example, a sequence homologybased classification kernel.9 Furthermore, among various ligand descriptors that can be used, 2D molecular fingerprints have been found to be efficient small molecule representations for SVM.10 Fingerprint similarity can be captured, for example, by encoding the Tanimoto coefficient (Tc)11 as a kernel function. However, many other types of kernel functions combining biological target and chemical ligand information can be envisioned, and one might expect that the information content of such kernels is a critical factor for SVM-based ligand prediction. In this study, we have investigated the role of kernel functions for the prediction of ligands for orphan targets with a particular focus on target kernels capturing protein information at rather different levels, beyond sequence similarity. We have carried out calculations using three different SVM strategies, including standard SVM, the recently introduced SVM linear combination,9 and targetligand kernel SVM.8,9 Three ligand kernels were tested that compare small molecules in different ways, utilizing

10.1021/ci9002624 CCC: $40.75  2009 American Chemical Society Published on Web 09/25/2009

2156

J. Chem. Inf. Model., Vol. 49, No. 10, 2009

2D fingerprints as molecular representations. For the linear combination and target-ligand kernel SVM strategies that learn from multiple targets, eleven alternative and conceptually different target kernel functions were designed. The three SVM strategies were applied to search for inhibitors of individual proteases in two different target sets that were regarded as orphan targets and hence not included during SVM learning. The results obtained in systematic search calculations using alternative kernel functions were rather unexpected, and their further analysis revealed a strong dependence of successful ligand prediction on the nearest neighbor reference target of a simulated orphan target, irrespective of the kernel functions and SVM search strategies that were used. Thus, the information provided by the reference target most closely related to an orphan target largely determined the success rate of ligand predictions. 2. SUPPORT VECTOR MACHINE THEORY 2.1. Simple SVM. SVMs are machine learning algorithms for binary object classification. A linear decision function is built based on a training data set to associate class labels of objects with feature vectors. SVM learning for the purpose of virtual compound screening makes use of training examples (xi,yi) with xi being the feature vector (fingerprint representation) and yi ∈{-1,+1} the class label (positive or negative; active or inactive) of a training compound. By solving a convex quadratic optimization problem, SVM derives the normal vector w and the scalar b to define a hyperplane H ){x|〈w,x〉+b ) 0} that best separates positive from negative training examples. The classification of an unknown test molecule x is based on the decision function f(x) ) sgn(〈x,w〉+b), i.e. compounds with f(x) ) +1 are assigned to the positive class and those with f(x) ) -1 to the negative class. To permit classification functions that do not linearly depend on the training data, scalar products 〈 · , · 〉 occurring in the objective function of the SVM optimization problem can be replaced by a kernel function K( · , · ).4,5 Although SVM was originally developed for binary classification, the approach can also be adapted for multiclass predictions12 or database ranking. In order to transform the classification approach into a ranking function, test molecules are sorted in descending order of g(x) ) K(x,w). This is equivalent to ranking the test compounds by their signed distance from the hyperplane, i.e. from the most distant compound located on the side of the positive training class to the most distant compound on the side of the negative training class. For simple SVM ranking or what we call “homology-based” SVM, ligands of the most closely related target among the reference targets were used as the positive training class and a randomly chosen subset of the screening database as the negative training class to learn the ranking function g(x). 2.2. Target-Ligand Kernel. In recent chemogenomicsoriented studies,7-9 not only compounds but also proteincompound pairs were used to train SVM in order to enable learning and classification for multiple targets. For this purpose, a target-ligand kernel (TLK) was defined to compare two different target-ligand pairs and calculate the scalar products for target-ligand pairs during SVM optimization and

WASSERMANN

ET AL.

ranking. Given the target-ligand pairs (t, xi) and (t′, xj), the target-ligand kernel is generally defined as the product of two separate kernels for the target pair and the ligand pair K((t, xi)(t′, xj)) ) Ktarget(t, t′) × Kligand(xi, xj) The design principle of target-ligand kernels is illustrated in Figure 1. Independent kernels for protein and ligand representations are combined to account for pairwise target and ligand similarities. For SVM training, each reference target was combined with its true ligands in the training set to generate positive training examples and randomly selected screening database molecules were combined with each reference target to build negative training examples. For ranking, test compounds x were combined with the orphanized target torphan for which no known ligands were available during training and the pairs (torphan, x) were ranked according to the signed distance from the hyperplane H, as described in section 2.1. 2.3. Linear Combination. The SVM linear combination (LC) technique was recently introduced.9 For each reference target ti, an individual weight vector wi was derived by learning an SVM classification function with the known ligands of ti as positive training examples and a randomly chosen subset of the screening database as negative examples. To then obtain a ranking function for the orphan target torphan, the weight vector worphan was generated by linearly combining the individual weight vectors wi of the reference targets. Values of the target kernels described in section 3 were used as linear factors so that the linear combination was directly comparable to the SVM TLK strategy worphan )

∑ Ktarget(torphan, ti)wi

For database ranking, all test compounds were sorted according to the value of g(x) ) Kligand(x, worphan) 3. KERNEL DESIGN

Three different ligand kernels were applied for the search strategies described above, using fingerprints as ligand representation. (1) Gaussian kernel (radial basis function kernel)13 KGaussian(xi, xj) ) exp(-γ|xi - xj | 2) (2) Tanimoto kernel14 KTanimoto(xi, xj) )

〈xi, xj〉 〈xi, xi〉 + 〈xj, xj〉 - 〈xi, xj〉

(3) Linear kernel that corresponds to the standard scalar product. In addition to the ligand kernel component, SVM TLK and LC require the calculation of a target component. The 11 target kernels considered in this study differ significantly in their design, information content, and complexity. (a) Uniform kernel between two targets (t,t′) Kuniform(t,t′) ) 1

LIGAND PREDICTION

FOR

ORPHAN TARGETS

J. Chem. Inf. Model., Vol. 49, No. 10, 2009 2157

Figure 1. Target-ligand kernels. The comparison of two target-ligand pairs via a target-ligand kernel function is divided into two independent tasks. In this case, the similarity of protein targets is quantified by sequence comparison, while ligand similarity is assessed through comparison of fingerprint representations. The product of the two similarity scores is taken to recombine target and ligand information. Table 1. Target and Ligand Data Set 1 target

abbreviation

MEROPS identifier

PDB entry

number of ligands

nearest neighbor target

angiotensin-converting enzyme 2 calpain 2 caspase 3 cathepsin D cathepsin L glutamate carboxypeptidase 2 methionine aminopeptidase 2 matrix metalloprotease 2 matrix metalloprotease 8 renin thrombin trypsin

ace2 cal2 cas3 catD catL mgcp metap2 mmp2 mmp8 ren thr try

M02.006 C02.002 C14.003 A01.009 C01.032 M28.010 M24.002 M10.003 M10.002 A01.007 S01.217 S01.127

1r42 1kfu 1cp3 1lyb 1mhw 2oot 1b6a 1qib 1bzs 2ren 1ppb 1trn

28 49 264 70 78 14 254 83 16 164 281 58

mmp2 catL catL ren cal2 mmp8 mgcp mmp8 mmp2 catD try thr

For each target protein, its MEROPS identifier, a corresponding Protein Data Bank (PDB) entry,39 the number of ligands, and the nearest neighbor target are reported. The MEROPS identifier is composed of a family-based component (e.g. thrombin and trypsin both belong to the family S01) and an individual target-based component.

In this case, differences between targets are not considered. For the TLK search strategy, using Kuniform corresponds to pooling the training molecules for all proteins and deriving the SVM on the pooled compounds. (b) Needle kernel is the protein sequence identity SI for a protein pair (t,t′) computed using the Needleman-Wunsch algorithm for pairwise global sequence alignment implemented in EMBOSS15 Kneedle(t,t′) ) SI(t,t′) (c) Water kernel. Each protein pair (t,t′) is also subjected to pairwise local sequence alignment using the SmithWaterman algorithm implemented in EMBOSS, and the alignment scores SSW(t,t′) are expressed in logarithmic form Kwater(t,t′) ) ln SSW(t,t′)

(d) PROFEAT kernel. The PROFEAT server16 computes 1447 protein descriptors from protein sequence including descriptors developed by Dubchak et al.17 that account for the composition, transition, and distribution of structural and physicochemical properties such as hydrophobicity, polarity, charge, and solvent accessibility. Each descriptor is separately normalized to the value range [0,1], and each target t is represented by a vector ΦP(t) of 1447 normalized descriptor values. The PROFEAT kernel is then defined as KPROFEAT(t,t′) ) 〈ΦP(t), ΦP(t′)〉 (e) Spectrum kernel is a string kernel introduced by Leslie et al.18 It compares sequence strings representing k-mers. Here conventional 3-mers were computed for

2158

J. Chem. Inf. Model., Vol. 49, No. 10, 2009

WASSERMANN

ET AL.

mented in the SSEA Web server,20 which yields a score SSS(t,t′) in the range [0,100]. This score is directly used as target kernel KSSEA(t,t′) ) SSS(t,t′) (g) The GO kernel. Gene Ontology (GO)21 terms of the Molecular Function category are extracted for all protein targets from the UniProt Knowledgebase.22,23 The GO kernel for a target pair (t,t′) counts the number of identical GO terms in the GO term sets of t and t′.24 (h) CleaVage kernel. Peptidases act on specific substrates and their catalytic activity is often restricted to specific sequence recognition sites. For all targets, available cleavage sites of their substrates are extracted from the MEROPS25 and CutDB26 databases that collect cleavage sites in natural and synthetic substrates. Cleavage site patterns are reduced to two residues on either side of the scissile bond, and for each target, a position-specific frequency matrix is generated. The columns of the matrix are then concatenated to form a 4 × 20 dimensional feature vector ΦC(t), and the cleavage kernel is calculated as follows Kcleavage(t,t′) ) Figure 2. Target relationships. The relationships between the proteases in the target sets 1 (top) and 2 (bottom) are illustrated. The MEROPS classification scheme25 (i.e., type, clan, family, and subfamily) is applied. From the “subfamily” to the “type” level, target similarity is fading away.

target sequences. Each protein t is represented by a 203 dimensional vector ΦS(t) (for 20 amino acids) where each dimension corresponds to a possible string of three amino acids and reports the count of the number of occurrences of this fragment in the sequence of t. To account for different lengths of protein sequences, the kernel is normalized as follows Kspectrum(t,t′) )

〈ΦS(t), ΦS(t′)〉

√〈ΦS(t), ΦS(t)〉〈ΦS(t′), ΦS(t′)〉

(f) SSEA kernel. For each target, the secondary structure is predicted by PSIPRED19 resulting in a string of residues each represented by one of three letters for the states helix, strand, or coil. Strings for a target pair (t,t′) are then globally aligned using the dynamic programming algorithm imple-

〈ΦC(t), ΦC(t′)〉

√〈ΦC(t), ΦC(t)〉〈ΦC(t′), ΦC(t′)〉

(i) SCOP kernel. The SCOP database27 is hierarchically structured into protein folds, superfamilies, families, and domains and can be represented as a directed acyclic graph (DAG). Each target t is represented by an n-dimensional feature vector ΦSc(t) where n is the number of nodes in the graph and each feature is assigned a value of 1 if the corresponding node occurs in t’s SCOP hierarchy and 0 otherwise. The SCOP kernel is then defined as follows KSCOP(t,t′) ) 2〈ΦSc(t),ΦSc(t′)〉 (j) Topmatch kernel. All protein targets are represented by a 3D substructure comprising all amino acids within an 8 Å radius of the target’s catalytic residues. Residues falling within this radius are computed with MOE28 and thensubjectedtostructurecomparisonusingTopMatch-web.29,30 For a target pair (t,t′), TopMatch-web computes a relative similarity score ST within the range [0,100] that is directly used as the target kernel KTopmatch(t,t′) ) ST(t,t′)

Table 2. Target and Ligand Data Set 2 target

abbreviation

MEROPS identifier

PDB entry

number of ligands

nearest neighbor target

calpain 1 calpain 2 caspase 1 caspase 3 cathepsin B cathepsin K cathepsin L cathepsin S factor Xa thrombin trypsin

cal1 cal2 cas1 cas3 catB catK catL catS faXa thr try

C02.001 C02.002 C14.001 C14.003 C01.060 C01.036 C01.032 C01.034 S01.216 S01.217 S01.127

1tlo 1kfu 1ice 1gfw 1gmy 1yk7 1mhw 1ms6 1mq5 1ppb 1trn

46 49 21 264 17 223 78 221 783 281 58

cal2 cal1 cas3 cas1 catS catS catK catK thr faXa faXa

For each target protein, its MEROPS identifier, a corresponding Protein Data Bank (PDB) entry,39 the number of ligands, and the nearest neighbor target are reported.

LIGAND PREDICTION

FOR

ORPHAN TARGETS

J. Chem. Inf. Model., Vol. 49, No. 10, 2009 2159

Table 3. Methods and Calculations search strategy simple SVM screening database reference targets inactive training class active training class ligand kernel target kernel fingerprints trials

target-ligand kernel

linear combination

subset of ZINC7, 100,000 compounds 1 (nearest neighbor target) all except the orphan target (i.e., 11 targets for data set 1 and 10 for set 2) 1000 ZINC7 compounds 5 inhibitors for the reference target 5 inhibitors per reference targets (i.e., a total of 55 inhibitors for data set 1 and 50 for set 2) Gaussian, linear, Tanimoto linear 11 different kernels (uniform, needle, water, PROFEAT, spectrum, SSEA, GO, cleavage, SCOP, Topmatch, MEROPS) MACCS, TGD 10 with different randomly selected compound reference sets

(k) MEROPS kernel. The MEROPS database25 is hierarchically structured into catalytic types, so-called protein clans, families, and subfamilies and can also be visualized as a DAG. Hence, ΦM(t) can be defined analogously to ΦSc(t), and the MEROPS kernel is given by KMEROPS(t,t′) ) 2〈ΦM(t),ΦM(t′)〉 Some of the above-mentioned functions are not generally valid kernel functions because they are not necessarily positive semidefinite for all protein sets (i.e., for a given set of proteins, an all-versus-all matrix of scores might have some negative eigenvalues). It has been shown that a symmetric matrix can be converted into a positive semidefinite matrix by subtracting its smallest negative eigenvalue from the diagonal.31 However, for the analysis presented herein, this conversion was not required because all score matrices were positive semidefinite for our data sets.

multiple protease targets was only assigned to the target it was most potent against. The second target set included 11 proteases and was described previously.9 These targets included the cysteine proteases calpain 1 and 2, caspase 1 and 3, and cathepsins B, L, K, and S, and the serine proteases factor Xa, thrombin, and trypsin, as summarized in Table 2. The relationship between these targets is also illustrated in Figure 2. Ligand sets for these targets were assembled as described above. For proteases shared among both target sets (i.e. calpain 2, caspase 3, cathepsin L, thrombin, and trypsin), the same ligand sets were used. As illustrated in Figure 2, the intertarget and nearest neighbor relationships differed beTable 4. Search Results for Ligand Prediction Using Simple SVM (Set 1)b kernel linear 100a

1000a

ace2 cal2 cas3 catD catL mgcp metap2 mmp2 mmp8 ren thr try average

0.0 19.1 2.6 15.4 10.4 0.0 0.0 49.6 49.1 30.3 27.7 36.9 20.1

0.9 60.2 9.5 51.5 34.7 0.0 0.5 77.1 59.1 57.9 70.3 61.4 40.3

ace2 cal2 cas3 catD catL mgcp metap2 mmp2 mmp8 ren thr try average

0.9 4.8 0.6 40.5 8.0 0.0 0.0 75.6 57.3 44.3 28.9 46.9 25.6

14.8 30.7 8.5 72.2 23.6 0.0 0.0 94.0 67.3 59.1 80.5 74.4 43.8

4. TARGET AND LIGAND SYSTEMS

Two sets of reference targets were assembled that represented different degrees of intertarget relationships. The first target set included 12 proteases belonging to nine different families (Table 1) and showing four different catalytic mechanisms: cathepsin D and renin are aspartate proteases; thrombin and trypsin serine proteases; cathepsin L, calpain 2, and caspase 3 cysteine proteases; and matrix metalloproteases 2 and 8, methionine aminopeptidase 2, glutamate carboxypeptidase 2, and angiotensin-converting enzyme 2 are metalloproteases. Proteases possessing the same catalytic machinery can either be closely or distantly related in sequence, as illustrated in Figure 2, which organizes targets into clans, families, and subfamilies following the classification scheme of the MEROPS peptidase database. Based on the MEROPS hierarchy, the nearest neighbor target for each protease in our test set was determined. If several nearest neighbor candidates were suggested for a given target based on the MEROPS hierarchy, the protease with highest sequence identity to the target was chosen. For all 12 targets, ligand sets were assembled from the MDL Drug Data Report (MDDR),32 the BindingDB database,33,34 and original literature sources. In total, 1359 different protease inhibitors were collected. As reported in Table 1, each ligand set contained between 14 and 281 compounds that had a potency of at least 1 µM (Ki or IC50) against the target. Ligand sets were mutually exclusive in their composition, i.e. a compound reported to inhibit

Tanimoto 100a

Gaussian

1000a

100a

1000a

MACCS 0.0 21.4 1.9 17.9 9.9 0.0 0.0 55.3 49.1 27.3 26.5 37.3 20.5

0.4 64.8 9.7 56.5 35.6 1.1 0.0 79.2 60.0 56.5 69.8 61.5 41.3

0.0 22.1 2.4 16.0 10.0 0.0 0.0 54.7 50.9 32.0 28.0 38.5 21.2

0.9 63.4 10.0 54.5 35.1 0.0 0.2 78.7 59.1 58.2 71.3 61.2 41.0

TGD 0.4 3.6 0.1 33.2 7.3 0.0 0.0 78.0 57.3 41.6 29.0 45.2 24.6

14.4 28.4 3.9 71.2 27.0 3.3 0.5 94.7 62.7 60.0 80.4 81.5 44.0

0.4 5.0 0.2 38.6 6.2 0.0 0.0 77.6 57.3 41.3 29.7 46.2 25.2

17.4 30.2 5.4 77.1 25.2 0.0 0.0 95.3 68.2 59.9 81.1 78.7 44.9

a Set size. b Recovery rates (in %) are reported for all targets in data set 1 averaged over 10 independent trials per target. The results reported for the Gaussian kernel were obtained with the parameter γ set to 0.01.

2160

J. Chem. Inf. Model., Vol. 49, No. 10, 2009

WASSERMANN

ET AL.

Figure 3. Ligand prediction for target set 1. For the MACCS and TGD fingerprints, recovery rates for SVM TLK and LC strategies are shown for 11 alternative target kernels and selection sets of 1000 database compounds. Recovery rates are averaged over all 12 targets and 10 independent search trials per target.

tween target sets 1 and 2. Whereas each target in set 2 had a nearest neighbor that belonged to the same subfamily, several targets in set 1 had nearest neighbors sharing the same catalytic mechanism but lacking further evidence of evolutionary relationships. The different intertarget relationships found in data sets 1 and 2 were explored in SVM modeling and ligand-target prediction. 5. SEARCH CALCULATIONS

The performance of alternative kernel functions and SVM ranking strategies was evaluated in systematic search calculations on our two protease systems using MACCS structural keys35 and the TGD fingerprint28 as ligand descriptors. TGD represents a two-point pharmacophoretype fingerprint that is calculated from the 2D connectivity table of a molecule. As a background database for SVM analysis, 100,000 compounds were randomly chosen from ZINC7.36 As negative training examples, 1000 database compounds were randomly selected in each case. For simple SVM calculations on each target, only five inhibitors of the nearest neighbor target were used as positive training molecules. For SVM LC and TLK, five inhibitors for each target were used. The inhibitor set of the orphanized target was not used during SVM learning but added to the background database as potential database hits during testing. Kernel functions, SVM strategies, and fingerprint descriptors were systematically combined. For each investigated combination of a kernel function, SVM strategy, and

fingerprint, 10 different randomly selected compound training and test sets were analyzed and the search results were averaged. Table 3 summarizes the different methods and calculation settings. As a measure of performance, recovery rates (RR: number of correctly identified orphan target inhibitors divided by their total number) were calculated for database selection sets of increasing size and averaged over the 10 independent trials per target. All calculations were carried out using SVMlight,37 a freely available SVM implementation.38 All calculation parameters were SVMlight default settings to ensure reproducibility of the calculations. Perl scripts were applied to calculate SVM linear combinations. 6. GLOBAL KERNEL PERFORMANCE

We first investigated the relative performance of ligand kernels in simple SVM calculations searching for active compounds on the basis of randomly selected reference sets. Table 4 reports the compound recovery rates for activity classes of target set 1, the MACCS and TGD fingerprints, and database selection sets of 100 and 1000 compounds. In these calculations, all three ligand kernels produced comparable recovery rates. In some instances, the search calculations failed for any kernel, and, in others, high recovery rates were consistently observed. Because there was no apparent preference for a ligand kernel in our test calculations, we selected the linear kernel, which has the lowest computational

LIGAND PREDICTION

FOR

ORPHAN TARGETS

J. Chem. Inf. Model., Vol. 49, No. 10, 2009 2161

Table 5. Search Results for Ligand Prediction Using SVM LC (Set 1)a uniform

needle

water

PROFEAT

spectrum

SSEA

ace2 cal2 cas3 catD catL mgcp metap2 mmp2 mmp8 ren thr try average

20.9 65.2 35.4 76.0 32.6 0.0 9.6 17.7 19.1 35.3 49.5 37.1 33.2

18.7 63.9 29.9 78.0 31.4 0.0 7.6 41.7 50.0 55.2 66.6 54.4 41.4

20.0 65.2 35.4 86.3 32.9 0.0 9.5 28.6 30.9 46.4 57.2 41.4 37.8

20.4 63.6 35.1 77.1 32.6 0.0 9.3 18.5 18.2 37.0 51.2 38.1 33.4

MACCS 18.7 20.0 65.9 64.6 34.9 34.9 92.5 80.0 31.6 32.9 1.1 0.0 8.8 9.8 39.5 22.6 32.7 24.6 54.2 45.2 50.7 55.5 37.3 42.1 39.0 36.0

ace2 cal2 cas3 catD catL mgcp metap2 mmp2 mmp8 ren thr try average

30.0 33.2 45.8 78.5 14.8 0.0 33.4 24.5 33.6 59.3 60.3 60.8 39.5

32.6 24.3 45.1 83.2 13.7 0.0 29.4 52.7 66.4 59.6 74.7 67.7 45.8

31.7 32.5 46.2 83.9 14.8 0.0 33.8 34.4 43.6 60.0 64.8 65.0 42.6

30.9 29.3 45.8 78.3 14.8 0.0 33.6 26.0 33.6 59.4 61.1 61.2 39.5

30.9 27.3 46.2 86.9 15.1 0.0 34.0 43.2 56.4 59.9 61.8 62.7 43.7

GO

cleavage

SCOP

Topmatch

MEROPS

17.8 64.3 35.3 89.5 36.6 2.2 9.9 34.7 36.4 49.9 62.3 40.2 39.9

10.0 66.8 36.3 90.2 38.6 0.0 8.5 33.5 33.6 55.0 58.8 46.2 39.8

4.8 74.6 34.0 73.9 41.8 0.0 10.1 61.7 55.5 57.8 74.0 61.0 45.8

17.8 63.4 35.5 85.2 37.1 1.1 9.0 68.6 55.5 55.7 67.5 57.5 46.2

10.9 79.1 35.2 74.2 41.1 1.1 9.8 74.4 57.3 57.6 73.7 61.4 48.0

46.5 42.1 47.8 84.2 17.0 0.0 32.8 39.5 49.1 60.0 76.5 63.3 46.6

25.7 38.6 49.5 87.2 18.6 0.0 18.3 30.4 46.4 59.3 76.3 63.7 42.8

33.0 36.1 45.1 83.9 19.0 0.0 35.9 78.3 70.9 59.4 80.9 72.9 51.3

27.4 32.3 48.8 83.5 15.2 0.0 27.3 71.4 73.6 59.8 79.2 73.1 49.3

37.4 44.1 49.0 83.5 18.5 0.0 39.6 89.9 72.7 59.4 80.7 73.7 54.0

TGD 31.7 31.4 45.4 82.9 15.1 0.0 33.5 26.8 36.4 60.2 66.0 65.0 41.2

a Recovery rates (in %) for all targets in data set 1 are reported for selection sets of 1.000 compounds averaged over 10 independent trials per target.

Table 6. Search Results for Ligand Prediction Using SVM TLK (Set 1)a uniform

needle

water

PROFEAT

spectrum

SSEA

ace2 cal2 cas3 catD catL mgcp metap2 mmp2 mmp8 ren thr try average

20.9 46.4 27.9 58.5 28.8 2.2 8.6 22.7 34.6 21.2 55.8 39.0 30.5

20.4 55.5 28.2 70.3 30.8 0.0 7.1 49.2 55.5 54.3 68.7 57.1 41.4

24.4 59.3 32.8 75.2 33.8 1.1 8.1 46.7 58.2 42.1 67.0 51.7 41.7

23.5 40.9 30.4 66.8 29.7 2.2 8.4 23.6 36.4 27.6 58.1 41.5 32.4

MACCS 20.4 23.0 54.3 43.6 34.9 31.4 90.2 71.5 30.8 27.3 1.1 0.0 8.5 8.4 48.3 34.4 44.6 47.3 54.0 39.3 52.8 66.2 37.3 53.3 39.8 37.1

ace2 cal2 cas3 catD catL mgcp metap2 mmp2 mmp8 ren thr try average

39.6 30.9 45.4 80.0 12.9 1.1 42.3 34.0 46.4 57.8 65.8 76.7 44.4

37.0 21.4 45.1 78.6 12.1 0.0 35.4 71.9 71.8 59.5 75.4 74.0 48.5

38.3 32.7 46.3 85.7 13.8 0.0 41.2 56.9 70.0 60.6 72.5 81.7 50.0

38.7 25.0 45.9 79.2 13.2 0.0 41.8 35.0 48.2 58.6 66.8 78.9 44.3

34.8 24.8 45.4 83.2 14.3 0.0 37.4 56.0 70.0 60.1 62.4 72.9 46.8

GO

cleavage

SCOP

Topmatch

MEROPS

0.4 61.8 25.9 81.1 31.0 3.3 4.6 40.6 47.3 58.6 36.4 43.9 36.2

3.9 79.6 34.9 87.5 39.7 1.1 5.5 50.4 56.4 49.1 65.9 52.9 43.9

4.4 74.1 36.0 68.5 41.6 0.0 10.0 65.3 56.4 57.5 72.9 61.4 45.7

17.8 62.5 35.9 67.4 35.3 1.1 7.9 70.1 59.1 55.7 69.6 59.2 45.1

11.3 78.0 36.1 68.6 42.3 0.0 9.0 73.1 57.3 57.4 72.8 60.6 47.2

11.3 36.1 42.6 83.2 16.6 1.1 34.5 53.2 68.2 60.2 67.5 74.8 45.8

19.6 38.2 48.8 82.9 19.7 0.0 20.9 60.4 75.5 59.7 80.3 76.0 48.5

32.6 40.9 43.9 80.6 18.6 0.0 38.6 87.7 70.9 59.8 80.8 78.7 52.8

27.8 36.8 48.8 80.8 14.3 0.0 33.5 88.6 69.1 59.7 80.7 84.4 52.0

43.9 44.8 48.6 77.9 19.2 0.0 42.1 91.8 70.9 59.6 80.8 79.2 54.9

TGD 44.8 24.1 45.2 83.9 12.9 0.0 42.0 43.9 51.8 60.8 75.2 83.7 47.4

a Recovery rates (in %) for all targets in data set 1 are reported for selection sets of 1000 compounds averaged over 10 independent trials per target.

complexity of the three, for further calculations and combined this kernel with the 11 different target kernels. Next we tested the 11 target kernels in SVM TLK and LC ligand prediction calculations. Average results for target

set 1 and selection sets of 1000 database compounds are shown in Figure 3, and Tables 5 and 6 report recovery rates on a per target basis. Supplementary Figure S1 and Tables S1 and S2 report corresponding results for database selection

2162

J. Chem. Inf. Model., Vol. 49, No. 10, 2009

WASSERMANN

ET AL.

Figure 4. Ligand prediction for target set 2. Search results for target set 2 are shown corresponding to Figure 3. Table 7. Search Results for Ligand Prediction Using SVM LC (Set 2)a uniform

needle

water

PROFEAT

spectrum

SSEA

cal1 cal2 cas1 cas3 catB catK catL catS faXa thr try average

76.8 90.5 35.0 36.7 79.2 45.2 57.0 44.4 10.7 37.8 39.8 50.3

67.6 92.5 30.0 46.3 83.3 55.5 74.1 62.1 27.8 63.9 57.0 60.0

77.3 95.2 36.3 40.3 80.0 46.3 59.7 43.2 15.3 49.5 45.1 53.5

77.3 92.3 35.0 37.0 79.2 46.3 58.6 46.1 11.3 39.5 40.8 51.2

MACCS 67.8 77.1 87.3 94.1 36.9 36.3 43.9 40.6 81.7 80.8 54.6 48.5 71.6 62.1 60.6 50.3 24.3 15.9 51.3 49.8 46.0 45.9 56.9 54.7

cal1 cal2 cas1 cas3 catB catK catL catS faXa thr try average

44.6 81.1 56.3 50.6 65.8 49.0 22.6 16.6 35.9 76.5 58.3 50.7

45.4 88.6 70.0 52.9 66.7 48.4 27.0 26.9 52.7 80.4 75.3 57.7

44.9 84.1 58.1 51.7 65.8 50.9 24.0 16.9 43.3 80.1 63.0 53.0

44.9 83.0 56.3 50.8 65.8 49.6 22.5 17.2 37.2 77.1 59.4 51.2

44.6 88.2 63.8 52.7 67.5 47.4 29.2 28.2 51.6 80.3 66.2 56.3

GO

cleavage

SCOP

Topmatch

MEROPS

76.1 98.6 39.4 32.1 86.7 49.1 63.2 50.9 16.1 67.6 44.5 56.8

76.8 98.9 40.0 45.0 76.7 48.1 67.8 48.0 22.8 54.2 51.1 57.2

73.2 99.6 50.0 47.0 82.5 53.6 73.7 57.4 38.0 72.4 61.3 64.4

75.6 98.0 40.0 45.8 85.8 53.8 74.0 55.8 33.8 68.5 57.2 62.6

69.0 95.7 50.0 47.0 77.5 52.1 73.3 60.7 38.0 72.4 61.3 63.4

46.8 83.0 53.8 50.6 70.8 46.3 23.4 19.7 46.9 81.4 63.6 53.3

46.1 84.1 72.5 53.3 64.2 49.6 25.5 16.5 51.0 80.1 69.3 55.6

48.3 89.1 86.9 54.1 65.0 49.2 28.9 25.1 56.0 82.0 84.5 60.8

47.1 87.1 76.9 53.6 66.7 50.4 28.5 23.2 55.8 80.8 76.8 58.8

46.3 89.3 86.9 54.1 65.0 45.8 27.7 28.0 56.0 82.0 84.5 60.5

TGD 45.4 83.4 58.1 51.8 67.5 50.1 23.6 18.2 44.0 80.0 63.4 53.2

a Recovery rates (in %) for all targets in data set 2 are reported for selection sets of 1000 compounds averaged over 10 independent trials per target.

sets of 100 compounds. The SVM TLK and LC search strategies were found to produce similar average compound recovery rates of approximately 30%-55% for both fingerprints and all target kernels for a database selection set size

of 1000 compounds. Moreover, there was relatively little variation in kernel performance, much less so than anticipated given the significant differences in target kernel complexity and encoded protein information. Equivalent trends were

LIGAND PREDICTION

FOR

ORPHAN TARGETS

J. Chem. Inf. Model., Vol. 49, No. 10, 2009 2163

Table 8. Search Results for Ligand Prediction Using SVM TLK (Set 2)a uniform

needle

water

PROFEAT

spectrum

SSEA

cal1 cal2 cas1 cas3 catB catK catL catS faXa thr try average

74.6 71.6 29.4 32.3 78.3 40.5 53.3 37.6 16.1 42.4 40.2 46.9

70.5 84.8 32.5 46.0 80.0 56.5 70.7 62.3 33.8 67.8 60.9 60.5

77.3 84.8 34.4 42.0 80.8 46.0 58.2 37.2 28.6 60.3 52.1 54.7

77.1 81.1 31.3 36.0 82.5 45.2 57.3 42.9 19.3 45.6 43.6 51.1

MACCS 72.2 77.1 81.6 84.8 39.4 35.6 45.4 42.2 83.3 83.3 55.7 48.6 70.1 61.2 61.4 47.9 29.6 29.9 54.2 60.2 47.4 53.6 58.2 56.8

cal1 cal2 cas1 cas3 catB catK catL catS faXa thr try average

42.9 61.1 70.6 51.5 60.0 40.5 20.7 14.9 44.1 77.8 81.1 51.4

44.9 83.9 75.6 53.0 64.2 40.4 28.8 30.3 53.8 81.5 87.6 58.5

46.6 76.8 79.4 52.3 59.2 42.2 23.6 16.7 50.9 81.1 90.2 56.3

45.1 68.4 72.5 51.6 60.8 40.8 20.7 17.2 45.6 78.8 80.0 52.9

45.4 84.8 76.3 53.0 60.8 43.2 30.4 29.6 53.4 80.1 74.0 57.3

GO

cleavage

SCOP

Topmatch

MEROPS

76.3 88.9 38.1 32.2 77.5 45.2 59.0 44.6 31.0 64.6 50.2 55.2

76.3 92.1 51.9 46.4 77.5 46.7 64.7 44.4 34.6 62.1 58.1 59.5

73.7 89.1 46.9 47.5 75.8 52.8 68.4 58.3 39.1 72.4 66.0 62.7

74.9 90.5 44.4 47.0 84.2 53.5 71.0 55.2 37.1 70.1 58.9 62.4

72.4 88.2 54.4 47.3 79.2 52.0 71.0 60.9 39.1 72.2 66.0 63.9

47.8 74.1 75.0 50.7 62.5 40.9 23.4 23.8 53.3 80.8 84.9 56.1

48.5 75.0 83.8 53.9 55.8 43.1 28.2 18.0 52.9 80.9 88.5 57.2

49.0 80.2 80.6 54.2 56.7 40.9 28.2 30.8 53.2 82.1 88.7 58.6

49.0 78.9 88.1 54.1 60.0 42.6 29.7 28.9 54.6 81.3 87.2 59.5

46.6 84.3 85.0 54.1 58.3 41.7 27.8 31.4 54.0 82.0 88.1 59.4

TGD 47.1 74.3 80.0 52.8 61.7 41.3 22.6 20.6 51.1 81.2 87.4 56.4

a Recovery rates (in %) for all targets in data set 2 are reported for selection sets of 1.000 compounds averaged over 10 independent trials per target.

Figure 5. Target-dependent search performance (set 1). For all targets of set 1, recovery rates are shown for simple SVM ranking and for the SVM TLK search strategy in combination with the MEROPS kernel. Recovery rates are compared for selection sets of 1000 compounds averaged over 10 independent trials per target.

observed for database selection sets of 100 compounds. The secondary structure-based SSEA kernel and the PROFEAT kernel, which is based on biophysical descriptors calculated from protein sequence, did not produce higher recovery rates than the uniform kernel that does not take protein similarity into account and hence served as a reference for target kernels. Differences in kernel performance were rather subtle, but the overall highest recovery rates were achieved with

the MEROPS kernel that encodes a hierarchical protein organization scheme. Equivalent observations were made for target set 2. Figure 4 (and Supplementary Figure S2) reports average results of the search calculations on set 2 (corresponding to those reported in Figure 3 and Supplementary Figure S1), and Tables 7 and 8 report recovery rates on a per target basis (Supplementary Tables S3 and S4 report corresponding

2164

J. Chem. Inf. Model., Vol. 49, No. 10, 2009

Figure 6. Cumulative recall curves. Representative recall curves for simple SVM and SVM TLK (MEROPS) calculations are shown for three targets, (A) catD, (B) mmp2, and (C) cal2, using TGD as the ligand descriptor. Recovery rates are averaged over 10 independent trials per target.

results for database selection sets of 100 compounds). In this case, the recovery rates were generally higher than for set 1, ranging on average from approximately 47%-64% for selection sets of 1000 compounds, but differences between SVM search strategies and alternative kernels were even smaller than those observed for target set 1. The uniform kernel produced an average recovery rate of close to 50%, and several kernels taking protein similarity at different levels into account performed only slightly better. Here, the hierarchical SCOP and MEROPS kernels and the Topmatch kernel that is based on active site structural similarity performed equally well but only slightly better than the sequence similarity-based needle kernel. Thus, taken together, the results of systematic SVM calculations on our two target sets revealed surprisingly little differences in search performance for target kernels of different design.

WASSERMANN

ET AL.

for all individual set 1 targets in SVM TLK calculations using the MEROPS kernel and, in addition, simple SVM control calculations. In the latter case, the SVM was trained on the ligands of the target most closely related to the orphanized target. The results of these SVM TLK and simple SVM calculations are shown in Figure 5 (and Supplementary Figure S3). Significant differences in target-dependent search performance were observed (consistent with the results reported in Tables 4 and 6). The search performance was found to be highly dependent on the degree of relatedness between the orphan target and its nearest neighbor. For those targets having a closely related nearest neighbor at the subfamily level (i.e., catD, mmp2, mmp8, ren, thr, try; see Figure 2), highest recovery rates were observed. For these targets, simple SVM calculations using the ligands of the nearest neighbor as positive training examples matched the performance of SVM TLK calculations using the MEROPS kernel. By contrast, simple SVM search calculations produced only low recovery rates, or failed, for targets that had no closely related neighbor (i.e., all cysteine proteases in set 1, mgcp, ace2, and metap2; see Figure 2). The cumulative recall curves shown in Figure 6A,B illustrate the close correspondence between simple SVM and SVM TLK calculations when a closely related nearest neighbor target (see Table 1) was available. However, Figure 6C shows that taking additional target and ligand information into account when no closely related neighbor was available further improved the search performance, a trend that was especially observed for larger database selection sets. Different from target set 1, each target in set 2 had a nearest neighbor at the subfamily level (Figure 2). Accordingly, one would expect better target-dependent search performance for targets in set 2 than in set 1. The SVM TLK (MEROPS) search calculations shown in Figure 7 (and Supplementary Figure S4) confirm this expectation. The majority of targets in set 2 produced recovery rates of at least 40% (with the MACCS and/or TGD fingerprints for ligand representation). In this case, simple SVM control calculations were also carried out after pooling the ligands of all members of the orphan target’s subfamily for training. As illustrated in Figure 7, the recovery rates observed in these SVM control calculations were almost indistinguishable from those of SVM TLK calculations. Furthermore, in Supplementary Table S5, recovery rates for simple SVM calculations on set 2 targets are reported for selection sets of 100 and 1000 compounds when either only ligands of the nearest neighbor target were used for training or, alternatively, ligands of all subfamily members were pooled. The results demonstrate that recovery rates for targets having several closely related subfamily members further improved when ligands from all related targets were taken into account compared to ligands of only the most closely related target.

7. TARGET-DEPENDENT KERNEL PERFORMANCE

As described in the Methods section, the overall bestperforming MEROPS kernel differs from other target kernels in that it assigns high weights to closely related targets, due to its exponential formalism. In order to explore the contributions of the most closely related targets to ligand recovery, we analyzed the search performance

8. NEAREST NEIGHBOR EFFECTS

The findings discussed above reflect a strong influence of ligand information of nearest neighbor targets on ligand prediction for orphanized targets. In order to evaluate the magnitude of nearest neighbor effects, SVM TLK calculations using the MEROPS kernel were also carried out after

LIGAND PREDICTION

FOR

ORPHAN TARGETS

J. Chem. Inf. Model., Vol. 49, No. 10, 2009 2165

Figure 7. Target-dependent search performance (set 2). For all targets of set 2, recovery rates are shown for simple SVM ranking and for the SVM TLK search strategy in combination with the MEROPS kernel. For simple SVM ranking, ligands of all members of the orphanized target’s subfamily were pooled and used as the positive training class. Recovery rates are shown for a selection set of 1000 compounds averaged over 10 independent trials per target.

Figure 8. Dependence of search performance on ligands of nearest neighbor targets. For all set 1 targets, search results are reported for the SVM TLK (MEROPS) search strategy. Blue bars show recovery rates obtained by learning with all target set ligands, whereas red bars show recovery rates obtained when the ligands of the nearest neighbor are excluded from the training set. Recovery rates are shown for a selection set of 1000 compounds averaged over 10 independent trials per target.

2166

J. Chem. Inf. Model., Vol. 49, No. 10, 2009

removal of the ligands of the nearest neighbor target from SVM learning. The search results for set 1 targets are shown in Figure 8 and Supplementary Figure S5, and Supplementary Table S6 reports the comparison of SVM TLK and LC calculations. As can be seen in Figure 8, removal of ligands led to a sharp decline in recovery rates when a nearest neighbor target was available at the subfamily level (effects observed in SVM TLK and LC calculations were similar). By contrast, removal of ligands for targets where no closely related neighbor was available had only little influence on the search performance. Thus, these findings further corroborated the crucial role of nearest neighbor ligand information for orphan target ligand prediction using SVM techniques. 9. CONCLUSIONS

In this study, we have investigated different strategies for SVM-based ligand prediction for simulated orphan targets with special emphasis on the evaluation of alternative target kernel functions that capture protein information at different levels. The approaches investigated here aimed at de novo ligand predictions. The way target information was taken into account presented a major variable in these calculations, and, accordingly, target kernels of different complexity and information content were designed and evaluated. Surprisingly, these alternative kernel functions influenced the calculations much less than one might anticipate. Rather, nearest neighbor effects were found to be the major determinant of ligand prediction performance. In particular, when ligand information from one or more closely related targets was available, simple SVM calculations utilizing this information met or exceeded the search performance of SVM TLK and LC calculations. For SVMbased ligand prediction on orphan targets, these findings have significant implications. Rather than focusing on information provided by reference systems capturing protein hierarchies, searching for targets with known ligands that are closely related to orphan targets (e.g., at the subfamily level) should be a primary objective. For this purpose, simple detection of sequence similarity might often be sufficient. In the presence of strong nearest neighbor relationships, SVM-based strategies for ligand prediction can be simplified. In these cases, simple SVM calculations using nearest neighbor ligands for learning, or corresponding SVM linear combinations, are expected to produce promising results. By contrast, if no closely related targets can be identified, SVM learning using target kernels capturing protein hierarchy information is likely to be a preferred approach. Thus, SVM strategies for ligand prediction can be adjusted based on an initial exploration of target relationships. ACKNOWLEDGMENT

We wish to thank Dagmar Stumpfe and Jens Humrich for help with target and ligand sets and SVMlight, respectively. Supporting Information Available: Supplementary Figures S1-S5 and Tables S1-S6 report the results of test calculations on individual targets using different SVM strategies. This material is available free of charge via the Internet at http://pubs.acs.org.

WASSERMANN

ET AL.

REFERENCES AND NOTES (1) Eckert, H.; Bajorath, J. Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug DiscoVery Today 2007, 12, 225–233. (2) Vapnik, V. N. The Nature of Statistical Learning Theory, 2nd ed.; Springer: New York, 2000. (3) Boser, B. E.; Guyon, I. M.; Vapnik, V. A training algorithm for optimal margin classifiers, In Proceedings of the 5th Annual Workshop on Computational Learning Theory, Pittsburgh, Pennsylvania, 1992; ACM, New York, 1992; pp 144-152. (4) Mu¨ller, K.-R.; Ra¨tsch, G.; Mika, S.; Tsuda, K.; Scho¨lkopf, B. An introduction to kernel-based learning algorithms. IEEE Neural Networks 2001, 12, 181–201. (5) Scho¨lkopf, B.; Smola, A. Learning with Kernels; MIT Press: Cambridge, MA, 2002. (6) Bock, J. R.; Gough, D. A. Virtual screens for ligands of orphan G protein-coupled receptors. J. Chem. Inf. Model. 2005, 45, 1402–1414. (7) Erhan, D.; L’Heureux, P.-J.; Yue, S. Y.; Bengio, Y. Collaborative filtering on a family of biological targets. J. Chem. Inf. Model. 2006, 46, 626–635. (8) Jacob, L.; Vert, J.-P. Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 2008, 24, 2149–2156. (9) Geppert, H.; Humrich, J.; Stumpfe, D.; Ga¨rtner, T.; Bajorath, J. Ligand prediction from protein sequence and small molecule information using support vector machines and fingerprint descriptors. J. Chem. Inf. Model. 2009, 49, 767–779. (10) Geppert, H.; Horva´th, T.; Ga¨rtner, T.; Wrobel, S.; Bajorath, J. Supportvector-machine-based ranking significantly improves the effectiveness of similarity searching using 2D fingerprints and multiple reference compounds. J. Chem. Inf. Model. 2008, 48, 742–746. (11) Willett, P.; Barnard, J. M.; Downs, G. M. Chemical similarity searching. J. Chem. Inf. Comput. Sci. 1998, 38, 983–996. (12) Wassermann, A. M.; Geppert, H.; Bajorath, J. Searching for targetselective compounds using different combinations of multiclass support vector machine ranking methods, kernel functions, and fingerprint descriptors. J. Chem. Inf. Model. 2009, 49, 582–592. (13) Powell, M. J. D.; Mason, J. C.; Cox, M. G. Radial Basis Functions for Multivariable Interpolation: A review. In Algorithm for Approximation; Clarendon Press: 1987; pp 143-167. (14) Ralaivola, L.; Swamidass, S. J.; Saigo, H.; Baldi, P. Graph kernels for chemical informatics. Neural Networks 2005, 18, 1093–1110. (15) Rice, P.; Longden, I.; Bleasby, A. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 2000, 16, 276–277. (16) Li, Z. R.; Lin, H. H.; Han, L. Y.; Jiang, L.; Chen, X.; Chen, Y. Z. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2006, 34, W32–W37. (17) Dubchak, I.; Muchnik, I.; Holbrook, S. R.; Kim, S. H. Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. U.S.A. 1995, 92, 8700–8704. (18) Leslie, C.; Eskin, E.; Noble, W. S. The spectrum kernel: a string kernel for SVM protein classification. Pac. Symp. Biocomput. 2002, 564– 575. (19) Jones, D. T. Protein secondary structure prediction based on positionspecific scoring matrices. J. Mol. Biol. 1999, 292, 195–202. (20) Fontana, P.; Bindewald, E.; Toppo, S.; Velasco, R.; Valle, G.; Tosatto, S. C. The SSEA server for protein secondary structure alignment. Bioinformatics 2005, 21, 393–395. (21) Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nat. Genet. 2000, 25, 25-29. (22) Apweiler, R.; Bairoch, A.; Wu, C. H.; Barker, W. C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; Martin, M. J.; Natale, D. A.; O’Donovan, C.; Redaschi, N.; Yeh, L. S. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 2004, 32, D115–D119. (23) Camon, E.; Magrane, M.; Barrell, D.; Binns, D.; Fleischmann, W.; Kersey, P.; Mulder, N.; Oinn, T.; Maslen, J.; Cox, A.; Apweiler, R. The Gene Ontology annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res. 2003, 13, 662– 672. (24) Lei, Z.; Dai, Y. Assessing protein similarity with Gene Ontology and its use in subcellular localization prediction. Bioinformatics 2006, 7, 491. (25) Rawlings, N. D.; Morton, F. R.; Kok, C. Y.; Kong, J.; Barrett, A. J. MEROPS: the peptidase database. Nucleic Acids Res. 2008, 36, D320– D325. (26) Igarashi, Y.; Eroshkin, A.; Gramatikova, S.; Gramatikoff, K.; Zhang, Y.; Smith, J. W.; Osterman, A. L.; Godzik, A. CutDB: a proteolytic event database. Nucleic Acids Res. 2007, 35, D546–549.

LIGAND PREDICTION

FOR

ORPHAN TARGETS

(27) Murzin, A. G.; Brenner, S. E.; Hubbard, T.; Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995, 247, 536540. (28) MOE (Molecular Operating EnVironment); Chemical Computing Group Inc.: Montreal, Quebec, Canada, 2007. (29) Sippl, M. J.; Wiederstein, M. A note on difficult structure alignment problems. Bioinformatics 2008, 24, 426–427. (30) Sippl, M. J. On distance and similarity in fold space. Bioinformatics 2008, 24, 872–873. (31) Saigo, H.; Vert, J. P.; Ueda, N.; Akutsu, T. Protein homology detection using string alignment kernels. Bioinformatics 2004, 20, 16821689. (32) MDL Drug Data Report (MDDR); Symyx Software: San Ramon, CA, 2005. (33) Liu, T.; Lin, Y.; Wen, X.; Jorrisen, R. N.; Gilson, M. K. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res. 2007, 35, D198–D201.

J. Chem. Inf. Model., Vol. 49, No. 10, 2009 2167 (34) Chen, X.; Liu, M.; Gilson, M. K. Binding DB: a web-accessible molecular recognition database. Comb. Chem. High Throughput Screening 2001, 4, 719–725. (35) MACCS Structural Keys; Symyx Software: San Ramon, CA, 2005. (36) Irwin, J. J.; Shoichet, B. K. ZINC - a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 2005, 45, 177–182. (37) SVMlight. URL for the publicly available SVM tool. http://svmlight. joachims.org/ (accessed June 2009). (38) Joachims, T. Making large-scale SVM learning practical. In AdVances in Kernel Methods - Support Vector Learning; Scho¨lkopf, B., Burges, C., Smola, A., Eds.; MIT-Press: Cambridge, MA, 1999. (39) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235–242.

CI9002624