Self-Organizing Fuzzy Graphs for Structure-Based Comparison of

Sep 30, 2010 - odological concepts of computer-assisted lead discovery and optimization. Various approaches for binding pocket comparison have...
16 downloads 0 Views 3MB Size
Self-Organizing Fuzzy Graphs for Structure-Based Comparison of Protein Pockets Felix Reisen,† Martin Weisel,‡ Jan M. Kriegl,§ and Gisbert Schneider*,† Computer-Assisted Drug Design, Eidgeno¨ssische Technische Hochschule (ETH) Zu ¨ rich, Zu ¨ rich, Switzerland, Pharma Research Scientific Informatics, Hoffmann-La Roche Inc, Nutley, New Jersey, United States, and Lead Discovery, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riβ, Germany Received July 12, 2010

Patterns of receptor-ligand interaction can be conserved in functionally equivalent proteins even in the absence of sequence homology. Therefore, structural comparison of ligand-binding pockets and their pharmacophoric features allow for the characterization of so-called “orphan” proteins with known three-dimensional structure but unknown function, and predict ligand promiscuity of binding pockets. We present an algorithm for rapid pocket comparison (PoLiMorph), in which protein pockets are represented by self-organizing graphs that fill the volume of the cavity. Vertices in these threedimensional frameworks contain information about the local ligand-receptor interaction potential coded by fuzzy property labels. For framework matching, we developed a fast heuristic based on the maximum dispersion problem, as an alternative to techniques utilizing clique detection or geometric hashing algorithms. A sophisticated scoring function was applied that incorporates knowledge about property distributions and ligand-receptor interaction patterns. In an all-against-all virtual screening experiment with 207 pocket frameworks extracted from a subset of PDBbind, PoLiMorph correctly assigned 81% of 69 distinct structural classes and demonstrated sustained ability to group pockets accommodating the same ligand chemotype. We determined a score threshold that indicates “true” pocket similarity with high reliability, which not only supports structure-based drug design but also allows for sequenceindependent studies of the proteome. Keywords: orphan protein • drug design • machine learning • pocketome • graph matching

Introduction Virtually all biochemical processes are based on receptorligand and macromolecule-macromolecule interactions, rendering the structural characterization of molecular interaction sites a valuable starting point for sequence-independent protein comparison, and consequently the design of novel bioactive agents.1-3 Comparative pocket analyses have been performed to define the “pocketome”,4 predict pocket “druggability”,5-8 or understand the role of “orphan” proteins with known structure but unknown function even in the absence of sequence homology.9 Knowledge about pharmacophoric pocket similarities can also be used to predict cross-reactivity of bioactive lead compounds.10,11 Despite great efforts to optimize ligands for specific target molecules, most of the drug candidates cause undesired side effects, often due to interactions with off-target proteins.12 Reliable assessment of target site similarity may help avoid such unintended cross-reactivity of * To whom correspondence should be addressed. Prof. Dr. Gisbert Schneider, Eidgeno¨ssische Technische Hochschule (ETH), Department of Chemistry and Applied Biosciences, Institute of Pharmaceutical Sciences, Wolfgang-Pauli-Str. 10, 8093 Zurich, Switzerland. E-mail: gisbert.schneider@ pharma.ethz.ch. Phone: +41 44 633 73 27. Fax: +41 44 633 13 79. † Eidgeno¨ssische Technische Hochschule (ETH) Zu ¨ rich. ‡ Hoffmann-La Roche Inc. § Boehringer Ingelheim Pharma GmbH & Co. KG.

6498 Journal of Proteome Research 2010, 9, 6498–6510 Published on Web 09/30/2010

drug candidates and provide insights into functional relatedness of pharmacologically relevant macromolecular targets. The obtained information about pocket similarities could then be exploited by rational ligand design resulting in the desired target and activity profile. This view of “pharmacophoric chemical space” will undoubtedly complement current methodological concepts of computer-assisted lead discovery and optimization. Various approaches for binding pocket comparison have been proposed during the last years. It is common to use the amino acid sequence, the protein backbone structure, and the local arrangement of pocket residue atoms as basis for the computational analysis.13 For cross-reactivity prediction as well as orphan protein studies, the latter is preferable because it allows detection of similarities in shape and properties without the strict need for sequence homology. The existing approaches can be grouped into three classes according to the type of structural information used: Type 1 methods are designed to detect similarities in atom, pseudo-center, or residue composition.11,13-17 They perform well if the amino acid sequence lining the pocket, residue motifs or atom positions are conserved.18 Type 2 methods utilize surface-based representations of the pockets,18-20 and pharmacophoric features are mapped onto the molecular surface of the proteins. 10.1021/pr100719n

 2010 American Chemical Society

research articles

Structure-Based Comparison of Binding Pockets

Figure 1. Construction of a fuzzy pocket graph. (A) Grid representation of thymidylate synthase binding pocket (1tsy,47 resolution: 2.2 Å). (B) Neurons of growing neural gas in the pocket grid. (C) Property assignments of the grid, exemplarily colored by its solvatability potential (from red “lipophilic” to blue “hydrophilic”). (D) Property assignment of the fuzzy graph, exemplarily colored by the attributes lipophile (from white “not lipophilic” to red “strongly lipophilic”) and hydrophile (from white “not hydrophilic” to red “strongly hydrophilic”); yellow sphere around central vertex shows the radius of projection. Due to the differentiated projection of electrostatic properties, lipophilic and hydrophilic influences do not cancel each other out, even though the vertex in the center of the sphere is located in the interface of hydrophilic and lipophilic regions.

Type 3 methods represent pockets by a grid placed in the binding site cavity.8,10 Conceptually based on the GRID method originally conceived by Goodford 198521 and related to CoMFA22 and other 3D-QSAR approaches, grid probes are used to calculate different potentials that are deemed important for receptor-ligand interaction. Type 3 methods directly describe the binding site volume that can be adopted by a ligand. The different methods can further be discriminated by the underlying mathematical model and the matching algorithm used for pocket comparison. For instance, pocket information can be stored in a “fingerprint”,3,8,11,18 which allows for rapid calculation of similarity scores between pockets. A disadvantage of this representation often is lack of interpretability, since these methods do not deliver a pocket alignment as part of the outcome. Other techniques, like graph-based methods, explicitly include the pocket alignment in the matching algorithm.17,23 Frequently employed alignment methods are clique detection24 and geometric hashing.25 For a more detailed overview of methods for pocket comparison, we refer to a recent review article by Kellenberger et al. (2008).26 Here, we present a new Type 3 method, which we termed “PoLiMorph” (pocket-ligand morphing). Its central idea is to reduce the shape and properties of a pocket to a threedimensional graph representation. For graph matching, we propose an algorithm based on a heuristic for the maximum dispersion problem.27 The vertices of the graphs are labeled according to fuzzy set theory28 as proposed by Zadeh in 1965 as an extension to classical set theory. It is based on the idea that “many things in life are matters of degree”,29 which allows for a differentiated description and comparison of objects. In previous related studies, fuzzy set theory has been integrated

in ligand30 and pocket31 models. In PoLiMorph, crisp property assignments to the vertices of the pocket graphs are replaced by smooth functions. In the next section, we give a detailed description of the mathematical concept and underlying algorithms. In the following two sections, we present results from experiments with PoLiMorph, show its application to pocket similarity analysis, and compare the method to other recently published approaches.

Materials and Methods Here, binding pockets of the proteins are represented by three-dimensional graphs with multiple fuzzy vertex labels. For graph construction, we relied on potential binding sites of proteins that had previously been identified by the software PocketPicker3 (Figure 1A). In PoLiMorph, positions of the graph vertices in the pocket are generated by a growing neural gas (GNG) approach32 leading to a set of “neurons” (points in space) that mimic the shape of the cavity (Figure 1B). Neuron coordinates are equivalent to vertex coordinates of the pocket graph. Subsequently, properties such as buriedness, solvatability, electrostatics, and potential hydrogen-bonding terms are derived from the protein structure and projected onto the vertices (Figure 1C and D). Finally, rapid pairwise comparison of pocket graphs is carried out by an error-tolerant graph matching algorithm. The resulting score value determines the similarity of the two pockets graphs and depends on the number as well as the properties of mapped vertices. Graph Theory. Mathematical graphs are abstract representations of objects and their relationships to each other.33 In a graph, objects and their relationships are represented by a set Journal of Proteome Research • Vol. 9, No. 12, 2010 6499

research articles

Reisen et al.

of vertices V and a set of edges E, respectively. Depending on the chosen graph form and on the modeled objects, vertices and edges can be described in more detail by labeling functions λ, φ and associated Labels LV and LE, respectively. Thus, a labeled graph G is a quadruple (V,E,λ,φ), where V ) {v1,v2, ..., vn}, E ⊆ V × V, λ:VfLv, and φ:EfLE. An intuitive example for the use of graphs is the representation of molecules: the vertices and edges represent atoms and bonds, respectively; λ assigns atom types to the vertices, φ assigns bond orders to the edges. Graph models allow for a compact description and subsequent comparison of objects. In molecular modeling, an important basic task is the comparison of chemical structures,34 which in turn can be reduced to the comparison of the molecular graphs. For instance, algorithms that identify graph isomorphisms35-38 can be used to identify duplicates in molecular databases. Other approaches discover structural similarity between molecules by means of common subgraph isomorphism algorithms.34,39 In this way, databases can be screened for molecules that are similar to a given query in terms of the number of mapped atoms and edges of the identified common subgraph. For further information about graphs and successful applications in cheminformatics and drug design, we refer to the literature.40,41 Fuzzy Logic. Zadeh proposed fuzzy set theory as an extension to classical set theory in 1965. It is based on the idea that “many things in life are matters of degree” (for example, a typical black and white photo consists not only of black and white, but contains also many gray shaded pixels).29 Fuzzy logic provides a conceptual framework for dealing with such problems in which precisely defined membership criteria are absent. Let A be a set and X a space of objects. In classical set theory each x ∈ X is either a member of A or it is not (binary yes/no assignment). In the context of fuzzy set theory such a classical set is called a ‘crisp’ set. In a fuzzy set, the binary assignment function is replaced by a function η that assigns a grade of membership for each x ∈ X to the fuzzy set Aη. Typically, the membership function is valued in the real unit interval [0,1], so that pseudo-probabilities are obtained. Therefore, the fuzzy sets approach allows for a more refined characterization of values than an approach with crisp sets and strict membership thresholds. The combination of multiple fuzzy sets leads to an even more detailed representation of objects. On the one hand, these sets can measure related properties, for example, different sets for electrostatic properties such as “positive”, “neutral” or “negative”. On the other hand, unrelated attributes can be detected, such as “charge” and “buriedness”. In regard to a description by multiple fuzzy sets, a similarity function ψ can be defined for the comparison of two objects x and y (eq 1).

searches for buried regions on the protein surface by means of an artificial grid approach. Proteins are embedded in a grid and buriedness indices are assigned to the grid points (Figure 1A). The environment of each grid point is scanned for surrounding atoms of the protein. Thirty different direction vectors of length 10 Å and width 0.9 Å emerge from each grid point. Each vector that encounters at least one atom of the protein increments the buriedness index of its grid point by one. After scanning the whole protein, grid points with buriedness indices greater than 14 are clustered. Each cluster incorporates a potential binding site at which large pockets are more likely to be “true” areas for protein-ligand interactions. In previous case studies, PocketPicker succeeded in locating the actual binding site as one of the top-three predicted sites in 85% of the proteins tested.8 Grid Potentials. Labels of the pocket graph vertices are integrated from five potentials of their surrounding grid points. Grid point properties are deduced from the protein structure (Figure 1C). In the present implementation, “Buriedness”,3 “Electrostatics”,42 “Solvatibility”,43,44 “HBondDonor”, and “HBondAcceptor”45 properties are used as grid potentials. Buriedness values Bg for grid points g are directly imported from the output of PocketPicker. Calculation of the electrostatic potential Eg is based on partial charges of the atoms belonging to the protein, as computed by the software PDB2PQR46 using the CHARMM27 forcefield.42 For each grid point g the electrostatic potential Eg is calculated from all partial charges of atoms that lie within a radius c of 10 Å (eq 2).

Eg )

∑ω

1 |F| η

p∈F

ηp(|ηp(x)

- ηp(y)|)

(1)

∑ i

∑ log P(a )*∆ i

6500

Journal of Proteome Research • Vol. 9, No. 12, 2010

2

(2)

c,s(δ(g, ai))

ai∈AL

∑∆

c,s(δ(g, ai))

ai∈AL

exp In eq 1, ηp ∈ F are the associated membership functions of the fuzzy sets characterizing the objects. ωηp are weighting functions that allow us to scale or emphasize the importance of certain fuzzy functions. In our approach, fuzzy sets and their associated membership functions are used to characterize the vertices of the pocket graphs. The similarity function is used to compare vertices of different graphs. PocketPicker. Identification of binding pockets is accomplished by the software package PocketPicker.3 PocketPicker

E

))

In eq 2, ε is the dielectric constant (here: ε ) 1), AE is the set of atoms lying within the radius c around the grid point, χ(ai) is the partial charge of atom ai, and δ(g,ai) is the Euclidean distance between g and ai. Solvatability values Sg are calculated analogously. LogP values of the protein atoms are estimated according to a substructurebased method proposed by Wildmann and Crippen (1999).44 The solvatability values of the grid are then derived from the estimated logP values. As proposed by Heiden et al. (1993),43 the projection is achieved by a sigmoidal function that weights the influence of atoms according to their distance to the grid point under consideration (eq 3).

Sg ) ψ(x, y) ) 1 -

( (

χ(ai) δ(g, ai) 1 * * 14*π*ε a ∈A δ(g, ai) c

(

, where ∆c,d(δ(g, ai)) )

)( (

))

2 ∗ δ(g, ai) - c -2c + 1 * exp +1 s s

-1

(3)

In eq 3, g stands for the grid point, AL is the set of atoms forming the pocket, log P(ai) is the atomistic octanol/water partition coefficient of atom ai. The steepness factor c and the width value s are parameters of the sigmoidal distance function ∆ set to 4 and 1.5 Å, respectively. δ(g,ai) gives the Euclidean distance between g and ai. The hydrogen bond acceptor and donor potentials Ag and Dg are calculated using knowledge-based potentials as proposed by Liu et al. (2008).45 Six different donor types and five

research articles

Structure-Based Comparison of Binding Pockets different acceptor types are used. For each donor-acceptor combination a distance and an angle dependent potential is defined. We first estimate the interaction energies between potential hydrogen donors of the protein and all hypothetical acceptor types located at the grid points. Then, for each grid point in turn, the optimal value of these energies is taken as value for Ag. Estimates depend on the distances as well as on the donor-hydrogen-acceptor angle between the grid points and the potential donor atoms (eq 4).

(∑

Ag ) minacc∈AT

∑ (ϑ

ai∈Dacc hj∈Hi

acc,don(ai)(ai, g)

+

θacc,don(ai)(ai, g, hj))

)

(4)

(∑

ai∈Adon

property (p)

projection function

grid-potential (φ)

b

d

Negative Neutral Positive Hydrophile Amphiphile Lipophile Surface Half-Buried Buried H-Bond-Donor H-Bond-Acceptor

ζ1(g) ζ2(g) ζ3(g) ζ1(g) ζ2(g) ζ3(g) ζ1(g) ζ2(g) ζ3(g) ζ1(g) ζ1(g)

Eg Eg Eg Lg Lg Lg Bg Bg Bg Dg Ag

1 0 1 1 0 1 1 0 1 1 1

-1 1 -1 -1 1 -1 -1 1 -1 -1 -1

a For each property, the projection function, grid potential, and the parameters b and d are given.

In eq 4, g is the grid point under consideration, and AT is the set of acceptor types. ϑacc,don(ai)(ai,g) denotes the distance potential of the grid point’s acceptor type acc and the protein atom’s donor type don(aI). Accordingly, θacc,don(ai)(ai,g,hj) describes the donor-hydrogen-acceptor angular potential, where hj is a hydrogen atom bound to ai. Dacc is the set of “favorable” donor atoms of the protein for a hypothetical acceptor atom of type acc in position of g, so that ϑacc,don(ai)(ai,g) < 0. Hi contains all hydrogen atoms that are covalently bound to a potential donor atom ai, so that θacc,don(ai)(ai,g,hj) < 0. The donor potential of the grid Dg is calculated analogously to Ag (eq 5). In contrast to the calculation of the acceptor values however, the hydrogen positions that are needed for the angular potential are not included in the protein structure. Therefore, a virtual hydrogen atom is placed at a position, so that the donor-hydrogen-acceptor angle is optimal with regard to θ. Dg ) min don∈DT

Table 1. Properties and Parameter Settings for the Coloring of Graph Verticesa

ϑacc(ai),don(ai, g) + θacc(ai),don(ai, g, hg,ai)

)

(5)

In eq 5, g is the grid point under consideration, and DT is the set of potential donor types. ϑacc(ai),don(ai,g) denotes the distance potential of the grid point’s donor type don and the atom’s acceptor type acc(ai). Accordingly, θacc(ai),don(ai,g,hg,ai) describes the angle potential, where hg,ai is the virtually placed hydrogen atom between g and the current acceptor atom ai. Adon is the set of favorable acceptor atoms of the protein for a hypothetical donor atom of type don in position of g, so that ϑacc(ai),don(ai,g) < 0. Fuzzy Graph Mapping. After the grid and its properties have been calculated, a three-dimensional (3D) fuzzy pocket graph GFP is fit into the pocket. Its vertex distribution mirrors the volume of the pocket. Vertex coordinates are determined by the growing neural gas (GNG) method.32 A growing neural gas is an incremental network model, which is able to learn topological relations of a given set of input vectors. In cheminformatics, GNG has already been employed for learning binding pocket topologies.48 A nongrowing neural gas approach was used for automated docking of flexible molecules into receptor binding sites.49 In our case, the grid point coordinates serve as input vectors, and the gas disperses its neurons in the grid during the training process. The training consists of alternating steps of a Hebb-like learning rule50 and steps of adding or removing neurons. The process terminates when the

average distance of the grid points to their nearest neurons is below a user-defined threshold. The coordinates of the neurons serve as coordinates of the vertices of the 3D fuzzy pocket graph GFP. The pseudo-code of the algorithm and its parametrization are given in Figure S1 and Table S1, respectively; an example for a training process is visualized in Video S1 (Supporting Information). GFP is a fully connected graph so that every vertex pair in the graph is connected by an edge. In mathematical terms, GFP can be formulated as quadruple (VFP,EFP,ΛFP,φFP), where VFP ) {v1,v2, ..., vn}, EFP ) VFP × VFP, Λ ) {λp:VFPfLVp|p ∈ P} and φFP:EFPfLE}. φFP labels the edges by the Euclidean distance between their associated vertices. The vertices of the graph are labeled by 11 different fuzzy functions λp ∈ ΛFP that account for different properties p ∈ P, such as “positive charge” or “negative charge”. These functions are refinements of the five grid potentials (Table 1). According to fuzzy set theory, the vertex labeling functions transform the grid values to a real unit interval [0,1]. For these projections, grid points that lie within a radius r ) 3.4 Å around a vertex are considered. This default value was obtained from preliminary experiments, in which radii from 1.5 to 5.0 Å were tested. Depending on property p, only those grid points within r that make a positive contribution to the fuzzy value λp are included in the calculation. For instance, a positively charged grid point is not included in the calculation of the property “negative charge”. This ensures that information about related but contrary properties in the environment of a vertex is conserved in the vertex description. In a naı¨ve approach with only one set describing the electrostatic potential, information loss can occur when vertices are located in the interface of a positive and a negative region. In this case, positive and negative contributions would cancel each other out, leading to a ‘neutral’ coloring of the vertices. These vertices could not be distinguished from vertices located in a grid area consisting only of neutral grid points (Figure 1D). Separate handling of different properties has the additional advantage of individual property weighting. To allow for a consistent handling and projection of the grid potentials, distribution parameters are used as part of the fuzzy labeling functions λP ∈ ΛFP. For this purpose, grid values were calculated for all binding pockets of the PDBbind core set (release 2007).2 For each grid potential φ ∈{Bg,Eg,Sg,Ag,Dg} the means µφ and standard deviations σφ of the calculated values were determined. Similar to the concept of z-scoring, these parameters are incorporated in the labeling functions as described in eq 6. Journal of Proteome Research • Vol. 9, No. 12, 2010 6501

research articles λp(v) )

Reisen et al. 1 |N|

∑ Ω (g), p

g∈N

1 where Ωp(g) ) b + d*exp - ζk(g)2 , 1 e k e 3, 2 µφ - φp(g) φp(g) - µφ ζ1(g) ) max 0, , ζ2(g) ) , σφ σφ φp(g) - µφ ζ3(g) ) max 0, , σφ N ) {g|(δ(g, v) < r) ∧ (ζk(g) > 0)}

(

(

(

)

)

(6)

)

In eq 6, N is the set of grid points that lie within radius r around vertex v and contribute a positive value to λp. The normalizing function ζk, the grid potential φ, b and d vary in regard to p as described in Table 1. In Figure 2, the graphs of the functions Ωp(g) are exemplarily shown for the properties “negative”, “neutral”, and “positive”. Comparison of Graphs. Inspired by clique-based approaches for the exact subgraph matching problem,24 we developed an error-tolerant heuristic for the matching of 3D graphs. Given two input graphs, the heuristic follows a greedy approximation strategy that tries to identify a k-densest subgraph27 in their labeled product graph. A k-densest subgraph Sk of a graph G is a subgraph with k vertices whose sum of the edge labels is maximal under all possible subgraphs with k vertices. Since the determination of a k-densest subgraph is NP-hard,51 we use a greedy approximation algorithm, which runs in O(|V|2): Starting with two fuzzy pocket graphs GFP1 ) (VFP1,EFP1,ΛFP1,φFP1) and GFP2 ) (VFP2,EFP2,ΛFP2,φFP2), the algorithm constructs the fuzzy labeled product graph PGGFP1,GFP2 ) (VPG,EPG},λPG,φPG). Its vertex set consists of all possible vertex pairs (u,v) ∈ VFP1 × VFP2. According to the different fuzzy vertex labels of u ∈ VFP1 and v ∈ VFP2, the vertex labeling function λPG((u,v)) of the product graph is defined as the fuzzy similarity between the vertices u and v (eq 7).

λPG((u, v)) ) ψ(u, v) ) 1 -

1 |P|

∑ ω (|λ (u) - λ (v)|), p

1 p

2 p

p∈P

dif)

where ωp(x) ) xlog(t)/log(σp

(7)

In eq 7, the value of λPG((u,v)) is scaled between 0 and 1. The weighting functions ωp(x) perform two different tasks: On the one hand, they scale the distances |λp(u) - λp(v)| for the different fuzzy sets by using the scaling parameters σpdif. For the calculation of these scaling parameters, all possible vertex pairs from the pocket graphs of the PDBbind core set were constructed and the half-normal distributions of |λp(u) - λp(v)| obtained. σpdif correspond to the respective standard deviations of these half-normal distributions. On the other hand, ωp(x) amplifies the signal of minor differences between the objects in the respective fuzzy sets by the included scaling factor t (set to 0.68, cf. Figure S2, Supporting Information). The edges of the product graph ((u1,v1),(u2,v2)) ∈ EPG are weighted by a labeling function φPG that depends on the vertex labels λPG((u1,v1)) and λPG((u2,v2)), as well as on the similarity 1 2 (u1,u2) and φFP (v1,v2) between the repreof the distances φFP 1 2 and GFP (eq 8). sented vertices in the pocket graphs GFP 6502

Journal of Proteome Research • Vol. 9, No. 12, 2010

Figure 2. Graphs of functions Ωp(g) for the three different properties “negative”, “neutral”, and “positive”. The projection is dependent on the distribution of the electrostatic potential Eg on all grid points of the PDB-Bind Core Set, which is incorporated in the functions ζk(g).

φPG((u1, v1), (u2, v2)) ) 0 if (u1 ) u2) or (v1 ) v2) λPG(u1, v1) + λPG(u2, v2) 1 2 (u1, u2), φFP (v1, v2)) else, *ξ(φFP 2

{

1 2 where ξ(φFP (u1, u2), φFP (v1, v2)) )

( (( exp m*

1 2 |φFP (u1, u2) - φFP (v1, v2)| 1 2 max(φFP (u1, u2), φFP (v1, v2))

)) )

-1

-q

+1

(8)

In eq 8, ξ is a sigmoidal distance function in the real unit interval [0,1], where m ) 20 and q ) 0.3 are scaling factors that were empirically determined in preliminary experiments. φPG((u1,v1),(u2,v2)) reaches large values if the Euclidean distances φ1FP(u1,u2) and φ2FP(v1,v2) are similar, and λPG(u1,v1) and λPG(u2,v2) are close to 1. For graph matching, a dense subgraph of size k ) min (|V1FP|,|V2FP|) with a high average edge weight corresponds to a good ‘fit’ with respect to shape and properties of the origin graphs. The k-densest subgraph algorithm works on the edge matrix of the product graph. First, the sum of each row of the edge matrix is calculated. Successively, the vertex of the product graph with minimal row sum is removed and the remaining sums are updated. This is done until only k vertices are left, which form an approximation of the k-densest subgraph. 1 2 Matched vertices of GFP and GFP can directly be obtained from the returned set of vertices S. Figure 3 displays the workflow of the algorithm. The matching heuristic can lead to solutions where vertices of one graph are matched against multiple vertices of the other graph. In fact, this behavior of the algorithm is favorable, because the stochastic generation of the vertex positions by the growing neural gas does not necessarily create the same number of vertices for the same region of a pocket in different runs. Overrating of these multiple matched vertices should be avoided by the scoring function. Therefore, we chose a scoring function that averages over multiple matched vertices (eq 9). Normalization of this score (to [ -1,1]) is used to decrease the influence of the graph size (eq 10).

Score )

[ (

∑ ∑

u∈V1

(u,vi)∈Su

λPG((u, vi)) - µλsim PG |Su |*σλsim PG

[∑( ∑ v∈V2

(ui,v)∈Sv

)]

+

λPG((ui, v)) - µλsim PG |Sv |*σλsim PG

)]

(9)

research articles

Structure-Based Comparison of Binding Pockets NormalizedScore )

1 2 Score(GPG , GPG )

1 1 2 2 1 ) + Score(GPG , GPG )) , GPG *(Score(GPG 2 (10)

In eq 9, µλPGsim and σλPGsim are the mean and standard deviation of the distribution of vertex similarities that are used to calculate a z-score of the vertex matchings. This distribution was derived from the calculation of the fuzzy similarities λPG for all vertex pairs of all pocket graphs from the PDBBind core set. Su and Sv are subsets of the k-densest subgraph S ⊆ VPG, 1 where Su and Sv contain all vertex pairs that include u ∈ VFP or 2 v ∈ VFP , respectively. Methodological Overview of the Experiments. All protein structures used in the presented study were preprocessed before using them for pocket identification and similarity analysis. This procedure included the removal of all ligands and water molecules, hydrogen adding and positioning, as well as setting of ionization states by the software PDB2PQR.46 In PDB2PQR the options “noopt”, “nodebump”, and “chain” were activated, and the CHARMM27 forcefield42 was selected in each call of the program. In total, four protein collections were prepared for pocket analysis. (i) The PDBbind2 (Database 1) is a subset of the Protein Data Bank52 (PDB) and contains 1300 protein-ligand complexes that meet certain quality standards such as a minimum resolution of 2.5 Å. For our calculations, we used version 2007. To avoid mistakes during pocket detection, predefined pocket structures of the PDBbind were used as input for PocketPicker. These pockets consist of all residues of the original structure that have a distance below 10 Å to the ligand. The largest pocket found in the respective structures served as input for PoLiMorph. Preprocessing was successful for 1289 structures. (ii) The PDBbind core set (Database 2) is a subset of the PDBbind consisting of 70 different protein ligand site classes. Each of these classes contains three PDB structures of the same protein, each bound to a different ligand. One of the 70 classes includes three structures of the mannose binding protein c (PDB-Codes: 1rdi, 1rdj, 1rdl53). The binding sites of these proteins are located on a convex part of the surface and cannot be detected by PocketPicker. Therefore, this class is left out in all analyses. Matchings between all of the remaining 207 proteins from Database 2 were conducted, which resulted in a matrix containing all pairwise matching scores. (iii) The third protein structure database (Database 3) used in this work contains 40 different protein structures with low pairwise sequence identities.54 The binding pockets under consideration are in complex with four different ligand types. Ten of these structure bind to ATP, 10 to NAD, 10 to heme (HEME), and the remaining 10 proteins to five distinct but

Figure 3. Pseudocode of the graph matching heuristic. S is the set of selected vertex pairs returned by the algorithm.

chemical similar steroids. For pocket identification all residues within 10 Å around the ligands served as input for PocketPicker. (iv) The fourth database (Database 4) consists of 23 kinase structures of six distinct classes as proposed by Sciabola et al. (2010).10 Pocket identification was carried out as for Database 3. For construction of hierarchical trees, columns of the input matrices, that contained pairwise similaritiy scores, were normalized and distance matrices containing Euclidean distances between the rows were derived. These matrices served as input for Wards-Clustering method.55 A KolmogorovSmirnoff (KS) test56 was used as statistical test for significant differences between score distributions. All calculations were carried out with the statistical software package [R].57 Graph visualizations in Figure 6 and Figure 7 were calculated with version 2.6.3 of the software CytoScape.58 In the graphs, each vertex represents one structure. In Figure 6 edges between vertices are set if the normalized pairwise matching score of the pockets is greater than 0.15. In Figure 7, edges are obtained from the clustering algorithm. In cross-validation experiments, the 69 protein classes of Database 2 were 10 times randomly divided in training and validation sets, consisting of 35 and 34 classes, respectively. In each run, parameter values were calculated using only proteins of the respective training set, screening experiments were conducted only within the validation set. To receive results that are comparable to results from the experiments using parameters obtained from the compelete PDBbind core set, only the top 5 instead of the top 10 ranked proteins were considered for evaluation. Sequence identities between pockets were calculated by aligning the sequences of the extracted pockets from the PDB structures with ClustalW version 2.0.59

Results To evaluate the performance of our graph matching method for pocket comparison, all pairwise similarities of pockets from the PDBbind core set (Database 2) were calculated. A reliability score for the matching of two pockets was determined that indicates strong structural similarity. This threshold was then used to analyze detected pocket similarities. For visual inspection of the results, we generated a network representation of pocket similarities. We also benchmarked our results against other recently published methods10,18 and present examples of successful pocket screenings. To analyze the performance of PoLiMorph, each pocket of the PDBbind core set was matched against all other pockets of this set. The set comprises 69 predefined clusters, each containing three structures (cf. Materials and Methods for cluster definition). In the following, we refer to pocket structures that belong to the same ligand site class as the query as “expected pockets”. Pockets that do not belong to the same cluster are termed “other pockets”. The goal of this exercise was to rank proteins of the expected pockets at the top. On average, using trivial parallelization the matching of two precalculated pocket graphs took 9.9 ms on a MacPro computer with two 2.26 GHz Quad-Core Intel Xeon Nehalem processors. In all cases the query pocket was ranked first but calculated scores and ranks were not included in subsequent analyses. In 81% of the screenings, one of the two expected pockets was ranked first. In 91% of the screenings, at least one of the two structures was ranked among the top 10. In 66% of the screenings the expected pockets were ranked first and second, Journal of Proteome Research • Vol. 9, No. 12, 2010 6503

research articles

Reisen et al.

Figure 4. Score distributions of pocket comparisons within the PDB-Bind core set. (A) Score distributions were obtained in a single experiment using parameters obtained from the complete PDB-Bind core set. (B) Score distributions of 10 different runs with differing parameters, which were part of a 10 times random 50 + 50 cross validation experiment. “Expected Pocket” gives the distribution of scores within the predefined ligand site clusters. The distribution “Top 10 Ranked of Other Pockets” contains all scores of the top 10 ranked pockets not belonging to the same cluster as the queries (B: Top 5). “All Other Pockets” contains for each query all 204 (B: 99) achieved scores to the pockets not belonging to the same cluster. The reliability threshold τ was chosen based on (A).

respectively. Since these values might be too optimistic we estimated the generalization ability of our predictor in a 10 times random 50 + 50 cross-validation study. Here we obtained 81% (stddev ) 3%), 90% (stddev ) 3%), and 58% (stddev ) 5%), respectively. This outcome suggests that the predictor is reliable with minimal overfitting. To further investigate the scoring ability of PoLiMorph (cf. Materials and Methods for scoring procedure), for each query pocket the scores of the expected pockets were compared to the scores of the top 10 ranked other pockets (Figure 4). The comparison shows that in average scores of the expected pockets are significantly higher (KS-test: p-value 0.15, whereas less than 25% of the top 10 ranked decoys pass this threshold τ. Therefore, scores larger than τ can be seen as a conservative indicator of pocket similarity. τ was determined by visual inspection of the graph presented in Figure 4A. Again, the generalization ability of our predictor was tested in a 10 times random 50 + 50 cross validation study. The obtained results suggest that τ can indeed be seen as a conservative indicator of pocket similarity (Figure 4B), even though only 71% of the expected pockets yield a score value greater than 0.15. To identify pocket similarities between pockets within the PDBbind core set that are not obvious by their cluster membership, we analyzed all pocket pairs which, on the one hand, achieved a score larger than the reliability score threshold and, on the other hand, do not belong to the same protein cluster. Among these pockets were pairs of homologous structures with high sequence identity, as well as nonhomologous pairs. Serine proteases serve as an example for high-scoring matches between homologous structures. Several clusters of the PDBbind core set consist of serine proteases, such as thrombin, trypsin, factor Xa, urokinase, and elastase. These proteins share high pairwise sequence identity (mean ) 0.5, stddev ) 0.12) and are known to have common inhibitors.60 With an average matching score of 0.16, PoLiMorph correctly identifies the strong structural resemblance of these pockets. As indicated by their average sequence identity (mean ) 0.35, stddev ) 0.04), elastase pockets are less similar toward the other five serine proteases. This can also be observed in the average scores calculated by PoLiMorph (mean ) 0.09, stddev ) 0.03). 6504

Journal of Proteome Research • Vol. 9, No. 12, 2010

Still, PDB structures 2bok61 (fXa) and 1elb62 (elastase) achieve a score of 0.17, 1j1763 (trypsin) and 1bma64 (elastase) a score of 0.15, which indicates that PoLiMorph is able to consider them as similar. This is in perfect agreement with the observed promiscuity of ligands, such as serpin ecotin, a picomolar inhibitor of elastase, factor Xa, and trypsin.65,66 An example of a nonhomologous pair with high pocket similarity score is the matching of thymidylate synthase (TS) and dihydrofolate reductase (DHFR). The average pocket similarity score between TS and DHFR structures in the PDBbind core set is 0.13 (stddev ) 0.03) even though the average sequence identity between the structures is only 0.12 (stddev ) 0.04). Several pairs of TS and DHFR structures achieve scores above the reliability threshold t ) 0.15 (e.g., 2d0k67 (DHFR) and 1f4e68 (TS), score ) 0.18). This predicted pocket similarity fits nicely to the related pharmacology of these proteins.69 This example demonstrates that PoLiMorph is able to detect similarities in binding pockets from ‘a ligand’s perspective’ without underlying sequence conservation. Scytalone dehydratase (SD) and retinoic acid receptor γ (RARγ) represent another pair of particularly high-scoring pockets but unrelated sequences (e.g., 3std70 (SD), and 1fcx71 (RARγ), score ) 0.24, sequence identity ) 9%). To the best of our knowledge, no similarity between these structures or their ligands has explicitly been reported so far. Interestingly, the substrate scytalone is a naphthalene derivate, and there are naphthalene derivates acting as inhibitors of RARγ.72 Pocket similarity can be observed by comparing the two bound ligands of SD (MQ0 in 3std) and RARγ (BMS184394 in 1fcx). Both ligands are lipophilic and resemble each other in size and shape (Figure 5). Nevertheless, it should be noted that despite their similar shape, compound BMS184394 is basic and potentially positively charged whereas the cyanocinnoline MQ0 is acidic and potentially negatively charged. These charge differences were not detected by PoLiMorph, which might be a shortcoming of the electrostatics model used. To gain an impression of global similarities between pockets in the PDBbind core set, a graph-based visualization was constructed by means of the software CytoScape (Figure 6). All pockets are represented by vertices and colored according to defined protein classes, such as “kinase” or “aspartic protease”. Vertices are linked by an edge if their calculated

Structure-Based Comparison of Binding Pockets

Figure 5. Alignment of the X-ray conformations of the ligands “BMS184394” and “MQ0” according to the matching of the pocket graphs of RARγ (1fcx) and SD (3std). (A) Pocket graphs of RARγ (top) and SD (bottom) colored according to the similarity between matched vertices (from white “not assigned” to red “strong similarity”). Identical numbers in the graphs indicate assigned vertices. (B) BMS184394 is shown in stick and surface representation (yellow), purple). Pocket graph matching from (A) was used to deduce a rotation matrix, which in turn was employed to align the protein structures together with their bound ligands.

similarity is greater than τ. Once more, this visualization shows the ability of PoLiMorph to cluster different types of related pockets. To complement the information given in Figure 6, a list of all pocket pairs within the PDBbind core set that do not belong to the same predefined cluster and achieve a score greater than a certain threshold τ is given in the Supporting Information (Table S2, Supporting Information). For each pocket, the score distributions and the top 10 ranked proteins are given (Figure S3, Supporting Information). We compared PoLiMorph to other recently published methods, using two additional protein databases. The first database (Database 3) was proposed by Morris et al. (2005)54 and recently employed for parameter optimization by Das et al. (2009)18 for their Type 2 method of pocket comparison. The set contains four different pocket classes binding to four different ligand types namely steroids, adenosine triphosphate (ATP), nicotinamide adenine dinucleotide (NAD), and heme. Despite high variance in pocket structure and pocket size within the classes, PoLiMorph is able to place binding sites of the same ligand class into neighboring regions of the constructed hierarchical tree (Figure 7). The different heme binding sites, as well as the steroid sites are clearly separated from the other classes. According to the hierarchical tree, ATP and NAD binding sites are predicted to be more similar to each other than to the remaining classes. This seems reasonable because of the chemical resemblance of ATP and NAD. Nevertheless, one NAD

research articles containing ligand site of structure 1e3l73 was sorted into an ATP subtree. Interestingly, ATP structures in the binding sites of 1b7674 and 1asz75 both have a conformation that is different from the other bound ATP molecules in the database. Their pockets exhibit more similarity toward each other than to the remainder of the ATP pockets and are thus appropriately located in a separate branch of the ATP subtree. Overall, most of the matches between pockets of the same ligand class do not achieve a score higher than the reliability threshold defined in the previous subsection, but none of the scores of interclass matches exceeds this threshold, either (Figure S4, Supporting Information). This shows that “true” pocket similarities can be observed at lower scores but false-positives above the threshold are rare. We performed KS-tests between distributions of intraand interclass scores. These tests revealed that intraclass scores are significantly higher than interclass scores (heme: p < 2.2 × 10-16; ATP: p ) 2.2 × 10-5; NAD: p < 2.2 × 10-16; steroids: p < 2.2 × 10-16), as expected. For a second benchmark we chose 23 kinase structures belonging to six distinct classes (Database 4) proposed by Sciabola et al. (2010).10 In contrast to the previous test database (Database 3), its structures share high sequence identity across kinase classes in particular in the ATP binding site, which makes the set a challenging task for clustering. For two kinase families of the set, namely cyclin-dependent kinase 2/cyclin-A (CDK2) and glycogen synthase kinase 3 β (GSK3β), nonselective ligands for both targets are known.75,76 Despite strong structural similarity, PoLiMorph clusters most binding pockets of the same protein class in well-defined parts of the hierarchical tree (Figure 8A). The distribution of all pairwise scores (mean ) 0.16, stddev ) 0.04) reflects the high sequence identity among the different kinases. More than 62% of the pairs achieve a score above the reliability threshold of 0.15. The expected strong resemblance of CDK and GSK3β cannot be observed in the tree. Nevertheless, the score distribution of matches between CDK2, GSK3β, and the remaining 15 proteins show that PoLiMorph ranks GSK3β and CDK2 as more similar to each other than to the other kinases (Figure 8B). To demonstrate potential areas of application, we now present the results of some additional test cases for PoLiMorph. In a first example, we compared the pockets of two different hemoglobin structures, 1hvb77 and 2lhb.78 1hvb is a bacterial dimeric structure from Vitreoscilla stercoraria that had been resolved with a resolution of 1.83 Å, whereas 2lhb is a eukaryotic monomeric structure from Petromyzon marinus (resolution: 2.0 Å). Both proteins bind heme and bear the typical globin fold, although sequence alignments of 1vhb to the chains of 2lhb result in sequence identities of only 12%. PoLiMorph scores the similarities between the heme binding site of 1hvb and the binding sites in chain A and chain B with 0.27 and 0.29, respectively, which is clearly greater than the reliability threshold of 0.15. In a prospective study, where only one of the two structures is characterized, these calculated similarities could be used to predict function of the other protein (“deorphanization” of protein function). In another example, we screened the complete PDBbind database (Database 1) for pockets that are similar to the binding site of a major urinary protein (MUP) (1qy179), resolution: 1.7 Å). Major urinary proteins belong to the family of lipocalins that are known to bind and transport small hydrophobic molecules.80 In general, members of the lipocalins do not share high sequence identity but tertiary structures of the proteins are preserved. In our screening, the top-ranked protein was Journal of Proteome Research • Vol. 9, No. 12, 2010 6505

research articles

Reisen et al.

Figure 6. Graph representation of similarities found between pockets within the PDBbind core set created by CytoScape. Colors represent class memberships. Complete predefined clusters of the PDBbind core set that are not included in explicitly annotated classes are colored in different shades of blue. A cluster is defined as “complete”, if all three structures of the cluster are connected by edges.

Figure 7. Clustering of 40 different binding sites binding to the different ligand classes ATP, NAD, heme, and steroids. Each leaf of the unrooted tree is labeled with the represented PDB identifier and the bound ligand. The misclassification of the binding pocket in structure 1e3l, and the two ATP binding pockets 1asz and 1b76 that are bound to ATP molecules with different conformations than the other ATP containing pockets are marked by ellipsoids.

the second MUP structure (1qy2,79 resolution: 1.75 Å) stored in the PDBbind. In place two, three, and six were structures of odorant binding protein (1dzk,81 1g85, and 1hn282), which also belongs to the familiy of the lipocalins. Notably, the protein structure 1dzk is complexed with 2-isobutyl-3-methoxypyra6506

Journal of Proteome Research • Vol. 9, No. 12, 2010

zine, which is also bound to the query pocket of 1qy1. Besides the lipocalins, among the top 13 ranked proteins (∼1% of Database 1) are five different structures of retinoic acid receptor γ1 (RARγ1) and two structures of thyroid hormone receptor β1 (TRs beta). This is a meaningful result, as members of the

Structure-Based Comparison of Binding Pockets

research articles

Figure 8. Ward clustering and score analysis of Database 4. (A) Clustering of the six different kinase types. (B) Score distributions of matches between members within the CDK2 cluster (“CDK”), between members within the GSK3β cluster (“GSK3”), between members of CDK2 and members of GSK3β (“CDK to GSK3β”), between members of CDK2 and the other four kinase classes (“CKD to Others”), and between members of GSK3β and the other four kinase classes (“GSK3β to Others”).

lipocalin family such as the retinoid binding protein83 and the lipocalin-type prostaglandin D synthase84 are known to bind retinoid acids or thyroid hormones, respectively. As a final example, the apo-structure of a bacterial R-amylase (3dc085) was used as query for screening Database 1. Twentyeight different binding pockets reached a score g 0.15. Among these hits we found all six R-amylase structures that are deposited in Database 1. These eukaryotic structures from man (Homo sapiens) and wild boar (sus scrofa) were ranked in places 1-5, and 12. More than 50% of the remaining structures retrieved are annotated as glycosidases (9 beta-glucosidases, 2 endoglucanases, 1 nucleoside hydrolase, 1 mannosidase), just like the query. Three additional pockets belong to antigenbinding fragments that are complexed with different polysaccharides, and only six of the pockets are not annotated or known to be involved in sugar binding. The top-ranking pocket structure of the latter belongs to a methionine aminopeptidase (1wkm86). One may speculate whether ligands with a sugar scaffold are suitable candidates for the development of methionine aminopeptidase inhibitors.

Discussion In this work, we present a novel graph-based method for the comparison of protein binding pockets and showed its capabilities in various examples. First, we examined calculated similarities within the PDBbind core set and showed that similar pockets were retrieved in the screenings. Additionally, we were able to experimentally determine a reliability threshold τ, which indicates “true” pocket similarity and therefore is especially useful in prospective studies. In general, τ proved to be rather conservative since not all matches between similar pockets exceeded this limit. This becomes apparent in the similarity analysis of Database 2, where most comparisons of related pockets do not exceed the threshold score of 0.15. Still, pocket matches within one ligand class score higher than matches between pockets of different classes, leading to meaningful pocket clusters. This outcome demonstrates that score distributions of screening studies depend on both the query and the database under consideration. For a comparison with other recently published pocket matching techniques, we clustered two databases that had been

provided as benchmark sets. In both cases, PoLiMorph was able to yield competitive results. PoLiMorph was able to identify groups of binding pockets binding to similar ligands even in the absence of protein structure homology. The fact that the quality of the clustering proposed by our method is similar to the results reported by Das et al. (2009)18 is especially remarkable because in contrast to Das et al. (2009), we did not use the database for parameter tuning. Because parameter values of PoLiMorph such as means and standard deviations of grid property distributions were optimized on a different training set, the sustained good performance is promising for prospective studies. This statement is supported by the stability of prediction accuracy in the conducted cross validation studies. Clustering of the kinase set published by Sciabola et al. (2010)10 was successful as well. Only pockets of P38 kinase structures were subdivided into disparate branches of the tree. All other kinases were neatly grouped together. From the tree topology alone, however, no similarity of pharmacophoric features between CDK and GSK3β could be deduced. This known resemblance could only be observed by a look at the higher than average scores between members of the CDK and GSK3β structures. This finding highlights the importance and necessity of a careful analysis of both scores and clustering results before a decision regarding pocket similarity is rendered. For further evaluation of PoLiMorph, we exemplarily conducted retrospective screening experiments in the complete PDBbind database with selected query pockets. With the binding pocket of a major urinary protein (MUP) as query, different lipocalins were among the top ranked structures. This corroborates our earlier notion that PoLiMorph is able to identify similar pockets in the absence of high sequence identity. This feature was also observed in the matching of the two different hemoglobin structures 1hvb and 2lhb, where the high matching score indicated similar function or binding partners. Especially from the ligand’s point of view, the result of the screening with MUP was interesting, since the second ranked structure binds to the same ligand as the query. Additionally, structures were found that bind to retionoids or thyroids, which is in accordance with known lipocalins binding to these ligand classes.83,84 Journal of Proteome Research • Vol. 9, No. 12, 2010 6507

research articles In the context of drug design, the ability to predict similar pockets can be exploited in various ways. For example, potential ligand cross-reactions and target profiles might be obtained which could help avoid expensive failures in later stages of the drug discovery pipeline. Also, ideas for ligand scaffolds and selectivity optimization can be obtained by comparing known ligands for similar pockets. The screening example with the R-amylase structure 3dc0 is a good example of how PoLiMorph could be employed for predicting protein function. Since the employed query is an apo-structure, and the top ranked beta-glucosidases or antigen-binding fragments are nonhomologous to the query, this function identification is similar to a scenario where one starts with a real orphan protein. Nevertheless, one should keep in mind that our method does not explicitly address flexible fit issues. In cases where pronounced flexible fit effects occur upon ligand binding, PoLiMorph will most likely not be able to identify “true” hits, for example, in the three tyrosine phosphatase structures (1nwl,88 1c84,89 1nny90) from Database 4. Pockets 1nwl and 1nny are regarded as similar by the algorithm but dissimilar to 1c84 (cf. Figure S3, Supporting Information). A closer look at the protein structures reveals that the pocket of 1c84 indeed differs from the others: here, a loop fills parts of the cavity. One possibility to address this problem might be to perform molecular dynamic simulations with the query structure prior to the actual screening process. Pocket graphs from different time steps of the simulation can then be used for screening.87 The idea of comparing chemical objects by pharmacophore graphs has already been subject of established ligand-based approaches such as feature trees,91 or Discngine’s pharmacophore graphs that are implemented in the software Pipline Pilot.92 In these approaches molecules are condensed into graph representations that retain pharmacophoric features but abstract from the molecular constitution. Pharmacophoric representations have also been applied for the comparison of binding pockets13,17,93 but, in contrast to the majority of related studies,11,13-20 in our approach pockets are represented by their volume rather than their surface, or by the residues that form the ligand-binding cavity. Additionally, important properties for protein ligand interactions are projected from the surrounding residues into the volume. This kind of pocket description has the advantage that the volume directly represents the explicit space where a ligand is located when it interacts with the protein. Moreover, fuzzy set theory is applied for graph labeling, which contributes to a robust pocket description. In other words, pockets are described from the ligand’s point of view in an error-tolerant way. This can be a desirable feature when the method is used for prediction of ligand cross-reactivity. To circumvent expensive comparison of pockets by grid alignments, with up to several hundreds of grid points per pocket, we designed a new form of approximating the grids’ properties by a labeled graph. The construction of this graph includes the usage of a nondeterministic growing neural gas approach, as well as fuzzy labeling of the resulting vertices. Gaussian projections of pocket properties onto graph vertices ensure that the graph labels resemble the grid values. For the actual comparison of pockets, inexact graph matching is performed. Inexact graph matching is known to be an NP-complete problem94 and is consequently a computationally expensive task. In our field of application, the matching algorithm has to be fast enough to permit large-scale analyses of pocket similarities. Additionally, the graph matching algo6508

Journal of Proteome Research • Vol. 9, No. 12, 2010

Reisen et al. rithm has to be capable of compensating differences between the graphs because these differences can occur even if the represented pockets are identical. This behavior results from the stochastic component of the growing neural gas algorithm. We tested different variants of subgraph isomorphism,95 optimal assignment,96 and geometric hashing algorithms25 for the matching of the fuzzy labeled graphs without satisfying success (not shown). Therefore, we developed a new heuristic that meets both aforementioned criteria and is able to identify meaningful matches between similar pocket graphs. The average computation time of 9.9 ms per pocket matching allows for large-scale experiments, and local topological and property distinctions are compensated. We assume that the algorithm is useful in many different fields of application where spatially distributed sets of labeled points have to be matched. On this account, the algorithm can probably be an alternative to the frequently used geometric hashing method, just to name one possibility. The algorithm always matches k vertices, where k is the number of vertices in the smaller graph. This leads to the phenomenon that relatively small graphs in a database receive similar scores for all matches if the similarities between the matched vertices are simply summed up. To overcome this problem we chose to rescore the matches by a z-transformation, which rewards strong similarities and punishes dissimilarities (eq 9). This greatly improved rankings of screening results (preliminary results not shown). The second scaling step transforms the scores of the matching in a fixed range [-1,1] (eq 10). One major advantage of this normalization is to reduce the tendency of large pockets to yield high scores. Score normalization allowed for the deduction of a reliability threshold from the observed score distribution. It is commonly useful when global comparisons of pockets have to be carried out. When the method is employed for identifying small subpockets in structure databases, the second normalization (eq 10) should probably be avoided because it favors scores between pockets of similar size. Generally, our method proved to be robust against small parameter changes, for example, in the number of vertices per pocket or different functions for the grid potential calculations (not shown). A major advantage over related methods is that the obtained graph matching can be used to directly align pockets or whole proteins without the requirement for sequence homology, which can be useful for analyzing and comparing pockets that are predicted to be similar. A major limitation of PoLiMorph and related methods is their strong dependence on the pocket identification algorithm. Even though state-of-the-art techniques such as PocketPicker are able to correctly identify the location of pockets in most cases, subtle changes in the structure of a protein can lead to large differences in the dimension and shape of the calculated pocket. We tried to compensate for this problem by using sophisticated scoring functions, but there are notorious cases where our approach is not sufficiently sensitive. This behavior can be observed for several singleton structures in Figure 6. An example for such a singleton is the structure of a dialkylglycine decarboxylase (1zc997) for which matchings with the other proteins in its class (1m0n,98 1m0q98) do not result in a score greater than τ (cf. Figure S3, Supporting Information). The calculated grids of the three pockets have a common core region but differ otherwise. The scoring function of PoLiMorph is not able to compensate for such pronounced structural differences. We are convinced that a more robust way of

Structure-Based Comparison of Binding Pockets

Figure 9. Redocking of imatinib into the pocket of the tyrosineprotein kinase ABL2 by PoLiMorph graph matching. Pocket and ligand conformations were extracted from PDB structure 3gvu98 (resolution: 2.05 Å). Crystal structure conformation of the bound ligand is shown in light gray, the redocked pose is shown in dark gray.

identifying interaction sites (pocket extraction problem) would lead to further improvement of our results. A challenging future development of the PoLiMorph method might be ligand-ligand and ligand-pocket matching using graph frameworks for molecule representation. For example, ligand graphs can be matched with pocket graphs yielding a structure-based virtual screening tool. In a first attempt, we extracted imatinib and the corresponding binding pocket on tyrosine-protein kinase ABL2 from PDB structure 3gvu99 (Ugochukwu et al. 2009), and calculated the ligand and the pocket graph using the growing neural gas technique. After graph matching, imatinib was placed into the pocket according to the rotation matrix obtained from the graph alignment (Figure 9). The observed rmsd of 1.9 Å is promising first result. Thorough parameter optimization and benchmarking will be needed to further explore the full potential of fuzzy graph matching for ligand-pocket matching. Most importantly, ligand and receptor flexibility should be taken into consideration, potentially by calculating pocket and ligand graphs not only for selected conformations but for conformation ensembles.

Acknowledgment. We are grateful to Ewgenij Proschak, Tim Geppert, Volker Ha¨hnke, Alexander Klenner, Markus Hartenfeller, Matthias Rupp, and Yusuf Tanrikulu for discussion of fuzzy graphs and pocket architecture. Supporting Information Available: Supplementary Figures and Tables and Video S1. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Kolb, P.; Rosenbaum, D. M.; Irwin, J. J.; Fung, J. J.; Kobilka, B. K.; Shoichet, B. K. Proc. Natl. Acad. Sci. U.S.A. 2009, 106, 6843–6848. (2) Wang, R.; Fang, X.; Lu, Y.; Yang, C.; Wang, S. J. Med. Chem. 2005, 48, 4111–4119. (3) Weisel, M.; Proschak, E.; Schneider, G. Chem. Cent. J. 2007, 1, 7. (4) An, J.; Totrov, M.; Abagyan, R. Mol. Cell. Proteomics 2005, 4, 752– 761. (5) An, J.; Totrov, M.; Abagyan, R. Genome Inform. 2004, 15, 31–41. (6) Brown, D.; Superti-Furga, G. Drug Discovery Today 2003, 8, 1067– 1077. (7) Hajduk, P. J.; Huth, J. R.; Tse, C. Drug Discovery Today 2005, 10, 1675–1682. (8) Weisel, M.; Proschak, E.; Kriegl, J. M.; Schneider, G. Proteomics 2009, 9, 451–459. (9) Stauch, B.; Hofmann, H.; Perkovic´, M.; Weisel, M.; Kopietz, F.; Cichutek, K.; Mu ¨ nk, C.; Schneider, G. Proc. Natl. Acad. Sci. U.S.A. 2009, 106, 12079–12084.

research articles (10) Sciabola, S.; Stanton, R. V.; Mills, J. E.; Flocco, M. M.; Baroni, M.; Cruciani, G.; Perruccio, F.; Mason, J. S. J. Chem. Inf. Model. 2010, 50, 155–169. (11) Weill, N.; Rognan, D. J. Chem. Inf. Model. 2010, 50, 123–135. (12) Ekins, S. Drug Discovery Today 2004, 9, 276–285. (13) Jambon, M.; Imberty, A.; Dele´age, G.; Geourjon, C. Proteins 2003, 52, 137–145. (14) Campagna-Slater, V.; Arrowsmith, A. G.; Zhao, Y.; Schapira, M. J. Chem. Inf. Model. 2010, 50, 358–367. (15) Gold, N. D.; Jackson, R. M. J. Mol. Biol. 2006, 355, 1112–1124. (16) Milik, M.; Szalma, S.; Olszewski, K. A. Protein Eng. 2003, 16, 543– 552. (17) Schmitt, S.; Kuhn, D.; Klebe, G. J. Mol. Biol. 2002, 323, 387–406. (18) Das, S.; Kokardekar, A.; Breneman, C. M. J. Chem. Inf. Model. 2009, 49, 2863–2872. (19) Kinoshita, K.; Nakamura, H. Protein Sci. 2003, 12, 1589–1595. (20) Sael, L.; La, D.; Li, B.; Rustamov, R.; Kihara, D. Proteins: Struct., Funct., Bioinf. 2008, 73, 1–10. (21) Goodford, P. J. J. Med. Chem. 1985, 28, 849–857. (22) Cramer, R. D.; Patterson, D. E.; Bunce, J. D. J. Am. Chem. Soc. 1988, 110, 5959–5967. (23) Schalon, C.; Surgand, J.; Kellenberger, E.; Rognan, D. Proteins: Struct., Funct., Bioinf. 2008, 71, 1755–1778. (24) Levi, G. Calcolo 1973, 9, 341–352. (25) Rigoutsos, I.; Wolfson, H. IEEE Comput. Sci. Eng. 1997, 4, 1070– 9924. (26) Kellenberger, E.; Schalon, C.; Rognan, D. Curr. Comput.-Aided Drug Des. 2008, 4, 209–220. (27) Feige, U.; Peleg, D.; Kortsarz, G. Algorithmica 2001, 29, 410–421. (28) Zadeh, L. Inf. Control 1965, 8, 338–353. (29) Mordeson, J. N.; Nair, P. S. Fuzzy Mathematics. An Introduction for Engineers and Scientists; Physica-Verlag: Heidelberg, 1998. (30) Exner, T. E.; Keil, M.; Brickmann, J. J. Comput. Chem. 2002, 23, 1176–1187. (31) Hullermeier, E.; Weskamp, N.; Klebe, G.; Kuhn, D. In 2007 IEEE International Fuzzy Systems Conference; London, UK, 2007; pp 16. (32) Fritzke, B. In Advances in Neural Information Processing Systems 7: Proceedings of the 1994 Conference; Denver, CO, 1994; pp 625632. (33) Cormen, T. H.; Leiserson, C. E.; Rivest, R. L.; Stein, C. Introduction to Algorithms, 3rd ed.; MIT Press: Cambridge, MA, 2009. (34) Raymond, J. W.; Gardiner, E. J.; Willett, P. Comput. J. 2002, 45, 631–644. (35) Bunke, H.; Messmer, B. In Image Analysis and Processing: 8th International Conference; ICIAP ′95, San Remo, Italy, September 13-15, 1995; Springer: Berlin, 1995; pp 44-55. (36) Cordella, L. P.; Foggia, P.; Sansone, C.; Vento, M. In International Conference on Image Analysis and Processing; IEEE Computer Society: Los Alamitos, CA, 1999; p 1172. (37) McKay, B. D. Congr. Numer. 1981, 30, 45–87. (38) Ullmann, J. R. J. Assoc. Comput. Machinery 1976, 23, 31–42. (39) Proschak, E.; Rupp, M.; Derksen, S.; Schneider, G. J. Comput. Chem. 2008, 29, 108–114. (40) Gillet, V. J.; Leach, A. R. An Introduction to Chemoinformatics, 1st ed.; Springer: Dordrecht, Netherlands, 2003. (41) Gasteiger, J.; Engel, T. Chemoinformatics: A Textbook, 1st ed.; Wiley-VCH: Weinheim, 2003. (42) MacKerell; Bashford, D.; Bellott; Dunbrack; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph-McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; Wiorkiewicz-Kuczera, J.; Yin, D.; Karplus, M. J. Phys. Chem. B 1998, 102, 3586–3616. (43) Heiden, W.; Moeckel, G.; Brickmann, J. J. Comp.-Aided Mol. Des. 1993, 7, 503–514. (44) Wildman, S. A.; Crippen, G. M. J. Chem. Inf. Comput. Sci. 1999, 39, 868–873. (45) Liu, Z.; Wang, G.; Li, Z.; Wang, R. J. Chem. Theory Comput. 2008, 4, 1959–1973. (46) Dolinsky, T. J.; Czodrowski, P.; Li, H.; Nielsen, J. E.; Jensen, J. H.; Klebe, G.; Baker, N. A. Nucleic Acids Res. 2007, 35, 522–525. (47) Finer-Moore, J. S.; Fauman, E. B.; Morse, R. J.; Santi, D. V.; Stroud, R. M. Protein Eng. 1996, 9, 69–75. (48) Weisel, M.; Kriegl, J. M.; Schneider, G. ChemBioChem 2010, 11, 556–563. (49) Klenner, A.; Weisel, M.; Reisen, F.; Proschak, E.; Schneider, G. Mol. Inf. 2010, 29, 189–193. (50) Hebb, D. O. Conditioned and Unconditioned Reflexes and Inhibition; McGill University: Montreal, Quebec, Canada, 1932.

Journal of Proteome Research • Vol. 9, No. 12, 2010 6509

research articles (51) Asahiro, Y.; Iwama, K.; Tamaki, H.; Tokuyama, T. J. Algorithms 2000, 34, 203–221. (52) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. Nucleic Acids Res. 2000, 28, 235–242. (53) Ng, K. K.; Drickamer, K.; Weis, W. I. J. Biol. Chem. 1996, 271, 663– 674. (54) Morris, R. J.; Najmanovich, R. J.; Kahraman, A.; Thornton, J. M. Bioinformatics 2005, 21, 2347–2355. (55) Ward, J. H. J. Am. Stat. Assoc. 1963, 58, 236–244. (56) Massey, E. J. J. Am. Stat. Assoc. 1951, 68–78. (57) R Development Core Team R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria. (58) Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N. S.; Wang, J. T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Genome Res. 2003, 13, 2498–2504. (59) Larkin, M. A.; Blackshields, G.; Brown, N. P.; Chenna, R.; McGettigan, P. A.; McWilliam, H.; Valentin, F.; Wallace, I. M.; Wilm, A.; Lopez, R.; Thompson, J. D.; Gibson, T. J.; Higgins, D. G. Bioinformatics 2007, 23, 2947–2948. (60) Weber, L. Curr. Med. Chem. 2002, 9, 2085–2093. (61) Scha¨rer, K.; Morgenthaler, M.; Paulini, R.; Obst-Sander, U.; Banner, D. W.; Schlatter, D.; Benz, J.; Stihle, M.; Diederich, F. Angew. Chem., Int. Ed. 2005, 44, 4400–4404. (62) Mattos, C.; Rasmussen, B.; Ding, X.; Petsko, G. A.; Ringe, D. Nat. Struct. Biol. 1994, 1, 55–58. (63) Reyda, S.; Sohn, C.; Klebe, G.; Rall, K.; Ullmann, D.; Jakubke, H. D.; Stubbs, M. T. J. Mol. Biol. 2003, 325, 963–977. (64) Peisach, E.; Casebier, D.; Gallion, S. L.; Furth, P.; Petsko, G. A.; Hogan, J. C.; Ringe, D. Science 1995, 269, 66–69. (65) Eggers, C. T.; Murray, I. A.; Delmar, V. A.; Day, A. G.; Craik, C. S. Biochem. J. 2004, 379, 107–118. (66) Gillmor, S.; McGrath, M.; Fletterick, R. Perspect. Drug Discovery Des. 1995, 2, 475–482. (67) Iwakura, M.; Maki, K.; Takahashi, H.; Takenawa, T.; Yokota, A.; Katayanagi, K.; Kamiyama, T.; Gekko, K. J. Biol. Chem. 2006, 281, 13234–13246. (68) Erlanson, D. A.; Braisted, A. C.; Raphael, D. R.; Randal, M.; Stroud, R. M.; Gordon, E. M.; Wells, J. A. Proc. Natl. Acad. Sci. U.S.A. 2000, 97, 9367–9372. (69) Vogt, I.; Mestres, J. Mol. Inf. 2010, 29, 10–14. (70) Chen, J. M.; Xu, S. L.; Wawrzak, Z.; Basarab, G. S.; Jordan, D. B. Biochemistry 1998, 37, 17735–17744. (71) Klaholz, B. P.; Mitschler, A.; Moras, D. J. Mol. Biol. 2000, 302, 155– 170. (72) Bernard, B. A.; Bernardon, J.; Delescluse, C.; Martin, B.; Lenoir, M.; Maignan, J.; Charpentier, B.; Pilgrim, W. R.; Reichert, U.; Shroot, B. Biochem. Biophys. Res. Commun. 1992, 186, 977–983. (73) Svensson, S.; Ho¨o¨g, J. O.; Schneider, G.; Sandalova, T. J. Mol. Biol. 2000, 302, 441–453. (74) Arnez, J. G.; Dock-Bregeon, A. C.; Moras, D. J. Mol. Biol. 1999, 286, 1449–1459. (75) Cavarelli, J.; Eriani, G.; Rees, B.; Ruff, M.; Boeglin, M.; Mitschler, A.; Martin, F.; Gangloff, J.; Thierry, J. C.; Moras, D. EMBO J. 1994, 13, 327–337. (76) Vulpetti, A.; Crivori, P.; Cameron, A.; Bertrand, J.; Brasca, M. G.; D’Alessio, R.; Pevarello, P. J. Chem. Inf. Model. 2005, 45, 1282– 1290.

6510

Journal of Proteome Research • Vol. 9, No. 12, 2010

Reisen et al. (77) Tarricone, C.; Galizzi, A.; Coda, A.; Ascenzi, P.; Bolognesi, M. Structure 1997, 5, 497–507. (78) Honzatko, R. B.; Hendrickson, W. A.; Love, W. E. J. Mol. Biol. 1985, 184, 147–164. (79) Bingham, R. J.; Findlay, J. B. C.; Hsieh, S.; Kalverda, A. P.; Kjellberg, A.; Perazzolo, C.; Phillips, S. E. V.; Seshadri, K.; Trinh, C. H.; Turnbull, W. B.; Bodenhausen, G.; Homans, S. W. J. Am. Chem. Soc. 2004, 126, 1675–1681. (80) Ganfornina, M. D.; Gutierrez, G.; Bastiani, M.; Sanchez, D. Mol. Biol. Evol. 2000, 17, 114–126. (81) Vincent, F.; Spinelli, S.; Ramoni, R.; Grolli, S.; Pelosi, P.; Cambillau, C.; Tegoni, M. J. Mol. Biol. 2000, 300, 127–139. (82) Ramoni, R.; Vincent, F.; Grolli, S.; Conti, V.; Malosse, C.; Boyer, F. D.; Nagnan-Le Meillour, P.; Spinelli, S.; Cambillau, C.; Tegoni, M. J. Biol. Chem. 2001, 276, 7150–7155. (83) Siegenthaler, G.; Saurat, J. Biochem. Biophys. Res. Commun. 1987, 143, 418–423. (84) Beuckmann, C. T.; Aoyagi, M.; Okazaki, I.; Hiroike, T.; Toh, H.; Hayaishi, O.; Urade, Y. Biochemistry 1999, 38, 8006–8013. (85) Alikhajeh, J.; Khaheh, K.; Ranjbar, B.; Naderi-Manseh, M.; NaderiManesh, H.; Chen, C. J. PDB 2008, 3DC0. (86) Copik, A. J.; Nocek, B. P.; Swierczek, S. I.; Ruebush, S.; Jang, S. B.; Meng, L.; D’souza, V. M.; Peters, J. W.; Bennett, B.; Holz, R. C. Biochemistry 2005, 44, 121–129. (87) Tanrikulu, Y.; Proschak, E.; Werner, T.; Geppert, T.; Todoroff, N.; Klenner, A.; Kottke, T.; Sander, K.; Schneider, E.; Seifert, R.; Stark, H.; Clark, T.; Schneider, G. ChemMedChem 2009, 4, 820–827. (88) Erlanson, D. A.; McDowell, R. S.; He, M. M.; Randal, M.; Simmons, R. L.; Kung, J.; Waight, A.; Hansen, S. K. J. Am. Chem. Soc. 2003, 125, 5602–5603. (89) Andersen, H. S.; Iversen, L. F.; Jeppesen, C. B.; Branner, S.; Norris, K.; Rasmussen, H. B.; Møller, K. B.; Møller, N. P. J. Biol. Chem. 2000, 275, 7101–7108. (90) Szczepankiewicz, B. G.; Liu, G.; Hajduk, P. J.; Abad-Zapatero, C.; Pei, Z.; Xin, Z.; Lubben, T. H.; Trevillyan, J. M.; Stashko, M. A.; Ballaron, S. J.; Liang, H.; Huang, F.; Hutchins, C. W.; Fesik, S. W.; Jirousek, M. R. J. Am. Chem. Soc. 2003, 125, 4087–4096. (91) Rarey, M.; Dixon, J. S. J. Comp.-Aided Mol. Des. 1998, 12, 471– 490. (92) SciTegic Pipeline Pilot, version 7.5; Accelrys Inc.: San Diego, 2009. (93) Shulman-Peleg, A.; Nussinov, R.; Wolfson, H. J. J. Mol. Biol. 2004, 339, 607–633. (94) Abdulkader, A. M. Parallel Algorithms for Labelled Graph Matching, Colorado School of Mines, 1998. (95) Bron, C.; Kerbosch, J. Commun. Assoc. Comput. Machinery 1973, 16, 575–577. (96) Rupp, M.; Proschak, E.; Schneider, G. J. Chem. Inf. Model. 2007, 47, 2280–2286. (97) Fogle, E. J.; Liu, W.; Woon, S.; Keller, J. W.; Toney, M. D. Biochemistry 2005, 44, 16392–16404. (98) Liu, W.; Rogers, C. J.; Fisher, A. J.; Toney, M. D. Biochemistry 2002, 41, 12320–12328. (99) Ugochukwu, E.; Salah, E.; Barr, A.; Mahajan, P.; Shrestha, B.; Savitsky, P.; Chaikuad, A.; Filippakopoulos, P.; Roos, A.; Pike, A. C. W.; Von Delft, F.; Bountra, C.; Arrowsmith, H.; Weigelt, J.; Edwards, A.; Knapp S. PDB 2009, 3GVU.

PR100719N