Index-Based Searching of Interaction Patterns in Large Collections of

Jan 27, 2017 - (4, 5) From a more abstract point of view, all of these tasks have in common the fact that they are searching for a specific spatial ar...
0 downloads 0 Views 5MB Size
Article pubs.acs.org/jcim

Index-Based Searching of Interaction Patterns in Large Collections of Protein−Ligand Interfaces Therese Inhester, Stefan Bietz, Matthias Hilbig, Robert Schmidt, and Matthias Rarey* Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany S Supporting Information *

ABSTRACT: Comparison of three-dimensional interaction patterns in large collections of protein−ligand interfaces is a key element for understanding protein−ligand interactions and supports various steps in the structure-based drug design process. Different methods exist that provide query systems to search for geometrical patterns in protein−ligand complexes. However, these tools do not meet all of the requirements, which are high query variability, an adjustable search set, and high retrieval speed. Here we present a new tool named PELIKAN that is able to search for a variety of geometrical queries in large protein structure collections in a reasonably short time. The data are stored in an SQLite database that can easily be constructed from any set of protein−ligand complexes. We present different test queries demonstrating the performance of the PELIKAN approach. Furthermore, two application scenarios show the usefulness of PELIKAN in structure-based design endeavors.



INTRODUCTION It is a common procedure in medicinal chemistry to use existing protein−ligand complexes in order to deduce hypotheses about preferred interactions between amino acids and parts of small molecules. Such comparisons are the key elements if, for example, chemoisosteric pockets are searched,1 functional groups interacting with similar protein environments are to be found,2,3 or preferred interaction geometries are analyzed.4,5 From a more abstract point of view, all of these tasks have in common the fact that they are searching for a specific spatial arrangements of atoms in the protein−ligand interface. With more than 115 000 structures in the Protein Data Bank (www.rcsb.org),6 a manual search for common geometric features is not feasible anymore. Fortunately, not only has the number of available protein−ligand structures grown constantly over the last years, but also, the quality of the structures has increased7 and the functional coverage of the proteins has widened thanks to different structural genomics projects.8 Advanced tools are needed to rapidly extract the relevant information concerning the spatial arrangement of atoms at the protein−ligand interface. A perfect tool for such a task would require a flexible query mechanism allowing the formulation of various geometrical questions related to protein−ligand interfaces. It would enable searches of large data sets in a © XXXX American Chemical Society

reasonably short time and present the results in a comprehensive manner. Ideally, the hardware requirements would be low, such that even difficult geometrical queries could be answered on a desktop computer or laptop without an underlying server infrastructure. A very convenient means to compare interaction patterns is provided by structural interaction fingerprints (SIFts),9 which represent the presence or absence of interactions between distinct residues and ligand atoms as bit strings that can be compared rapidly. The specific residues of a protein are encoded by specific positions in the bit string, making it impossible to compare interaction patterns of different proteins. In principle, tools searching for pharmacophores in threedimensional (3D) space meet some of the described requirements. The development started in the late 1980s10 and still today11 new approaches are being published, showing their ongoing importance. In contrast to the attempt described here, the search mechanism in these tools is limited to defined features in small molecules. A free search for structural elements of interest is therefore not possible. Furthermore, the search is usually limited to sets of small molecules. The recently published tool CrossMiner11 extends this principle and Received: September 16, 2016

A

DOI: 10.1021/acs.jcim.6b00561 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

the complete algorithm can be transferred to other SQL-based database systems. The Query. A PELIKAN query consists of a set of search points in 3D space that represent potentially interacting atoms. Each search point can be further specified by the following properties: (1) its element type, (2) its interaction type, and (3) the molecule from which it derives (e.g., protein or ligand). If the point represents a protein atom, further properties can be defined: (4) the amino acid type, (5) the location within the residue (e.g., backbone or side chain), (6) the secondary structure into which the amino acid is embedded, and if the amino acid type is given, (7) the atom name. If the point belongs to a small molecule, a functional group determining the environment of matching atoms can be defined. In both cases, protein or small molecule, the chemical environment of a point can be defined using a recursive SMARTS22 pattern. In PELIKAN, not only can the SMARTS pattern be used to define the chemical environment of a search point, but also, a chemical substructure containing multiple search points can be defined by adding labels to atoms in the SMARTS pattern (for an example, see Application Examples). Table S1 in the Supporting Information provides a detailed list of search point attributes and their value ranges. Every two search points can be combined with a point−point constraint. This can be either a distance interval or an interaction type. An overview of the interaction types and their geometrical constraints used here is shown in Figure 1.

is able to perform searches on collections of protein−ligand interfaces. Nevertheless, the limitation to predefined pharmacophoric elements remains. A few methods for geometric analysis of binding sites have been published in the past.12 The tools ASSAM13 and IMAAAGINE14 can be used to search for geometrical arrangements of amino acids (e.g., a catalytic triad). However, the search is restricted to proteins. PDBeMotif15 (http://www. ebi.ac.uk/pdbe-site/pdbemotif) is also mainly focused on proteins. Various patterns describing spatial arrangements of amino acids can be defined and combined with distance constraints to ligand atoms. The search engine of PDBeMotif does not allow the definition of ligand substructures in a query. Searching for ligand substructures in combination with distance and interaction constraints to amino acids can be performed with the tool Prolix.16 Unfortunately, the tool lacks atomic precision for the protein part of the query. Relibase17 and its commercial version Relibase+, which were released in 2003, are to our knowledge still the only tools supporting precise definitions of both protein and ligand query parts.18 Relibase and Relibase+ can only be used with a server infrastructure, making it difficult to build a database out of user-defined collections of protein−ligand complexes. In our hands, computing times for queries could be quite long using Relibase. Recently, commercial software vendors have also started to develop protein−ligand databases with geometric searching.19,20 To date, not very much is known about their underlying algorithmic approaches and their performances. Here we report on the development of the tool PELIKAN and its underlying algorithmic techniques. PELIKAN can be used to efficiently mine variable atom-based 3D queries in protein−ligand interfaces on a standard laptop. We show herein that even difficult geometrical queries can be processed in a reasonably short time. Unfortunately, it is not an easy undertaking to compare our method with existing tools. This is due to the fact that most of them are not publicly available and often the performance of the algorithms has not been examined in detail in the respective publications. At the end of the Results and Discussion, we show two application examples demonstrating the ability of PELIKAN to search for geometrical queries. Since example applications are inappropriate for a detailed performance analysis, we present a test scenario examining the dependence of different aspects of the query and the used data collection in detail. The results of these benchmark experiments are presented in Query Computing Time. In the Supporting Information we provide all of the data sets and queries used, making possible a detailed benchmark on existing and novel approaches.

Figure 1. Overview of the different interaction types used in PELIKAN and their geometrical parameters. vdW = van der Waals.

Every pairwise combination of distance constraints, interactions, or interaction directions at search points can be combined with a range of allowed angles. The geometrical part of a PELIKAN query can be combined with textual and numerical properties of the complex, the protein, the pocket, and the ligand. All of the textual and numerical properties that can be used and their possible values are listed in Table S2. Database Construction. Even though only one database file is used in this application, the tables can be logically grouped into three parts. The first part stores general information about all small molecules, metals, waters, and proteins that is needed to fully reconstruct each protein−ligand complex later on. This part of the database has already been used in iRaise,23 a tool for inverse virtual screening. In the second part, numerical and textual properties of ligands, proteins, and pockets are stored. Data needed in order to answer geometrical queries rapidly is stored in the third part.



METHODS From a computational perspective, we are aiming at a fast geometric pattern recognition approach. Although solvable in polynomial time in practice, these search processes are usually very resource-demanding. In order to achieve short computing times, we apply classical relational database concepts to store the 3D information on a complex as well as auxiliary data extracted from the PDB files. The key to fast query processing is the careful design of 3D indexing techniques that allow trading with space and time requirements. In this first version of PELIKAN, we employ an SQLite21 database, which does not need any server infrastructure and thus represents an easy and light alternative to server-based database systems. In principle, B

DOI: 10.1021/acs.jcim.6b00561 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling The database is constructed from protein−ligand complexes in PDB or PDBxMMCIF format using the NAOMI framework.24 Hydrogen positions in the complete protein−ligand complex are calculated using Protoss.25 For each input file, a set of ligands is detected first. This set includes covalently bound small molecules. The detailed approach used here to detect ligands in PDB files is explained in the Supporting Information (see paragraph S1). All detected ligands with more than five and fewer than 100 heavy atoms are considered as reference ligands and are used to define pockets. A pocket contains all residues, waters, metals, and other ligands that have a minimal atom−atom distance of 6.5 Å to one of the reference ligand’s atoms. Using this distance range, all parts of the pockets that are able to form a direct interaction to the reference ligand are considered. As a first step, general information about the complete protein−ligand complex as well as about each detected pocket is stored in the first part of the database. All numerical and textual properties that can be specified in the query are calculated and stored in the second part of the database. Therein, pocket properties are calculated using the DoGSite26 algorithm. Geometrical properties of each protein− ligand complex are stored in the third part of the database. This includes the calculation of non-covalent interactions between all atoms within the pocket. All of the types of calculated interactions as well as the geometrical constraints used can be seen in Figure 1. Additionally, each heavy atom within the pocket is considered as potential result point (PRP) and stored in the database together with its coordinates, its interaction type, and all of its atom properties. If an atom can form interactions of different types (e.g., the oxygen of a water molecule), the atom is stored several times in the database (e.g., first as a donor and second as an acceptor). All atoms are stored at least once in the database independent of their ability to form predefined interactions. Accelerating Geometric Queries. In general, databases provide fast access to data stored in a single table. Data distributed over different tables, however, requires join operations that can be very time-consuming. The computing time for joins can be dramatically reduced if, instead of joining complete tables, only parts of the tables are selected with a WHERE condition beforehand. In our approach, such join operations are needed if distances between two PRPs are queried. If the distance between two PRPs with rarely occurring properties (e.g., element = Calcium) is queried, the join operation is quite fast. However, if the distance between two carbon atoms is requested, the join operation is very timeconsuming. For this reason, we developed a descriptor that stores the existence of every PRP in one of 15 642 triangles. Schematically, the construction of the triangle descriptor is depicted in Figure 2. For each PRP, all possible triangles with side lengths between 2.5 and 8.5 Å are constructed. In this process, the two remaining corners of the triangle can each be one of the following types: donor, acceptor, aromatic, oxygen, nitrogen, or carbon from either a small-molecule atom or a protein atom. Triangles with two carbon−protein corners appear very often in the database and are thus neglected here. The three distances of the triangle are binned between 2.5 and 8.5 Å in 1 Å intervals. In the database, the triangle descriptor is stored as a bit vector bt for each triangle t. The bit bt[i] is set to 1 exactly if PRP i is part of a triangle of type t. This “inverted index” has previously been used by Sheridan et al.10 in the context of 3D substructure searches in small-molecule collections.

Figure 2. Structure of the triangle descriptor and its calculation. Blue points represent PRPs, purple points represent the two other corner points of a triangle, and black arrows represent the triangle side lengths. Atoms have to match one of the features in the list to be a valid corner point of a triangle. The considered side lengths between the corners of the triangle are listed below. The table schematically displays the triangle descriptor table, which stores a byte array for every triangle. For more information, see Accelerating Geometric Queries.

Search Algorithm. The development of PELIKAN’s stepwise query mechanism has mainly been influenced by two strategies: (i) early reduction of the possible result set and (ii) execution of fast search steps before time-consuming steps. Our algorithm can be separated into five distinct steps that are explained in detail in the following paragraphs. A complete overview of the search algorithm can be seen in Figure 3. Step 1. Our search algorithm starts by querying all of the textual and numerical properties that have been specified for the complex, the protein, the pocket, and the reference ligand. If the substructure defined for a search point is very discriminative (i.e., if it covers more than four atoms or contains elements other than carbon, oxygen, or nitrogen), the SMARTS pattern is additionally evaluated in step 1. Large SMARTS patterns and those containing rare elements are usually highly discriminative since they exclude a large portion of the data. From all of the resulting complexes, proteins, pockets, and ligands from this step, a list of possible pockets is constructed. This list is passed to step 2. Step 2. In the second step, the triangle descriptor is used to get a list of possible PRPs for each search point. All triangles with side lengths between 2.5 and 8.5 Å and the considered elements and interaction types are extracted from the query. From these relevant triangles, a list of PRPs occurring in a given triangle can be generated by querying the triangle descriptor table in the database. During this search, only those PRPs that are part of the possible pockets detected in step 1 are considered. Finally, for each search point that is part of a relevant triangle in the query, a list of possible PRPs is generated. The list of possible pockets and the lists of possible PRPs are then transferred to step 3. Step 3. In step 3, all of the point−point constraints defined in the query are processed sequentially. In the following, a database query for one point−point constraint will be termed a “request”. This terminology should avoid confusion between C

DOI: 10.1021/acs.jcim.6b00561 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 3. Search algorithm in PELIKAN, divided in five distinct steps. Search points are displayed as green dots, and blue dots represent PRPs. For more information, see Search Algorithm.

Figure 4. Disk space and calculation times of PELIKAN databases with different numbers of protein−ligand complexes. (a) Disk spaces of different databases, each separated into the disk space used by the triangle descriptor table (dark blue) and that used by all other tables (light blue). (b) Counts of pockets (dark red) and potential result points (light red) in different databases. (c) Box-whisker plot displaying the runtimes for adding a protein−ligand complex to a PELIKAN database without triangle descriptor calculation. Each column represents the addition of a subsequent data slice of 10 000 PDB files. Red line = median; red dot = mean; blue box = upper and lower quartiles. (d) Runtimes for calculating the triangle descriptor for a complete database (light blue) and for adding 10% of the database content to the triangle descriptor (dark blue).

single requests and the complete query containing several point−point constraints. First, the processing order of the point−point constraints is calculated by estimating the number of expected results. From a simple count of different element and interaction types in the database, the number of expected results is calculated as the product of the numbers of appearances of the points. Point−

point constraints with smaller products are processed before point−point constraints with larger products. Herein, each processing step starts with a database request detecting all matching PRP pairs for one point−point constraint. In this request, all of the properties of the search points as well as the lists of possible PRPs, if applicable, and the list of possible pockets are used. In the case of an interaction constraint, the D

DOI: 10.1021/acs.jcim.6b00561 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

disk drive (500 GB, model ST500DM002) with an ext4 file system (herein denoted as HDD); (2) a standard PC equipped with an Intel I5-4570 (3.2 GHz) processor, 8 GB of main memory, and a Samsung 950 pro PCIe solid state drive (512 GB, model nvme) with a btrfs file system (herein denoted as SSD). Both platforms run with a standard configuration of a Linux openSUSE 13.1 distribution. Database Construction. The most critical performance parameters for PELIKAN’s database are the size and the construction time. As shown in Figure 4a, the size of the databases depends linearly on the number of inserted protein− ligand complexes. If the complete PDB is used, the resulting database has a size of about 80 GB. For each of the different databases, about 50% of the size is used by the triangle descriptor tables. Figure 4b gives an overview of the numbers of pockets and PRPs stored in the databases. If the complete PDB is used, about 260 000 pockets and almost 70 million PRPs are stored in the database. Upon database construction, first all of the protein−ligand complexes are added to the database, and afterward, the triangle descriptor is calculated for all of the points. Thus, the computing time can be separated into that for merely adding the protein−ligand complexes to the database and that for calculating the triangle descriptor. Figure 4c shows the computing time needed to add a protein−ligand complex to the database. Each column represents the addition of a subsequent data slice of 10 000 PDB files. The mean computing time per PDB file slightly fluctuates between 6.8 s (data block 0−10000) and 9.8 s (data block 50000−60000). However, the construction time does not increase in the course of the plot, which means that the construction time does not increase with database size. The time for calculating the triangle descriptor depends on the size of the database in a linear way (see Figure 4d). In conclusion, constructing a complete database with data set 8000 takes about 22 h. For the complete PDB, about 190 h is required. In our opinion, the typical workflow for our tool will be the single construction of a database of interest, which will be subjected to different geometrical queries. This makes longer construction times tolerable. Even more important might be the possibility to add protein−ligand complexes to an existing database. Thereby, the newest structures of the PDB or in-house structures could be added to an existing database. To this end, again the protein−ligand complexes first have to be added to the database and then the triangle descriptor has to be calculated. The first step does not differ from the addition of protein−ligand complexes during database construction since only new rows are appended to the respective tables. Concerning the triangle descriptor, the new PRPs have to be appended to each descriptor bit string. Figure 4d shows the computing time required for augmenting the triangle descriptor by 10% of the protein−ligand complexes. Thus, about 6000 protein−ligand complexes could be added to a database containing the complete PDB in about 20 h (∼13 h to add the protein−ligand complexes plus ∼7 h to update the triangle descriptor). Correctness. In order to validate the correctness of PELIKAN, we checked our algorithm for false-positive results. Herein, for every resulting match the fulfillment of all required constraints is tested. A much more difficult endeavor is the detection of false-negative results. To this end, we first constructed a database of 200 randomly chosen PDB files. Afterward, every complex of the 200 PDB files was randomly translated and rotated in 3D space. Then, for each pocket, randomly eight to ten atoms of the pocket were used to

type of the interaction is also used. If the point−point constraint contains a distance range, the Euclidian distance between the two PRPs is calculated for each pair during the request and compared with the allowed minimal and maximal distances. After each database request, the lists of possible pockets and possible PRPs for each search point are updated. If no results are found for a point−point constraint in a specific pocket, this pocket is excluded for further database requests. By the use of this procedure, the lists of possible pockets and possible PRPs for each search point get shorter with every database request. After these lists are updated, the next point−point constraint is processed. If no PRPs for a search point remain at some point, the complete process stops. At the end of this step, a list has been generated for every search point containing the possible PRPs that match the particular point and all point−point constraints in which this point is involved. Step 4. In the next step, all valid matches of the complete query are generated by building all possible combinations of the matching PRPs. Herein, a product graph is constructed for every pocket. According to the concept of line graphs from RASCAL,27 a node is generated for each detected combination of a point−point constraint with a PRP pair. Two nodes in this graph are connected if their PRP assignments do not contradict each other. Every clique of size n in this product graph represents a valid match of the complete query, with n being the number of point−point constraints in the query. It should be noted that no absolute stereoconfiguration can be defined using only distance and angle constraints. Thus, the stereoconfigurations of the resulting matches are not compared to the stereoconfiguration of the defined query. Step 5. In the last step of the search algorithm, the substructure constraints that could have been annotated for a search point are checked. PELIKAN supports the processing of SMARTS patterns describing one atom that matches the respective search point. Additionally, labels of other search points can be added to the SMARTS pattern in order to define the chemical relation between search points. For each match generated in step 4, all of the SMARTS patterns are checked, taking all of the matching atoms into consideration. A large number of matches may lead to large runtimes in this step. Therefore, the SMARTS matching process automatically stops after 1000 matches have been detected.



RESULTS AND DISCUSSION The performance of PELIKAN was tested on different data sets. A large database was constructed using all of the protein− ligand complexes from the PDB (accessed November 2016). All PDB files containing at least one reference ligand (69 481 PDB files, 55.9% of the complete PDB) were added to the database. In the following, the set containing all of the PDB files that were added to this database will be called the “complete PDB”. From this collection, random sets of PDB codes containing 2000, 4000, 8000, 16 000, and 32 000 codes were generated. These sets were used to build databases with increasing size. Each database will in the following be denoted by its PDB file count. The exact PDB codes involved in each data set are listed in the Supporting Information. All of the calculations shown here were performed on a single core of one of two different hardware settings: (1) a standard PC equipped with an Intel I5-4570 (3.2 GHz) processor, 8 GB of main memory, and a Seagate Desktop hard E

DOI: 10.1021/acs.jcim.6b00561 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 5. Definition of test queries. (a) 3D and 2D depictions of the pocket of PDB structure 1j7u around the ligand ANP for standard queries. Atoms used for the query are highlighted in green and marked with the number corresponding to the search point. (b) 3D and 2D depictions of the pocket of PDB structure 1j7u around the ligand ANP for queries with rarely occurring attributes. Atoms used are highlighted in green and marked with the number corresponding to the search point. (c−e) Topologies for all used test queries. Green points represent the search points, and the id for each search point is given as a black number. Black lines represent distance constraints. The distance between the corresponding atoms in PDB structure 1j7u is displayed at each distance constraint.

Figure 6. Runtimes of test queries. Each bar displays the mean value and the standard deviation of three independent experiments. Above each bar, the number of resulting matches is shown. (a) Influence of atoms with rarely occurring attributes on the query runtime. (b) Influence of the hardware reading ability on the query runtime. Left bars: runtimes on HDD hardware. Right bars: runtimes on SSD hardware.

e. In order to assess the influence of one specific attribute on the runtime of a query, this attribute of a query was changed in several steps, and the results were compared with those for the standard case. The exact definitions of all of the test queries can be found in the Supporting Information. If not otherwise stated, all of the tests were conducted on a database containing 16 000 protein−ligand complexes and 1j7u on the SSD hardware setup. In most of our tests, we saw a clear dependence of the query run time on the number of results. The number of results can be reduced by adding more details to the query, e.g., by making the point constraints more precise (see Figure S1a in the Supporting Information), or by adding constraints on the pocket or the complete structure (see Figure S1b). Most of these speed-ups are a result of the reduction of time spent for step 3, which is usually the most time-consuming step (see Figures S2−S6). The distances used in point−point constraints influence the runtime of a query. First, larger distance intervals lead to more results and thus to longer runtimes (see Figure S2a). Second, if larger distances are used in a query by keeping the interval

generate a query with randomized pairwise distance ranges around the actual pairwise distance within the structure. Using this query on the database, we made sure that at least the original pocket was detected. The whole procedure was repeated three times. In all of the experiments, neither falsepositive nor false-negative results were detected. Query Computing Time. In order to get a profound understanding of our algorithm as well as to efficiently use PELIKAN, it is important to know which parts of queries lead to fast completion and which result in long computing times. Therefore, we developed a set of test queries that differ in their number of point constraints and in their topologies (see Figure 5). In order to make sure that for every query at least one result was found, the queries were designed such that they always matched atoms in the protein−ligand interface of the protein− ligand complex with PDB code 1j7u28 (see Figure 5a,b). For each test query, the “standard” setting was defined such that for each atom the molecule type and element type were annotated. All of the point−point constraints were set to the measured length between the corresponding atoms in 1j7u ± 0.5 Å. The topologies of all of the test queries are displayed in Figure 5c− F

DOI: 10.1021/acs.jcim.6b00561 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling length constant, e.g., 9−10 Å instead of 3−4 Å, more results and longer runtimes can be observed (see Figure S2b). Similar findings can be observed if frequently occurring database attributes are replaced by rare attributes, e.g., the element type “phosphorus” or the molecule type “crystallographic water”. In this scenario, the number of results is reduced and also the runtime of the query is shortened (see Figure 6a). However, this is not true for the tetrahedron. Here, the first replacement of a point constraint to a metal leads to fewer results but almost similar runtimes. The reason for this observation is the triangle descriptor, which is only applicable for frequently appearing elements. When the molecule type “metal” is used, the number of descriptor triangles is reduced from nine to four, making the descriptor much less effective. Several steps in the search mechanism select information from the database. Thus, in theory all of these processes should depend on the time needed for reading the database files. Figure 6b shows the runtimes needed for different standard queries on the two different hardware setups. Only the runtimes for the complete PDB are shown since no significant runtime differences were observed for the smaller databases. Overall, the highest achieved speed-up from HDD to SSD is 4.5 for the tetrahedron topology “metal”. Angle constraints have almost no influence on the runtime of a query since the accordance of a match with an angle constraint is checked in step 4 and the time-consuming step 3 is not influenced (see Figure S3). An important feature of PELIKAN is the handling of SMARTS patterns, which can be used to describe the chemical environment of a search point. The exact SMARTS patterns used in this test are displayed in Figure 7. All of the long SMARTS used in our test scenario describe more than four atoms. The short SMARTS describing point 3 contains elements other than oxygen, nitrogen, and carbon. Therefore, these SMARTS filters are also used as SMARTS filters for the reference ligand. Overall, the runtimes for queries with SMARTS patterns are higher than for those without SMARTS

patterns. The reason for this observation is that in all cases the most time-consuming step is the SMARTS matching on all hits entering step 5 of the algorithm (see Figure S4). Thus, the runtime of the query highly depends on the number of matches entering step 5. In line with this, if all possible descriptions are added to the search point or if longer SMARTS patterns are used, the runtime of the queries is strongly reduced. Noticeably, for two points, a shorter runtime is achieved when the short SMARTS are used compared with the long SMARTS, both without further point descriptors. This observation can be explained by the fact that the short SMARTS matches frequently and thus the maximum number of 1000 matches is reached early in step 5. Overall, we can conclude that the runtime of a query highly depends on the number of results reached after step 3 or step 4 of the search process. Thus, in order to achieve short retrieval times, the following rules should be obeyed when constructing a query: (1) Specify the points and add as many properties of the pocket, the protein, and the complete complex to the query as possible; prefer rarely occurring attributes over frequently occurring attributes, if possible. (2) Add as many distance constraints to the query as possible, and prefer small distance ranges in close proximities, if possible. Ideally, several triangles are created by this process that are part of the triangle descriptor. (3) Use a database that is as small as possible. Alternatively, perform the search on a fast SSD. (4) If the environment of a point is defined by a SMARTS pattern, use the largest possible pattern if the remaining query is not very discriminative; otherwise, prefer a smaller SMARTS pattern. In typical application scenarios, we think that a user would not like to examine too many results and thus should build a precise query that results in less than 100 results. When queries that do not contain SMARTS patterns are used, fewer than 100 results are retrieved in less than 50 s on a database containing 16 000 structures. The same holds true if SMARTS patterns in addition to point descriptors are included in the query. Triangle Descriptor. Triangle descriptors for accelerating 3D searches in protein−ligand complexes have already been applied in the pharmacophore search tools Pharmer29 and Recore.13 In both tools, each triangle is stored as a point in a KD tree using each of its side lengths as one coordinate in 3D space. This allows very efficient access to those points that exactly appear in a searched triangle and results in very short computing times. Because we do not use only pharmacophore points but include all atoms of a pocket, storing the triangles in a KD tree would have increased the database size by a factor of 10. With regard to the ever-growing number of protein−ligand complexes, we decided to store the detected triangles in a bit string, which requires less space. The purpose of the triangle descriptor here is the fast calculation of possible PRPs matching the search points. In order to test the effect of the descriptor on the query runtime, different queries with a triangle shape were constructed. One corner of the triangle can be any PRP, while the two other corners of the triangle are all possible combinations of point descriptors used for the triangle descriptor. The distance ranges of the point−point constraints in the queries are all 2.5−3.5 Å, all 5.5−6.5 Å, or all 7.5−8.5 Å. The setup of the queries is schematically displayed in Figure 8a. Thus, in this test every query hits exactly one bit of the triangle descriptor. Since very large numbers of results can be produced during this test, the memory was increased to 16 GB in both hardware settings (SSD and HDD). Figure 8b−d shows the speed-up factors

Figure 7. Influence of the point environment defined by SMARTS on the query runtime. Each bar displays the mean value and the standard deviation of three independent experiments. Above each bar, the number of resulting matches is shown. G

DOI: 10.1021/acs.jcim.6b00561 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 8. Speed-up factors of the triangle descriptor using triangle queries. (a) Query setup for triangle test queries. Search points of the query are displayed in green. Distance constraints are shown as black arrows. (b−d) Histograms showing resulting speed-up factors gained by the triangle descriptor on all possible test queries with side lengths 2.5−3.5 Å, 5.5−6.5 Å, and 7.5−8.5 Å, respectively. Mean speed-up factors are indicated by the dotted lines.

Figure 9. (a, b) Query construction for the search of an aromatic cage using PELIKAN. (c−f) Examples of PDB structures resulting from the search with the query defined in (b). In each resulting structure, the ring centers matching the search point of the query are highlighted.

gained with the triangle descriptor for all used queries on a database containing 16 000 protein−ligand complexes for the different side lengths, respectively. All of the tests shown in Figure 8b−d were performed using the SSD hardware setting. It can clearly be seen that the triangle descriptor always leads to a speed-up of the query. The mean speed-up for each triangle type is between 1.46 (see Figure 8b) and 1.64 (see Figure 8d). In about 85% of the queries, the speed-up factor is between 1 and 2. Only about 15% have larger speed-up factors over 2. The largest achieved speed-up factor is 3.5 (see Figure 8b). This

speed-up is in most cases independent of the database size and the hardware setting used (see Figure S10). Looking at the different queries in more detail, it seems that large speed-up factors are usually gained if the triangle descriptor is able to substantially reduce the number of interim results during the sequential processing of point−point constraints in step 3. If the number of interim results is already quite low when no triangle descriptor is used, not much speed can be gained by using the descriptor. On the other hand, if the number of results at the end of the query is very high, the H

DOI: 10.1021/acs.jcim.6b00561 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

some cases, the detected pockets had an aromatic cage constructed from even more than three aromatic rings. In the structure of an ABC-transporter choline binding protein complexed with acetylcholine, the aromatic cage contains four aromatic rings (PDB code 2rin; see Figure 9f). The aromatic cage in the structure of the chromodomain MRG2 bound to H3K36me3 even contains five rings (PDB code 4pli; see Figure 9e). Interestingly, four of the resulting PDB files were bound to glycine betaine. As an example, the structure of the binding protein OpuAC (PDB code 3l6h) is shown in Figure 9d. 2. Chemoisosterism. Chemoisosterism has been described as “the property by which different protein environments can tolerate the interaction with the same chemical fragment”.1 To demonstrate the ability of PELIKAN to search for these protein environments, we defined a query starting from the pocket around the ligand deoxythymidine (THM) in PDB structure 2z1a (see Figure 10a). The query contains four search points within an aromatic ring of the ligand (see search points 1−4 in Figure 10b). All of these search points have non-covalent interactions with protein atoms (see search points 5−8 and their point−point constraints to search points 1−4 in Figure 10b). The substructure of the uridine fragment is defined using a SMARTS pattern in search point 1. With the help of labels, the exact chemical relation among search points 1, 2, and 3 is defined. The angle constraint between search points 4 and 8, both aromatic rings, ensures a face-to-face configuration of the π−π interaction. The search of this query took about 90 s using the complete PDB on the SSD hardware. In total, 160 matches in 43 different PDB files were found. The exact runtimes for this query on different hardware settings with and without the triangle descriptor are shown in Table 1. Among them were 24 transferases, which were mainly structures of thymidylate kinases from different organisms. An example of these structures (PDB code 4hld) is shown in Figure 10c. In several structures we detected, the uracil fragment for which we searched was part of UDP, as for example in the structure of a UDP-galactopyranose mutase mutant (PDB code 4u8o; see Figure 10d). Moreover, six structures of the dUTPase were found (e.g., PDB code 4apz; see Figure 10e). Interestingly, the search also revealed the uridine fragment as part of a small nuclear RNA bound to Lsm protein (PDB code 4m7a; see Figure 10f). These results demonstrate the ability of PELIKAN to rapidly search for identical chemical fragments that are bound by specific interactions to different protein environments.

triangle descriptor is not able to largely reduce the number of interim results, and no speed-up can be observed. It is difficult to estimate the value of the triangle descriptor since it highly depends on the query type, the exact definition of the points, and the distance ranges. However, we were able to show that the computing time for a query is never prolonged when the triangle descriptor is used. Furthermore, for queries that do not contain triangles and thus for which the triangle descriptor cannot be used, the additional size of the database due to the triangle descriptor does not influence the runtime of the query process (see Figure S9). Application Examples. Because of the high adjustability of our query system, PELIKAN may contribute to different steps in the drug design process. Here we would like to show how PELIKAN can help solve two different tasks: (1) finding small molecules that bind to a special subpocket of a protein and (2) finding proteins that bind the same ligand substructure (chemoisosterism). 1. Aromatic Cage Formed by Three Aromatic Rings. Cages formed by aromatic rings of the protein are frequently considered as interesting regions in proteins and have been addressed by different small-molecule design projects.30,31 Thus, it might be of interest to find all protein structures featuring such a hydrophobic space in their binding pocket. Here we used the structure of putrescine receptor (PDB code 1a99; see Figure 9a) to create a query that can be used for such a search. The ring centers of residues Trp244, Trp37, and Phe276 were used here to define three search points. For each point−point constraint, the observed distance ±1 Å was used. In order to make sure that the rings face the enclosed volume with their π electrons, we introduced angle constraints between all of the ring normals and the point−point constraints, respectively. It should be noted that for a ring system, the direction of the ring normal is not defined per se. Thus, the angle constraints can be obeyed by both possible directions. However, for each resulting match, all of the angle constraints have to be met with the same direction of each ring normal. For query creation, we used here the ring normal pointing into the enclosed aromatic cage for each ring, respectively. Using a database containing the complete PDB, we could find 49 matches in 20 different PDB files. The search took about 110 s on the SSD hardware. The exact runtimes for this query on different hardware settings with and without the triangle descriptor are shown in Table 1. Among the results is a structure of the RNA binding protein YTH-YTHDF2 bound to a modified adenosine (PDB code 4rdn; see Figure 9c). Furthermore, a structure of the betaine transporter BetP bound to its substrate (PDB code 4llh) was found. Here, the aromatic cage is part of the S1 binding site. In



CONCLUSIONS AND OUTLOOK PELIKAN is a new approach for searching spatial arrangements of atoms in protein−ligand interfaces employing an SQLite database. Databases can be constructed out of any collection of protein−ligand complexes, and new protein−ligand complexes can easily be added to existing databases. The spatial search in our approach is not limited to predefined features or chemical groups but allows a highly flexible definition of search points without the need to recalculate the database. Point−point and angle constraints allow the definition of any topology within the query. In contrast to the frequently used sphere-matching approach in pharmacophore tools, the specific distance ranges for every point−point constraint used here make it possible to define geometrically precise and vague parts within the same query. The geometrical query can be expanded with textual and numerical properties for a large number of different ligand, pocket, protein, or complex properties. Because of the

Table 1. Runtimes for Application Examples on a Database Containing the Complete PDB on Different Hardware Settings Using PELIKAN runtime (s)a

aromatic cage chemoisoterism

HDD with triangle descriptor

HDD without triangle descriptor

SSD with triangle descriptor

SSD without triangle descriptor

471 ± 8.4 269 ± 16

1775 ± 22 322 ± 33

110 ± 8 93 ± 0.2

274 ± 8.6 92 ± 0.8

a

Mean values and standard deviations of three independent experiments are shown. I

DOI: 10.1021/acs.jcim.6b00561 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 10. (a, b) Query construction for the search for pockets that can bind a specific chemical fragment (chemoisosterism) using PELIKAN. (c−f) Examples of PDB structures resulting from the search with the query defined in (b). In each resulting structure, the atoms/ring centers matching the search point of the query are highlighted. The numbers indicate the specific search points the atoms or ring points are matching.

conceptual division of the database into separate parts, the list of properties could in principle easily be extended. The search algorithm is precise in the sense that exactly all structures matching the query are found. In our view, the typical workflow for PELIKAN would be the construction of a database out of protein−ligand complexes of interest that is afterward subjected to many different spatial queries. We see applications of PELIKAN when precise patterns in protein−ligand interfaces are searched, leading to a small number of results that can be inspected visually. Thus, regarding the computation time, we laid more emphasis on short times for retrieval of complicated geometrical queries than for database construction. PELIKAN runs on standard PC hardware and does not require any connection to a server infrastructure. This standalone tool has the advantages that databases can be constructed out of any collection of protein−ligand complexes. Additionally, fast data reading abilities of the used hardware may accelerate the retrieval time. A valuable addition to our tool in the future would be a web service that would allow PELIKAN to be easily accessed via a web browser. The large number of protein−ligand complexes available nowadays is a great opportunity to gather relevant structural data, but at the same time, it is a challenge to efficiently handle such amounts of data. The expression power of the queries, the easy deployment and use, the precision, and the speed of the search process are the strengths of our approach. Besides use cases, we have presented a series of experiments measuring the performance of the geometric search. We hope that these experiments will be beneficial for future developments of

similar tools. The resulting tool, PELIKAN, offers a new way of ad hoc analysis of protein−ligand complexes that is able to support various steps in structure-based molecular design projects.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.6b00561. List of all properties of search points and their value ranges, list of all textual and numerical filters that can be used in PELIKAN and their value ranges, exact definition of the test queries used in this article, additional results of test queries, runtimes of all test queries separated by the different steps of the algorithm, list of all PDB codes used in different databases, and short documentation for using PELIKAN (ZIP)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Matthias Rarey: 0000-0002-9553-6531 Author Contributions

T.I. and M.R. developed the algorithmic approach behind PELIKAN. T.I. implemented the main PELIKAN method. S.B. implemented the calculation of non-covalent interactions in protein−ligand complexes. M.H. implemented parts of the database. R.S. improved the SMARTS matching algorithm. J

DOI: 10.1021/acs.jcim.6b00561 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Crystal Structure Databases. J. Chem. Inf. Model. 2012, 52 (6), 1450− 1461. (17) Hendlich, M.; Bergner, A.; Günther, J.; Klebe, G. Relibase: Design and Development of a Database for Comprehensive Analysis of Protein-Ligand Interactions. J. Mol. Biol. 2003, 326 (2), 607−620. (18) Bergner, A.; Günther, J.; Hendlich, M.; Klebe, G.; Verdonk, M. Use of Relibase for Retrieving Complex Three-Dimensional Interaction Patterns Including Crystallographic Packing Effects. Biopolymers 2001, 61 (2), 99−110. (19) PSILO: A Protein Structure Database System; Chemical Computing Group: Montreal, QC; http://www.chemcomp.com/ PSILO-Protein_Structure_Database_System.htm. (20) PLDB; Schrödinger, LLC: New York; https://www.schrodinger. com/pldb. (21) Hipp, D. R.; Kennedy, D.; Mistachkin, J. SQLite. http://www. sqlite.org (accessed Sept 2, 2016). (22) Weininger, J. C. Daylight Theory Tutorial; Daylight Chemical Information Systems: Aliso Viejo, CA; http://www.daylight.com. (23) Schomburg, K. T.; Bietz, S.; Briem, H.; Henzler, A. M.; Urbaczek, S.; Rarey, M. Facing the Challenges of Structure-Based Target Prediction by Inverse Virtual Screening. J. Chem. Inf. Model. 2014, 54 (6), 1676−1686. (24) Urbaczek, S.; Kolodzik, A.; Groth, I.; Heuser, S.; Rarey, M. Reading PDB: Perception of Molecules from 3D Atomic Coordinates. J. Chem. Inf. Model. 2013, 53 (1), 76−87. (25) Bietz, S.; Urbaczek, S.; Schulz, B.; Rarey, M. Protoss: A Holistic Approach to Predict Tautomers and Protonation States in ProteinLigand Complexes. J. Cheminf. 2014, 6, 12. (26) Volkamer, A.; Kuhn, D.; Grombacher, T.; Rippmann, F.; Rarey, M. Combining Global and Local Measures for Structure-Based Druggability Predictions. J. Chem. Inf. Model. 2012, 52 (2), 360−372. (27) Raymond, J. W.; Gardiner, E. J.; Willett, P. RASCAL: Calculation of Graph Similarity Using Maximum Common Edge Subgraphs. Comput. J. 2002, 45 (6), 631−644. (28) Burk, D. L.; Hon, W. C.; Leung, A. K.; Berghuis, A. M. Structural Analyses of Nucleotide Binding to an Aminoglycoside Phosphotransferase. Biochemistry 2001, 40 (30), 8756−8764. (29) Koes, D. R.; Camacho, C. J. Pharmer: Efficient and Exact Pharmacophore Search. J. Chem. Inf. Model. 2011, 51 (6), 1307−1314. (30) Lupas, A. N.; Zhu, H.; Korycinski, M. The Thalidomide-Binding Domain of Cereblon Defines the CULT Domain Family and Is a New Member of the β-Tent Fold. PLoS Comput. Biol. 2015, 11 (1), e1004023. (31) Schärer, K.; Morgenthaler, M.; Paulini, R.; Obst-Sander, U.; Banner, D. W.; Schlatter, D.; Benz, J.; Stihle, M.; Diederich, F. Quantification of Cation−π Interactions in Protein−Ligand Complexes: Crystal-Structure Analysis of Factor Xa Bound to a Quaternary Ammonium Ion Ligand. Angew. Chem., Int. Ed. 2005, 44 (28), 4400− 4404.

M.R. supervised the project. T.I. and M.R. wrote the manuscript. Notes

The authors declare no competing financial interest. PELIKAN is available for Linux, OS X, and Windows from http://www.zbh.uni-hamburg.de/pelikan for free for academic use and evaluation purposes. Databases containing different sets of protein−ligand complexes are provided on this Web site for free. All feedback is greatly appreciated and supports the further development of PELIKAN.



ABBREVIATIONS vdW, van der Waals; SQL, structured query language; PDB, Protein Data Bank; PRP, potential result point



REFERENCES

(1) Jalencas, X.; Mestres, J. Chemoisosterism in the Proteome. J. Chem. Inf. Model. 2013, 53 (2), 279−292. (2) Chan, A. W. E.; Laskowski, R. A.; Selwood, D. L. Chemical Fragments That Hydrogen Bond to Asp, Glu, Arg, and His Side Chains in Protein Binding Sites. J. Med. Chem. 2010, 53 (8), 3086− 3094. (3) Kawai, K.; Nagata, N. Metal−ligand Interactions: An Analysis of Zinc Binding Groups Using the Protein Data Bank. Eur. J. Med. Chem. 2012, 51, 271−276. (4) Sirimulla, S.; Bailey, J. B.; Vegesna, R.; Narayan, M. Halogen Interactions in Protein−Ligand Complexes: Implications of Halogen Bonding for Rational Drug Design. J. Chem. Inf. Model. 2013, 53 (11), 2781−2791. (5) Seebeck, B.; Reulecke, I.; Kämper, A.; Rarey, M. Modeling of Metal Interaction Geometries for Protein−ligand Docking. Proteins: Struct., Funct., Genet. 2008, 71 (3), 1237−1254. (6) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1), 235−242. (7) Number of structures in the PDB with resolution < 1.5 Å released before 2010: 5059. Number of structures in the PDB with resolution < 1.5 Å released before 2016: 10780. Searches were performed using the advanced search tool on www.rcsb.org (accessed Dec 7, 2016). (8) Khafizov, K.; Madrid-Aliste, C.; Almo, S. C.; Fiser, A. Trends in Structural Coverage of the Protein Universe and the Impact of the Protein Structure Initiative. Proc. Natl. Acad. Sci. U. S. A. 2014, 111 (10), 3733−3738. (9) Deng, Z.; Chuaqui, C.; Singh, J. Structural Interaction Fingerprint (SIFt): A Novel Method for Analyzing Three-Dimensional ProteinLigand Binding Interactions. J. Med. Chem. 2004, 47 (2), 337−344. (10) Sheridan, R. P.; Nilakantan, R.; Rusinko, A.; Bauman, N.; Haraki, K. S.; Venkataraghavan, R. 3DSEARCH: A System for ThreeDimensional Substructure Searching. J. Chem. Inf. Model. 1989, 29 (4), 255−260. (11) Korb, O.; Kuhn, B.; Hert, J.; Taylor, N.; Cole, J.; Groom, C.; Stahl, M. Interactive and Versatile Navigation of Structural Databases. J. Med. Chem. 2016, 59 (9), 4257−4266. (12) Inhester, T.; Rarey, M. Protein−ligand Interaction Databases: Advanced Tools to Mine Activity Data and Interactions on a Structural Level. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2014, 4 (6), 562−575. (13) PubMed and NCBI. SPRITE and ASSAM: web servers for side chain 3D-motif searching in protein structures. http://www.ncbi.nlm. nih.gov/pubmed/22573174 (accessed Sept 2, 2016). (14) IMAAAGINE: Search a hypothetical 3D protein arrangement in the Protein Data Bank. http://mfrlab.org/grafss/imaaagine/ (accessed Sept 2, 2016). (15) Golovin, A.; Henrick, K. MSDmotif: Exploring Protein Sites and Motifs. BMC Bioinf. 2008, 9, 312. (16) Weisel, M.; Bitter, H.-M.; Diederich, F.; So, W. V.; Kondru, R. PROLIX: Rapid Mining of Protein−Ligand Interactions in Large K

DOI: 10.1021/acs.jcim.6b00561 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX