Article pubs.acs.org/jcim
Comprehensive Classification and Diversity Assessment of Atomic Contacts in Protein−Small Ligand Interactions Kota Kasahara,† Matsuyuki Shirota,†,‡ and Kengo Kinoshita*,†,‡,§ †
Department of Applied Information Sciences, Graduate School of Information Sciences, ‡Tohoku Medical Megabank Organization, and §Institute of Development, Aging, and Cancer, Tohoku University, Miyagi 980-8597, Japan S Supporting Information *
ABSTRACT: Elucidating the molecular mechanisms of selective ligand recognition by proteins is a long-standing problem in drug discovery. Rapid increase in the availability of three-dimensional protein structural data indicates that a data-driven approach for finding the rules that govern protein−ligand interactions is increasingly attractive. However, this approach is not straightforward because of the complexity of molecular interactions and our inadequate understanding of the diversity of molecular interactions that occur during ligand recognition. Thus, we aimed to provide a comprehensive classification of the spatial arrangements of ligand atoms based on the local coordinates of each interacting “protein fragment” consisting of three atoms with covalent bonds in each amino acid. We used a pattern recognition technique based on the Gaussian mixture model and found 13 519 patterns in the spatial arrangements of interacting ligand atoms, each of which was described as a Gaussian function of the local coordinates. Some typical well-known interaction patterns such as hydrogen bonds were ubiquitous in several hundred protein families, whereas others were only observed in a few specific protein families. After removing protein sequence redundancy from the data set, we found that 63.4% of ligand atoms interacted via one or more interaction patterns and that 25.7% of ligand atoms interacted without patterns, whereas the remainder had no direct interactions. The top 3115 major patterns included 90% of the interacting pairs of residues and ligand atoms with patterns, while the top 6229 included all of them.
■
INTRODUCTION Drug discovery is an arduous process. A significant contributor to the problems encountered during drug discovery is the lack of high efficacy compounds.1 Thus, there is a major need to design high efficacy compounds during the early stages of drug discovery. In recent years, structure-based drug design that examines the chemical space for discovering new compounds, guided by three-dimensional structural data of target proteins has been extensively studied.2−5 However, this approach has only been partially successful because of our inadequate understanding of the molecular mechanisms of ligand recognition by proteins. In particular, one of the most intricate questions relates to protein selectivity, i.e., how each protein can recognize only few specific types of compounds from numerous candidate molecules with limited variations in the amino acid residues that form the binding pocket. Complex ligand recognition may be described based on combinations of the atomic contacts between amino acids and ligands. However, the interactions are highly complex at such an elemental level, and the diversity of such atomic interactions is not wellunderstood. The Protein Data Bank (PDB)6 is the primary resource for elucidating the diversity of atomic contacts in protein−ligand interactions, and many statistical analyses of molecular interactions have been performed using this database. This type of approach is attractive because PDB has been growing rapidly due to several major structural genomics projects in recent years.7−10 However, the vast wealth of structural data © XXXX American Chemical Society
also makes it difficult to extract information or knowledge, and hence, new methods are required to acquire insights into molecular interactions using the structural database.11 Previous analyses of protein−ligand interactions can be divided into a predefined classification-based approach and an unsupervised approach. In the first approach, molecular interactions are classified based on predefined geometrical and chemical patterns. For example, Panigrahi and Desiraju reported the propensities of strong and weak hydrogen bonds between proteins and ligands based on criteria related to distances between hydrogen and acceptor atoms and angles between donor, hydrogen, and acceptor atoms.12 This approach is promising because it is easy to implement, and the results can be interpreted directly, although this approach can only be applied when the interactions are limited and predefined. However, it is not practical to list all protein−ligand interactions before analyses because many different types of interactions are used for recognizing specific molecules. Therefore, an alternative approach is necessary that does not specify the predefined classification of interactions. The unsupervised approach can analyze a wide range of interactions without the need for predefined classifications. One of the main approaches in this category is based on the statistics of pairwise interatomic distances known as “statistical potential” or “knowledge-based potential”. These potential functions have Received: August 13, 2012
A
dx.doi.org/10.1021/ci300377f | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling
Article
Figure 1. Method overview. The methods consist of the following steps. First, (i) two data sets were constructed. The primary data set consists of protein−small ligand complexes extracted from the PDB. From this data set, the nonredundant data set was constructed based on the clustering of protein sequence identities (in addition, complexes with heme ligands were removed). Second, (ii) information of interactions was extracted from the primary data set. In this step, amino acid residues were decomposed into fragments consisting of covalently linked three atoms. A unit of interactions was defined as a contacting pair of a protein fragment and a ligand atom. These interaction units were extracted from the data set, and they were superimposed onto the local coordination systems in each type of interaction units. Next, (iii) the interaction units were classified into patterns. Statistically overrepresented spatial arrangements of interaction units were found from superimposed distribution of interaction units, by using a pattern recognition technique based on the GMM. Each of the overrepresented interactions was defined as a Gaussian component. Then, assignments of all interaction units to each Gaussian component were performed. If a Gaussian component is assigned ≥100 interaction units, the Gaussian component is referred as an interaction pattern. Here, we obtained a catalogue of interaction patterns. Some statistical analyses based on this catalogue were performed; how diverse interactions are there, how diverse proteins use common interactions for the recognitions, and what kinds of interactions are preferred were discussed. In addition, applicability of the catalogue was evaluated with docking simulations.
vector to encode interacting ligand atoms.24 Wang et al. reported the types of amino acid residues that preferred to interact with each ligand fragment.25 These analytical studies have provided some useful insights into the propensities of protein−ligand interactions, but they have ignored majority of the information in the structural data. Thus, these approaches have not addressed the spatial arrangements of interacting pairs explicitly and instead interactions were discretized roughly to make interpretations. In this study, we obtained a comprehensive classification of the spatial arrangements of interacting pairs of amino acid residues and ligand atoms using an unsupervised parametric pattern recognition technique based on the Gaussian mixture model (GMM). Using this classification, we found diverse interaction patterns; however some protein families had “unique” interaction patterns that were not observed in other families. Moreover, we performed docking calculations and
been used for scoring protein−ligand complexes in docking tasks,13−16 and they are useful for docking studies, although protein−ligand interactions do not always behave in an isotropic manner. Thus, a consideration of only the interatomic distances may be insufficient and an explicit consideration of the spatial arrangements of interacting pairs is necessary. To overcome this difficulty, several studies have used the spatial distributions of protein/ligand atoms around the interacting partner atoms in complexes.17−23 These studies used statistical information to predict interacting atoms, although the knowledge extracted from the database was not summarized sufficiently well to uncover the diversity of protein−ligand interactions because the preferences of proteins and their ligand atoms were summed to generate the final prediction score. In contrast, there have been some early attempts to obtain knowledge from the PDB directly. Imai et al. classified the interactions of some amino acid residues by using a binary B
dx.doi.org/10.1021/ci300377f | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling
Article
types). The types of ligand atoms were defined based on their Tripos force field27 (Supporting Information Table S1). Combinations of fragment and ligand atom types were used to identify the units of interaction. These definitions of fragment and atom types are the key features that determine what information to be extracted from the database. In this study, we focused on differences in interactions among amino acid types and encoded them into the type definition of protein fragments. For ligand atoms, their physicochemical property was encoded as Tripos force field type. The interatomic contacts between proteins and ligands were detected in all the protein−ligand complexes in the primary data set using the criterion that the interatomic distance was less than the sum of van der Waals radii and offset value (1.0 Å). The relative geometry of an interacting pair was transformed into the local coordination system defined by the three atoms in the protein fragment. The spatial distributions of the interactions were determined by gathering the positions of interacting atoms in the local coordination system for each type of interaction unit. We defined the distributions as the orthogonal coordination systems, in contrast to the Rantanen’s work,22 which were based on the polar coordination systems. Interaction units containing 5 heavy atoms and a molecular weight of ≥80 Da and 2.5 Å in RMSD. The dashed lines shows average values of the ratios, that were 0.84 and 0.49 for native and decoys, respectively.
were clearly different between the native and decoy structures (the average ratios were 0.84 and 0.49, respectively). This result indicates that the interactions in patterns were enriched in the native protein−ligand binding modes, and the validity of our assumption was confirmed. The interaction patterns defined in this work have a potential for application of a new knowledgebased scoring method to predict binding sites and binding modes.
■
CONCLUSION This study used an unsupervised pattern recognition technique based on GMM to determine the spatial distributions of contacting ligand atoms around a three-atom protein fragment and the atomic contacts were classified into 13 519 patterns. After assessing the uniqueness of each interaction pattern by G
dx.doi.org/10.1021/ci300377f | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling
Article
(9) Weigelt, J. Exp. Cell Res. 2010, 316, 1332−1338. (10) Terwilliger, T. C. J. Struct. Funct. Genomics 2011, 12, 43−44. (11) Bissantz, C.; Kuhn, B.; Stahl, M. J. Med. Chem. 2010, 53, 5061− 5084. (12) Panigrahi, S. K.; Desiraju, G. R. Proteins 2007, 67, 128−141. (13) Zhang, C.; Liu, S.; Zhu, Q.; Zhou, Y. J. Med. Chem. 2005, 48, 2325−2335. (14) Muegge, I. J. Med. Chem. 2006, 49, 5895−5902. (15) Ozrin, V. D.; Subbotin, M. V.; Nikitin, S. M. J. Comput.-Aided Mol. Des. 2004, 18, 261−270. (16) Yang, C.-Y.; Wang, R.; Wang, S. J. Med. Chem. 2006, 49, 5903− 5911. (17) Laskowski, R. A.; Thornton, J. M.; Humblet, C.; Singh, J. J. Mol. Biol. 1996, 259, 175−201. (18) Bruno, I. J.; Cole, J. C.; Lommerse, J. P.; Rowland, R. S.; Taylor, R.; Verdonk, M. L. J. Comput.-Aided Mol. Des. 1997, 11, 525−537. (19) Verdonk, M. L.; Cole, J. C.; Taylor, R. J. Mol. Biol. 1999, 289, 1093−1108. (20) Verdonk, M. L.; Cole, J. C.; Watson, P.; Gillet, V.; Willett, P. J. Mol. Biol. 2001, 307, 841−859. (21) Boer, D. R.; Kroon, J.; Cole, J. C.; Smith, B.; Verdonk, M. L. J. Mol. Biol. 2001, 312, 275−287. (22) Rantanen, V. V.; Denessiouk, K. A.; Gyllenberg, M.; Koski, T.; Johnson, M. S. J. Mol. Biol. 2001, 313, 197−214. (23) Rantanen, V.-V.; Gyllenberg, M.; Koski, T.; Johnson, M. S. J. Comput.-Aided Mol. Des. 2003, 17, 435−461. (24) Imai, Y. N.; Inoue, Y.; Yamamoto, Y. J. Med. Chem. 2007, 50, 1189−1196. (25) Wang, L.; Xie, Z.; Wipf, P.; Xie, X.-Q. J. Chem. Inf. Model. 2011, 51, 807−815. (26) O’Boyle, , N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. J. Cheminform. 2011, 3, 33. (27) Clark, M.; Cramer, R. D.; Van Opdenbosch, N. J. Comput. Chem. 1989, 10, 982−1012. (28) Attias, H. Inferring parameters and structure of latent variable models by variational bayes. UAI’99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, Stockholm, Sweden, July 30−Aug 1, 1999; pp 21−30. (29) Morris, G. M.; Huey, R.; Lindstrom, W.; Sanner, M. F.; Belew, R. K.; Goodsell, D. S.; Olson, A. J. J. Comput. Chem. 2009, 30, 2785− 2791. (30) Kinoshita, K.; Sadanami, K.; Kidera, A.; Go, N. Protein Eng. 1999, 12, 11−14.
counting the number of protein families using each pattern, we found the most common interaction patterns among over 700 protein families, while many patterns were found unique to only one family. The ligands were recognized by common and family specific interactions. We found that 63.9% of the ligand atoms in the nonredundant data set had at least one of the interaction patterns, 25.5% of the atoms interacted without patterns, while the remaining 10.6% of ligand atoms did not interact with proteins. In most of the complexes, over half of the ligand atoms were recognized by one or more patterns. The classification of interactions was highly redundant and interacting pairs could be assigned to more than one pattern. The top 3115 main interaction patterns included 90% of the interacting residue−atom pairs with at least one pattern, while the remaining 10% were included an additional 3114 interaction patterns. Thus, the remaining 6390 patterns were considered to be redundant. After rejecting interactions that did not follow any pattern, most (90%) of the binding modes for the ligand recognition of proteins could be described using a combination of 3115 interactions patterns, while a maximum of 6229 patterns were required for all of the binding modes. The diversity of residue−atom interactions was limited.
■
ASSOCIATED CONTENT
S Supporting Information *
Figure S1: Definitions of atom types. Table S1 and Figure S2: Additional statistics on the primary and nonredundant data sets. Text and Figures S3−7: More detailed analysis of recognition by each element. This material is available free of charge via the Internet at http://pubs.acs.org.
■
AUTHOR INFORMATION
Corresponding Author
*Tel.: +81-22-795-7179. Fax: +81-22-795-7179. E-mail:
[email protected]. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS This work was funded by the “HD-Physiology” Grant-in-Aid for Scientific Research in Innovative Areas (22136005). The supercomputing resource was provided by the Human Genome Center (University of Tokyo). A tool for the visualization of the contours of the Gaussian functions was provided by Dr. Takeshi Kawabata.
■
REFERENCES
(1) Kola, I.; Landis, J. Nat. Rev. Drug Discov. 2004, 3, 711−716. (2) Dailey, M. M.; Hait, C.; Holt, P. A.; Maguire, J. M.; Meier, J. B.; Miller, M. C.; Petraccone, L.; Trent, J. O. Exp. Molec. Pathol. 2009, 86, 141−150. (3) Kalyaanamoorthy, S.; Chen, Y.-P. P. Drug Discovery Today 2011, 16, 831−839. (4) Taboureau, O.; Baell, J. B.; Fernández-Recio, J.; Villoutreix, B. O. Chem. Biol. 2012, 19, 29−41. (5) Schaffhausen, J. Trends Pharmacol. Sci. 2012, 33, 223. (6) Berman, H.; Henrick, K.; Nakamura, H.; Markley, J. L. Nucleic Acids Res. 2007, 35, D301−3. (7) Thomas, C.; Terwilliger, D. S. S. Y. Annu. Rev. Biophys. 2009, 38, 371. (8) Dessailly, B. H.; Nair, R.; Jaroszewski, L.; Fajardo, J. E.; Kouranov, A.; Lee, D.; Fiser, A.; Godzik, A.; Rost, B.; Orengo, C. Structure 2009, 17, 869−881. H
dx.doi.org/10.1021/ci300377f | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX