Subscriber access provided by EAST TENNESSEE STATE UNIV
Article
Statistical Analysis, Investigation and Prediction of the Water Positions in the Binding Sites of Proteins wei xiao, Zenghui He, Meijian Sun, Shiliang Li, and Honglin Li J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.6b00620 • Publication Date (Web): 12 Jun 2017 Downloaded from http://pubs.acs.org on June 23, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 45
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Statistical Analysis, Investigation and Prediction of the Water Positions in the Binding Sites of Proteins Wei Xiao1, 2, Zenghui He2, Meijian Sun2, Shiliang Li2, Honglin Li1, 2, * 1
School of Information Science and Engineering, East China University of Science and
Technology, Shanghai 200237, China 2
Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China
University of Science and Technology, Shanghai 200237, China
* To whom correspondence should be addressed.
[email protected] Please address correspondence and requests for reprints to: Prof. Honglin Li Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology 130 Meilong Road, Shanghai 200237, P. R. China Phone/Fax: +86-21-64250213 E-mail:
[email protected] ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ABSTRACT Water molecules play a crucial role in biomolecular associations by mediating a hydrogen bond network or filling spaces with van der Waals interactions. Although current drug design technologies have taken water molecule interactions into account, their applications are still limited to their reliance on either excessive computer resources or a particular potential energy model. Here, we introduce a statistical method that is based on experimentally-determined water molecules in the binding sites of high-resolution X-ray crystal structures to predict the potential hydration sites in the binding sites of a crystal structures of interest. By clustering and analyzing the various interaction patterns of water molecules in the training dataset, we derived a tetrahedron-water-cluster model based on a series of residue group triplets that form feature triangles of different shapes. In the tetrahedral-water-cluster model, a triplet of three polar atoms in the residue group triplet acts as the vertices of the bottom triangle of the tetrahedron, and the water molecule that interacts with these three polar atoms is set as the top vertex of the tetrahedron. By comparing the shapes of the bottom triangles in the training dataset with the shape of the triangle in the residue group triplets in the crystal structure of interest, we can identify the bottom triangle that is most similar to the one in the residue group triplet of the crystal structure of interest. According to the tetrahedron-water-cluster model, the hydration site for the residue group triplet in the crystal structure of interest can be predicted based on the height of the tetrahedron that has the most similar bottom triangle in the training dataset. A test set containing 193 crystal structures was used to evaluate model performance, and
ACS Paragon Plus Environment
Page 2 of 45
Page 3 of 45
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
extensive comparison with the recently published program Dowser++ revealed that our model is at least as good at providing an accurate set of the potential hydration sites in crystal structures of interest.
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
INTRODUCTION Biomolecules typically exist in an aqueous environment and interactions with water molecules are of key importance in their structures and functions.1 Aside from mediating the formation of hydrogen bonds between proteins and their ligands,2 water molecules can also modify the shape and flexibility of the ligand binding site in crystal structures.3-6 A study of 392 high-resolution complexes retrieved from the Protein Data Bank7 (PDB) found that over 85% of the protein-ligand complexes included at least one water molecule bridging the protein and the ligand. Specifically, the average number of ligand-bound water molecules was found to be 4.6, and 76% of these water molecules were identified as participating in polar interactions with both the protein and the ligand.4 In addition, many studies have shown that considering water molecules may result in improving docking performance,8-13 or facilitating receptor-ligand recognition.4 In general, valuable information about the location of water molecules can be obtained not only from primary X-ray crystallography, but also from neutron diffraction and nuclear magnetic resonance. In many cases, however, the location of water molecules is uncertain14 or the information is missing.15 Hence, computational approaches allowing efficient prediction of potential hydration sites are of great value.16 Generally speaking, three categories of applications have been developed to date to deal with the problem of hydration sites in the crystal structures of interest, particularly in protein-ligand docking procedures. Firstly, empirically-based methods, such as Consolv,17 WaterScore,18 and WATGEN,19 focus on crystallographically
ACS Paragon Plus Environment
Page 4 of 45
Page 5 of 45
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
resolving hydration sites and evaluating them with a set of descriptors. These methods can provide quick solutions to distinguish important and potentially conserved water molecules from those that are easily displaced. Of note, these empirically-based methods rely heavily on the training dataset used. Methods in the second category, such as GRID,20, 21 AcquaAlta,22 AQUARIUS,23 SuperStar,24 the force field Fold-X25 and wPMF,26 are knowledge-based methods that aim to predict hydration sites ab initio, and rely on the training dataset to determine favorable hydrated sites. The drawback of knowledge-based applications is that they do not directly indicate which predicted water molecules are likely to be displaced by the ligand or remain bound to the protein. Different from above, the third category of physics-based methods, including Dowser++,27 JAWS,28 double decoupling,29 WaterMap,30 WATsite,31 WaterDock32 and 3D-RISM,33 model explicit water molecules and extract hydration sites from peaks in water density constructed in the protein, or averaged over the location of the water molecule,34 by running molecular dynamics or Monte Carlo simulations. The advantage of this category of methods is that they include entropic effects in the prediction; however, they are time-consuming to run, especially in scenarios with buried cavities. In addition to the diverse methods described above, many other well-known approaches, such as AutoDock35 and 3D-QSAR,36 full atom models GOLD,12 GLIDE37, 38 and SLIDE,39 and spherical particle models FlexX13, 40, 41
and FITTED,42 attempt to explore different aspects of hydration properties. While
the above three categories of applications confirm the importance of considering water molecules, capturing water molecules in the crystal structure is still
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
challenging.16 In this study, we propose a statistical predictive method that relies on experimentally-identified water molecules in the binding sites of high-resolution crystal structures. The interaction patterns of water molecules in the binding sites of crystal structures in the training dataset were analyzed in order to develop a more effective molecular docking method that considers water molecules. A modified four-body statistical pseudo-potential was also included in the method to analyze the hydrophilic characteristics of residue groups. We propose a tetrahedron-water-cluster model based on a series of residue group triplets that form feature triangles of different shapes. In the tetrahedral-water-cluster model, a triplet of three polar atoms acts as the vertices of the bottom triangle of the tetrahedron, and the water molecule that interacts with these three polar atoms is set as the top vertex of the tetrahedron. By classifying the bottom triangles and the tetrahedrons based on similarity of shape, and establishing the relationship between them, the experimentally-determined water sites in each type of tetrahedron can be superposed. Furthermore, by comparing the shapes of the bottom triangles in the training dataset with the shapes of the triangles in the residue group triplets in the crystal structure of interest, we can identify the bottom triangle that is most similar to the one in the residue group triplet of crystal structure of interest. According to our tetrahedron-water-cluster model, the hydration sites for the residue group triplet in the crystal structure of interest can be predicted based on the height of the tetrahedron that has the most similar bottom triangle in the training dataset. The performance of the model in predicting potential hydration sites
ACS Paragon Plus Environment
Page 6 of 45
Page 7 of 45
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
was evaluated using a test set and compared directly with the program Dowser++. Results showed that the tetrahedron-water-cluster model is at least as good as Dowser++ at providing an accurate set of the potential hydration sites in crystal structures of interest.
EXPERIMENTAL METHODS 1. Training Dataset All available protein-ligand complexes in PDB (Accessed April 10, 2013) were collected. Selection criteria were as follows: neutron diffraction or nuclear magnetic resonance structures, and X-ray crystal structures with a resolution larger than 2.0 Å 4, 12, 17, 19, 26, 43
were removed. Any complexes lacking water molecules in their binding
sites were also not retained. The resulting training dataset consisted of 2003 complexes (see Table S1 for a comprehensive overview of the complexes included).
2. Training Process Key features of our training process are described below: (I) For each complex, the size of the binding site was defined by the protein atoms and water molecules within 7 Å of any ligand atoms.12, 18 (II) On average, each interfacial water molecule had three polar interactions with the surrounding atoms, and was less mobile than protein atoms when they had more than three polar interactions.4 Hence, for each water molecule in a binding site of the crystal structure of interest, we only considered at most three surrounding polar atoms
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
within a hydrogen bonding range of 2.5 Å ~ 3.5 Å. Then, according to their sources, we divided the selected polar atoms into three types, that is, ligands (including halogen atoms and metal ions), water molecules, and amino acid residues. (III) To determine if the selected polar atoms from the amino acid residues are exposed to the surfaces of the binding sites, we introduced a program named NACCESS44 to calculate the atomic accessible surface by rolling a probe of a default size around the van der Waals surface and following the coordinates of the center of the probe.45 Only atoms with an atomic accessible surface greater than zero Å2 were retained as interface atoms. The selected polar atoms from the amino acid residues were then combined, forming sets of different residue groups: triplets, two-tuples and unaries, for further analysis.
3. Design of the Prediction Model As demonstrated in many biophysical investigations, water molecules can make hydrogen bonds with the surrounding atoms and form tetrahedron water clusters.46, 47 In order to detect potential hydration sites in the binding sites of the crystal structures of interest, a tetrahedron-water-cluster model (see Figure 1) was proposed based on a series of residue group triplets. In Figure 1, a triplet of three polar atoms serves as the vertices of a bottom triangle of a tetrahedron, and the water molecule that interacts with these atoms is set as the top vertex of a tetrahedron.
ACS Paragon Plus Environment
Page 8 of 45
Page 9 of 45
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Figure 1. The architecture of the tetrahedron-water-cluster model, where a, b, and c stand for the length of the longest, middle and shortest edges of the bottom triangle, respectively, O represents the center of the bottom triangle, and h represents the height of the tetrahedron.
The shapes of tetrahedrons formed can be classified according to the shape of the bottom triangle and the height of the tetrahedron. First, two ratios are adopted to characterize the shape of the bottom triangle and are used in the similarity calculation. The first ratio is calculated by dividing the longest edge of the bottom triangle a by the shortest edge c in each triplet. The first ratio is then classified into six sectors, i.e., 1.00-1.20, 1.20-1.40, 1.40-1.60, 1.60-1.80, 1.80-2.00 and 2.00-3.00 (Figure 2A), referred to here as fractions. The bottom triangles in each sector are then further divided based on the second ratio, calculated by dividing the middle edge of the same
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
triangle b by the shortest edge c. Subsequently, the bottom triangles are subdivided into more specific sub-fractions (Figure 2B-2G).
Figure 2. Distribution of the ratios used to characterize the shape of the bottom
ACS Paragon Plus Environment
Page 10 of 45
Page 11 of 45
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
triangle: (A) The bottom triangles are divided into six sectors according to the first ratio. (B-G) Each of the six sectors is then subdivided by the second ratio. The two ratios are then aligned to match the shape of the bottom triangle.
Having categorized the shapes of the bottom triangles, the shapes of the tetrahedrons are then determined according to the height, h, from the water molecule to the bottom triangle. By observing the shapes of the tetrahedrons in each sector, the sum of the lengths of the three edges (a + b + c) of the bottom triangle in each case is used to establish the relationship between the bottom triangle and the height h of the tetrahedron. The first two columns of Table 1 give the two ratios of the bottom triangles; the sums of the lengths of the three edges of the bottom triangle are divided into different ranges according to the average height of the tetrahedrons.
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
Page 12 of 45
Table 1. Relationship between the sums of the lengths of the three edges of the bottom triangle and the average heights of the tetrahedrons. Ratio1
1.00-1.20
1.20-1.40
1.40-1.60
1.60-1.80
1.80-2.00
Ratio2
Sums of the lengths of the three edges of the bottom triangle
1.00-1.05
>=13.90
1.05-1.10
>=14.00
1.10-1.15
>=14.00
14.00-13.10
=13.70
13.70-13.17
=13.54
13.54-11.66
=13.87
13.87-12.40
=13.90
13.90-12.67
=14.00
14.00-13.17