J. Chem. Theory Comput. 2008, 4, 1959–1973
1959
Geometrical Preferences of the Hydrogen Bonds on Protein-Ligand Binding Interface Derived from Statistical Surveys and Quantum Mechanics Calculations Zhiguo Liu, Guitao Wang, Zhanting Li, and Renxiao Wang* State Key Laboratory of Bioorganic Chemistry, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 354 Fenglin Road, Shanghai 200032, P. R. China Received July 8, 2008
Abstract: We have conducted potential of mean force (PMF) analyses to derive the geometrical parameters of various types of hydrogen bonds on protein-ligand binding interface. Our PMF analyses are based on a set of 4535 high-quality protein-ligand complex structures, which are compiled through a systematic mining of the entire Protein Data Bank. Hydrogen bond donor and acceptor atoms are classified into several basic types. Both distance- and angle-dependent statistical potentials are derived for each donor-acceptor pair, from which distance and angle cutoffs are obtained in an objective, unambiguous manner. These donor-acceptor pairs are also studied by quantum mechanics (QM) calculations at the MP2/6-311++G** level on model molecules. Comparison of the outcomes of PMF analyses and QM calculations suggests that QM calculation may serve as an alternative approach for characterizing hydrogen bond geometry. Both of our PMF analyses and QM calculations indicate that C-H · · · O hydrogen bonds are relatively weak as compared to common hydrogen bonds formed between nitrogen and oxygen atoms. A survey on the protein-ligand complex structures in our data set has revealed that CR-H · · · O hydrogen bonds observed in protein-ligand binding are frequently accompanied by bifurcate N-H · · · O hydrogen bonds. Thus, the CR-H · · · O hydrogen bonds in such cases would better be interpreted as secondary interactions.
1. Introduction Hydrogen bonding is probably the most important factor for maintaining the molecular structures and functions of various biological as well as chemical systems.1,2 The very basic characteristics of a hydrogen bond is the D-H · · · A alignment, in which the hydrogen donor (D) is normally a strong electronegative atom such as nitrogen or oxygen, while the hydrogen acceptor (A) is another electronegative atom with at least one electron lone pair. Dissociation energy of a hydrogen bond may vary from 1 kcal/mol for a weak hydrogen bond such as C-H · · · O to 40 kcal/mol for a strong ionic hydrogen bond such as FH · · · F-.3 This feature endows hydrogen bonding an essential dual role: on one hand, hydrogen bonds are relatively weak compared to covalent bonds, thus they may form and break rapidly during the * Corresponding author phone: 86-21-54925128; e-mail: wangrx@ mail.sioc.ac.cn.
process of a conformational change or molecular recognition; on the other hand, due to the considerable strength and directional nature of hydrogen bonds, a desired specificity in structure can be eventually achieved. An in-depth understanding of protein-ligand interactions has laid the foundation of structure-based drug design techniques, such as virtual screening,4-6 de noVo design,7,8 and fragment-based design.9-11 Hydrogen bonding is an essential factor in the binding process of a ligand molecule to its target protein. Many computational studies on this subject need to detect hydrogen bonds with rule-based algorithms, which rely on interpreting the relative positions and orientations, such as the D-A distance and the D-H-A angle, of two interacting chemical groups. Such algorithms are also implemented in some empirical scoring functions, such as the ones in LUDI,12,13 FlexX,14 ChemScore,15,16 GlideScore,17 SCORE,18 and X-Score,19 to estimate the contribution of hydrogen bonds to protein-ligand binding affinities. Char-
10.1021/ct800267x CCC: $40.75 2008 American Chemical Society Published on Web 10/15/2008
1960 J. Chem. Theory Comput., Vol. 4, No. 11, 2008
acterization of hydrogen bonds is also an essential factor in some other theoretical studies, such as protein folding. Therefore, deduction of the preferred geometrical parameters of various types of hydrogen bonds is a meaningful goal for all these studies. Geometrical parameters of hydrogen bonds can be derived from a statistical survey on a large number of crystal structures. Some studies of this kind have been reported before,20-22 which were based on either the Cambridge Structure Database23 (CSD) or the Protein Data Bank (PDB).24 For the purpose of characterizing the hydrogen bonds on protein-ligand binding interface, apparently the latter approach is more straightforward. Due to the rapid progress in structural biology, the total number of available three-dimensional structures of biological macromolecules is growing constantly. While this manuscript is in preparation, over 50,000 structures have already been deposited in PDB. According to our previous analyses,25,26 up to 40% of them can be classified as valid protein-ligand complexes. High-resolution structures of these protein-ligand complexes can serve as a solid basis for conducting statistical surveys regarding the hydrogen bonds on protein-ligand binding interface. In this study, we have applied the potential of mean force (PMF) analysis on a large number of high-quality crystal structures of protein-ligand complexes to derive the geometrical preferences of various types of hydrogen bonds. It must be mentioned that the term “potential of mean force” could be confusing since it is actually more frequently used in other areas of molecular modeling, such as the molecular dynamics simulation of liquid phases. The PMF analysis applied in our study refers to the approach proposed by Sippl et al., which was originally applied to protein folding studies.27,28 In recent years, this approach has been extended to the evaluation of protein-ligand binding by a number of scoring functions, such as PMF-Score,29-31 DrugScore,32,33 BLEEP,34,35 SMoG,36,37 DFIRE,38,39 and M-Score.40 The primary aim of our study is not to develop another PMFbased scoring function. Instead, we apply this approach to the characterization of the hydrogen bonds in protein-ligand binding, which is the first of this kind to the best of our knowledge. Our study covers common hydrogen bonds formed between oxygen and nitrogen atoms as well as C-H · · · O hydrogen bonds. To make comparison with the outcomes of our PMF analyses, we have also employed quantum mechanics (QM) calculations on some model molecules to characterize these hydrogen bonds. The geometrical parameters of various types of hydrogen bonds deduced in our study can be readily utilized by the empirical algorithms for perceiving hydrogen bonds in protein-ligand binding or protein folding studies.
2. Computational Details 2.1. Preparation of Protein-Ligand Complex Structures. Our statistical survey is conducted on a large set of high-quality structures of protein-ligand complexes. These complexes are selected throughout the entire Protein Data Bank (PDB) through a procedure similar to the one devel-
Liu et al.
oped by us in the compilation of the PDBbind database.25,26 This procedure can be described briefly as following. First, the composition of protein-ligand complex is considered. PDB entries which do not contain at least one protein molecule and one valid small-molecule ligand are filtered out. Here, a valid ligand must not be a cofactor/coenzyme (such as Heme, CoA, NAD, FAD, and their derivatives) or any component of an organic solvent and buffer. It also must not contain any uncommon elements, such as Be, B, Si, and metal atoms, and its molecular weight shall not exceed 1000. Note that oligopeptides (up to 9 residues) and oligonucleotides (up to 3 residues) are considered as valid smallmolecule ligands in our study. Second, the quality of protein-ligand complex structure is considered. Only the protein-ligand complex structures which are determined through crystal diffraction with an overall resolution better than or equal to 2.5 Å are accepted. Finally, each qualified complex should be formed by one protein molecule with one ligand molecule in a binary manner, i.e. there should not be multiple ligands residing in close vicinity at the same binding site. In addition, covalently bound complexes are filtered out. All of the above examinations are conducted by a set of computer programs, which make judgments based on the contents of the original structural files downloaded from PDB. Manual inspections and adjustments are also employed whenever necessary. We have screened the entire PDB (as released in January 2006) through the above procedure, and the outcome is a list of 4535 protein-ligand complexes. The structures of all of these complexes in the PDB format are downloaded from the RCSB PDB Web site (http://www.rcsb.org/pdb/). Each complex structure is then processed into appropriate formats for the convenience of subsequent analyses. In brief, each complex is split into a ligand molecule and a complete “biological unit” of the protein molecule, and they are saved in two separated files. Other components in the original PDB file, such as water and other solvent molecules, are ignored. The protein structure is sufficiently presented by the PDB format and thus does not need any additional treatment. The ligand structure, however, needs to be interpreted properly since the atom/bond type information is largely missing in the PDB format. The I-interpret program41 is applied here to tackle this problem. This program interprets the chemical structure of a given organic molecule with a high accuracy merely based on the identities and coordinates of its component atoms. Each processed ligand is saved in the Mol2 format and is further manually inspected in the graphical interface of the Sybyl software42 in order to detect any remaining problems in atom/bond types. Since the primary aim of our study is to analyze hydrogen bonds, it is necessary to specify the explicit positions of hydrogen atoms on the protein and the ligand, which are normally absent in the original PDB structural files. In our study, the “standard” protonation states under neutral pH are applied to the ligand side, i.e. carboxylic, sulfonic, and phosphoric acid groups are set in deprotonated forms, while aliphatic amine groups, guanidine, and amidine groups are set in protonated forms. Hydrogen atoms are added onto the ligand accordingly with the Sybyl software. Situations on
H Bonds on Protein-Ligand Binding Interface
J. Chem. Theory Comput., Vol. 4, No. 11, 2008 1961
Table 1. Hydrogen Bond Donor and Acceptor Types Defined in Our PMF Analyses symbol
SMARTS string
description Donor Types
OD.H ND.3 ND.AM ND.PL3 CD.G CD.A
[$([#8]([#1])[#6])] [$([#7∧3][#1])] [$([#7]([#1])[#6,#15,#16])[#8]),$([#7]([#1])[#6])[#16])] [$([#7;∧2;D3][#1])] [$([#6][#1])] [$([#6]([#7])([#6])[#8])[#1])]
OA.2 OA.H OA.E OA.NC NA.2
[$([#8])*)] [$([#8;D2][#1])] [$([#8;D2;H0])] [$([#8;D1]∼[#6,#15,#16]∼[#8;D1])] [$([#7;D2])
sp3 oxygen atom in a hydroxyl group sp3 nitrogen atom in an amine group, positively charged nitrogen atom in an amide group sp3 or sp2 nitrogen atom with a triangle planar geometrya generic carbon atom alpha-carbon on an amino acid residueb
Acceptor Types
a
Such as the nitrogen atom in pyrrole and the one in aniline.
sp2 oxygen atom sp3 oxygen atom in a hydroxyl group sp3 oxygen atom in an ester or ether group oxygen atom in a carboxylic group, negatively charged sp2 nitrogen atom b
Only applicable to protein molecules.
the protein side are more complicated since the protonation status of an amino acid residue may be affected by its surrounding environment. The PROPKA algorithm43 is employed in our study to determine the protonation status of ionizable residues under neutral pH. This algorithm is chosen since its performance was the best in a recent benchmark.44 Hydrogen atoms are then added onto the protein structure with the AMBER program45 according to the predictions by PROPKA. 2.2. Probing of Donor-Acceptor Pairs. An in-house C++ program, PLHB, is developed based on the open source library in OpenBabel.46 It is used to probe the donor-acceptor pairs on the binding interface of all of the protein-ligand complexes in our data set. Donor atoms and acceptor atoms are classified into several categories according to their chemical natures (Table 1). Combination of these donor and acceptor types covers most common hydrogen bonds observed between proteins and small-molecule ligands. The SMARTS chemical language47 and the Programmable ATom TYper (PATTY) algorithm48 are applied to this typing scheme. With SMARTS and PATTY, flexible and efficient atom type classification can be expressed in text strings that are interpretable to chemists. Our PLHB program also computes the desired geometrical parameters of donor-acceptor pairs, including the D-A distance (d) and the D-H-A angle (θ). Computation of the D-H-A angle needs the coordinates of hydrogen atoms, which are normally not available in the original structural files from PDB. Coordinates of the hydrogen atoms on most chemical groups can be reliably predicted with standard bond lengths, bond angles, and dihedral angles based on the hybridization state of their root atoms. An obvious exception is the hydrogen atom on a hydroxyl group (i.e., R-OH), which may have multiple possible positions around the R-O axis due to a low-energy rotation barrier. A simple searching algorithm is implemented in our PLHB program to tackle this problem: if a hydroxyl group is in close vicinity to an acceptor group on the counteracting molecule, the hydrogen atom on this hydroxyl group will be rotated around the R-O axis systematically to achieve the largest possible value of the D-H-A angle. The final coordinates of this hydrogen atom will be used in our statistical survey. 2.3. Derivation of Statistical Potentials. Pairwise potentials between donors and acceptors are derived from our
data set of protein-ligand complexes through the potential of mean force (PMF) analysis. The basic idea beneath PMF analysis27,28 is that statistically more populated configurations are energetically more favorable, and the ensemble of all accessible configurations are assumed to obey a Boltzmann distribution. In our study, the distance-dependent potential of each donor-acceptor pair is computed as
[
Dij(d) ) -RTln
fij(d) g(d) m0 + mij
m0 + mij
]
(1)
where fij(d) is the relative probability of observing donoracceptor pair i-j at distance d, and g(d) is the relative probability of observing a reference state at the same distance. Since our aim is to derive the geometrical preferences of hydrogen bonds over a nonspecific reference state, a reasonable choice of the reference state is all possible atom pairs, including van der Waals pairs as well as hydrogen bond pairs. In eq 1, fij(d) is computed as
( ) )⁄( ) Dmax
fij(d) ) Fij(d) ⁄ Fij(bulk) )
(
nij(d) 4πd2∆d
)⁄ ∫
And, g(d) is computed as
∑ nij(d)
Dmin
Dmax
Dmin
4πd2∆d
(2)
Dmax
(
g(d) ) Fall(d) ⁄ Fall(bulk))
nall(d) 4πd2∆d
∑ nall(d)
Dmin
∫DD
max
min
4πd2∆d
(3)
Here, Fij(d) is the numerical density of donor-acceptor pair i-j observed at distance d, while Fij(bulk) is the numerical density of donor-acceptor pair i-j observed throughout the entire sampling space. Fall(d) and Fall(bulk) are defined similarly, which are applied to all atom pairs. In our study, the lower bound (Dmin) and the upper bound (Dmax) of distance cutoff are set to 2.0 Å and 8.0 Å, respectively. In order to count the occurrence (nij) of donor-acceptor pair i-j at a particular distance, the spherical sampling space centered at the donor atom is divided into multiple layers (Figure 1A). The bin width, i.e. ∆d, is set to 0.1 Å. The
1962 J. Chem. Theory Comput., Vol. 4, No. 11, 2008
Liu et al.
angle regardless if a hydrogen bond between them is possible. Since θ is only relevant to donor-acceptor pairs, the reference state in eq 4 is different from the one in eq 1. In eq 4, fij(θ) is computed as
(
fij(θ) ) Fij(θ) ⁄ Fij(bulk))
)
nij(θ)d