Subscriber access provided by CORNELL UNIVERSITY LIBRARY
Article
Statistical Analysis and Prediction of Covalent Ligand Targeted Cysteine Residues Weilin Zhang, Jianfeng Pei, and Luhua Lai J. Chem. Inf. Model., Just Accepted Manuscript • Publication Date (Web): 16 May 2017 Downloaded from http://pubs.acs.org on May 17, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Statistical Analysis and Prediction of Covalent Ligand Targeted Cysteine Residues Weilin Zhang,† Jianfeng Pei,‡ and Luhua Lai*,†,‡,§ †
Peking-Tsinghua Center for Life Sciences, AAIS, Peking University, Beijing 100871, P.R.
China ‡
Center for Quantitative Biology, AAIS, Peking University, Beijing 100871, P.R. China
§
BNLMS, State Key Laboratory for Structural Chemistry of Unstable and Stable Species,
College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, P.R. China
* Corresponding author. E-mail:
[email protected]. Tel: 86-10-62757486
ACS Paragon Plus Environment
1
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 23
ABSTRACT: Targeted covalent compounds or drugs have good potency as they can bind to a specific target for a long time with low doses. Most currently known covalent ligands were discovered by chance or by modifying existing non-covalent compounds to make them covalently attached to a nearby reactive residue. Computational methods for novel covalent ligand binding prediction are highly demanded. We performed statistical analysis on protein complexes with covalent ligands attached to cysteine residues. We found that covalent modified cysteine residues have unique features compared to those not attached to covalent ligands, including lower pKa, higher exposure and higher ligand binding affinity. SVM models were built to predict cysteine residues suitable for covalent ligand design with prediction accuracy of 0.73. Given a protein structure, our method can be used to automatically detect druggable Cys residues for covalent ligand design, which is especially useful for identifying novel binding sites for covalent allosteric regulating ligand design.
ACS Paragon Plus Environment
2
Page 3 of 23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
INTRODUCTION In recent years, there is a resurgence of interest on covalent ligands with high potency and selectivity. About one third of the FDA approved drugs act through covalent mechanisms and some of them are discovered after daily application for years.1 Several computational tools such as CovalentDock2, CovDock3, CovDock-VS4, DOCKovalent5, DOCK-TITE6, AutoDock7 and SCAR8 have been developed for designing covalent ligands. Quantum chemical-based studies have also been applied to fulfill this goal by carefully evaluate the potential energy surface of the covalent bond formation between the involved residue and the warhead.9, 10 Currently, all these programs or protocols are applied to the known nucleophilic residue and the effort is mainly focused on searching covalent candidates with delicate designed covalent bond contribution. Among these programs or protocols, only DOCKovalent and SCAR have been used to guide a forward discovery of new covalent ligands. Covalent binding of ligands is very common in the biological process and cysteine residue in general has the highest nucleophilic reactivity among all the natural amino acid residues. Proteomic profiles have shown that a broad range of cysteine residues could act as reactive sites with the desired probe reagents. 11, 12 With the integration of these proteomics information and H-bond network contribution from corresponding protein structures, Soylu and Marino developed a well-designed web server Cy-preds to predict Cys reactivity.13 However, whether a cysteine is suitable to bind a druglike ligand highly depends on its environment. Although there are several computational approaches to predict functional cysteine residue 13, 14, whether the nucleophile cysteine residues are suitable for druglike covalent ligand design is not well addressed. “Kinase cysteinome” has been proposed for non-catalytic cysteine residues in the active site for protein kinase family.15 Recently, Zhang et.al 16 used multiple sequence alignment
ACS Paragon Plus Environment
3
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 23
and FTMap17 to detect the reactivity of no-catalytic cysteine in seven kinases. They proposed that higher frequency of H-bonded and non-bonded interaction around cysteine is preferred to form covalent bond. The short distance between the warhead and the thiol group of cysteine is also required. Cysteinome is a comprehensive database manually curated from current literatures, which compiles 462 protein targets with targetable cysteine from 122 different species with 1217 covalent modulators.18 Among all the entries, there are 342 targets with protein structures and 176 targets with covalent modulator complexes. The accumulated information of covalent ligands binding with proteins makes it possible to carry out a comprehensive statistical analysis. Here we first analyzed the general features of cysteine residues that are covalently attached to ligands, then developed a predictive model to identify targeted cysteine residues suitable for druglike covalent ligand discovery. DATASET PREPARATION A statistical survey on binding constants of covalent bound protein-ligand complexes has been done several years ago by Li et.al.19 They selected 79 covalent complexes and most of them were enzymes with serine residues. Other datasets used to develop computational tools were limited to a small number of protein families. 2, 3, 6-8, 16We extracted our dataset “Covalent” (Cov) by the following procedure. We first explored the Protein Data Bank20 for complexes with covalent bond formed between the ligand and cysteine residue as indicated by the text information provided. At this stage all explicit disulfide-bond forming cysteine residues were eliminated. Heme containing proteins and Fe-S proteins structures were not considered within the dataset as their patterns are very special. We also checked the structures and corresponding literatures to make sure such complexes indeed have covalent bonds. All the covalent ligands were manually checked for their non-covalent and warhead form.
ACS Paragon Plus Environment
4
Page 5 of 23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Another dataset “Non-covalent” (Non-Cov) was also built for comparison. It was comprised of all the other cysteine residues from the same structure, which do not form covalent bond with the ligand in that crystal structure. A special case is small ligands with high reactivity such as BME and DTT, which are routinely used in protein purification and crystallization. In some structures, they form sulfur-sulfur bond with certain cysteines. But considering the normal dosage, not all cysteines are modified by these chemicals, indicating that the bond formation process should depend on the local environment. To make a general exploration, all protein structures with sulfur-sulfur bond forming with these small highly reactive ligands were included into the “Covalent-Active agent” (Cov-Active) dataset. All the other covalent ligands are tagged as “Covalent-ligand”(Cov-Lig). The final “Covalent-ligand” and “Covalent-Active agent” datasets contain 1134 and 328 Cys residues, respectively, which are covalently modified from 1309 PDB entries (1057/257) belonging to 515 targets. These structures come from various protein families and organisms (Supporting Table S1 to S3). Based on Blastclust at 90% sequence identity from RCSB PDB21, these proteins can be clustered into 371 and 144 clusters, respectively. With 40% sequence identity, they could be clustered into 257 and 132 clusters, respectively. The total number of covalent ligands is 531 from 61 warheads which is comparable to that of Cysteinome. The “non-covalent” dataset contains 5705 cysteine residues within the same proteins. The pKa values of cysteine residues were calculated using PROPKA (PROPKA v2.0) 22as suggested by Marino
23
. The solvent
accessible surface areas (SASA) of cysteine residues were calculated using POPS. 24 To obtain the information of potential pocket that a cysteine belongs to or is close by, the protein structures were analyzed using CAVITY, a protein surface cavity detection and druggbility analysis program. 25 If a covalent cysteine is within a CAVITY detected pocket, the number of other
ACS Paragon Plus Environment
5
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 23
pocket forming residues within 4Å, 6Å, 8Å and 10Å of the sulfur atom of that covalent Cys are counted with proDy package using a custom python script.26 The ratio of each type of amino acid residues within corresponding range is calculated by dividing the total number of residues detected. The descriptors based on amino acid alphabets are added according to their definitions. If a cysteine is either on the exposed surface of or stay inside a protein without a detectable pocket around it, its pocket and all other descriptors are assigned to “None” and not used in the further machine learning exploration. RESULTS AND DISCUSSION Covalent bond length, Cys pKa and exposure Among all the covalent ligands, about 68% form sulfur-carbon bond and 32% form sulfur-sulfur bond. The average bond length of sulfur-carbon bond is shorter than that of sulfursulfur bond with a peak at 1.8 Å (Figure 1). We also calculated the pKa values and SASA for all the cysteines (Figure 2, Supporting Table S4 and S5). About 1386 Cys residues within “NonCov” dataset failed in the PROPKA calculation and these residues are not considered here. According to Marino and Gladyshev27, solvent-exposed Cys residue may easily be polarized, leading to a decreased pKa, and the solvent-exposed Cys residues are estimated to have close pKa to the physiological pH. The average pKa (7.9) of cysteines in the “Cov-Lig” dataset is significantly lower than that of the “Non-Cov” dataset (9.7). These reactive cysteines are relatively more exposed to solvent with an average fraction of accessible surface of 25% (solvent accessible surface area of 36Å2 across 144 Å2 theoretical value), which is about 10% larger than that of the “Non-Cov”. In “Cov-Active”, these parameters are in the middle of “Cov-Lig” and
ACS Paragon Plus Environment
6
Page 7 of 23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
“Non-Cov” . The tendency in three datasets indicates that pKa and SASA of cysteine highly impact on its nucleophilic reactivity which is an important factor for covalent bond formation.
Figure 1. Distribution of bond length for sulfur-carbon bond and sulfur-sulfur bond. The grey part is the overlap of the two distributions.
ACS Paragon Plus Environment
7
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 23
Figure 2. Calculated pKa and fraction of solvent accessible area of cysteines in Cov-Lig, CovActive and Non-Cov datasets. Cysteines with covalent bond formation have relative lower pKa and higher solvent accessible area. Analysis of surface cavities near cysteine residues It has been proposed that the covalent bond forming process includes binding and bond formation. The binding process of ligand highly depends on the available binding site 28-33. CAVITY is a program that detects potential binding cavities and their properties. The successful design of compounds for the newly detected pockets have been published in many previous work.25 We used CAVITY in “whole protein mode” to detect all the surface pockets with CavityScore larger than 1.5 in our datasets. Then we checked whether a covalent cysteine is within the region of a pocket by checking each pocket’s amino acid residue composition. As shown in Figure 3, majority of cysteine residues binding with covalent ligands are located within or near one of the pockets, which is in accordance with that suggested by previous literatures (cite ref here). For the Cov-Active and Non-Cov datasets, a significantly less percentage of cysteines were found within or nearby surface pockets.
ACS Paragon Plus Environment
8
Page 9 of 23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Covalent Dataset
a) Cov-Lig Dataset
c)
Non-Covalent Dataset
b) Cov-Active Dataset
d)
Figure 3. Percentage of Cys residues within in or close to detected pockets. a) The overall percentage of Cys residues associated with detected pockets for both “Cov-Lig” and “CovActive” datasets. b) The percentage of Cys residues associated with detected pockets for “NonCov” dataset. c) The percentage of Cys residues associated with detected pockets for “Cov-Lig” dataset. d) The percentage of Cys residues associated with detected pockets for “Cov-Active” dataset.
ACS Paragon Plus Environment
9
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 23
Based on the geometrical structure and physical-chemical properties, CAVITY gives CavityScore which is linearly correlated with the binding affinity (average pKd ) of non-covalent ligands25. Wilcoxon tests are applied for CAVITY generated descriptors. As shown in Table 1, the estimated average pKd (pKd Ave) is significantly larger in the “Cov-Lig” dataset than the other two datasets. This parameter is almost the same for “Cov-Active” and “Non-Cov” (Supporting Table S6 and Figure S1, S2). Hydrophobicity and hydrogen bond donor ratio are slighted higher in “Cov-Lig” dataset. The hydrogen bond acceptor ratio is slightly higher in the “CovActive” dataset, which is consistent with the result for kinase pockets16. Besides these physicalchemical parameters, other geometrical parameters also provide valuable information. Large and deep (volume and depth) pockets are preferred for covalent ligand binding. To better describe the binding ability of pockets, we borrowed the idea of “druggability” analysis used by CAVITY 25 to evaluate whether a pocket is suitable for designing a non-covalent ligand. Lip-volume ratio (lipVR) is defined as the number of grids in the lip level over the total grid number. This geometrical parameter is correlated with the predicted binding affinity and could be used to describe the overall shape of the pocket. The distribution of average binding affinity could be split into three lip-volume ratio groups (Figure-4). In the first category, pockets have a wellformed pocket with large vacant and small lip, which could constrain the ligands inside. The distributions of average binding affinities are quite similar for both datasets. This indicates that although we choose all other non-covalently modified cysteine as “Non-Cov” dataset, some of them are within a pocket with high ligand binding potency and may be covalently targeted by other ligands. For the other two categories with larger lip volume ratio, the pockets are either widely open or relative flat and small (Figure 4). The overall distributions of average binding affinities are relatively flat and show significant difference between two datasets. The “Cov-Lig”
ACS Paragon Plus Environment
10
Page 11 of 23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
dataset has higher population in the higher-end of pKa, while “Non-Cov” shows a large distribution on the lower-end of pKa. The widely open grooves would be targeted by endogenous peptide-like ligands while the small and shallow pockets could be used by endogenous small metabolites-like ligands, which is also very important for allosteric post-translational modification.
Table 1. Two-sample Wilcoxon tests on CAVITY generated descriptors in pockets in Cov and Non-Cov pockets Cov Non-Cov
a
Mean
SD
Mean
SD
Difference
p-value
Openness a
176
124
161
138
14
7.3 e-13
Depth b
17.9
14.1
13.9
11.6
3.4
1.7e-23
Volume a
742
482
576
470
165
9.1e-37
Hydrophobic Ratio
0.27
0.14
0.24
0.16
0.02
9.2e-10
Hydrogen Bond Donor Ratio
0.48
0.09
0.49
0.09
-0.01
4e-4
Hydrogen Bond Acceptor Ratio
0.25
0.08
0.25
0.08
0.002
0.1
Lip-Volume Ratio
6.4
0.6
6
0.8
0.3
2.08e-28
pKdAve
176
124
161
138
14
7.3 e-13
The unit of openness and volume are Å2 and Å3.
b
The unit of depth is the number of layers from the lip level to the bottom level of the cavity grid points.
ACS Paragon Plus Environment
11
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 23
a)
b)
c)
Cov-Lig
Non-Cov
Figure 4. Average binding affinity categorized by lip-volume ratio. The lipVR range of lipVR-1, lipVR-2 and lipVR-3 are (0, 0.1),(0.1, 0.2),(0.2, 0.4), respectively. a) geometrical illustration of pockets shape in each category. b) predicted average binding affinity of the pockets in each category. c) the distribution of lip/volume ratio in two data sets.
ACS Paragon Plus Environment
12
Page 13 of 23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Analysis of amino acid composition around covalent cysteines Considering cysteine is one of the least abundant amino acids in proteins, its position and function are highly related to its adjacent environments. To simplify characteristics of the amino acid side chains, different amino acid alphabets were chosen here (derived from various approaches, including physicochemical properties, local structure considerations and sequence alignments (Supporting Table S8)). 34-42 We first examined the total number of amino acid residues around the cysteines in three datasets. On average, there are more residues around covalent cysteine residues (Table 2). This parameter is related to the position of the cysteine within a pocket. The cysteine located inside a pocket will have more neighbors than those on the edge. According to a hydrophobic/polar (H/P) classification by Kameteka et. al37, there are more polar and less hydrophobic residues around the covalently modified cysteines. Details about the comparison of each amino acids and other smaller groups among the three datasets are shown in the supporting information. Table 2. Total number of amino acids and H/P pattern percentage for covalent and non-covalent cysteine residues Non
Cov-Lig
p-value
Amino Acid
Mean
Median
Mean
Median
ENV4A.All
4.76
5.00
3.77
4.00
2.47e-102
ENV6A.All
7.93
8.00
5.93
6.00
4.30e-153
ENV8A.All
12.92
13.00
9.21
9.00
1.65e-179
ENV10A.All
17.35
17.00
12.19
12.00
5.54e-173
ENV4A.Polar
0.74
0.80
0.62
0.67
3.39e-49
ADEGHKNPQRST
ENV4A.Hydrophobic
0.26
0.20
0.38
0.33
3.39e-49
CFILMVWY
ENV6A.Polar
0.69
0.73
0.61
0.63
9.00e-41
ADEGHKNPQRST
ENV6A.Hydrophobic
0.31
0.27
0.39
0.38
9.00e-41
CFILMVWY
ACS Paragon Plus Environment
13
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 23
Predictive models for covalent cysteine identification We built a predictive model to evaluate whether a cysteine residue could be covalently targeted. We built the training set by assigning Cys residues within in or close to detected pockets from “Cov-Lig” (1027) and “Non-Cov” (2476 with pKa calculation failed entries excluded) as positive and negative data, respectively. As many covalent ligands have been verified by mass spectrometry and other experimental methods without available crystal structures of the complex, we extracted targets with known three-dimensional structures but no complex structures with covalent ligand from Cysteinome database as an external test set.18 This external dataset contains 119 targets, 1663 PDB structures and 6562 cysteine residues (1377/5185). The machine learning process was performed using R package. SVM models were trained with radial basis kernel with cost and gamma specified in Supporting Information by e1071 package. During 10-fold cross-validation, one tenth of the training data was reserved for prediction. The best cut-offs which maximizes the summation of sensitivity and specificity were chosen as the threshold for binomial classification. All ten models generated are validated against the test set with the same cut-off values. In addition to the parameter analyzed in the previous section, the amino acid alphabet from Tanaka et.al 40 was employed, which was originally used for protein design. According to Longo and Blaber39, these representative amino acids could cover most of the properties required for protein folding such as complexity, secondary structure propensity, hydrophobic-hydrophilic patterning and core-packing potential. Models built with other amino acid alphabets were given in the supporting information. As previously discussed physical-chemical descriptors mainly cover Cys reactivity, Cys position and binding affinity of potential non-covalent ligands, while distance based amino acid composition descriptors mainly cover information from sequence.
ACS Paragon Plus Environment
14
Page 15 of 23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Combining these two type of descriptors can significantly increase the predictive accuracy of the model (Table 3 , Figure 5 and Supporting Table S9 to S12). To further simplify the model, distance based parameters are kept if they have a low p-value (