Prediction of Protein Pairs Sharing Common Active ... - ACS Publications

Aug 25, 2016 - pubs.acs.org/jcim ... DOI: 10.1021/acs.jcim.6b00118. J. Chem .... (bioactivity < 0.1 μM). ... Tanimoto coefficient of 1.0 based on MAC...
0 downloads 0 Views 7MB Size
This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes.

Article pubs.acs.org/jcim

Prediction of Protein Pairs Sharing Common Active Ligands Using Protein Sequence, Structure, and Ligand Similarity Yu-Chen Chen,†,∥ Robert Tolbert,§ Alex M. Aronov,‡ Georgia McGaughey,‡ W. Patrick Walters,‡,⊥ and Lidio Meireles*,† †

Vertex Pharmaceuticals Incorporated, 11010 Torreyana Road, San Diego, California 92121, United States Vertex Pharmaceuticals Incorporated, 50 Northern Avenue, Boston, Massachusetts 02210, United States § OpenEye Scientific Software, 9 Bisbee Court, Suite D, Santa Fe, New Mexico 87508, United States ‡

S Supporting Information *

ABSTRACT: We benchmarked the ability of comparative computational approaches to correctly discriminate protein pairs sharing a common active ligand (positive protein pairs) from protein pairs with no common active ligands (negative protein pairs). Since the target and the off-targets of a drug share at least a common ligand, i.e., the drug itself, the prediction of positive protein pairs may help identify off-targets. We evaluated representative protein-centric and ligand-centric approaches, including (1) 2D and 3D ligand similarity, (2) several measures of protein sequence similarity in conjunction with different sequence sources (e.g., full protein sequence versus binding site residues), and (3) a newly described pocket shape similarity and alignment program called SiteHopper. While the sequence-based alignment of pocket residues achieved the best overall performance, SiteHopper outperformed sequence-based approaches for unrelated proteins with only 20−30% pocket residue identity. Analogously, among ligand-centric approaches, pathbased fingerprints achieved the best overall performance, but ROCS-based ligand shape similarity outperformed path-based fingerprints for structurally dissimilar ligands (Tanimoto 25%−40%). A significant drop in recognition performance was observed for ligand-centric approaches when PDB ligands were used instead of ChEMBL ligands. Finally, we analyzed the relationship between pocket shape and ligand shape in our data set and found that similar ligands tend to bind to similar pockets while similar pockets may accept a range of differentshaped ligands.



INTRODUCTION Drug target identification is crucial to drug discovery projects from both mechanistic and safety perspectives. Drugs exert their therapeutic benefits by modulating disease biology through binding interactions with specific macromolecular targets. Thus, knowing the disease-modifying targets modulated by pharmacologically active small molecules such as those discovered in phenotypic-based screens constitutes a significant step in the drug discovery process. On the other hand, drugs can also interact with multiple unintended targets, so-called of ftargets, which may result in adverse drug reactions and safety issues. In this case, early detection of off-targets can help researchers to develop optimization strategies that may mitigate these risks. In silico approaches to target/off-target identification constitute a fast and lower-cost complement to experimental techniques and can be categorized into protein-centric and ligand-centric approaches. Ligand-centric approaches rely on the observation that similar ligands are likely to share the same protein targets. These approaches vary mainly on the specifics of how ligand similarity is calculated. Fingerprint-based approaches evaluate ligand similarity on the basis of molecular connectivity, whereas ligand-shape-based approaches measure © XXXX American Chemical Society

ligand similarity on the basis of their three-dimensional (3D) shapes. Keiser et al.1 used the similarity ensemble approach (SEA) to predict drug polypharmacology for 246 protein targets based on Daylight and ECFP fingerprints.2,3 This work led to the experimental confirmation of 23 new drug−target associations, including the inhibition of the 5-hydroxytryptamine transporter by Vadilex. Using the SEA method, GregoriPuigjané et al.4 successfully deorphaned seven FDA-approved drugs with unknown mechanisms of action, and Lounkine et al.5 discovered that chlorotrianisene causes abdominal pain via inhibition of cyclooxygenase-1. AbdulHameed et al.6 showed that the program ROCS (Rapid Overlay of Chemical Structures)7 was efficient in the retrospective prediction of drug targets using the Directory of Useful Decoys (DUD)8 and further discovered previously unknown off-targets for rimantadine, propranolol, and domperidone with subsequent experimental support. In turn, protein-centric approaches for target identification can be sequence-based or structure-based. Protein sequence alignment is commonly used to evaluate sequence similarity, Received: March 2, 2016

A

DOI: 10.1021/acs.jcim.6b00118 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 1. (A) Criteria for positive and negative protein pairs (see the text for details); (B) Illustration of the pairwise comparisons conducted for scoring each benchmark protein pair.

Thus, by predicting which proteins share a common ligand with the drug target, one can generate hypotheses about the potential off-targets of the drug, even if the actual ligands shared by the protein pairs may be unknown. Kalinina et al.20 successfully predicted new protein−ligand associations by exploiting the premise that if two proteins share a common ligand, then they are likely to share others. Adding protein pairs predicted to share a common ligand to those pairs known to share a common ligand expands the scope of that premise. Among ligand-centric approaches evaluated for positive protein pair recognition, we benchmarked five chemical fingerprints and the shape overlay program ROCS, whereas among protein-centric approaches, we evaluated three measures of protein sequence similarity in conjunction with different sequence sources (e.g., full protein sequence versus binding site residues). In addition, we benchmarked a new pocket shape similarity and alignment program called SiteHopper.21

which may be a consequence of functional relationships between proteins.9 Kruger and Overington10 demonstrated that small-molecule binding is conserved for protein targets with sequence identity above 80%. This similarity can then be utilized as a predictive measure for assessing polypharmacology. In addition to biological sequence, protein similarity can also be evaluated from a structural perspective. Binding pockets have 3D characteristics and can be used to assess sequence-free similarity, which may be useful in recognizing cross-target activity among nonhomologous proteins with similar binding pockets. As an example of this approach, Xie and Bourne11 and Kinnings et al.12 demonstrated the applicability of Cα-atombased pocket similarity by repurposing entacapone, which is indicated for the treatment of Parkinson’s disease, for antitubercular therapeutics. Other approaches to pocket comparison include evaluating pocket similarity on the basis of pseudoatoms,13 all atoms,14 Cα−Cβ vectors,15 spherical harmonics,16 3D Zernike descriptors,17 indexed triangle descriptors,18 and geometric hashing of residues.19 In this work, we evaluated the ability of protein-centric and ligand-centric approaches to correctly discriminate protein pairs sharing a common active ligand (positive protein pairs) from protein pairs with no common active ligands (negative protein pairs). This classification problem is related to the off-target prediction problem since by definition the target and the offtargets of a drug share a common ligand, i.e., the drug itself.



MATERIALS AND METHODS Benchmark Data Generation. The benchmark was formulated as a binary classification problem with the goal of testing the ability of protein-centric and ligand-centric approaches to discriminate protein pairs with common active ligands (positive protein pairs) from protein pairs with no common active ligands (negative protein pairs). This benchmark study was designed to evaluate the ability of different

B

DOI: 10.1021/acs.jcim.6b00118 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

SiteHopper, a Patch Score is defined as 1·ShapeTanimoto + 3· ColorTanimoto. This scoring scheme was chosen on the basis of subjective evaluations rather than fitting of parameters to any particular data set. The following steps are used to create a searchable patch for a binding site: 1. The molecular surface of the protein is created using the Spicoli Toolkit.28 2. The known ligand is used to subset the surface to those surface vertices within 4 Å of the ligand. 3. A spherical Gaussian with a carbon radius is placed at each surface vertex. 4. A subset of the protein residues within 7 Å of the binding site surface is created, and a modified color force field is applied to this subset of residues to create a set of color atoms. In this case, we use a variation of the implicit Mills−Dean force field that is standard in ROCS in which the ring centroid definitions are omitted and the color atom radius is increased from 1.0 to 2.0 Å to account for the higher flexibility of proteins. Finally, color atoms further than 2.5 Å from the surface are removed. 5. This active site shape description is written out in a format that can be searched with the Shape Toolkit29 using the same algorithm as in ROCS. As illustrated in Figure 1B, for each benchmark protein pair with N ligand binding sites collected for one protein and M for the other, N × M pairwise comparisons of binding sites were performed, and the maximum Patch Score returned was assigned to the protein pair. Protein Pair Scoring Using Protein Sequence Similarity. Three protein sequence sources (protein sequence, pocket sequence, and pocket residues) were evaluated in conjunction with three scoring measures (sequence identity, sequence similarity, and normalized alignment score). We defined pocket residues as the amino acid residues surrounding a cocrystallized ligand within a distance of 7 Å. We defined the pocket sequence as the segment of the protein sequence starting from the first pocket residue to the last pocket residue. For each protein pair, the corresponding protein/pocket sequences were aligned using the global sequence alignment algorithm implemented in Biopython30 with a gap opening penalty of 11, a gap extension penalty of 1, and a Blosum62 scoring matrix. The alignment of pocket residues was extracted from the alignment of the corresponding pocket sequences. We scored each alignment by percent identity, percent similarity, and alignment score normalized by alignment length. Protein Pair Scoring Using Ligand-Shape-Based and Fingerprint-Based Ligand Similarity. The program ROCS version 3.2 was used to calculate the 3D shape similarities of ligands. Fingerprint-based ligand similarity was measured by the Tanimoto coefficient using five different fingerprints: Path, Circular, Tree, MACCS, and LINGO (OpenEye GraphSim Tookit). We analyzed the performance of ligand-centric approaches using ligands from two sources: cocrystallized ligands from the PDB and active ligands from ChEMBL (bioactivity < 0.1 μM). As illustrated in Figure 1B, for each benchmark protein pair with N cocrystallized ligands for one protein and M for the other, N × M pairwise comparisons of cocrystallized ligands were performed, and the maximum ROCS TanimotoCombo score returned was assigned to the protein pair (for fingerprint-based methods, the maximum Tanimoto coefficient was returned). The crystallographic

approaches to correctly recognize cross-target activity using information regarding protein target A to identify protein target B. A large benchmark data set of positive and negative protein pairs was assembled by integrating data from the ChEMBL20 bioactivity database22 and the Protein Data Bank (PDB).23 Chemical bioactivity data in the forms of IC50, EC50, Ki, and Kd were collected from ChEMBL.22 For ligands with multiple data points for the same target, we aggregated the bioactivity data of different types (IC50, EC50, Ki, and Kd) by calculating in the negative log space (e.g., pIC50) the mean and standard deviation (pActivity and pStd, respectively). We considered a ligand active against a protein if pActivity − pStd ≥ 7.0, i.e., Activity ≤ 0.1 μM and pStd (i.e., the discrepancy among reported experimental values) is adequately small. If this criterion was not met because pStd was too high, we discarded the data. Likewise, we considered a ligand inactive against a protein if pActivity + pStd ≤ 5.0, i.e., Activity ≥ 10 μM and pStd adequately small. We included data from all organisms and treated orthologous proteins as different targets. We used only bioactivity data with the target annotation “SINGLEPROTEIN” and filtered out non-drug-like molecules using the software OpenEye f ilter (version 2.5.1, default parameters). A protein pair was labeled positive if there was at least one common active ligand. Conversely, a protein pair was labeled negative if it was not a positive protein pair and there were three or more ligands that were inactive against one protein but at least weakly active against the other (Activity ≤ 10 μM and pStd small, as explained above). The criteria for negative protein pairs exclude protein pairs lacking bioactivity data. Protein pairs generated from ChEMBL were filtered to include only those pairs involving proteins having a crystal structure in complex with a drug-like molecule deposited in the PDB (cofactors not included). This filtering step ensured that the same benchmark data set could be consistently used to assess positive protein pair discrimination for methods requiring different data types (i.e., drug-like molecules, protein sequence, and protein structure). Figure 1A summarizes the criteria for positive and negative protein pairs. We filtered out 76 protein pairs involving cytochrome p450 enzymes (CYPs) from the positive set because of their high promiscuity for small-molecule binding; from the negative set we filtered out 5 CYP/CYP pairs but kept 111 CYP/non-CYP pairs. The final benchmark set consisted of 6598 positive and 379 negative protein pairs, representing 497 proteins and 170 563 ligands. The smaller number of negative protein pairs is due to the stringent constraints imposed by us on the definition of a negative protein pair and the lack of experimental data to support labeling protein pairs as negative, resulting in many “unable to determine” pairs. To our knowledge, this is the first time such a large data set of positive and negative protein pairs has been assembled and made publically available. Protein Pair Scoring Using SiteHopper Binding Site Similarity. SiteHopper21 represents ligand binding sites as 3D shapes using a shape similarity technology that has proven robust and predictive in many applications.7,24−27 SiteHopper uses a set of spherical Gaussians centered on a set of vertices selected from the binding site surface to describe the shape of the binding site. Chemical properties of protein atoms near the surface (hydrogen-bond donors, hydrogen-bond acceptors, cations, anions, hydrophobes) are then used to label the spherical Gaussian. Similarity described as Patch Scores is then measured in both shape and chemical (color) overlap. For C

DOI: 10.1021/acs.jcim.6b00118 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 2. Benchmark data set. (A) Sequence identity of positive and negative protein pairs calculated from the full protein sequence alignment. (B) Structural similarity of cocrystallized ligands in positive and negative protein pairs evaluated by the Tanimoto coefficient based on MACCS fingerprints. For each protein pair, N × M pairwise comparisons of N cocrystallized ligands for one protein and M for the other were performed, and the maximum Tanimoto coefficient was reported for the protein pair.

approaches and SD is the standard deviation of the bootstrapped differences in the AUCs of the two approaches. As Z approximately follows a normal distribution, the p value of Z was calculated accordingly. Statistical Significance of Approach-Specific Scores. The similarity scores generated by two approaches are not directly comparable because their distributions are not necessarily equal. Therefore, we calculated the p values of the similarity scores. The distribution of similarity scores of an approach was approximated using scores calculated from 100 000 random data pairs. The p value of a score was calculated from the cumulative distribution function of the extreme value distribution for each approach.

conformation of co-crystallized ligands was used in the ROCS overlays. For ligands collected from ChEMBL, active ligands of one protein were compared to active ligands of the other protein. The common active ligands were excluded from the pairwise comparisons to avoid biasing the results. The program OMEGA2 (version 2.5.1.4) was used to generate 25 low-energy conformations to be used in the ROCS overlays of ChEMBL ligands. A total of 100 protein pairs from the negative set of 379 pairs could not be scored with a ligand-based approach because one or both proteins of the pair did not have a potent active ligand (ligand-based approaches involve the comparison among active ligands). Bootstrapped AUC Measurement of Positive Protein Pair Discrimination. The performance of protein-centric and ligand-centric approaches in positive protein pair discrimination was evaluated using the receiver operating characteristic (ROC) curve, which is a measure of the ability of a method to discriminate between two groups. The ROC curve is generated by plotting the true positive rate against the false positive rate determined by incremental score thresholds. The area under the ROC curve (AUC) represents the probability that an approach will score a randomly chosen positive instance higher than a randomly chosen negative one. Because this probability (i.e., the AUC) is invariant to imbalanced data sets,31 we used it to study the aggregated performance of different scoring approaches. The bootstrapping operation was performed to estimate the distribution of AUCs for each approach using the pROC module in R.32 The bootstrapping process generated 2000 replicates using nonparametric stratified resampling.33 The 95% confidence interval of the AUC measurement was then calculated on the basis of the distribution of the AUC generated from 2000 bootstrap replicates for each approach evaluated. Pairwise comparison of approaches was performed to determine whether the difference in recognition performance was statistically significant. The AUCs of two approaches were compared using the bootstrap test implemented in the pROC module in R.32 The statistical significance in differentiating AUCs was estimated using the formula Z = (AUC1 − AUC2)/ SD, where AUC1 and AUC2 are the AUCs of the two



RESULTS AND DISCUSSION Characterization of the Benchmark Data Set. We investigated the sequence identity of the protein pairs to ensure that the data set is not biased toward highly homologous sequences. Figure 2A shows the histogram of the sequence identity of the positive and negative protein pairs. For the positive protein pairs, the distribution has a median sequence identity of 17.5%, and 89.4% of the positive protein pairs have sequence identity in the range 10%−40%. Hence, the majority of positive protein pairs do not share significant sequence identity based on the full protein sequence alignment. The applicability of full sequence-based annotation is therefore limited to homologous proteins and might not be appropriate for identifying cross-target activity among unrelated protein pairs. Figure 2B shows the structural similarity of cocrystallized ligands in positive and negative protein pairs evaluated by the Tanimoto coefficient based on MACCS fingerprints. The distribution has a median similarity of 64.9% for the positive pairs. We note that the distribution for positive pairs is slightly shifted toward more similar ligands in comparison with the distribution for negative pairs, yet substantial overlap between the distributions exists. Only a small fraction of cocrystallized ligand pairs from the positive set (303 out of 6598) have a Tanimoto coefficient of 1.0 based on MACCS fingerprints. Those are identical or highly similar ligands. D

DOI: 10.1021/acs.jcim.6b00118 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 3. Performance of ligand-centric approaches in positive protein pair discrimination. Shown are the bootstrapped ROC curves and AUCs with 95% confidence intervals calculated from 2000 bootstrap replicates for (A) ROCS, (B) MACCS fingerprints, (C) LINGO fingerprints, (D) Circular fingerprints, (E) Path fingerprints, and (F) Tree fingerprints.

Performance of Ligand-Centric Approaches in Positive Protein Pair Discrimination. We evaluated the performance of five chemical fingerprints (MACCS, LINGO,

Path, Circular, and Tree) and ROCS-based ligand shape similarity using cocrystallized ligands from the PDB and active ligands from ChEMBL. Figure 3 shows the bootstrapped ROC E

DOI: 10.1021/acs.jcim.6b00118 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling curve and the AUC with 95% confidence interval for each ligand-centric approach. Path fingerprints using active ligands from ChEMBL achieved the best performance overall (AUC = 77.9%); however, their performance dropped significantly when PDB ligands were used instead of ChEMBL ligands (AUC = 64.1%). Interestingly, the same trend of decreasing performance when switching from ChEMBL ligands to PDB ligands was also observed for all of the ligand-centric approaches analyzed. These results highlight the sensitivity of ligand-centric approaches to the domain of applicability of the data source. With ChEMBL ligands, ROCS-based ligand shape similarity achieved an AUC of 72.9%. Figure 4 shows the pairwise comparison of ligand-centric approaches in positive protein pair discrimination. Path

Figure 5. Performance in positive protein pair discrimination by the effects of two-dimensional structural similarity calculated by Path fingerprints. The AUCs for ROCS and Path fingerprints were calculated for the subset of protein pairs with a Tanimoto coefficient (Path fingerprint) less than each cutoff using ChEMBL ligands. The mean AUCs with 95% confidence intervals were calculated from 2000 bootstrap replicates. Figure 4. Pairwise comparison of ligand-centric approaches in positive protein pair discrimination using ChEMBL ligands. The p value of each pairwise comparison was generated from 2000 bootstrap replicates and is colored red if p < 0.005.

sequence sources (protein sequence, pocket sequence, and pocket residues) in conjunction with three scoring measures (sequence identity, sequence similarity, and normalized alignment score). We also analyzed the performance of SiteHopper pocket similarity. Figure 6 shows the bootstrapped ROC curve and the AUC with 95% confidence interval for each proteincentric approach. Figure 7 shows the pairwise comparison of these approaches. Among sequence-based approaches, the sequence source that was used for the sequence alignment had the most significant impact on the recognition performance. The alignment of pocket residues achieved the best performances (identity AUC = 90.0%, similarity AUC = 90.4%, and normalized score AUC = 90.2%), whereas the alignment of the full protein sequences resulted in the worst performances (identity AUC = 56.9%, similarity AUC = 54.5%, and normalized score AUC = 65.9%). This observation is consistent with Kruger’s study,10 which demonstrated that the conservation of protein functions depends to a larger degree on binding site residues than on the overall sequence. The alignment of pocket sequences resulted in AUCs that were 2− 3.5% lower than those obtained by aligning the pocket residues only. Furthermore, there was no significant difference in performance among scoring by sequence identity, similarity, or normalized score using pocket sequence and residues. However, the normalized score was significantly better than the identity and similarity scores (AUC = 65.9% versus 56.9% and 54.5%, respectively) when when full protein sequences were used. The performance of SiteHopper pocket shape similarity (AUC = 88.5%) was comparable to the performance of pocket sequence alignments (AUC = 86.5%−88.2%) but slightly worse than that of pocket residue alignments (AUC = 90.0%−90.4%) with p < 1.0 × 10−3.

fingerprints outperform all of the other approaches with statistical significance. Circular and Tree fingerprints outperform MACCS and LINGO fingerprints, but there is no significant difference in performance between them. The performance of ROCS is comparable to that of MACCS fingerprints but inferior to those of Circular, Tree, and Path fingerprints in terms of overall AUCs. However, ROCS has a significant advantage when the chemical space is diverse, as discussed below. While the AUC measurement summarizes the overall performance based on the whole benchmark data set, it is important to investigate how 2D ligand similarity influences the recognition performance of 3D ligand shape similarity. Figure 5 shows the AUC achieved by ROCS as a function of the Tanimoto coefficient based on Path fingerprints. Expectedly, the recognition performance of both approaches decreases as the structural similarity drops to 50%. However, for a Tanimoto coefficient below 40% (1605 positive and 195 negative protein pairs), ROCS (AUC = 59.6%) outperforms Path fingerprints (AUC = 51.5%) with a p value of 2.34 × 10−5. This result demonstrates the potential of ROCS in identifying targets shared by structurally distinct ligands with similar shapes. For a Tanimoto coefficient below 25%, the high dispersion of the AUCs indicates that it could not be reliably estimated (because of the small sample size). Performance of Protein-Centric Approaches in Positive Protein Pair Discrimination. We evaluated the performance of sequence-based approaches using three F

DOI: 10.1021/acs.jcim.6b00118 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 6. Performance of protein-centric approaches in positive protein pair discrimination on the same benchmark data set used for ligand-centric approaches. The bootstrapped ROC curves and AUCs with 95% confidence intervals calculated from 2000 bootstrap replicates for (A) SiteHopper, (B) sequence identity, (C) sequence similarity, and (D) normalized alignment score were evaluated for full protein sequence, pocket sequence, and pocket residues.

shape similarity in identifying cross-target activity among unrelated protein pairs with similar binding pockets. For pocket residue identity below 20%, the high dispersion of the AUCs indicates that it could not be reliably estimated (because of the small sample size). Pros and Cons of Pocket Shape versus Pocket Sequence. We describe three different examples to demonstrate the pros and cons of using pocket shape versus pocket sequence to identify potential cross-target activity. We believe that both methods should be applied, as each method has its advantages and disadvantages. Nuclear Hormone Receptors. Androgen receptor (AR) and estrogen receptor (ER) both belong to the nuclear hormone receptor family, which directly regulates gene expression to control development, homeostasis, and metabolism. The alignment of their pocket sequences and structural superposition are shown in Figure 9A,B, respectively. The identities of the pocket sequences and pocket residues are as low as 23.8% and 29.2%, respectively. However, AR and ER have similar binding pockets with a SiteHopper score of 2.0 (p = 5.88 × 10−7). Their similar pockets thus explain why, despite

It is important to note that for direct comparison of approaches, the AUCs mentioned above for protein-centric approaches were generated using the same benchmark protein pairs as the results obtained for ligand-centric approaches. Thus, 100 out of 379 negative protein pairs were not included because one or both proteins of the pair did not have a potent active ligand recorded in ChEMBL, and hence, these 100 pairs could not be scored with a ligand-based approach. However, we also calculated the AUCs for protein-centric approaches on the whole benchmark data set and obtained virtually the same results, shown in Figure S1 in the Supporting Information. Next, we analyzed the effects of related and unrelated proteins in the recognition performance of protein-centric approaches on the whole benchmark data set. Figure 8 shows the AUC as a function of pocket residue identity. The AUC remains stable as the pocket residue identity of protein pairs decreases to 40%. Interestingly, for pocket residue identities less than 30% (555 positive and 302 negative protein pairs), SiteHopper (AUC = 75.6%) significantly outperforms pocket residue identity (AUC = 53.7%) with a p value of 4.96 × 10−17. This result demonstrates the potential of SiteHopper pocket G

DOI: 10.1021/acs.jcim.6b00118 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 7. Pairwise comparison of protein-centric approaches on the whole benchmark data set. The statistical p value of each pairwise comparison was generated from 2000 bootstrap replicates and is colored red if p < 0.005.

Hydrolases. Tissue-type plasminogen activator (PLAT) mediates proteolysis, tissue remodeling and degradation, and neuronal migration by converting zymogen plasminogen to plasmin. Prothrombin (F2) plays an important role in the coagulation cascade by converting fibrinogen into fibrin. The identities of their pocket sequences and pocket residues are relatively low (31.2% and 39.0%, respectively; Figure 9C). However, PLAT and F2 have similar pockets with a SiteHopper score of 1.75 (p = 1.28 × 10−7), and indeed, they share two common ligands with bioactivity < 0.1 μM according to the ChEMBL database. Kinases. Cyclin-dependent kinase-like 2 and 3 (CDK2 and CDK3, respectively) are members of the cyclin-dependent protein kinase family, which plays a crucial role in regulating the cell cycle. These two proteins share common active ligands in the ChEMBL database. The identities of their pocket sequences and pocket residues are as high as 65.7% (p = 1.51 × 10−7) and 88.6% (p = 9.15 × 10−10), respectively, as shown in Figure 9E. From that we would expect these two proteins to have similar pockets, but the SiteHopper score was 1.2 (p = 9.08 × 10−3), which is relatively low. Here we found that the ligand 38R in the crystal structure 3ZDU (CDK3) induced a bigger pocket than the ligand TC0 in the crystal structure 4BBM (CDK2). Because of the different pocket sizes, SiteHopper does not recognize them as similar pockets. In contrast, the sequence alignment is independent of the three-dimensional coordinates and thus is particularly useful in identifying closely related targets. Comparison of Pocket Shape versus Ligand Shape among Proteins Sharing Common Ligands. Here we discuss the relationship between pocket shape and ligand shape among protein pairs with common active ligands. One question of interest is whether similar pockets are occupied exclusively by similar-shaped ligands and vice versa. First we analyzed the subset of positive protein pairs with common cocrystallized

Figure 8. Performance in positive protein pair discrimination on the whole benchmark data set by the effects of pocket residue identity. The AUCs for SiteHopper and sequence-based approaches were calculated for the subset of protein pairs with pocket residue identity less than each cutoff. The mean AUCs with 95% confidence intervals were calculated from 2000 bootstrap replicates.

the low sequence identity, AR and ER share some common ligands in the ChEMBL database. For instance, the smallmolecule danazol has been demonstrated to interact with both AR and ER.34 H

DOI: 10.1021/acs.jcim.6b00118 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 9. Sequence and pocket comparison. (A) Pocket sequence alignment and (B) SiteHopper-based pocket alignment of androgen receptor (PDB entry 1GS4, orange) and estrogen receptor (PDB entry 1GWQ, cyan). (C) Pocket sequence alignment and (D) SiteHopper-based pocket alignment of tissue-type plasminogen activator (PDB entry 1A5H, orange) and prothrombin (PDB entry 1ETS, cyan). (E) Pocket sequence alignment and (F) SiteHopper-based pocket alignment between cyclin-dependent kinase-like 2 (PDB entry 4BBM, cyan) and 3 (PDB entry 3ZDU, orange). The binding site residues are displayed as uppercase letters and colored green.

ligands (263 pairs). Figure 10A shows the root-mean-square deviations (RMSDs) of the bound conformations of common cocrystallized ligands plotted against SiteHopper pocket similarity scores. With some exceptions, common cocrystallized ligands tend to bind to similar pockets with similar conformations (RMSD < 1.0). Next, we analyzed the full data set in terms of SiteHopper pocket shape similarity and

ROCS ligand shape similarity. Among 3346 protein pairs having similar pockets with a SiteHopper score > 1.3 (p = 1.62 × 10−3), 99.4% are positive protein pairs (true positives). Likewise, among 628 protein pairs having similar cocrystallized ligands with a ROCS score > 1.3 (p = 5.20 × 10−4), 96.8% are positive protein pairs (true positives). This result confirms that similar pockets and similar-shaped ligands are both predictive of I

DOI: 10.1021/acs.jcim.6b00118 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 10. Relationship between pocket shape and ligand shape. (A) Root-mean-square deviation (RMSD) of the bound conformations of common cocrystallized ligands plotted against SiteHopper pocket similarity score. (B) ROCS ligand shape similarity versus SiteHopper pocket shape similarity for all positive protein pairs. The red and blue dashed lines indicate a SiteHopper score of 1.3 and a ROCS TanimotoCombo score of 1.3, respectively.

approaches when PDB ligands were used instead of ChEMBL ligands (Figure 3). We hypothesize that the superior performance attained with ChEMBL ligands might be a consequence of the usual practice in the medicinal chemistry literature to publish the bioactivity data for multiple analogues of the same compound series against a couple of different targets, say target kinase X and related antitargets kinase Y and kinase Z. For this reason, it would be easier to establish an association among kinases X, Y, and Z using ChEMBL literature data rather than PDB ligands. Moreover, it is important to note that the benchmark performance in retrospective data sets should not be construed as prospective prediction accuracy; nevertheless, benchmarking is still useful to study the relative performance of different methods, and we observed the same ranking of ligand-based methods using ChEMBL ligands versus PDB ligands. We noted a consistent trend across protein-centric and ligand-centric approaches. That is, while 2D-based approaches (e.g., protein sequence and chemical fingerprints) performed better than 3D-based approaches (e.g., ligand shape and pocket shape) on the entire benchmark data set, 3D-based approaches achieved superior performance for a subset of the data set involving structurally dissimilar ligands or unrelated proteins with similar binding pockets. Thus, 3D-based approaches have the potential of discovering nonobvious relationships among ligands/proteins, which complements the domain of applicability of 2D-based approaches. The finding that protein-centric approaches outperformed ligand-centric ones may be rationalized in terms of the relationship between pocket shape and ligand shape: we observed that similar ligands tend to bind to similar pockets (one-to-one relationship) while similar pockets tend to accept different-shaped ligands (one-to-many relationship). As discussed in the previous section, our data set has nearly 5 times more similar pocket pairs than similar ligand pairs. Thus, pocket comparison is, to a certain extent, ligand-agnostic and, for being less specific, offers more recognition power in identifying protein pairs sharing common (potentially unknown) ligands. However, in practice, pocket comparisons are

cross-target activity. However, the coverage or fraction of the data set with similar-shaped ligands (628) is much smaller than the fraction of the data set with similar-shaped pockets (3346). Figure 10B shows a plot of SiteHopper pocket similarity against the ROCS ligand shape similarity for all positive protein pairs. Nearly half of the positive protein pairs (50.4%) have similar pockets with a SiteHopper score > 1.3, but only 9.2% of the positive protein pairs have similar ligands with ROCS score > 1.3. Among those 9.2% of the positive protein pairs with similar ligands, 83.6% have similar pockets with a SiteHopper score > 1.3. Thus, similar ligands tend to bind to similar pockets. However, the reverse is not necessarily true. Among the 50.4% of positive protein pairs with similar pockets, only 15.3% have similar-shaped ligands. In summary, while similar ligands tend to bind to similar pockets, similar pockets may accept a range of different-shaped ligands.



CONCLUSION The performance of protein-centric and ligand-centric approaches in recognizing protein pairs with common active ligands was evaluated on a large benchmark data set assembled from the ChEMBL bioactivity database and the PDB. Sequence-based alignment of pocket residues achieved the best overall performance (identity AUC = 90.0%, similarity AUC = 90.4%, and normalized score AUC = 90.2%) with no significant difference in performance among the three scoring measures (Figures 6 and 7). However, SiteHopper pocket shape similarity significantly outperformed sequence-based approaches for unrelated proteins with pocket residue identity < 30% (AUC = 75.6% versus AUC = 53.7%; Figure 8). For the entire benchmark data set, SiteHopper achieved an AUC of 88.7% (Figure S1). Among ligand-centric approaches, Path fingerprints resulted in the best performance (AUC = 77.9%) when bioactive ligands from ChEMBL were used (Figure 3). However, ROCS ligand shape similarity outperformed Path fingerprints for structurally dissimilar ligands with a Tanimoto coefficient < 40% (AUC = 59.6% versus AUC = 51.5%). Overall, ROCS achieved an AUC of 72.9%. A significant drop in performance was noted for all of the ligand-centric J

DOI: 10.1021/acs.jcim.6b00118 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

(10) Kruger, F. A.; Overington, J. P. Global Analysis of Small Molecule Binding to Related Protein Targets. PLoS Comput. Biol. 2012, 8, e1002333. (11) Xie, L.; Bourne, P. E. Detecting Evolutionary Relationships across Existing Fold Space, Using Sequence Order-Independent Profile-Profile Alignments. Proc. Natl. Acad. Sci. U. S. A. 2008, 105, 5441−5446. (12) Kinnings, S. L.; Liu, N.; Buchmeier, N.; Tonge, P. J.; Xie, L.; Bourne, P. E. Drug Discovery Using Chemical Systems Biology: Repositioning the Safe Medicine Comtan to Treat Multi-Drug and Extensively Drug Resistant Tuberculosis. PLoS Comput. Biol. 2009, 5, e1000423. (13) Schmitt, S.; Kuhn, D.; Klebe, G. A New Method to Detect Related Function among Proteins Independent of Sequence and Fold Homology. J. Mol. Biol. 2002, 323, 387−406. (14) Najmanovich, R.; Kurbatova, N.; Thornton, J. Detection of 3d Atomic Similarities and Their Use in the Discrimination of Small Molecule Protein-Binding Sites. Bioinformatics 2008, 24, i105−111. (15) Gao, M.; Skolnick, J. Apoc: Large-Scale Identification of Similar Protein Pockets. Bioinformatics 2013, 29, 597−604. (16) Morris, R. J.; Najmanovich, R. J.; Kahraman, A.; Thornton, J. M. Real Spherical Harmonic Expansion Coefficients as 3d Shape Descriptors for Protein Binding Pocket and Ligand Comparisons. Bioinformatics 2005, 21, 2347−2355. (17) Chikhi, R.; Sael, L.; Kihara, D. Real-Time Ligand Binding Pocket Database Search Using Local Surface Descriptors. Proteins: Struct., Funct., Genet. 2010, 78, 2007−2028. (18) von Behren, M. M.; Volkamer, A.; Henzler, A. M.; Schomburg, K. T.; Urbaczek, S.; Rarey, M. Fast Protein Binding Site Comparison Via an Index-Based Screening Technology. J. Chem. Inf. Model. 2013, 53, 411−422. (19) Shulman-Peleg, A.; Nussinov, R.; Wolfson, H. J. Recognition of Functional Sites in Protein Structures. J. Mol. Biol. 2004, 339, 607− 633. (20) Kalinina, O. V.; Wichmann, O.; Apic, G.; Russell, R. B. Combinations of Protein-Chemical Complex Structures Reveal New Targets for Established Drugs. PLoS Comput. Biol. 2011, 7, e1002043. (21) Batista, J.; Hawkins, P. C. D.; Tolbert, R.; Geballe, M. T. Sitehopper - a Unique Tool for Binding Site Comparison. J. Cheminf. 2014, 6 (Suppl. 1), P57. (22) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J. P. Chembl: A Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40, D1100−1107. (23) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235−242. (24) Grant, J. A.; Gallardo, M. A.; Pickup, B. T. A Fast Method of Molecular Shape Comparison: A Simple Application of a Gaussian Description of Molecular Shape. J. Comput. Chem. 1996, 17, 1653− 1666. (25) Hawkins, P. C.; Skillman, A. G.; Nicholls, A. Comparison of Shape-Matching and Docking as Virtual Screening Tools. J. Med. Chem. 2007, 50, 74−82. (26) Wlodek, S.; Skillman, A. G.; Nicholls, A. Automated Ligand Placement and Refinement with a Combined Force Field and Shape Potential. Acta Crystallogr., Sect. D: Biol. Crystallogr. 2006, 62, 741− 749. (27) Kelley, B.; Brown, S.; Warren, G. L.; Muchmore, S. W. Posit: Flexible Shape-Guided Docking for Pose Prediction. J. Chem. Inf. Model. 2015, 55, 1771−1780. (28) OpenEye Scientific Software (Santa Fe, NM). Spicoli TK. http://www.eyesopen.com/spicoli-tk (accessed Aug 24, 2016). (29) OpenEye Scientific Software (Santa Fe, NM). Shape TK. http://www.eyesopen.com/shape-tk (accessed Aug 24, 2016). (30) Cock, P. J.; Antao, T.; Chang, J. T.; Chapman, B. A.; Cox, C. J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; de Hoon, M. J. Biopython: Freely Available Python Tools for

limited by the availability of protein structures, and the results presented above suggest exploiting the complementarity between 2D and 3D approaches to maximize success.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.6b00118. Performance of protein-centric approaches in positive protein pair discrimination on the whole benchmark data set (PDF) List of positive and negative protein pairs used in the benchmark (protein_pairs.tsv) (ZIP)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Present Addresses

∥ Y.-C.C.: Molsoft L.L.C., 11199 Sorrento Valley Road #209, San Diego, CA 92121, USA. ⊥ W.P.W.: Relay Therapeutics, 215 First Street, Cambridge, MA 02142, USA.

Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS For their valuable comments and suggestions we thank the members of the Modeling & Informatics Group at Vertex Pharmaceuticals Incorporated, especially Emanuele Perola, Albert Pierce, and Michael Bower. We also thank Jonathan Weiss and Gary Smith for their computational support.



REFERENCES

(1) Keiser, M. J.; Setola, V.; Irwin, J. J.; Laggner, C.; Abbas, A. I.; Hufeisen, S. J.; Jensen, N. H.; Kuijer, M. B.; Matos, R. C.; Tran, T. B.; Whaley, R.; Glennon, R. A.; Hert, J.; Thomas, K. L.; Edwards, D. D.; Shoichet, B. K.; Roth, B. L. Predicting New Molecular Targets for Known Drugs. Nature 2009, 462, 175−181. (2) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (3) Daylight Fingerprint. http://www.daylight.com (accessed Aug 24, 2016). (4) Gregori-Puigjané, E.; Setola, V.; Hert, J.; Crews, B. A.; Irwin, J. J.; Lounkine, E.; Marnett, L.; Roth, B. L.; Shoichet, B. K. Identifying Mechanism-of-Action Targets for Drugs and Probes. Proc. Natl. Acad. Sci. U. S. A. 2012, 109, 11178−11183. (5) Lounkine, E.; Keiser, M. J.; Whitebread, S.; Mikhailov, D.; Hamon, J.; Jenkins, J. L.; Lavan, P.; Weber, E.; Doak, A. K.; Cote, S.; Shoichet, B. K.; Urban, L. Large-Scale Prediction and Testing of Drug Activity on Side-Effect Targets. Nature 2012, 486, 361−367. (6) AbdulHameed, M. D.; Chaudhury, S.; Singh, N.; Sun, H.; Wallqvist, A.; Tawa, G. J. Exploring Polypharmacology Using a RocsBased Target Fishing Approach. J. Chem. Inf. Model. 2012, 52, 492− 505. (7) Rush, T. S., 3rd; Grant, J. A.; Mosyak, L.; Nicholls, A. A ShapeBased 3-D Scaffold Hopping Method and Its Application to a Bacterial Protein-Protein Interaction. J. Med. Chem. 2005, 48, 1489−1495. (8) Huang, N.; Shoichet, B. K.; Irwin, J. J. Benchmarking Sets for Molecular Docking. J. Med. Chem. 2006, 49, 6789−6801. (9) Lee, D.; Redfern, O.; Orengo, C. Predicting Protein Function from Sequence and Structure. Nat. Rev. Mol. Cell Biol. 2007, 8, 995− 1005. K

DOI: 10.1021/acs.jcim.6b00118 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Computational Molecular Biology and Bioinformatics. Bioinformatics 2009, 25, 1422−1423. (31) Fawcett, T. An Introduction to ROC Analysis. Pattern Recogn. Lett. 2006, 27, 861−874. (32) Robin, X.; Turck, N.; Hainard, A.; Tiberti, N.; Lisacek, F.; Sanchez, J. C.; Muller, M. Proc: An Open-Source Package for R and S + to Analyze and Compare ROC Curves. BMC Bioinf. 2011, 12, 77. (33) Carpenter, J.; Bithell, J. Bootstrap Confidence Intervals: When, Which, What? A Practical Guide for Medical Statisticians. Stat. Med. 2000, 19, 1141−1164. (34) Barbieri, R. L.; Lee, H.; Ryan, K. J. Danazol Binding to Rat Androgen, Glucocorticoid, Progesterone, and Estrogen Receptors: Correlation with Biologic Activity. Fertil. Steril. 1979, 31, 182−186.

L

DOI: 10.1021/acs.jcim.6b00118 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX