Subscriber access provided by READING UNIV
Article
ATPbind: accurate protein-ATP binding site prediction by combining sequence-profiling and structure-based comparisons Jun Hu, Yang Li, Yang Zhang, and Dong-Jun Yu J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00397 • Publication Date (Web): 23 Jan 2018 Downloaded from http://pubs.acs.org on January 24, 2018
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
ATPbind: Accurate Protein-ATP Binding Site Prediction by Combining Sequence-Profiling and Structure-Based Comparisons Jun Hu†,‡, Yang Li†,‡, Yang Zhang‡,* and Dong-Jun Yu†,* †
School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing, 210094, P. R. China ‡
Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw, Ann Arbor, MI 48109-2218, USA *
Address correspondence to Yang Zhang at
[email protected] or Dong-Jun Yu at
[email protected] ABSTRACT: Protein-ATP interactions are ubiquitous in a wide variety of biological processes. Correctly locating ATP-binding sites from protein information is an important but challenging task for protein function annotation and drug discovery. However, there is no method that can optimally identify ATP-binding sites for different proteins. In this study, we report a new composite predictor, ATPbind, for ATP-binding sites by integrating the outputs of two templatebased predictors (i.e., S-SITE and TM-SITE) and three discriminative sequence-driven features of proteins: position specific scoring matrix, predicted secondary structure, and predicted solvent accessibility. In ATPbind, we assembled multiple support vector machines (SVMs) based on a
ACS Paragon Plus Environment
1
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 42
random under-sampling technique to cope with the serious imbalance phenomenon between the numbers of ATP-binding sites and of non-ATP-binding sites. We also constructed a new goldstandard benchmark dataset consisting of 429 ATP-binding proteins from the PDB database to evaluate and compare the proposed ATPbind with other existing predictors. Starting from a query sequence and predicted I-TASSER models, ATPbind can achieve an average accuracy of 72%, covering 62% of all ATP binding sites while achieving a Matthews Correlation Coefficient value that is significantly higher than that of other state-of-the-art predictors.
KEYWORDS: Protein-ATP binding site prediction; Structure-based comparison; Sequenceprofiling INTRODUCTION Interactions between proteins and ligands are indispensable for biological activities and play important roles in a wide variety of biological processes
1-3
. Hence, accurately locating the
protein-ligand binding sites or pockets is of significant importance for both analyzing protein function and designing novel drugs
4-7
. Tremendous wet-lab efforts have been made to uncover
the intrinsic mechanisms of protein-ligand interactions, and thousands of protein-ligand interaction structure complexes have been deposited into the PDB 8. However, identifying protein-ligand binding sites via wet-lab experimental technologies is often cost-intensive and time-consuming. Due to the importance of protein-ligand interactions and the difficulty of experimentally identifying the binding sites, the development of efficient and automatic computational methods for the fast prediction of protein-ligand binding sites has become an increasingly important problem in bioinformatics, especially when faced with the large-scale protein sequences of the post-genomic era 9, 10.
ACS Paragon Plus Environment
2
Page 3 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Many computational methods have emerged for predicting protein-ligand binding sites during the past decades
9-12
. These methods can be generally grouped into two categories: general-
purpose methods and ligand-specific methods. In the early stage, general-purpose predictors, which predict ligand-binding sites (or pockets) regardless of the ligand types, dominated the field of protein-ligand binding site prediction, including (to name a few) LIGSITE SURFNET
15
, POCKET
16
, Fpocket
17
, Q-SiteFinder
18
, SITEHOUND
19
13
, CASTp
14
,
, and 3DLigandSite 20.
Recently, another general-purpose predictor, COACH 9, a meta-server approach to protein-ligand binding site prediction, was designed. In COACH, two template-based predictors, i.e., TM-SITE and S-SITE, were first proposed for complementary binding site prediction based on bindingspecific substructure comparison and sequence profile alignment, respectively; the binding models from TM-SITE and S-SITE were then combined with results from other general-purpose predictors, such as COFACTOR
21
, FINDSITE
22
, and ConCavity 11, to obtain the final ligand-
binding site prediction 9. Since different ligands tend to bind diverse types of residues with prominent specificities due to the specific roles, sizes, and distributions of protein-ligand interactions
23
, the second type of
ligand-specific predictors, which are designed to predict binding sites (or pockets) for specific ligand types, have been increasingly of interest. Such predictors include NsitePred 24, TargetS 10, TargetSOS DNABR
29
25
, and TargetNUCs
and MetaDBSite
30
26
for nucleotides; FINDSITE-metal
for DNA; and IonCom
31
27
and CHED
28
for metal,
for metal and acid radical ion binding
site predictions. These studies demonstrated that ligand-specific binding site predictors are often superior to general-purpose binding site predictors due to the added consideration of the physicochemical features of specific ligand-protein interactions.
ACS Paragon Plus Environment
3
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 42
Among the many ligand-specific binding predictors, Adenosine-5’-triphosphate (ATP) is of particular interest. ATP is a nucleotide, also called a nucleoside triphosphate, which is a small molecule that is used in cells as a coenzyme and plays an essential role in membrane transport, cellular motility, muscle contraction, signaling, the transcription and replication of DNA, and various metabolic processes
32
. It interacts with proteins through protein-ATP binding sites and
provides chemical energy to proteins via the hydrolysis of ATP 33. The proteins can then perform various biological functions using the chemical energy. Additionally, ATP-binding sites are valuable drug targets for antibacterial and anti-cancer chemotherapy
34
. Hence, accurately
localizing the protein-ATP binding sites is of significant importance for both protein function annotation and drug discovery. ATPint
32
is one of the first custom-designed computational predictors for identifying ATP-
specific binding sites, and it was trained with a position specific scoring matrix (PSSM) and several other sequential descriptors, using a dataset consisting of 168 non-redundant ATP interacting proteins. ATPsite
35
was later proposed by Kurgan and coworkers; this system
combined PSSM and an SVM and was trained on a larger dataset with 227 non-redundant ATP interacting proteins. However, the imbalanced learning problem
36
embedded in protein-ATP
binding site prediction, i.e., that the number of non-ATP-binding sites is much larger than the number of ATP-binding sites, is a problem that could decrease the final prediction performance and, is ignored by ATPint and ATPsite. To solve the imbalanced learning problem and enhance the prediction accuracy, we recently proposed TargetATPsite
37
by integrating random under-
sampling (RUS) and AdaBoost 36 ensemble algorithms. Except for the above three ATP-specific predictors, many nucleotide-specific binding site predictors, e.g., NsitePred
24
, TargetS
33
, and
TargetNUCs 26, which also contain an ATP-binding site prediction model, can be used to predict
ACS Paragon Plus Environment
4
Page 5 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
the ATP-binding sites. These existing approaches have the advantage of generating predictions from sequence alone, but the MCC of the predictions is low (typically approximately 0.580 at 52% sensitivity) because sequence information cannot directly show protein function. Despite the progress made in the ATP binding prediction, most predictors are based on protein sequence information, but protein structure information, which has demonstrated a significant advantage in other ligand binding studies
9, 21
, has not been utilized. In this study, we aim to
systematically examine the impact of the employment of structure-based features on ATP binding prediction by developing a new meta ATP-binding site predictor called ATPbind, which integrates the outputs of two template-based predictors, i.e., S-SITE and TM-SITE, with three sequence-based features, i.e., the position specific scoring matrix, the predicted secondary structure, and the predicted solvent accessibility. Here, S-SITE is a sequence-template-based predictor, and TM-SITE is a structure-template-based predictor. To ensure that the proposed ATPbind is an ATP-specific predictor, we extend S-SITE and TM-SITE to the ATP-specific predictors and rename them as S-SITEatp and TM-SITEatp, respectively. A new gold-standard benchmark dataset consisting of 429 non-redundant ATP-binding proteins was collected from the PDB database and will be used to systematically examine the strengths and weaknesses of such a composite combination of sequence and structure information for ATP binding predictions. In particular, given the significant imbalance feature of the ATP-binding data, we introduced a mean-ensemble-based method to integrate multiple support vector machines (SVMs) based on the random under-sampling technique. MATERIALS AND METHODS Benchmark datasets. We constructed a dataset of 2,144 ATP-binding protein chains, named PATP-2144, which had clear target annotations and had been deposited into the Protein Data
ACS Paragon Plus Environment
5
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 42
Bank (PDB) 8 before November 5, 2016. We further removed the redundant sequences by using CD-hit software
38
with sequence identity 30% to the query sequence are excluded in S-SITEatp. Again, a sliding window (W=17) is used to represent the S-SITEatp-based feature of each residue. The S-SITEatp-based feature dimensionality is 17.
ACS Paragon Plus Environment
8
Page 9 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
The comparison results between S-SITEatp and S-SITE can be found in Text S1 in the Supporting Information (SI). TM-SITEatp-based feature. TM-SITE is a structure-template-based method that is also designed to derive the general-purpose binding sites by structurally comparing the query protein with the template proteins 9. Similar to S-SITE, it is time-consuming to use TM-SITE to detect the ATP-specific binding sites. In this study, we extended TM-SITE to be ATP-specific, named it TM-SITEatp, by replacing the BioLip library
43
to an ATP-specific database (PATP-2144 in
this paper). For a given protein with a known structure, TM-SITEatp can directly predict the ATP-binding probability for each target residue. For a given protein sequence without a 3D structure, we first construct its modeling structure by I-TASSER 44. The ATP-binding probability of each residue is then detected by TM-SITEatp. To avoid homologous contamination, homologous templates with a sequence identity >30% to the query have been excluded from both I-TASSER and ATP-binding libraries. Finally, the TM-SITEatp-based feature (whose dimensionality is 17) is gained based on a sliding window with size 17. To show the performance of TM-SITEatp, we compare TM-SITEatp and the original TM-SITE in Text S2 in SI. Learning from imbalanced data. Protein-ATP binding site prediction is a standard imbalanced learning problem, where the number of the majority class (non-ATP-binding residues) is larger than that of the minority class (ATP-binding residues)
37
. From Table 1, we
see that the imbalanced ratio between the number of non-ATP-binding residues and the number of ATP-binding residues is larger than 21. Compared to the minority class, the majority class contains lots of redundant information in the original dataset, which can decrease the prediction performance and increase the training and testing time. To overcome this hurdle, random under-
ACS Paragon Plus Environment
9
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
sampling (RUS)
36
Page 10 of 42
and mean ensemble schemes (ME) are combined to solve the imbalanced
learning problem. RUS is used to reduce the number of the majority class in this study. However, RUS comes with the disadvantage of potentially losing useful information
36
. To overcome this
disadvantage caused by RUS, we employ the mean ensemble scheme (ME) to enhance the final prediction accuracy. More specifically, in the training stage, we randomly sample the majority class T times (in the present study, T=10) with RUS and thus obtained T majority training subsets. The T majority training subsets each in addition to the minority class (ATP-binding sites) training set constitute T new training datasets. Then, a kind of machine learning model (SVM in this study) is trained on each of the T new training datasets. In the prediction stage, for each residue in a given protein sequence, its probability of belonging to the ATP-binding site class is predicted by each of the T prediction models. Then, the T probabilities of the residue belonging to the ATP-binding site class are fused by ME. The details of ME is described as follows. Let { pt }Tt =1 be the T predicted probabilities, where pt means the probability output of the t-th prediction model. The equation of ME can then be represented as follows. T
pME = 1 ∑ pt T t =1
(2)
where pME is the average probability. Support vector machine (SVM). An SVM use LIBSVM
46
45
is utilized to construct the base classifiers. We
, which is freely available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/, to
implement the SVM algorithm. Here, a radial basis function is chosen as the kernel function. The kernel width parameter σ and the regularization parameter γ , which are the two most important
ACS Paragon Plus Environment
10
Page 11 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
parameters, are optimized over a five-fold cross-validation using a grid search strategy in the LIBSVM tool. The details of cross-validation can be found in Text S3 in the SI.
Figure 1. Architecture of ATPbind. Architecture of ATPbind. Figure 1 illustrates the architecture of the proposed ATPbind for ATP-binding site prediction. For a given protein, ATPbind can extract the above five different
ACS Paragon Plus Environment
11
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 42
view features for each target residue by calling the corresponding programs and applying the sliding window technique. In the training phase, after extracting the features of all proteins in PATP-388, we can obtain the extremely unbalanced training sample set. We then employ RUS T (T=10 in this study) times to construct a set of multiple training subsets. One prediction model is trained for each subset using SVM. The ensemble prediction model is finally created using the ME method by integrating the outputs of the different models. In the prediction phase, for a protein to be predicted, the ensemble prediction model can be utilized to give the probability output for each residue of being an ATP-binding residue. We also propose a sequence-based ATP-binding predictor, named ATPseq, that only utilizes information from the protein sequence. The only difference between ATPseq and ATPbind is that ATPseq does not use TM-SITEatpbased feature. Assessing predictive ability. Five evaluation indexes that are routinely used in this field, i.e., Sensitivity (Sen), Specificity (Spe), Accuracy (Acc), Precision (Pre), and the Matthews Correlation Coefficient (MCC) are utilized to evaluate predictive ability, as follows: Sen =
TP TP + FN
(3)
Spe =
TN TN + FP
(4)
TP + TN TP + FN + TN + FP
(5)
TP TP + FP
(6)
Acc =
Pre =
MCC =
TP ⋅TN − FN ⋅ FP (TP + FN ) ⋅ (TP + FP) ⋅ (TN + FN ) ⋅ (TN + FP)
(7)
where TN, TP, FN, and FP are abbreviations for true negatives, true positives, false negatives, and false positives, respectively. The MCC (ranging from -1 to 1) evaluates the overall predictive
ACS Paragon Plus Environment
12
Page 13 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
quality. A higher MCC value means a better prediction performance. 0 represents that all residues are predicted as non-binding (or binding). The reported threshold, which can maximize the MCC value, is then chosen to calculate the values of Sen, Spe, Acc, Pre, and MCC. Furthermore, the area under the receiver operating characteristic (ROC) curve (termed AUC), which increases in direct proportion to the overall prediction performance, is employed to assess the overall predictive ability. RESULTS AND DISCUSSIONS Assessment of the quality of the I-TASSER-modeled structures. Since the quality of the modeled structure of the protein has an impact on TM-SITEatp, we compare the accuracy of the current predictions using I-TASSER
44
with the experimental structures on the PATP-TEST
dataset, where the protein lengths range from 115 to 863, in terms of the TM-score and RMSD evaluation indexes 47. For each given protein, the standard I-TASSER program, which excludes all homologous template proteins with sequence identity >30% to the given sequence, generates its structural model from the query protein sequence with iterative fragment assembly simulations. We then calculate TM-score and RMSD of the I-TASSER-modeled structures (ITAMSs) for the 41 testing proteins. The results are compiled in Figure 2. From Figure 2, it is easily found that the majority of the testing proteins (≈87.8%) can be modeled by I-TASSER with a correct fold, i.e., TM-score > 0.5, and 32 proteins (78.05% in PATP-TEST) have a RMSD below 4 Å. The overall average TM-score and RMSD for the testing proteins is 0.721 and 3.502 Å, respectively. These represent that the quality of the ITAMSs is acceptable for ATP-binding site prediction.
ACS Paragon Plus Environment
13
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 42
To further check the ITAMS quality, we compare the quality of the ITAMSs and the MODELLER-modeled structures (MODMSs), which are modeled by MODELLER software 48, on PATP-TEST. The detailed comparison results can be found in Table S4 and Figure S1. The overall average TM-score and RMSD of the ITAMSs are 0.721 and 3.502 Å, respectively, which are approximately 0.116 and 0.695 Å better than those of the MODMSs. Concretely, there are 26 testing proteins (≈63.41%) for which the ITAMSs have a higher TM-score than the MODMSs. Meanwhile, there are 58.54% proteins in PATP-TEST for which the ITAMSs have lower RMSD than the MODMSs. These results indicate that the quality of the ITAMSs outperforms the quality of the MODMSs for ATP-binding site prediction.
Figure 2. TM-score and RMSD distributions of the I-TASSER-modeled structures on PATPTEST.
Do S-SITEatp-based and TM-SITEatp-based features help ATP-binding site prediction? In this section, the discriminative performances of the three combination features, i.e., PSSM+PSS+PSA (PPP), PSSM+PSS+PSA+S-SITEatp (PPPS), and PSSM+PSS+PSA+S-
ACS Paragon Plus Environment
14
Page 15 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
SITEatp+TM-SITEatp (PPPST), will be investigated for measuring whether S-SITEatp-based and TM-SITEatp-based features help ATP-binding site prediction. Here, the TM-SITEatp-based feature is extracted by the ground-truth 3D structure. Each feature is evaluated by a five-fold cross-validation test on the training dataset PATP-388 with a single SVM classifier. In each training phase of the cross-validation test, we first employ RUS to make the sample number of the majority class equal to that of the minority class, and we then train a single SVM model. Table 2 summarizes the discriminative performance comparison of the three features on PATP388 over five-fold cross-validation tests with the single SVM classifier.
Table 2. Performance comparison between PPP, PPPS, and PPPST features on PATP-388 over five-fold cross-validation tests with a single SVM classifier Feature type
Sen (%)
Spe (%)
Acc (%)
Pre (%)
MCC
AUC
PPP
51.86
97.64
95.88
46.64
0.470
0.895
PPPS
58.05
98.11
96.58
55.04
0.547
0.919
PPPST
65.49
98.25
96.99
59.74
0.610
0.935
From Table 2, it is found that the PPPS and PPPST features consistently outperform the PPP feature concerning the six evaluation indexes. Comparing PPPS and PPP, the Sen, MCC, and AUC of PPPS are 58.05%, 0.547, and 0.919, which are approximately 11.94%, 16.38%, and 2.68% better than that of PPP, respectively. The Sen, MCC, and AUC of PPPST are superior to those of the other two features, i.e., PPP and PPPS, and the improvements of 12.82, 11.52, and 1.74 percent are achieved, respectively, compared with the second-best feature PPPS. The Pvalues of Student’s t-test for the difference in MCC scores between PPPST and the other two
ACS Paragon Plus Environment
15
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 42
features are both less than 10-4. Figure S2 illustrates the corresponding ROC curves. In Figure S2, we can intuitively discover that PPPST is the best one and that PPPS is also better than PPP. Does the mean ensemble strategy help ATP-binding site prediction? Table 3 lists the prediction performances of the single SVM classifier and the ensemble classifier, whose prediction model is obtained by using the mean ensemble strategy to integrate T (T=10) SVM models, with PPPS and PPPST features on PATP-388 over five-fold crossvalidation tests.
Table 3. Performance comparison between the ensemble classifier and single SVM classifier with PPPS and PPPST features on PATP-388 over five-fold cross-validation tests Feature type
Classifier
Sen (%)
Spe (%)
Acc (%)
Pre (%)
MCC
AUC
PPPS
Single a
58.05
98.11
96.58
55.04
0.547
0.919
Ensembled b
57.52
98.86
97.27
66.69
0.605
0.913
Single a
65.49
98.25
96.99
59.74
0.610
0.935
Ensembled b
64.04
98.88
97.55
69.57
0.655
0.932
PPPST
a b
‘Single’ represents the single SVM classifier. ‘Ensembled’ represents the ensemble classifier.
From Table 3, we can find that the ensemble classifier consistently outperforms the single SVM classifier on both PPPS and PPPST features in the Spe, Acc, Pre, and MCC evaluation indexes, although the ensemble classifier has a slightly lower Sen and AUC. Taking results with PPPST as an example, the Spe, Acc, Pre, and MCC of the ensemble classifier are 98.88%, 97.55%, 69.57%, and 0.655, which are approximately 0.6%, 0.6%, 16.5%, and 7.4% higher than those of the single SVM classifier, respectively. The Sen and AUC (64.04% and 0.932) of the
ACS Paragon Plus Environment
16
Page 17 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
ensemble classifier are both slightly lower than that (65.49% and 0.935) of the single SVM classifier. In addition to Table 3, we draw Figures S3 and S4 to show the ROC curves and the variation curves of MCC versus the false positive rate of the ensemble classifier and the single SVM classifier on the PPPS and PPPST features, respectively. Figures S3 and S4 show that the ensemble classifier outperforms the single SVM classifier in the low false positive rate (FPR=FP/(FP+TN)) regions, where the FPR is less than 11.32% on the PPPS feature and 14.60% on the PPPST feature, although the overall AUCs of the ensemble classifier are slightly lower than those of the single SVM classifier for both PPPS and PPPST features. Note that the low FPR region is more important than the high FPR region, especially in the imbalanced data learning problem. Since the maximum MCC values of the two classifiers all lie in the low FPR regions on PPPS and PPPST features, the reported MCCs (0.605 and 0.655) of the ensemble classifier are 10.6% and 7.4% higher than those of the single SVM classifier. Comparing ATPseq and ATPbind with existing ATP-binding site predictors. In this section, we demonstrate the efficacy of ATPseq and ATPbind by comparing them with other existing ATP-binding site predictors over five-fold cross-validation tests on PATP-388 and independent validation tests on PATP-TEST. Performance comparison over cross-validation tests. Table 4 illustrates the performance comparison of ATPbind, ATPseq, TM-SITEatp, and S-SITEatp on PATP-388 over five-fold cross-validation tests. By observing Table 4, we can find that the proposed ATPbind and ATPseq are both better than TM-SITEatp and S-SITEatp, and ATPbind is superior to the other three predictors. Compared to S-SITEatp, the MCC values of ATPseq and ATPbind for the ATPbinding site prediction increase by 32.97% and 43.96%, respectively. Meanwhile, ATPseq and
ACS Paragon Plus Environment
17
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 18 of 42
ATPbind achieve improvements of 19.33% and 29.19%, respectively, in the MCC evaluation index compared with TM-SITEatp. It is noted that the results of ATPbind and TM-SITEatp are both obtained by employing the experimental 3D structure information. Table 4. Performance comparison of ATPbind, ATPseq, TM-SITEatp, and S-SITEatp on PATP-388 over five-fold cross-validation tests Predictor
Sen (%)
Spe (%)
Acc (%)
Pre (%)
MCC
AUC
S-SITEatp
69.88
94.47
93.53
33.47
0.455
N/A
TM-SITEatp
73.64
95.29
94.46
38.37
0.507
N/A
ATPseq
57.52
98.86
97.27
66.69
0.605
0.913
ATPbind
64.04
98.88
97.55
69.57
0.655
0.932
The ‘N/A’ means that the corresponding value could not be computed
Performance comparison over independent validation tests. Table 5 illustrates the performance comparison of ATPbind, ATPseq and other existing protein-ATP binding site predictors, including three structure-based predictors (i.e., COACH 9, 3DLigandSite six sequence-based predictors (i.e., TargetNUCs 26, TargetSOS
25
20
, TM-SITEatp) and
, TargetS 33, TargetATPsite 37,
NsitePred 24, S-SITEatp) on the independent test dataset (PATP-TEST). Figure 3 shows the ROC curves of ATPbind, ATPseq, and the abovementioned six sequence-based predictors. From Table 5 and Figure 3, it is clear that the proposed ATPbind achieves the best performance on PATPTEST and that the proposed ATPseq outperforms the other six sequence-based predictors. The MCC and AUC values of ATPseq are consistently superior to those of all six other sequence-based predictors here considered, and improvements of 1.91% and 2.57%, respectively, are achieved compared with the second-best sequence-based performer, TargetNUCs
26
. It has not escaped our notice that TargetNUCs achieves the highest Pre score,
ACS Paragon Plus Environment
18
Page 19 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
0.8681, and with similar Spe and Acc values; however, its Sen value (0.4688) is much lower than that of ATPseq (0.5445), indicating that more false negatives are incurred during prediction. Table 5. Performance comparison of ATPbind, ATPseq, and other existing ATP-binding site predictors on the independent test dataset (PATP-TEST)
NS
ITA
EXP
Predictor
Sen (%)
Spe (%)
Acc (%)
Pre (%)
MCC
AUC
S-SITEatp
67.51
92.65
91.51
30.41
0.416
N/A
NsitePred a
46.74
97.70
95.39
49.22
0.456
0.852
TargetATPsite b
41.25
99.49
96.84
79.43
0.559
0.853
TargetS c
51.63
98.89
96.74
68.91
0.580
0.872
TargetSOS d
49.26
99.46
97.18
81.37
0.620
0.863
TargetNUCs e
46.88
99.66
97.26
86.81
0.627
0.856
ATPseq
54.45
99.27
97.24
78.09
0.639
0.878
TM-SITEatp
69.73
96.09
84.89
45.90
0.541
N/A
3DLigandSite f
48.81
98.58
96.32
62.08
0.532
N/A
COACH g
58.16
98.59
96.76
66.33
0.604
N/A
ATPbind
62.31
98.85
97.19
72.04
0.656
0.905
TM-SITEatp
78.78
96.27
95.48
50.14
0.607
N/A
3DLigandSite f
56.82
99.31
97.38
79.63
0.660
N/A
COACH g
63.20
98.73
97.11
70.30
0.652
N/A
ATPbind
63.06
99.03
97.40
75.62
0.677
0.915
a
Results computed using the NsitePred server at http://biomine.cs.vcu.edu/servers/NsitePred. Results computed using the TargetATPsite server at http://www.csbio.sjtu.edu.cn/bioinf/TargetATPsite. c Results computed using the TargetS server at http://www.csbio.sjtu.edu.cn/bioinf/TargetS. d Results computed using the TargetSOS server at http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS. e Results computed using the TargetNUCs server at http://csbio.njust.edu.cn/bioinf/TargetNUCs/. f Results computed using the 3DLigandSite server at http://www.sbg.bio.ic.ac.uk/~3dligandsite/ with submitting the I-TASSER-modeled or experimental protein structure. g Results computed using the standalone program of COACH which was downloaded at https://zhanglab.ccmb.med.umich.edu/COACH/. ‘NS’, ‘ITA’, and ‘EXP’ mean no structure, I-TASSER-modeled structure, and experimental structure, respectively. ‘N/A’ means that the corresponding value could not be computed.
b
ACS Paragon Plus Environment
19
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 20 of 42
Figure 3. ROC curves of ATPbind (ITA and EXP), ATPseq, TargetNUCs, TargetSOS, TargetS, TargetATPsite, and NsitePred on the independent test dataset (PATP-TEST). ‘ITA’ and ‘EXP’ mean the I-TASSER-modeled structure and the experimental structure
The MCC and AUC values of ATPbind are 0.656 and 0.905, respectively, which are 2.66% and 3.08% higher (P-value < 0.02 in Student’s t-test for the difference in MCC score) than those of ATPseq, which is the best sequence-based predictor. Compared with the best general-purpose ligand-binding site predictor, i.e., COACH, ATPbind obtains Sen and MCC improvements of 7.14% and 8.61%, respectively. Furthermore, ATPbind achieves Sen and MCC improvements of
ACS Paragon Plus Environment
20
Page 21 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
27.61% and 23.32%, respectively, compared to 3DLigandSite, which is the second-best generalpurpose predictor. TM-SITEatp’s Sen (0.6973) is higher than that of ATPbind (0.6231), but its Pre (0.459) is significantly lower than that of ATPbind (0.7204), resulting in a low MCC value of 0.541. The differences between ATPbind and COACH, 3DLigandSite, and TM-SITEatp in MCC values are all statistically significant, with P-values < 0.05, < 10-3, and < 10-2, respectively, by Student’s t-tests. In Table 5, we list the prediction results of the structure-dependent predictors when the experimental 3D structure of the query proteins is used. As expected, the performance of all the structure-based methods is enhanced due to the increase of structural accuracy of the query proteins. It is also easily found that ATPbind outperforms the other three structure-based predictors in the MCC evaluation index. To further compare ATPbind with the existing structure-based ATP-binding site predictors, we first divide the independent test dataset, PATP-TEST, into two parts, denoted PATP-TEST-easy and PATP-TEST-hard, based on the quality of the I-TASSER-modeled structure of each protein. Concretely, PATP-TEST-easy contains 36 proteins with a high quality of the modeled structure, in which each TM-score between the I-TASSER-modeled structure and the experimental structure is higher than 0.5. PATP-TEST-hard includes 5 proteins with a low quality of the ITASSER-modeled structure, in which each TM-score between the modeled and experimental structures is lower than 0.5. Note that the protein structure pairs with a TM-score>0.5 are mostly in the same fold
49
; the TM-scores of the 41 PATP-TEST proteins can be found in Table S4.
Overall, the average TM-score for PATP-TEST-easy and PATP-TEST-hard is 0.765 and 0.400, respectively. Next, we compared the prediction performances of ATPbind, COACH 9, 3DLigandSite
20
, and TM-SITEatp on the PATP-TEST-easy and PATP-TEST-hard datasets.
ACS Paragon Plus Environment
21
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 22 of 42
Tables 6 and 7 present the comparison results on PATP-TEST-easy and PATP-TEST-hard, respectively.
Table 6. Performance comparison of ATPbind and other existing structure-based ATP-binding site predictors on PATP-TEST-easy for which the TM-score of the I-TASSER model is >0.5. Predictor
Sen (%)
Spe (%)
Acc (%)
Pre (%)
MCC
TM-SITEatp
71.38
96.04
94.88
47.11
0.555
3DLigandSite
49.66
98.67
96.36
64.84
0.549
COACH
60.27
98.49
96.69
66.30
0.615
ATPbind
64.14
98.83
97.20
72.99
0.670
Table 7. Performance comparison of ATPbind and other existing structure-based ATP-binding site predictors on PATP-TEST-hard for which the TM-score of the I-TASSER model is