ATPbind: Accurate Protein-ATP Binding Site Prediction by Combining

KEYWORDS: Protein-ATP binding site prediction; Structure-based comparison; Sequence- profiling ... of protein-ligand binding site prediction, includin...
0 downloads 13 Views 2MB Size
Subscriber access provided by READING UNIV

Article

ATPbind: accurate protein-ATP binding site prediction by combining sequence-profiling and structure-based comparisons Jun Hu, Yang Li, Yang Zhang, and Dong-Jun Yu J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00397 • Publication Date (Web): 23 Jan 2018 Downloaded from http://pubs.acs.org on January 24, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ATPbind: Accurate Protein-ATP Binding Site Prediction by Combining Sequence-Profiling and Structure-Based Comparisons Jun Hu†,‡, Yang Li†,‡, Yang Zhang‡,* and Dong-Jun Yu†,* †

School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing, 210094, P. R. China ‡

Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw, Ann Arbor, MI 48109-2218, USA *

Address correspondence to Yang Zhang at [email protected] or Dong-Jun Yu at [email protected]

ABSTRACT: Protein-ATP interactions are ubiquitous in a wide variety of biological processes. Correctly locating ATP-binding sites from protein information is an important but challenging task for protein function annotation and drug discovery. However, there is no method that can optimally identify ATP-binding sites for different proteins. In this study, we report a new composite predictor, ATPbind, for ATP-binding sites by integrating the outputs of two templatebased predictors (i.e., S-SITE and TM-SITE) and three discriminative sequence-driven features of proteins: position specific scoring matrix, predicted secondary structure, and predicted solvent accessibility. In ATPbind, we assembled multiple support vector machines (SVMs) based on a

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 42

random under-sampling technique to cope with the serious imbalance phenomenon between the numbers of ATP-binding sites and of non-ATP-binding sites. We also constructed a new goldstandard benchmark dataset consisting of 429 ATP-binding proteins from the PDB database to evaluate and compare the proposed ATPbind with other existing predictors. Starting from a query sequence and predicted I-TASSER models, ATPbind can achieve an average accuracy of 72%, covering 62% of all ATP binding sites while achieving a Matthews Correlation Coefficient value that is significantly higher than that of other state-of-the-art predictors.

KEYWORDS: Protein-ATP binding site prediction; Structure-based comparison; Sequenceprofiling INTRODUCTION Interactions between proteins and ligands are indispensable for biological activities and play important roles in a wide variety of biological processes

1-3

. Hence, accurately locating the

protein-ligand binding sites or pockets is of significant importance for both analyzing protein function and designing novel drugs

4-7

. Tremendous wet-lab efforts have been made to uncover

the intrinsic mechanisms of protein-ligand interactions, and thousands of protein-ligand interaction structure complexes have been deposited into the PDB 8. However, identifying protein-ligand binding sites via wet-lab experimental technologies is often cost-intensive and time-consuming. Due to the importance of protein-ligand interactions and the difficulty of experimentally identifying the binding sites, the development of efficient and automatic computational methods for the fast prediction of protein-ligand binding sites has become an increasingly important problem in bioinformatics, especially when faced with the large-scale protein sequences of the post-genomic era 9, 10.

ACS Paragon Plus Environment

2

Page 3 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Many computational methods have emerged for predicting protein-ligand binding sites during the past decades

9-12

. These methods can be generally grouped into two categories: general-

purpose methods and ligand-specific methods. In the early stage, general-purpose predictors, which predict ligand-binding sites (or pockets) regardless of the ligand types, dominated the field of protein-ligand binding site prediction, including (to name a few) LIGSITE SURFNET

15

, POCKET

16

, Fpocket

17

, Q-SiteFinder

18

, SITEHOUND

19

13

, CASTp

14

,

, and 3DLigandSite 20.

Recently, another general-purpose predictor, COACH 9, a meta-server approach to protein-ligand binding site prediction, was designed. In COACH, two template-based predictors, i.e., TM-SITE and S-SITE, were first proposed for complementary binding site prediction based on bindingspecific substructure comparison and sequence profile alignment, respectively; the binding models from TM-SITE and S-SITE were then combined with results from other general-purpose predictors, such as COFACTOR

21

, FINDSITE

22

, and ConCavity 11, to obtain the final ligand-

binding site prediction 9. Since different ligands tend to bind diverse types of residues with prominent specificities due to the specific roles, sizes, and distributions of protein-ligand interactions

23

, the second type of

ligand-specific predictors, which are designed to predict binding sites (or pockets) for specific ligand types, have been increasingly of interest. Such predictors include NsitePred 24, TargetS 10, TargetSOS DNABR

29

25

, and TargetNUCs

and MetaDBSite

30

26

for nucleotides; FINDSITE-metal

for DNA; and IonCom

31

27

and CHED

28

for metal,

for metal and acid radical ion binding

site predictions. These studies demonstrated that ligand-specific binding site predictors are often superior to general-purpose binding site predictors due to the added consideration of the physicochemical features of specific ligand-protein interactions.

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 42

Among the many ligand-specific binding predictors, Adenosine-5’-triphosphate (ATP) is of particular interest. ATP is a nucleotide, also called a nucleoside triphosphate, which is a small molecule that is used in cells as a coenzyme and plays an essential role in membrane transport, cellular motility, muscle contraction, signaling, the transcription and replication of DNA, and various metabolic processes

32

. It interacts with proteins through protein-ATP binding sites and

provides chemical energy to proteins via the hydrolysis of ATP 33. The proteins can then perform various biological functions using the chemical energy. Additionally, ATP-binding sites are valuable drug targets for antibacterial and anti-cancer chemotherapy

34

. Hence, accurately

localizing the protein-ATP binding sites is of significant importance for both protein function annotation and drug discovery. ATPint

32

is one of the first custom-designed computational predictors for identifying ATP-

specific binding sites, and it was trained with a position specific scoring matrix (PSSM) and several other sequential descriptors, using a dataset consisting of 168 non-redundant ATP interacting proteins. ATPsite

35

was later proposed by Kurgan and coworkers; this system

combined PSSM and an SVM and was trained on a larger dataset with 227 non-redundant ATP interacting proteins. However, the imbalanced learning problem

36

embedded in protein-ATP

binding site prediction, i.e., that the number of non-ATP-binding sites is much larger than the number of ATP-binding sites, is a problem that could decrease the final prediction performance and, is ignored by ATPint and ATPsite. To solve the imbalanced learning problem and enhance the prediction accuracy, we recently proposed TargetATPsite

37

by integrating random under-

sampling (RUS) and AdaBoost 36 ensemble algorithms. Except for the above three ATP-specific predictors, many nucleotide-specific binding site predictors, e.g., NsitePred

24

, TargetS

33

, and

TargetNUCs 26, which also contain an ATP-binding site prediction model, can be used to predict

ACS Paragon Plus Environment

4

Page 5 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the ATP-binding sites. These existing approaches have the advantage of generating predictions from sequence alone, but the MCC of the predictions is low (typically approximately 0.580 at 52% sensitivity) because sequence information cannot directly show protein function. Despite the progress made in the ATP binding prediction, most predictors are based on protein sequence information, but protein structure information, which has demonstrated a significant advantage in other ligand binding studies

9, 21

, has not been utilized. In this study, we aim to

systematically examine the impact of the employment of structure-based features on ATP binding prediction by developing a new meta ATP-binding site predictor called ATPbind, which integrates the outputs of two template-based predictors, i.e., S-SITE and TM-SITE, with three sequence-based features, i.e., the position specific scoring matrix, the predicted secondary structure, and the predicted solvent accessibility. Here, S-SITE is a sequence-template-based predictor, and TM-SITE is a structure-template-based predictor. To ensure that the proposed ATPbind is an ATP-specific predictor, we extend S-SITE and TM-SITE to the ATP-specific predictors and rename them as S-SITEatp and TM-SITEatp, respectively. A new gold-standard benchmark dataset consisting of 429 non-redundant ATP-binding proteins was collected from the PDB database and will be used to systematically examine the strengths and weaknesses of such a composite combination of sequence and structure information for ATP binding predictions. In particular, given the significant imbalance feature of the ATP-binding data, we introduced a mean-ensemble-based method to integrate multiple support vector machines (SVMs) based on the random under-sampling technique. MATERIALS AND METHODS Benchmark datasets. We constructed a dataset of 2,144 ATP-binding protein chains, named PATP-2144, which had clear target annotations and had been deposited into the Protein Data

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 42

Bank (PDB) 8 before November 5, 2016. We further removed the redundant sequences by using CD-hit software

38

with sequence identity 30% to the query sequence are excluded in S-SITEatp. Again, a sliding window (W=17) is used to represent the S-SITEatp-based feature of each residue. The S-SITEatp-based feature dimensionality is 17.

ACS Paragon Plus Environment

8

Page 9 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The comparison results between S-SITEatp and S-SITE can be found in Text S1 in the Supporting Information (SI). TM-SITEatp-based feature. TM-SITE is a structure-template-based method that is also designed to derive the general-purpose binding sites by structurally comparing the query protein with the template proteins 9. Similar to S-SITE, it is time-consuming to use TM-SITE to detect the ATP-specific binding sites. In this study, we extended TM-SITE to be ATP-specific, named it TM-SITEatp, by replacing the BioLip library

43

to an ATP-specific database (PATP-2144 in

this paper). For a given protein with a known structure, TM-SITEatp can directly predict the ATP-binding probability for each target residue. For a given protein sequence without a 3D structure, we first construct its modeling structure by I-TASSER 44. The ATP-binding probability of each residue is then detected by TM-SITEatp. To avoid homologous contamination, homologous templates with a sequence identity >30% to the query have been excluded from both I-TASSER and ATP-binding libraries. Finally, the TM-SITEatp-based feature (whose dimensionality is 17) is gained based on a sliding window with size 17. To show the performance of TM-SITEatp, we compare TM-SITEatp and the original TM-SITE in Text S2 in SI. Learning from imbalanced data. Protein-ATP binding site prediction is a standard imbalanced learning problem, where the number of the majority class (non-ATP-binding residues) is larger than that of the minority class (ATP-binding residues)

37

. From Table 1, we

see that the imbalanced ratio between the number of non-ATP-binding residues and the number of ATP-binding residues is larger than 21. Compared to the minority class, the majority class contains lots of redundant information in the original dataset, which can decrease the prediction performance and increase the training and testing time. To overcome this hurdle, random under-

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

sampling (RUS)

36

Page 10 of 42

and mean ensemble schemes (ME) are combined to solve the imbalanced

learning problem. RUS is used to reduce the number of the majority class in this study. However, RUS comes with the disadvantage of potentially losing useful information

36

. To overcome this

disadvantage caused by RUS, we employ the mean ensemble scheme (ME) to enhance the final prediction accuracy. More specifically, in the training stage, we randomly sample the majority class T times (in the present study, T=10) with RUS and thus obtained T majority training subsets. The T majority training subsets each in addition to the minority class (ATP-binding sites) training set constitute T new training datasets. Then, a kind of machine learning model (SVM in this study) is trained on each of the T new training datasets. In the prediction stage, for each residue in a given protein sequence, its probability of belonging to the ATP-binding site class is predicted by each of the T prediction models. Then, the T probabilities of the residue belonging to the ATP-binding site class are fused by ME. The details of ME is described as follows. Let { pt }Tt =1 be the T predicted probabilities, where pt means the probability output of the t-th prediction model. The equation of ME can then be represented as follows. T

pME = 1 ∑ pt T t =1

(2)

where pME is the average probability. Support vector machine (SVM). An SVM use LIBSVM

46

45

is utilized to construct the base classifiers. We

, which is freely available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/, to

implement the SVM algorithm. Here, a radial basis function is chosen as the kernel function. The kernel width parameter σ and the regularization parameter γ , which are the two most important

ACS Paragon Plus Environment

10

Page 11 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

parameters, are optimized over a five-fold cross-validation using a grid search strategy in the LIBSVM tool. The details of cross-validation can be found in Text S3 in the SI.

Figure 1. Architecture of ATPbind. Architecture of ATPbind. Figure 1 illustrates the architecture of the proposed ATPbind for ATP-binding site prediction. For a given protein, ATPbind can extract the above five different

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 42

view features for each target residue by calling the corresponding programs and applying the sliding window technique. In the training phase, after extracting the features of all proteins in PATP-388, we can obtain the extremely unbalanced training sample set. We then employ RUS T (T=10 in this study) times to construct a set of multiple training subsets. One prediction model is trained for each subset using SVM. The ensemble prediction model is finally created using the ME method by integrating the outputs of the different models. In the prediction phase, for a protein to be predicted, the ensemble prediction model can be utilized to give the probability output for each residue of being an ATP-binding residue. We also propose a sequence-based ATP-binding predictor, named ATPseq, that only utilizes information from the protein sequence. The only difference between ATPseq and ATPbind is that ATPseq does not use TM-SITEatpbased feature. Assessing predictive ability. Five evaluation indexes that are routinely used in this field, i.e., Sensitivity (Sen), Specificity (Spe), Accuracy (Acc), Precision (Pre), and the Matthews Correlation Coefficient (MCC) are utilized to evaluate predictive ability, as follows: Sen =

TP TP + FN

(3)

Spe =

TN TN + FP

(4)

TP + TN TP + FN + TN + FP

(5)

TP TP + FP

(6)

Acc =

Pre =

MCC =

TP ⋅TN − FN ⋅ FP (TP + FN ) ⋅ (TP + FP) ⋅ (TN + FN ) ⋅ (TN + FP)

(7)

where TN, TP, FN, and FP are abbreviations for true negatives, true positives, false negatives, and false positives, respectively. The MCC (ranging from -1 to 1) evaluates the overall predictive

ACS Paragon Plus Environment

12

Page 13 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

quality. A higher MCC value means a better prediction performance. 0 represents that all residues are predicted as non-binding (or binding). The reported threshold, which can maximize the MCC value, is then chosen to calculate the values of Sen, Spe, Acc, Pre, and MCC. Furthermore, the area under the receiver operating characteristic (ROC) curve (termed AUC), which increases in direct proportion to the overall prediction performance, is employed to assess the overall predictive ability. RESULTS AND DISCUSSIONS Assessment of the quality of the I-TASSER-modeled structures. Since the quality of the modeled structure of the protein has an impact on TM-SITEatp, we compare the accuracy of the current predictions using I-TASSER

44

with the experimental structures on the PATP-TEST

dataset, where the protein lengths range from 115 to 863, in terms of the TM-score and RMSD evaluation indexes 47. For each given protein, the standard I-TASSER program, which excludes all homologous template proteins with sequence identity >30% to the given sequence, generates its structural model from the query protein sequence with iterative fragment assembly simulations. We then calculate TM-score and RMSD of the I-TASSER-modeled structures (ITAMSs) for the 41 testing proteins. The results are compiled in Figure 2. From Figure 2, it is easily found that the majority of the testing proteins (≈87.8%) can be modeled by I-TASSER with a correct fold, i.e., TM-score > 0.5, and 32 proteins (78.05% in PATP-TEST) have a RMSD below 4 Å. The overall average TM-score and RMSD for the testing proteins is 0.721 and 3.502 Å, respectively. These represent that the quality of the ITAMSs is acceptable for ATP-binding site prediction.

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 42

To further check the ITAMS quality, we compare the quality of the ITAMSs and the MODELLER-modeled structures (MODMSs), which are modeled by MODELLER software 48, on PATP-TEST. The detailed comparison results can be found in Table S4 and Figure S1. The overall average TM-score and RMSD of the ITAMSs are 0.721 and 3.502 Å, respectively, which are approximately 0.116 and 0.695 Å better than those of the MODMSs. Concretely, there are 26 testing proteins (≈63.41%) for which the ITAMSs have a higher TM-score than the MODMSs. Meanwhile, there are 58.54% proteins in PATP-TEST for which the ITAMSs have lower RMSD than the MODMSs. These results indicate that the quality of the ITAMSs outperforms the quality of the MODMSs for ATP-binding site prediction.

Figure 2. TM-score and RMSD distributions of the I-TASSER-modeled structures on PATPTEST.

Do S-SITEatp-based and TM-SITEatp-based features help ATP-binding site prediction? In this section, the discriminative performances of the three combination features, i.e., PSSM+PSS+PSA (PPP), PSSM+PSS+PSA+S-SITEatp (PPPS), and PSSM+PSS+PSA+S-

ACS Paragon Plus Environment

14

Page 15 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

SITEatp+TM-SITEatp (PPPST), will be investigated for measuring whether S-SITEatp-based and TM-SITEatp-based features help ATP-binding site prediction. Here, the TM-SITEatp-based feature is extracted by the ground-truth 3D structure. Each feature is evaluated by a five-fold cross-validation test on the training dataset PATP-388 with a single SVM classifier. In each training phase of the cross-validation test, we first employ RUS to make the sample number of the majority class equal to that of the minority class, and we then train a single SVM model. Table 2 summarizes the discriminative performance comparison of the three features on PATP388 over five-fold cross-validation tests with the single SVM classifier.

Table 2. Performance comparison between PPP, PPPS, and PPPST features on PATP-388 over five-fold cross-validation tests with a single SVM classifier Feature type

Sen (%)

Spe (%)

Acc (%)

Pre (%)

MCC

AUC

PPP

51.86

97.64

95.88

46.64

0.470

0.895

PPPS

58.05

98.11

96.58

55.04

0.547

0.919

PPPST

65.49

98.25

96.99

59.74

0.610

0.935

From Table 2, it is found that the PPPS and PPPST features consistently outperform the PPP feature concerning the six evaluation indexes. Comparing PPPS and PPP, the Sen, MCC, and AUC of PPPS are 58.05%, 0.547, and 0.919, which are approximately 11.94%, 16.38%, and 2.68% better than that of PPP, respectively. The Sen, MCC, and AUC of PPPST are superior to those of the other two features, i.e., PPP and PPPS, and the improvements of 12.82, 11.52, and 1.74 percent are achieved, respectively, compared with the second-best feature PPPS. The Pvalues of Student’s t-test for the difference in MCC scores between PPPST and the other two

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 42

features are both less than 10-4. Figure S2 illustrates the corresponding ROC curves. In Figure S2, we can intuitively discover that PPPST is the best one and that PPPS is also better than PPP. Does the mean ensemble strategy help ATP-binding site prediction? Table 3 lists the prediction performances of the single SVM classifier and the ensemble classifier, whose prediction model is obtained by using the mean ensemble strategy to integrate T (T=10) SVM models, with PPPS and PPPST features on PATP-388 over five-fold crossvalidation tests.

Table 3. Performance comparison between the ensemble classifier and single SVM classifier with PPPS and PPPST features on PATP-388 over five-fold cross-validation tests Feature type

Classifier

Sen (%)

Spe (%)

Acc (%)

Pre (%)

MCC

AUC

PPPS

Single a

58.05

98.11

96.58

55.04

0.547

0.919

Ensembled b

57.52

98.86

97.27

66.69

0.605

0.913

Single a

65.49

98.25

96.99

59.74

0.610

0.935

Ensembled b

64.04

98.88

97.55

69.57

0.655

0.932

PPPST

a b

‘Single’ represents the single SVM classifier. ‘Ensembled’ represents the ensemble classifier.

From Table 3, we can find that the ensemble classifier consistently outperforms the single SVM classifier on both PPPS and PPPST features in the Spe, Acc, Pre, and MCC evaluation indexes, although the ensemble classifier has a slightly lower Sen and AUC. Taking results with PPPST as an example, the Spe, Acc, Pre, and MCC of the ensemble classifier are 98.88%, 97.55%, 69.57%, and 0.655, which are approximately 0.6%, 0.6%, 16.5%, and 7.4% higher than those of the single SVM classifier, respectively. The Sen and AUC (64.04% and 0.932) of the

ACS Paragon Plus Environment

16

Page 17 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ensemble classifier are both slightly lower than that (65.49% and 0.935) of the single SVM classifier. In addition to Table 3, we draw Figures S3 and S4 to show the ROC curves and the variation curves of MCC versus the false positive rate of the ensemble classifier and the single SVM classifier on the PPPS and PPPST features, respectively. Figures S3 and S4 show that the ensemble classifier outperforms the single SVM classifier in the low false positive rate (FPR=FP/(FP+TN)) regions, where the FPR is less than 11.32% on the PPPS feature and 14.60% on the PPPST feature, although the overall AUCs of the ensemble classifier are slightly lower than those of the single SVM classifier for both PPPS and PPPST features. Note that the low FPR region is more important than the high FPR region, especially in the imbalanced data learning problem. Since the maximum MCC values of the two classifiers all lie in the low FPR regions on PPPS and PPPST features, the reported MCCs (0.605 and 0.655) of the ensemble classifier are 10.6% and 7.4% higher than those of the single SVM classifier. Comparing ATPseq and ATPbind with existing ATP-binding site predictors. In this section, we demonstrate the efficacy of ATPseq and ATPbind by comparing them with other existing ATP-binding site predictors over five-fold cross-validation tests on PATP-388 and independent validation tests on PATP-TEST. Performance comparison over cross-validation tests. Table 4 illustrates the performance comparison of ATPbind, ATPseq, TM-SITEatp, and S-SITEatp on PATP-388 over five-fold cross-validation tests. By observing Table 4, we can find that the proposed ATPbind and ATPseq are both better than TM-SITEatp and S-SITEatp, and ATPbind is superior to the other three predictors. Compared to S-SITEatp, the MCC values of ATPseq and ATPbind for the ATPbinding site prediction increase by 32.97% and 43.96%, respectively. Meanwhile, ATPseq and

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 42

ATPbind achieve improvements of 19.33% and 29.19%, respectively, in the MCC evaluation index compared with TM-SITEatp. It is noted that the results of ATPbind and TM-SITEatp are both obtained by employing the experimental 3D structure information. Table 4. Performance comparison of ATPbind, ATPseq, TM-SITEatp, and S-SITEatp on PATP-388 over five-fold cross-validation tests Predictor

Sen (%)

Spe (%)

Acc (%)

Pre (%)

MCC

AUC

S-SITEatp

69.88

94.47

93.53

33.47

0.455

N/A

TM-SITEatp

73.64

95.29

94.46

38.37

0.507

N/A

ATPseq

57.52

98.86

97.27

66.69

0.605

0.913

ATPbind

64.04

98.88

97.55

69.57

0.655

0.932

The ‘N/A’ means that the corresponding value could not be computed

Performance comparison over independent validation tests. Table 5 illustrates the performance comparison of ATPbind, ATPseq and other existing protein-ATP binding site predictors, including three structure-based predictors (i.e., COACH 9, 3DLigandSite six sequence-based predictors (i.e., TargetNUCs 26, TargetSOS

25

20

, TM-SITEatp) and

, TargetS 33, TargetATPsite 37,

NsitePred 24, S-SITEatp) on the independent test dataset (PATP-TEST). Figure 3 shows the ROC curves of ATPbind, ATPseq, and the abovementioned six sequence-based predictors. From Table 5 and Figure 3, it is clear that the proposed ATPbind achieves the best performance on PATPTEST and that the proposed ATPseq outperforms the other six sequence-based predictors. The MCC and AUC values of ATPseq are consistently superior to those of all six other sequence-based predictors here considered, and improvements of 1.91% and 2.57%, respectively, are achieved compared with the second-best sequence-based performer, TargetNUCs

26

. It has not escaped our notice that TargetNUCs achieves the highest Pre score,

ACS Paragon Plus Environment

18

Page 19 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

0.8681, and with similar Spe and Acc values; however, its Sen value (0.4688) is much lower than that of ATPseq (0.5445), indicating that more false negatives are incurred during prediction. Table 5. Performance comparison of ATPbind, ATPseq, and other existing ATP-binding site predictors on the independent test dataset (PATP-TEST)

NS

ITA

EXP

Predictor

Sen (%)

Spe (%)

Acc (%)

Pre (%)

MCC

AUC

S-SITEatp

67.51

92.65

91.51

30.41

0.416

N/A

NsitePred a

46.74

97.70

95.39

49.22

0.456

0.852

TargetATPsite b

41.25

99.49

96.84

79.43

0.559

0.853

TargetS c

51.63

98.89

96.74

68.91

0.580

0.872

TargetSOS d

49.26

99.46

97.18

81.37

0.620

0.863

TargetNUCs e

46.88

99.66

97.26

86.81

0.627

0.856

ATPseq

54.45

99.27

97.24

78.09

0.639

0.878

TM-SITEatp

69.73

96.09

84.89

45.90

0.541

N/A

3DLigandSite f

48.81

98.58

96.32

62.08

0.532

N/A

COACH g

58.16

98.59

96.76

66.33

0.604

N/A

ATPbind

62.31

98.85

97.19

72.04

0.656

0.905

TM-SITEatp

78.78

96.27

95.48

50.14

0.607

N/A

3DLigandSite f

56.82

99.31

97.38

79.63

0.660

N/A

COACH g

63.20

98.73

97.11

70.30

0.652

N/A

ATPbind

63.06

99.03

97.40

75.62

0.677

0.915

a

Results computed using the NsitePred server at http://biomine.cs.vcu.edu/servers/NsitePred. Results computed using the TargetATPsite server at http://www.csbio.sjtu.edu.cn/bioinf/TargetATPsite. c Results computed using the TargetS server at http://www.csbio.sjtu.edu.cn/bioinf/TargetS. d Results computed using the TargetSOS server at http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS. e Results computed using the TargetNUCs server at http://csbio.njust.edu.cn/bioinf/TargetNUCs/. f Results computed using the 3DLigandSite server at http://www.sbg.bio.ic.ac.uk/~3dligandsite/ with submitting the I-TASSER-modeled or experimental protein structure. g Results computed using the standalone program of COACH which was downloaded at https://zhanglab.ccmb.med.umich.edu/COACH/. ‘NS’, ‘ITA’, and ‘EXP’ mean no structure, I-TASSER-modeled structure, and experimental structure, respectively. ‘N/A’ means that the corresponding value could not be computed.

b

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 42

Figure 3. ROC curves of ATPbind (ITA and EXP), ATPseq, TargetNUCs, TargetSOS, TargetS, TargetATPsite, and NsitePred on the independent test dataset (PATP-TEST). ‘ITA’ and ‘EXP’ mean the I-TASSER-modeled structure and the experimental structure

The MCC and AUC values of ATPbind are 0.656 and 0.905, respectively, which are 2.66% and 3.08% higher (P-value < 0.02 in Student’s t-test for the difference in MCC score) than those of ATPseq, which is the best sequence-based predictor. Compared with the best general-purpose ligand-binding site predictor, i.e., COACH, ATPbind obtains Sen and MCC improvements of 7.14% and 8.61%, respectively. Furthermore, ATPbind achieves Sen and MCC improvements of

ACS Paragon Plus Environment

20

Page 21 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

27.61% and 23.32%, respectively, compared to 3DLigandSite, which is the second-best generalpurpose predictor. TM-SITEatp’s Sen (0.6973) is higher than that of ATPbind (0.6231), but its Pre (0.459) is significantly lower than that of ATPbind (0.7204), resulting in a low MCC value of 0.541. The differences between ATPbind and COACH, 3DLigandSite, and TM-SITEatp in MCC values are all statistically significant, with P-values < 0.05, < 10-3, and < 10-2, respectively, by Student’s t-tests. In Table 5, we list the prediction results of the structure-dependent predictors when the experimental 3D structure of the query proteins is used. As expected, the performance of all the structure-based methods is enhanced due to the increase of structural accuracy of the query proteins. It is also easily found that ATPbind outperforms the other three structure-based predictors in the MCC evaluation index. To further compare ATPbind with the existing structure-based ATP-binding site predictors, we first divide the independent test dataset, PATP-TEST, into two parts, denoted PATP-TEST-easy and PATP-TEST-hard, based on the quality of the I-TASSER-modeled structure of each protein. Concretely, PATP-TEST-easy contains 36 proteins with a high quality of the modeled structure, in which each TM-score between the I-TASSER-modeled structure and the experimental structure is higher than 0.5. PATP-TEST-hard includes 5 proteins with a low quality of the ITASSER-modeled structure, in which each TM-score between the modeled and experimental structures is lower than 0.5. Note that the protein structure pairs with a TM-score>0.5 are mostly in the same fold

49

; the TM-scores of the 41 PATP-TEST proteins can be found in Table S4.

Overall, the average TM-score for PATP-TEST-easy and PATP-TEST-hard is 0.765 and 0.400, respectively. Next, we compared the prediction performances of ATPbind, COACH 9, 3DLigandSite

20

, and TM-SITEatp on the PATP-TEST-easy and PATP-TEST-hard datasets.

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 42

Tables 6 and 7 present the comparison results on PATP-TEST-easy and PATP-TEST-hard, respectively.

Table 6. Performance comparison of ATPbind and other existing structure-based ATP-binding site predictors on PATP-TEST-easy for which the TM-score of the I-TASSER model is >0.5. Predictor

Sen (%)

Spe (%)

Acc (%)

Pre (%)

MCC

TM-SITEatp

71.38

96.04

94.88

47.11

0.555

3DLigandSite

49.66

98.67

96.36

64.84

0.549

COACH

60.27

98.49

96.69

66.30

0.615

ATPbind

64.14

98.83

97.20

72.99

0.670

Table 7. Performance comparison of ATPbind and other existing structure-based ATP-binding site predictors on PATP-TEST-hard for which the TM-score of the I-TASSER model is