Computational Prediction and Analysis for ... - ACS Publications


Here, we introduce elastic net to perform feature selection and develop a predictor named TyrPred for predicting nitrotyrosine, sulfotyrosine, and kin...
1 downloads 0 Views 2MB Size


Article pubs.acs.org/jcim

Cite This: J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Computational Prediction and Analysis for Tyrosine PostTranslational Modifications via Elastic Net Man Cao,† Guodong Chen,† Lina Wang,‡ Pingping Wen,‡ and Shaoping Shi*,† †

Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, and College of Chemistry, Nanchang University, Nanchang 330031, China



S Supporting Information *

ABSTRACT: The tyrosine residue has been identified as suffering three major post-translational modifications (PTMs) including nitration, sulfation, and phosphorylation, which could be involved in different physiological and pathological processes. Multiple tyrosine residues of the whole protein may be modified concurrently, where PTM of a single tyrosine may affect modification of other neighboring tyrosine residues. Hence, it is significant and beneficial to predict nitration, sulfation, and phosphorylation of tyrosine residues in the whole protein sequence. Here, we introduce elastic net to perform feature selection and develop a predictor named TyrPred for predicting nitrotyrosine, sulfotyrosine, and kinase-specific tyrosine phosphorylation sites on the basis of support vector machine. We critically evaluate the performance of TyrPred and compare it with other existing tools. The satisfying results show that using elastic net to mine important features for training can considerably improve the prediction performance. Feature optimization indicates that evolutionary information is significant and contributes to the prediction model. The online tool is established at http://computbiol.ncu.edu.cn/TyrPred. We anticipate that TyrPred can provide useful complements to the existing approaches in this field.



INTRODUCTION Tyrosine nitration, sulfation, and phosphorylation are three significant post-translational modifications (PTMs), which play important roles in different physiological and pathological processes. Tyrosine nitration is a sign of oxidant burden in human diseases and reversible owing to the apparent enzymatic activity, which is modified in the 3′-position of the phenolic ring by adding −NO2.1−5 As an irreversible modification, tyrosine sulfation is mediated by tyrosyl protein sulfotransferase which catalyzes the transfer of sulfate from 3′-phosphoadenosine−5′-phosphosulfate to tyrosine residues in the trans-Golgi network.2 Malfunction or dysregulation of tyrosine sulfation would trigger several serious diseases, including atherosclerosis,6 lung diseases,7 and HIV infection.8 Both tyrosine nitration and sulfation need a nearby negative charge to present and turn inducing amino acids to exist and disulfide or other steric hindrances to absent.2,9 The difference between them lies in that sulfation hinders the activity of chymotrypsin but nitration does not.9 As a reversible and widespread PTM, tyrosine phosphorylation is catalyzed by specific kinase, which has a significant regulatory effect on regulation of protein activity, function, and intracellular transport.10 Kinase-specific phosphorylation information is fundamental for the reconfiguration of signal transduction networks and the identification of potential drug targets.11 Thus, the validation of PTM sites and analyses of substrate proteins could afford useful information for the understanding of modified mechanism and the drug design for the related diseases. Indeed, lots of conventional © XXXX American Chemical Society

experimental approaches have been used for the identification of protein nitration,12,13 sulfation,14,15 and phosphorylation16 sites. However, conventional experimental approaches are often arduous, time-consuming, and costly. Compared to the experimental approach, convenient computational prediction tools have been designed for the PTMs of tyrosine. For example, Xue et al. presented a novel tool group-based prediction system (GPS) to hierarchically predict protein phosphorylation sites in 2008.17 Gao et al. proposed a tool Musite for predicting general 6 organisms and 13 kinases or kinase family phosphorylation sites in 2010.18 Subsequently, Liu et al. combined sequence information to predict the nitrated tyrosine sites.19 On the basis of secondary structure, encoding based on grouped weight with auto correlation function, Huang et al. developed a tool PredSulSite to identify sulfotyrosine sites in 2012.20 Suo et al. constructed a model, PSEA, for detecting specific kinase, kinase family, and kinase group phosphorylation sites in 2014. PSEA improved the prediction performance by making full use of the theme of gene set enrichment analysis (GSEA).21 Then, the prediction model iNitro-Tyr was developed by Xu et al., based on the position specific dipeptide propensity with general pseudo amino acid composition to identify nitrotyrosine sites.22 Jia et al. predicted protein sulfotyrosine sites by using four designed strategies.23 Received: December 2, 2017

A

DOI: 10.1021/acs.jcim.7b00688 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Table 1. Summary of Currently Available Prediction Tools of Tyrosine PTM Sitesa predictorb

typec

featured

GPS-NO219 iNitro-Tyr22 Sulfinator27 Sulfotyrosine28 PredSulSite20 GPS-TSP25 SulfoTyrP23 DFA-PTSs24 NetPhosK29 PredPhospho30

nitration nitration sulfation sulfation sulfation sulfation sulfation sulfation phosphorylation phosphorylation

GPS PseAAC HMM PSSM EBGW, ACF, SS GPS BPBAA, CMV SS, disorder, PP neural network HMM

PPSP 1.031 KinasePhos32 PhoScan33 CRPhos 0.834 SiteSeek35

phosphorylation phosphorylation phosphorylation phosphorylation phosphorylation

PostMod36 GPS 2.137 Musite18 PKIS38 PhosK3D39 PhosphoPICK40 PSEA21 Phos_pred26 GPS 3.0 PhosPred-RF41

phosphorylation phosphorylation phosphorylation phosphorylation phosphorylation phosphorylation phosphorylation phosphorylation phosphorylation phosphorylation

BDT PCP, PS AAC CRF evolutionary, hydrophobicity PP, motif GPS disorder, AAC CMS motifs, AAC BN PSEA PS, GO, KEGG GPS evolutionary, disorder

window size

regular service

website

15 19 25 9 9 15 11 9 21 15

9 25 9

7 15 13 13 9 15 15 21

http://yno2.biocuckoo.org/ http://app.aporc.org/iNitro-Tyr/ http://www.expasy.org/tools/sulfinator/ http://ecsb.ex.ac.uk/sulfotyrosine http://www.bioinfo.ncu.edu.cn/inquiries_PredSulSite.aspx http://tsp.biocuckoo.org/index.php

yes yes yes yes yes yes

http://biolabxynu.zicp.net:9090/DFAPTSs/ http://www.cbs.dtu.dk/services/NetPhosK/ http://www.nih.go.kr/phosphovariant/html/seq_input_ predphospho2.htm http://ppsp.biocuckoo.org/ http://kinasephos2.mbc.nctu.edu.tw/index.html http://bioinfo.au.tsinghua.edu.cn/phoscan/ http://www.ptools.ua.ac.be/CRPhos/

N/A yes yes

http://biodb.kaist.ac.kr/PTM/index.html http://gps.biocuckoo.org/index.php http://musite.sourceforge.net/ http://bioinformatics.ustc.edu.cn/pkis/ http://csb.cse.yzu.edu.tw/PhosK3D/ http://bioinf.scmb.uq.edu.au/phosphopick/phosphopick http://bioinfo.ncu.edu.cn/PKPred_Home.aspx http://bioinformatics.ustc.edu.cn/phos_pred/ http://gps.biocuckoo.org/index.php http://server.malab.cn/PhosPred-RF

yes yes yes yes N/A yes yes yes yes yes

yes yes N/A yes

a GPS = Group-based Prediction System; PseAAC = Pseudo Amino Acid Composition; HMM = Hidden Markov Model; PSSM = Position Specific Scoring Matrix; EBGW = Encoding Based on Grouped Weight; ACF = Auto Correlation Function; BPBAA = Biprofile Bayesian Amino Acid; CMV = Composition Moment Vector; SS = Secondary Structure; PP = Physicochemical Property; BDT = Baysian Decision Theory; PCP = Protein Coupling Pattern; PS = Primary Sequence; AAC = Amino Acid Composition; CRF = Conditional Random Field; CMS = Composition of Monomer Spectrum; PSEA = Phosphorylation Set Enrichment Analysis; BN = Bayesian Network; N/A = Not Available. bThe reference for the predictor. cThe type of tyrosine PTM. dThe feature used for the prediction.

methods have been reported for identifying tyrosine nitration so far. To address the above limitations, it is necessary to enhance the prediction quality of tyrosine PTM sites by developing a new method. By incorporating multiple features via elastic net feature selection method, we proposed a novel tool called TyrPred which could simultaneously predict nitration, sulfation, and phosphorylation of tyrosine residues in a system. Subsequently, analyses demonstrated that nitrotyrosine, sulfotyrosine, and phosphotyrosine had some significant differences in sequence profiles and the evolutionary information exerted more influence on these models. Comparisons with the existing tools showed that TyrPred exhibited a competitive performance for predicting tyrosine PTM sites. Finally, we have implemented our method as an online tool, which can be freely accessed for academic research at http://computbiol.ncu. edu.cn/TyrPred.

Recently, Guo et al. described DFA-PTSs for prediction of tyrosine sulfation sites by using discrete firefly algorithm (DFA) and support vector machine (SVM) and performed secondround feature selection.24 Although there has been considerable success in the development of prediction for tyrosine modifications (Table 1), there are still some limitations. (i) By summarizing all available tools for the prediction of tyrosine PTM sites, we observe that current tyrosine PTM prediction tools have only been designed to identify either nitration or sulfation or phosphorylation sites. They could not be developed to predict tyrosine nitration, sulfation, and phosphorylation in a system. Multiple tyrosine residues of the whole protein may be modified concurrently, where PTM of a single tyrosine may affect modification of other neighboring tyrosine residues.25 Hence, it is significant and beneficial to predict nitration, sulfation, and phosphorylation of tyrosine residues in whole protein. (ii) Meanwhile, as we can see in Table 1, some predictors consider the single feature to predict tyrosine PTM sites, such as GPS17 and PSEA21 which only concentrate on sequence information. Or some predictors combine multiple features to predict tyrosine PTM sites, including PredSulSite20 and Phos_pred,26 but they have not discussed the importance and contribution of different features for the tyrosine PTM sites prediction. (iii) Moreover, the performance of other prediction models is not very satisfactory. Only two computational



METHODS Data Collection and Preprocessing. We collected data from several database sources including UniProtKB/Swiss-Prot (November 15, 2016),42 SysPTM,43 dbPTM,44 and PhosphoSitePlus (April 1, 2017)45 as well as the relevant literature. As previously described,20 homology protein sequences were eliminated by CD-HIT with a threshold of 30% to avoid the overestimation of the prediction accuracy.46 In total, the B

DOI: 10.1021/acs.jcim.7b00688 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling nonredundant data sets included 1038 experimentally verified nitrotyrosine sites as positive data for training and 76 nitrotyrosine sites as positive data for testing, 90 nonredundant sulfated proteins were divided into two parts, 75 proteins with 132 sulfated sites were used for model training while 15 proteins with 23 sulfated sites for independent test, and 2520 kinase-specific tyrosine phosphorylation site entries were divided into nine single kinases (Abl, EGFR, FYN, InsR, JAK2, Lck, LYN, Src, Syk), six kinase families (Abl, EGFR, InsR, JakA, Src, Syk), and one kinase group (TK) (see details in the Supporting Information).47 All negative samples were selected from the same type of tyrosine residue excluding all experimentally verified nitrated, sulfated, and phosphorylated sites in these proteins. Meanwhile, we set the window length of the peptide for nitrotyrosine with 1519 (−7 to +7 in relation to the central annotated nitrated or non-nitrated tyrosine), sulfotyrosine with 920,24,27,48 (−4 to +4 in relation to the central annotated sulfated or nonsulfated tyrosine), and kinasespecific tyrosine phosphorylation with 15.18,21,37 Finally, the statistical results of nitrated, sulfated, and phosphorylated proteins and sites are shown in Tables S1−3 in the Supporting Information. Feature Extraction. Three types of original features, including sequenced-derived information, evolutionary information, and physicochemical property information, were selected in the first stage for building the prediction model. • Sequence information feature. (i) Amino acid composition (AAC) feature can be a beneficial feature for predicting PTMs sites,49,50 which reflects frequency information on amino acid occurrence. (ii) Binary encoding (BE) is used to transform each amino acid into a 20-dimensional binary vector. (iii) K-spaced amino acid pair composition (K-spaced) could reveal the traits of the residues surrounding modification sites, and it is also a useful feature in the prediction of phosphorylation sites.51 (iv) Position weight amino acid compositions (PWAA) is presented to obtain the sequence order information on amino acid residues around nitrotyrosine sites, sulfated tyrosine sites, and kinase-specific tyrosine phosphorylation sites. Detailed information about AAC, BE, Kspaced, and PWAA are shown in the Supporting Information. • Evolutionary information feature. The K nearest neighbors (KNN) algorithm, which was first used in text classification,50,52 has shown better predictive performance so far.21,50,53 For the purpose of taking full advantage of cluster information on local sequence fragments to predict tyrosine PTM, so we used the algorithm to extract features in both positive and negative data sets. For two local sequence fragments s1 and s2, the distance D(s1, s2) is defined as

min(M) and max(M) are the smallest and largest numbers in the matrix, respectively. • Physicochemical properties feature. As mentioned above, tyrosine nitration and sulfation need the presence of a nearby negative charge. In the previous work, we found that prediction model of tyrosine sulfation achieved a better performance only by using the encoding based on grouped weight (EBGW) feature vector.20 In accordance with that, we adopted this encoding scheme to represent physicochemical property information from protein sequences. The EBGW divides 20 amino acid residues into four different classes based on the charged character and hydrophobicity.55,56 Then we divided amino acid residues into three disjoint groups. It represents as one when it appears in which group; otherwise, it is zero (see details in the Supporting Information). Feature Optimization. Due to the large number of feature vectors and heterogeneous, feature optimization is very necessary. Therefore, we introduced elastic net to dig important feature vectors and then to construct the optimization model by using 10-fold cross-validation. At first, the least absolute shrinkage and selection operator (lasso) was proposed by Tibshirani.57,58 The lasso is based on an L1-penalty, which can compress the regression coefficient values to 0. Due to the character of the L1-penalty, the lasso does both continuous shrinkage and automatic variable selection simultaneously. Given a linear regression with standardized predictors xij and centered response values yi for i = 1, 2, ..., N and j = 1, 2, ..., p, the lasso copes with the l1-penalized problem of finding β = {βj} to minimize ⎛ ∑ ⎜⎜yi − i=1 ⎝ N

∑−L sim(S1(i), S2(i)) 2L + 1

(3)

This is equivalent to minimizing the sum of squares with a constraint ∑|βj| ≤ λ2. Owing to the form of the l1-penalty, it can be applied in model selection for high dimensional data. Despite the lasso has made achievements in many cases, it still exists disadvantages. For instance, when the number of variables p used for prediction is greater than the number of observation samples n, it selects at most n variables before it congests, since the parameters of the lasso analysis are estimated to be continuous. In addition, if there is a set of variables among which the pairwise correlations are very high, then the lasso tends to choose one of these variables which are random. So in this paper we use a valuable algorithmelastic net, which is put forward by Zou and Hastie.59 It can solve the problems that are referred to above. Analogical to the lasso, the elastic net concurrently does automatic variable selection and continuous shrinkage, and it can choose sets of correlated variables. Real data examples and simulation studies indicate that the elastic net are often superior to the lasso in prediction performance. The elastic net criterion is defined as follows:

L

D(s1 , s2) = 1 −

⎞2 p ∑ xijβj⎟⎟ + λ ∑ |βj| ⎠ j j=1

(1)

L(μ1 , λ1 , β) = |y − Xβ|2 + λ1 |β|2 + μ1 |β|1

M(a , b) − min(M ) max(M ) − min(M ) (2) where a and b represent two amino acids; sim stems from the BLOSUM62 matrix;54 M is the substitution matrix; L denotes the number of upstream or downstream amino acids flanking the target tyrosine residue; sim(a , b) =

(4)

Where μ1 and λ1 are any fixed non-negative, |β|2 = ∑pj=1 β2j , |β|1 = ∑pj=1 |βj|. The elastic net estimator β̂ is the minimizer of equation: β ̂ = arg min{L(μ1 , λ1 , β )} C

(5) DOI: 10.1021/acs.jcim.7b00688 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Model Evaluation. SVM60 was used to evaluate the consequences of different types of features. In this paper, we chose the kernel function radial basis function to map the input samples into a higher dimensional space and the SVM was trained to distinguish positive samples and negative samples in TyrPred using the grid search strategy in LIBSVM. For the specific implementation, we used the LIBSVM package (https://www.csie.ntu.edu.tw/~cjlin/libsvm). SVM predicts only class label without probability information. Chang and Lin60 discussed the LIBSVM implementation for extending SVM to give probability estimates. In addition, Wu et al.61 used their approaches to obtain probability estimates from all these pairwise class probabilities (the detail in the Supporting Information). Based on this approach, we reserved the relative parameters derived from our training data sets trained by drive function svmtrain in LIBSVM package, and then we used the drive function svmpredict to predict testing data according to the training data parameters. The LIBSVM computed the estimated probability as the SVM probability. In addition, the four measurements of accuracy (Acc), sensitivity (Sn), specificity (Sp), and Mathew Correlation Coefficient (MCC) are employed to evaluate the prediction performance. The Acc, Sn, Sp, and MCC are defined as follows:

This procedure can be regarded as a penalized least-squares λ method. Let α = μ +1 λ , then solving β̂ is equivalent to the 1

1

optimization problem β ̂ = arg min|y − Xβ|2 Subject to(1 − α)|β|1 + α|β|2 ≤ λ 2 for someλ 2

(6)

The function (1 − α)|β|1 + α|β| is defined as the elastic net penalty, which we consider only α < 1. Furthermore, when α = 1, the elastic net is simple ridge regression; when α = 0, the elastic net turns into lasso. In this work, we used (λ1, λ2) as the tuning parameter to parametrize the elastic net and further selected the optimal feature vectors based on optimization parameter. Algorithm classification describes the detailed procedure of our feature selection method based on elastic net. 2

Sn =

TP TP + FN

(7)

Sp =

TN TN + FP

(8)

Acc =

TP + TN TP + FP + TN + FN

(9)

MCC = TP × TN − FP × FN (TP + FN)(TN + FP)(TP + FP)(TN + FN) (10)

Where TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively. The receiver operating characteristic (ROC) curves and area under ROC (AUC) values were also carried out.



RESULTS AND DISCUSSION Feature Optimization Results via Elastic Net. From the previous experiment result, we found that the combination of all six features (AAC, BE, K-spaced, PWAA, KNN, and EBGW) would exhibit a more sophisticated predictor than independent coding tests with single feature. However, if we combine all six Table 2. Comparison of Model Performance before and after Dimension Reduction in Tyrosine Nitration, Sulfation, and Kinase-Specific Phosphorylationa before

a

after

modification

dim

Acc (%)

Sn (%)

Sp (%)

Mcc (%)

dim

Acc (%)

Sn (%)

Sp (%)

Mcc (%)

nitration sulfation EGFR Src TK

2582 2455 2581 2581 2582

64.90 80.33 77.41 74.66 75.44

64.66 75.18 68.57 73.52 76.55

65.14 85.49 86.25 75.80 74.34

29.90 62.70 56.36 49.72 50.90

470 144 72 497 396

79.67 94.82 97.50 85.78 82.86

79.76 94.15 97.50 86.46 82.86

79.57 95.49 97.50 85.10 82.86

59.40 90.12 95.28 71.66 65.86

dim refers to the dimension of feature vectors. D

DOI: 10.1021/acs.jcim.7b00688 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 1. Accuracies of elastic net with other feature selection methods in nitration, sulfation, and Src, respectively.

Figure 2. Comparison with the each feature ratios of nitration, sulfation, and kinase-specific phosphorylation about tyrosine. The vertical axis represents the log2 ratio of selected to the single total feature values between nitration, sulfation, and kinase-specific phosphorylation sequence. The horizontal axis represents the six feature (K-spaced, KNN, AAC, BE, EBGW, and PWAA).

Figure 3. Sequence logo generated illustration tyrosine PTM sequence information by Two Sample Logo.

(mRMR-IFS)64 were usually used as feature selection. Thus, we also used the IG, F-score, and mRMR-IFS to carry out feature optimization respectively, and compared their results with the result of elastic net. From Figure 1 and Table S7, we found that prediction performance of the model by using elastic net is superior to that of IG and F-score based on the feature vectors with the same dimension. For instance, the accuracy has improvement by 9%, 12%, and 7% for nitration, sulfation, and Src separately in comparison with IG, and the accuracy has increased at a rate of 7%, 10%, and 6% for nitration, sulfation, and Src separately in comparison with F-score. Based on IG and F-score, the higher the ranking of feature vector is, the greater the impact on the tyrosine PTM sites. However, elastic net depends on (λ1, λ2) to select feature vectors and choose the optimal λ1 corresponding to each λ1. The λ2 values of elastic net have a quite different in each dimension. For example, we used elastic net to optimize the dimensionality of feature vectors when generating the predictive model of nitration. First, we defined parameter λ1 = 0.5, applied elastic net to evaluate all of feature vectors of nitration, and got the value of parameter λ2 about each of feature vector, so all feature vectors were ranked according to their λ2 values from small to large. The lower λ2 value the more valuable the corresponding feature vector. Then, we optimized parameter λ2, we first respectively trained feature vectors with λ2 value less than 0.08, 0.12, 0.16, 0.20, and 0.24 using 10-fold cross-validation. According to the maximum Acc value of λ1 = 0.5, we chose the optimal λ2 value. Subsequently, we used the same way to process λ1 = 0.1, 0.2,

features, feature vectors have totally more than 2000dimension. The high-dimensional features would be timeconsuming for training model classification, and there are potentially biased toward model prediction performance among feature vectors. Consequently, all features are not equivalently essential for the model performance. It is widely indispensable to reduce dimensionality so that we can reserve the important one. We have built 18 models, including 1 nitration model, 1 sulfation model, 9 single kinases models, 6 kinase family models, and 1 kinase group model. The prediction performance of combining six features, and the chosen feature for predicting three tyrosine PTM sites are shown in Table 2 (other detailed dimensional information on kinases in Tables S4−6 in the Supporting Information), and we observe that the chosen feature vectors have superior prediction performance to combined six features. For instance, the SVM accuracy of prediction model of kinase family Src which combined all six features (AAC, BE, K-spaced, KNN, EBGW, and PWAA feature) is 74.66%, while the SVM accuracy of the Src prediction model with the selected feature is 85.78%. The improved accuracy is 11.1% comparing with all six features, which indicates that feature optimization can build a highperformance model. Comparison Elastic Net with Other Feature Selection Methods. In the previous prediction models of protein PTM sites, information gain (IG),62 F-score,63 and maximum relevance minimum redundancy-incremental feature selection E

DOI: 10.1021/acs.jcim.7b00688 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 4. (A) Heat map of nitration (the other is shown in Figure S1). (B) KNN scores between nitration, sulfation, and kinase-specific phosphorylation sequence and non-nitration, nonsulfation, and nonphosphorylation. The vertical axis denotes the average KNN scores. The horizontal axis denotes numbers of nearest neighbors.

overall accuracy rate. Figure 1 shows that, the accuracy has improvement by 8%, 5%, and 6% for tyrosine three types of PTM separately in comparison with mRMR. Overall, the results show that feature selection by elastic net significantly outperforms those selected by IG, F-score, and mRMR in the prediction of tyrosine PTM sites. Analysis of Feature Importance and Contribution. From what has been discussed above, elastic net has, comparatively speaking, more advantages in feature selection. Sequentially, we further analyze which feature vectors are important to prediction model based on optimization features selected by elastic net. In Figure 2, taking tyrosine nitration as an example, we reconstitute 470-dimension new feature from above six features, and the ratios of chosen dimension feature vectors belonging to the six features are 0.1 (2/20), 0.12 (39/ 315), 0.19 (420/2205), 0, 1 (6/6), and 0.2 (3/15), respectively. From these results, we observe that the proportion of KNN feature is notably higher than those of other five features, indicating that KNN features exert a vital effect on the performance evaluation of the model and contribute to predict tyrosine PTM sites. The KNN features are concerned with conserved residues and detected local sequence similarity, which exhibit the best performances in these models. The EBGW feature has a higher proportion relative to the remaining four features, implying the physicochemical property is also comparatively important to predict tyrosine modification sites. In contrast, the ratios of dimension of PWAA and BE feature vectors are relatively small, which imply that most of features do not exert more influence on this model and these two features are not important as KNN feature for the models. Furthermore, the features optimization results of different tyrosine PTMs reveal that elastic net method can achieve the higher prediction accuracy with fully considered the importance and contribution of each dimension feature vector. Feature Analysis of Different Types of Tyrosine PTMs. After that, we further investigate the differences among nitrotyrosine, sulfotyrosine, and phosphotyrosine from the features. We apply Two Sample Logo tool to determine statistically significant residues and to present the compositional biases surrounding different modification sites (Figure 3).65 In the above work, we choose the behalf of single kinase, kinase family, and kinase group, respectively, for the further analysis, which are EGFR, Src, and TK, respectively. From the Figure 3, amino acid residues that prominently enrich and deplete in tyrosine PTM sequences are easily identified. In tyrosine sulfation, we find that asparagine and glutamate, which have negative charge, enrich in the whole fragment, while

Figure 5. Comparison of EBGW between nitration, sulfation, kinasespecific phosphorylation sequence, and non-nitration, nonsulfation, nonphosphorylation. The vertical axis denotes the log2 ratio of average EBGW values between nitration, sulfation, and kinase-specific phosphorylation sequence and non-nitration, nonsulfation, and nonphosphorylation. The horizontal axis denotes the three binary sequences.

Figure 6. ROC curves and AUC values for 10-fold cross-validations (CV) of the training sets for three modifications.

0.3, 0.4. Finally, according to optimal parameter λ1 = 0.1, λ2 = 0.12, we selected 470 dimensional top vectors among 2582 dimensional feature vectors for reconstituting a feature set for nitrotyrosine prediction (other tyrosine PTM optimization parameter shown in the Supplementary Table S6). The mRMR-IFS method is based on the incremental feature selection (IFS) for training model, which quantify both relevance and redundancy through mutual information. Yet it may be a time-consuming process for the multidimensional to construct the optimal feature set that brings about the highest F

DOI: 10.1021/acs.jcim.7b00688 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Table 3. Comparison of Prediction Performance between Our Method (Tyrosine Nitration, Sulfation, and Kinase-Specific Phosphorylation Models) and Other Tools performance of prediction modification

method

nitration

GPS

sulfation

iNitro-Tyr our work GPS

EGFR

The Sulfinator Sulfotyrosine Predsulsite our work GPS

Src

our work PSEA

GPS

Musite

TK

our work PSEA

GPS

our work

stringency

Acc (%)

Sn (%)

Sp (%)

MCC (%)

AUC (%)

high medium low

85.52 85.52 86.18 76.97 84.87 78.26 84.78 84.78 82.61 82.61 80.43 93.48 75.00 78.57 78.57 91.18 75.38 75.38 74.23 77.69 80.00 80.77 50.77 53.46 55.77 82.69 67.70 68.58 68.36 63.50 66.37 68.36 78.91

81.39 83.75 86.67 69.90 88.16 70.96 80.76 83.33 85.71 89.47 60.87 95.65 68.42 75.00 75.00 94.12 76.15 77.69 81.54 71.95 76.35 78.57 61.18 74.12 90.59 83.08 66.39 66.54 65.31 58.79 61.94 66.40 77.34

90.91 87.50 85.71 91.84 86.58 93.33 90.00 86.36 80.00 77.78 1.00 91.30 88.89 83.33 83.33 88.24 74.62 73.08 66.92 87.50 84.82 83.33 94.12 89.41 80.00 82.31 69.23 71.21 72.93 79.05 76.06 70.85 80.47

71.67 71.15 72.37 64.07 69.89 60.28 70.16 69.63 62.94 66.23 66.14 87.04 53.53 57.74 57.74 82.50 50.78 50.82 50.00 57.38 60.58 61.72 58.56 64.29 70.99 65.39 35.51 38.22 37.48 31.96 35.27 36.99 57.84

84.91 82.67 89.54 80.74 89.35 81.94 85.53 84.97 84.84 85.20 85.97 93.58 78.51 79.17 79.17 92.86 71.87 73.17 72.37 77.73 75.48 76.91 74.67 75.11 77.01 83.69 70.07 70.63 71.31 71.24 72.28 72.63 80.39

high medium low

high medium low high medium low high medium low high medium low high medium low high medium low

negative data sets in nitration is a slight difference, which may cause not higher accuracy than sulfation. From Figure 5, the average ratio of EBGW scores of H1 (1−5) between nitration, sulfation, kinase-specific phosphorylation sequence, and nonnitration, nonsulfation, nonphosphorylation is below zero, implying more charged group existed around the nitration, sulfation, and kinase-specific phosphorylation sites. The average ratio of EBGW scores of H2 (6−10) is less than zero with the exception of nitration which indicates that there are more negatively charged surrounding sulfation and kinase-specific phosphorylation sites. This result is also consistent with the above analysis. In a word, it shows that there have significant differences among different modifications, which could be a helpful feature for tyrosine different modification type prediction. Model Performance Evaluation. To evaluate the robustness and performance of TyrPred, the leave-one-out (LOO) validation and k-fold cross-validations (k = 2, 4, 6, 8, 10) were performed on each data set (Figure S3). In accordance with 10fold cross-validation, tyrosine PTMs of nitration, sulfation, and kinase-specific phosphorylation (EGFR, Src, and TK) achieved AUCs of 0.8, 0.952, 0.977, 0.856, and 0.838, respectively (Figure 6). For EGFR, the AUC values of k-fold crossvalidations (k = 2, 4, 6, 8, 10) were 0.945, 0.967, 0.952, 0.976,

positive charge amino acids such as arginine and lysine and hydrophobicity amino acids including alanine, leucine, and valine deplete. This result is consistent with that described earlier. From the heat maps of BE (Figure 4A and Figure S1), it visualizes and analyzes position distribution of the tyrosine PTM sites. We find that tryptophan tends to enrich at position +4, threonine tends to enrich at position −3, and glutamate tends to enrich at position −2 in tyrosine nitration. However, isoleucine and valine tend to enrich at position −1, phenylalanine, leucine, and proline tend to enrich at position +3 in the tyrosine kinase family Src. Furthermore, in Figure S2, asparagine and glutamate tend to enrich in the positive fragment in tyrosine sulfation, which is consistent with sequence logo result. Meanwhile, we observe that glutamate enriches in the three modification sequences, whereas polar amino acids including cysteine and tryptophan deplete for three modifications with different degrees. In addition, we compared the KNN scores and the ratio of EBGW in tyrosine three types of PTM. In Figure 4B, the KNN scores have significantly different among different values of k in tyrosine sulfation and nonsulfation sequences. For sulfation, the average KNN scores are within 0.7−0.9, which may contribute the higher accuracy than that of other modifications. Meanwhile, we find that the gap between positive data sets and G

DOI: 10.1021/acs.jcim.7b00688 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling and 0.977. In general, the results of k-fold cross-validations (k = 2, 4, 6, 8, 10) and LOO method were highly analogical, especially for nitration and TK which might be owing to the larger data set for training, implying that TyrPred is a robust and stable predictor with satisfactory performance. Comparison with Other Existing Tools. For the sake of evaluating the prediction performance of the TyrPred objectively, we compared it with other existing tools, including GPS-YNO2,19 iNitro-Tyr,22 Sulfinator,27 Sulfotyrosine,23 Predsulsite,20 GPS 3.0, PSEA,21 and Musite18 (Tables 3, S8, and 9). Because these tools can only predict nitration, sulfation or phosphorylation sites in proteins, we directly submitted each testing data set for the prediction, and the SVM results of TyrPred were used for a comparison (Figure S4). For the general prediction of nitrotyrosine, AUC value of the prediction performance of TyrPred is 0.894 (Table 3). Compared with iNitro-Tyr, the AUC value of our model has improvement by 9%. Meanwhile, the AUC value of GPS-NO2 (Low) is 0.895, comparable with TyrPred. For Src, GPS, Musite, and PSEA, tools have to sacrifice Sn for achieving high specificity, which would cause smaller values of MCC and AUC. For sulfation, the Sulfinator has a proper balance between Sp and Sn, but it would not be satisfied to the values of MCC and AUC. Nevertheless, TyrPred not only keeps proper balance between Sn and Sp, but also the values of AUC and MCC are large. For example, the prediction performance of tyrosine sulfation of TyrPred model is Sn of 95.65%, Sp of 91.30%, Acc of 93.48%, MCC of 0.870, and AUC of 0.936, while the other four methods have relatively low values. This suggests that TyrPred is better than general tools. The following reasons may account for such a big difference. Perhaps we consider the comprehensive information on encoding protein sequence, including sequence-derived information, evolution information, and physical chemistry properties information, which combines multiple features to understand the protein PTM mechanism from different perspectives. Furthermore, it may be attributed to the evolutionary information which is regarded as the important feature for tyrosine PTM in the analysis of all feature importance.

tyrosine nitration, sulfation, and kinase-specific phosphorylation sites.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00688. Supplemental illustration; heat map (Figure S1); average AAC value for tyrosine PTM sites (Figure S2); ROC curve of tyrosine PTMs (Figure S3); the statistics of nitration, sulfation, and kinase-specific tyrosine phosphorylation data sets (Tables S1−3); comparison of model performance before and after dimension reduction in kinase-specific tyrosine phosphorylation (Tables S4− 5); tyrosine PTM optimization parameter with elastic net (Table S6); comparison elastic net with other feature selection methods (Table S7); comparison of the prediction performance of independent test in tyrosine phosphorylation data sets (Tables S8−9) (PDF)



AUTHOR INFORMATION

Corresponding Author

*Tel.: + 86-13879153564. Fax: 86-791-86355377. E-mail: [email protected] ORCID

Shaoping Shi: 0000-0003-0045-868X Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by the National Natural Science Foundation of China (21665016 and 21305062) and the Natural Science Foundation of Jiangxi Province (20151BAB203022).



ABBREVIATIONS PTMs, post-translational modifications; GPS, group-based prediction system; GSEA, gene set enrichment analysis; AAC, amino acid composition; BE, binary encoded; K-spaced, kspaced amino acid pair composition; PWAA, position weight amino acid compositions; KNN, k nearest neighbors; EBGW, encoding based on grouped weight; lasso, least absolute shrinkage and selection operator; Acc, accuracy; Sn, sensitivity; Sp, specificity; MCC, Mathew Correlation Coefficient; SVM, support vector machine; LOO, leave-one-out; ROC, receiver operating characteristic; AUC, area under ROC; IG, information gain; mRMR-IFS, maximum relevance minimum redundancy-incremental feature selection



CONCLUSIONS With the multiple features via the feature selection by elastic net, we have presented a novel predictor TyrPred for identifying potential tyrosine nitration, sulfation, and kinasespecific phosphorylation sites based on primary protein sequences. The corresponding analyses and comparison with the existing tools demonstrate that TyrPred is stabilized and satisfied in the prediction performance. Feature analysis shows that nitrotyrosine, sulfotyrosine, phosphotyrosine, and nonnitrotyrosine, nonsulfotyrosine, nonphosphotyrosine have some significantly differences in sequenced-derived information, evolutionary information, and physicochemical properties information. Meanwhile, feature optimization indicates that the KNN feature is significant and exerts a great influence on the prediction model. Additionally, the online service is provided with a user-friendly interface to run locally for TyrPred, which can work on Google Chrome, Mozilla Firefox, and Internet Explorer to provide a robust service. By submitting a standard FASTA sequence, the TyrPred could efficiently return the prediction results including protein name, the position of site, flanking amino acids, and SVM probability. We anticipate that TyrPred could afford useful information on the identification of



REFERENCES

(1) Abello, N.; Kerstjens, H. A.; Postma, D. S.; Bischoff, R. Protein Tyrosine Nitration: Selectivity, Physicochemical and Biological Consequences, Denitration, and Proteomics Methods for the Identification of Tyrosine-nitrated Proteins. J. Proteome Res. 2009, 8, 3222. (2) Huttner, W. B. Protein Tyrosine Sulfation. Trends Biochem. Sci. 1987, 12, 361−363. (3) Gow, A. J.; Duran, D.; Malcolm, S.; Ischiropoulos, H. Effects of Peroxynitrite-induced Protein Modifications on Tyrosine Phosphorylation and Degradation. FEBS Lett. 1996, 385, 63. (4) Kamisaki, Y.; Wada, K.; Bian, K.; Balabanli, B.; Davis, K.; Martin, E.; Behbod, F.; Lee, Y. C.; Murad, F. An Activity in Rat Tissues that

H

DOI: 10.1021/acs.jcim.7b00688 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Modifies Nitrotyrosine-Containing Proteins. Proc. Natl. Acad. Sci. U. S. A. 1998, 95, 11584−11589. (5) Monigatti, F.; Hekking, B.; Steen, H. Protein Sulfation Analysis– A Primer. Biochim. Biophys. Acta, Proteins Proteomics 2006, 1764, 1904−1913. (6) Koltsova, E.; Ley, K. Tyrosine Sulfation of Leukocyte Adhesion Molecules and Chemokine Receptors Promotes Atherosclerosis. Arterioscler., Thromb., Vasc. Biol. 2009, 29, 1709−1711. (7) Liu, J.; Louie, S.; Hsu, W.; Yu, K. M.; Nicholas, J. H.; Rosenquist, G. L. Tyrosine Sulfation is Prevalent in Human Chemokine Receptors Important in Lung Disease. Am. J. Respir. Cell Mol. Biol. 2008, 38, 738−743. (8) Farzan, M.; Babcock, G. J.; Vasilieva, N.; Wright, P. L.; Kiprilov, E.; Mirzabekov, T.; Choe, H. The Role of Post-translational Modifications of the CXCR4 Amino Terminus in Stromal-derived Factor 1 Alpha Association and HIV-1 Entry. J. Biol. Chem. 2002, 277, 29484−29489. (9) Souza, J. M.; Daikhin, E.; Yudkoff, M.; Raman, C. S.; Ischiropoulos, H. Factors Determining the Selectivity of Protein Tyrosine Nitration. Arch. Biochem. Biophys. 1999, 371, 169−178. (10) Pawson, T. Specificity in Signal Transduction: From Phosphotyrosine-SH2 Domain Interactions to Complex Cellular Systems. Cell 2004, 116, 191−203. (11) Sobolev, B.; Filimonov, D.; Lagunin, A.; Zakharov, A.; Koborova, O.; Kel, A.; Poroikov, V. Functional Classification of Proteins Based on Projection of Amino Acid Sequences: Application for Prediction of Protein Kinase Substrates. BMC Bioinf. 2010, 11, 313. (12) Zaragozá, R.; Torres, L.; García, C.; Eroles, P.; Corrales, F.; Bosch, A.; Lluch, A.; García-Trevijano, E. R.; Viña, J. R. Nitration of Cathepsin D Enhances its Proteolytic Activity during Mammary Gland Remodelling after Lactation. Biochem. J. 2009, 419, 279−288. (13) Kers, J. A.; Wach, M. J.; Krasnoff, S. B.; Widom, J.; Cameron, K. D.; Bukhalid, R. A.; Gibson, D. M.; Crane, B. R.; Loria, R. Nitration of a Peptide Phytotoxin by Bacterial Nitric Oxide Synthase. Nature 2004, 429, 79. (14) Heathfield, T. F.; Onnerfjord, P. L.; Heinegard, D.; Dahlberg, L. Cleavage of Fibromodulin in Cartilage Explants Involves Removal of the N-terminal Tyrosine Sulfate-rich Region by Proteolysis at a Site that is Sensitive to Matrix Metalloproteinase-13. J. Biol. Chem. 2004, 279, 6286. (15) Yu, Y.; Hoffhines, A. J.; Moore, K. L.; Leary, J. A. Determination of the Sites of Tyrosine O-sulfation in Peptides and Proteins. Nat. Methods 2007, 4, 583−588. (16) Salek, M.; Alonso, A.; Pipkorn, R.; Lehmann, W. D. Analysis of Protein Tyrosine Phosphorylation by Nanoelectrospray Ionization High-resolution Tandem Mass Spectrometry and Tyrosine-targeted Product Ion Scanning. Anal. Chem. 2003, 75, 2724. (17) Xue, Y.; Ren, J.; Gao, X.; Jin, C.; Wen, L.; Yao, X. GPS 2.0, A Tool to Predict Kinase-specific Phosphorylation Sites in Hierarchy. Mol. Cell. Proteomics 2008, 7, 1598. (18) Gao, J.; Thelen, J. J.; Dunker, A. K.; Xu, D. Musite, A Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites. Mol. Cell. Proteomics 2010, 9, 2586−2600. (19) Liu, Z.; Cao, J.; Ma, Q.; Gao, X.; Ren, J.; Xue, Y. GPS-YNO2: Computational Prediction of Tyrosine Nitration Sites in Proteins. Mol. BioSyst. 2011, 7, 1197. (20) Huang, S. Y.; Shi, S. P.; Qiu, J. D.; Sun, X. Y.; Suo, S. B.; Liang, R. P. PredSulSite: Prediction of Protein Tyrosine Sulfation Sites with Multiple Features and Analysis. Anal. Biochem. 2012, 428, 16−23. (21) Suo, S. B.; Qiu, J. D.; Shi, S. P.; Chen, X.; Liang, R. P. PSEA: Kinase-specific Prediction and Analysis of Human Phosphorylation Substrates. Sci. Rep. 2015, 4, 4524. (22) Xu, Y.; Wen, X.; Wen, L. S.; Wu, L. Y.; Deng, N. Y.; Chou, K. C. iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition. PLoS One 2014, 9, e105018. (23) Jia, C.; Zhang, Y.; Wang, Z. SulfoTyrP: A High Accuracy Predictor of Protein Sulfotyrosine Sites. Match-Commun. Math. Co. 2014, 71, 227−240.

(24) Guo, S.; Liu, C.; Zhou, P.; Li, Y. A Multifeatures Fusion and Discrete Firefly Optimization Method for Prediction of Protein Tyrosine Sulfation Residues. BioMed Res. Int. 2016, 2016, 1−8. (25) Pan, Z. C.; Liu, Z. X.; Cheng, H.; Wang, Y. B.; Gao, T. S.; Ullah, S.; Ren, J.; Xue, Y. Systematic Analysis of the In Situ Crosstalk of Tyrosine Modifications Reveals No Additional Natural Selection on Multiply Modified Residues. Sci. Rep. 2015, 4, 11. (26) Fan, W.; Xu, X.; Shen, Y.; Feng, H.; Li, A.; Wang, M. Prediction of Protein Kinase-specific Phosphorylation Sites in Hierarchical Structure Using Functional Information and Random Forest. Amino Acids 2014, 46, 1069−1078. (27) Monigatti, F.; Gasteiger, E.; Bairoch, A.; Jung, E. The Sulfinator: Predicting Tyrosine Sulfation Sites in Protein Sequences. Bioinformatics 2002, 18, 769. (28) Yang, Z. R. Predicting Sulfotyrosine Sites Using the Random Forest Algorithm with Significantly Improved Prediction Accuracy. BMC Bioinf. 2009, 10, 361. (29) Blom, N.; Sicheritzpontén, T.; Gupta, R.; Gammeltoft, S.; Brunak, S. Prediction of Post-translational Glycosylation and Phosphorylation of Proteins from the Amino Acid Sequence. Proteomics 2004, 4, 1633−1649. (30) Kim, J. H.; Lee, J.; Oh, B.; Kimm, K.; Koh, I. Prediction of Phosphorylation Sites Using SVMs. Bioinformatics 2004, 20, 3179− 3184. (31) Xue, Y.; Li, A.; Wang, L.; Feng, H.; Yao, X. PPSP: Prediction of PK-specific Phosphorylation Site with Bayesian Decision Theory. BMC Bioinf. 2006, 7, 163. (32) Wong, Y. H.; Lee, T. Y.; Liang, H. K.; Huang, C. M.; Wang, T. Y.; Yang, Y. H.; Chu, C. H.; Huang, H. D.; Ko, M. T.; Hwang, J. K. KinasePhos 2.0: A Web Server for Identifying Protein Kinase-specific Phosphorylation Sites Based on Sequences and Coupling Patterns. Nucleic Acids Res. 2007, 35, 588−594. (33) Li, T.; Li, F. X; Zhang, X. Prediction of Kinase-specific Phosphorylation Sites with Sequence Features by a Log-odds Ratio Approach. Proteins: Struct., Funct., Genet. 2008, 70, 404−414. (34) Dang, T. H.; Van Leemput, K.; Verschoren, A.; Laukens, K. Prediction of Kinase-specific Phosphorylation Sites Using Conditional Random Fields. Bioinformatics 2008, 24, 2857−2864. (35) Yoo, P. D.; Ho, Y. S.; Zhou, B. B.; Zomaya, A. Y. SiteSeek: Posttranslational Modification Analysis Using Adaptive Locality-effective Kernel Methods and New Profiles. BMC Bioinf. 2008, 9, 272. (36) Jung, I.; Matsuyama, A.; Yoshida, M.; Kim, D. PostMod: Sequence Based Prediction of Kinase-specific Phosphorylation Sites with Indirect Relationship. BMC Bioinf. 2010, 11, S10. (37) Zou, L.; Wang, M.; Shen, Y.; Liao, J.; Wang, M.; Li, A. PKIS: Computational Identification of Protein Kinases for Experimentally Discovered Protein Phosphorylation Sites. BMC Bioinf. 2013, 14, 247. (38) Su, M. G.; Lee, T.-Y. Incorporating Substrate Sequence Motifs and Spatial Amino Acid Composition to Identify Kinase-specific Phosphorylation Sites on Protein Three-dimensional Structures. BMC Bioinf. 2013, 14, S2−S2. (39) Patrick, R.; Lê Cao, K.-A.; Kobe, B.; Bodén, M. PhosphoPICK: Modelling Cellular Context to Map Kinase-substrate Phosphorylation Events. Bioinformatics 2015, 31, 382−389. (40) Xue, Y.; Liu, Z.; Cao, J.; Ma, Q.; Gao, X.; Wang, Q.; Jin, C.; Zhou, Y.; Wen, L.; Ren, J. GPS 2.1: Enhanced Prediction of Kinasespecific Phosphorylation Sites with an Algorithm of Motif Length Selection. Protein Eng., Des. Sel. 2011, 24, 255−260. (41) Wei, L.; Xing, P.; Tang, J.; Zou, Q. PhosPred-RF: A Novel Sequence-based Predictor for Phosphorylation Sites Using Sequential Information Only. IEEE Trans. Nanobioscience 2017, 16, 240. (42) Chernorudskiy, A. L.; Garcia, A.; Eremin, E. V.; Shorina, A. S.; Kondratieva, E. V.; Gainullin, M. R. UbiProt: A Database of Ubiquitylated Proteins. BMC Bioinf. 2007, 8, 126. (43) Li, J.; Jia, J.; Li, H.; Yu, J.; Sun, H.; He, Y.; Lv, D.; Yang, X.; Glocker, M. O.; Ma, L.; et al. SysPTM 2.0: An Updated Systematic Resource for Post-translational Modification. Database 2014, 2014, bau025. I

DOI: 10.1021/acs.jcim.7b00688 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling (44) Lee, T. Y.; Huang, H. D.; Hung, J. H.; Huang, H. Y.; Yang, Y. S.; Wang, T. H. dbPTM: An Information Repository of Protein Posttranslational Modification. Nucleic Acids Res. 2006, 34, D622−D627. (45) Hornbeck, P. V.; Chabra, I.; Kornhauser, J. M.; Skrzypek, E.; Zhang, B. PhosphoSite: A Bioinformatics Resource Dedicated to Physiological Protein Phosphorylation. Proteomics 2004, 4, 1551− 1561. (46) Huang, Y.; Niu, B.; Gao, Y.; Fu, L.; Li, W. CD-HIT Suite: A Web Server for Clustering and Comparing Biological Sequences. Bioinformatics 2010, 26, 680−682. (47) Manning, G.; Whyte, D. B.; Martinez, R.; Hunter, T.; Sudarsanam, S. The Protein Kinase Complement of the Human Genome. Science 2002, 298, 1912. (48) Niu, S.; Huang, T.; Feng, K.; Cai, Y.; Li, Y. Prediction of Tyrosine Sulfation with mRMR Feature Selection and Analysis. J. Proteome Res. 2010, 9, 6490−6497. (49) Wang, L. N.; Shi, S. P.; Xu, H. D.; Wen, P. P.; Qiu, J. D. Computational Prediction of Species-specific Malonylation Sites via Enhanced Characteristic Strategy. Bioinformatics 2016, btw755. (50) Weng, S. L.; Huang, K. Y.; Kaunang, F. J.; Huang, C. H.; Kao, H. J.; Chang, T. H.; Wang, H. Y.; Lu, J. J.; Lee, T. Y. Investigation and Identification of Protein Carbonylation Sites Based on Positionspecific Amino Acid Composition and Physicochemical Features. BMC Bioinf. 2017, 18, 66. (51) Zhao, X.; Zhang, W.; Xu, X.; Ma, Z.; Yin, M. Prediction of Protein Phosphorylation Sites by Using the Composition of K-spaced Amino Acid Pairs. PLoS One 2012, 7, e46302. (52) Tan, S. An Effective Refinement Strategy for KNN Text Classifier. Expert Syst. Appl. 2006, 30, 290−298. (53) Wang, T.; Zheng, W.; Wuyun, Q.; Wu, Z.; Ruan, J.; Hu, G.; Gao, J. PrAS: Prediction of Amidation Sites Using Multiple Feature Extraction. Comput. Biol. Chem. 2017, 66, 57−62. (54) Henikoff, S.; Henikoff, J. G. Amino Acid Substitution Matrices From Protein Blocks. Proc. Natl. Acad. Sci. U. S. A. 1992, 89, 10915. (55) Nanni, L.; Lumini, A. An Ensemble of Reduced Alphabets with Protein Encoding Based on Grouped Weight for Predicting DNAbinding Proteins. Amino Acids 2009, 36, 167−175. (56) Zhang, Z. H.; Wang, Z. H.; Zhang, Z. R.; Wang, Y. X. A Novel Method for Apoptosis Protein Subcellular Localization Prediction Combining Encoding Based on Grouped Weight and Support Vector Machine. FEBS Lett. 2006, 580, 6169−6174. (57) Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. 1996, 58, 267−288. (58) Tibshirani, R. Regression Shrinkage and Selection via the lasso: A Retrospective. J. R. Stat. Soc. 2011, 73, 273−282. (59) Zou, H.; Hastie, T. Regularization and Variable Selection via the Elastic Net. J. R. Stat. Soc. 2005, 67, 301−320. (60) Chang, C. C.; Lin, C. J. LIBSVM: A Library for Support Vector Machines. Acm. T. Intel. Syst. Tec. 2011, 2, 1. (61) Wu, T. F.; Lin, C. J.; Weng, R. C. Probability Estimates for Multi-class Classification by Pairwise Coupling. J. Mach. Learn. Res. 2004, 5, 975−1005. (62) Wen, P. P.; Shi, S. P.; Xu, H. D.; Wang, L. N.; Qiu, J. D. Accurate In Silico Prediction of Species-specific Methylation Sites Based on Information Gain Feature Optimization. Bioinformatics 2016, 32, 3107. (63) Wang, L. N.; Shi, S. P.; Wen, P. P.; Zhou, Z. Y.; Qiu, J. D. Computing Prediction and Functional Analysis of Prokaryotic Propionylation. J. Chem. Inf. Model. 2017, 57, 2896. (64) Peng, H.; Long, F.; Ding, C. Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and MinRedundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226. (65) Vacic, V.; Iakoucheva, L. M.; Radivojac, P. Two Sample Logo: A Graphical Representation of the Differences Between Two Sets of Sequence Alignments. Bioinformatics 2006, 22, 1536−1537.

J

DOI: 10.1021/acs.jcim.7b00688 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX