Article pubs.acs.org/jpr
Cite This: J. Proteome Res. XXXX, XXX, XXX−XXX
MsDBP: Exploring DNA-Binding Proteins by Integrating Multiscale Sequence Information via Chou’s Five-Step Rule Xiuquan Du,*,† Yanyu Diao,† Heng Liu,‡ and Shuo Li§ †
The School of Computer Science and Technology, Anhui University, Hefei, Anhui, China Department of Gastroenterology, The First Affiliated Hospital of Anhui Medical University, Hefei, Anhui, China § Department of Medical Imaging, Western University, London, ON N6A 3K7, Canada ‡
Downloaded via BUFFALO STATE on July 18, 2019 at 12:47:04 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
S Supporting Information *
ABSTRACT: DNA-binding proteins are crucial to alternative splicing, methylation, and the structural composition of the DNA. The existing experimental methods for identifying DNA-binding proteins are expensive and time-consuming; thus, it is necessary to develop a fast and accurate computational method to address the problem. In this Article, we report a novel predictor MsDBP, a DNAbinding protein prediction method that combines the multiscale sequence feature into a deep neural network. First of all, instead of developing a narrow-application structured-based method, we are committed to a sequenced-based predictor. Second, instead of characterizing the whole protein directly, we divide the protein into subsequences with different lengths and then encode them into a vector based on composition information. In this way, the multiscale sequence feature can be obtained. Finally, a branch of dense layers is applied for learning multilevel abstract features to discriminate DNA-binding proteins. When MsDBP is tested on the independent data set PDB2272, it achieves an overall accuracy of 66.99% with the SE of 70.69%. In addition, we also perform extensive experiments to compare the proposed method with other existing methods. The results indicate that MsDBP would be a useful tool for the identification of DNA-binding proteins. MsDBP is freely available at a web server on http://47.100.203.218/MsDBP/. KEYWORDS: DNA-binding proteins, multiscale features, dense layers
1. INTRODUCTION DNA-binding proteins (DBPs) play vital roles in many biological processes, such as specific nucleotide sequence recognition, transcription, and DNA replication.1,2 In addition, DBPs have many important applications to drug development in treating genetic diseases3 and biological studies of DNA.4 Due to the importance of DBPs, it is highly desirable to develop effective methods to identify DBPs. At present, some experimental techniques, such as filter binding assays,5 X-ray crystallography,6 genetic analysis,7 etc., are developed for identifying DBPs. However, experimental methods are both costly and time-consuming. Meanwhile, more and more protein sequences have exploded with efficient next-generation sequencing techniques. Therefore, it is an important research topic to develop fast and effective computational methods to handle such large-scale protein sequence data. Recently, various effective computational methods based on machine learning algorithms have been presented to predict whether the target protein is DBP or not. These computational methods can be roughly categorized into structure-based and sequence-based methods. In the case of structure-based predictors, Gao et al.8 proposed a DNA-binding Domain Hunter (DBD-Hunter) method, which combined structural © XXXX American Chemical Society
comparison with a statistical potential to judge whether the query sequence is DBP or not. Nimrod et al.9 adopted various properties like the amino acid conservation patterns of the predicted functional regions, average surface electrostatic potentials, and other features, and they fed these features into the random forest (RF) for DBPs identification. Zhao et al.10 developed a template-based method to predict DBPs by utilizing structural similarity and binding affinity. However, the structure-based methods depend heavily on the structural information, and they are inapplicable when structural information on a query sequence is unknown. To solve this problem, many sequence-based methods that are free of structural information have been built. For example, by incorporating some features into the form of the pseudo amino acid composition (PseAAC), Lin et al.11 proposed a method called iDNA-Prot for identifying DBPs. Zhao et al.12 identified DBPs by combining RF with sequence features produced by PseAAC based on six physicochemical characters of amino acid. To get an optimal model for DBPs prediction, Zou et al.13 investigated four different categories of protein Received: April 8, 2019 Published: July 3, 2019 A
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
truncating the protein sequence. For a given protein, we first divide the entire protein sequence into four subsequences to extract multiscale features, and then we regard the feature vector as the input of the network and apply a branch of dense layers to automatically learn diverse hierarchical features. Finally, we use a neural network with two hidden layers to connect their outputs for DBPs prediction. Our contributions are described as follows: (1) A new method for predicting DBPs with the multiscale feature is constructed. Instead of extracting features from the whole sequence directly, our model divides the entire protein sequence into four subsequences with varying length and then extracts each subsequence information based on amino acid composition and dipeptide composition to transform protein sequence into a feature vector, which fully mines the composition information on multiscale continuous amino acid segments and is simple and easy to implement. (2) A new deep learning-based method (MsDBP). It is made up of multiple layers of neural networks in the way of the stack, and the outputs of each layer are the inputs of the successive layer. This layer-by-layer learning helps to map unorganized low-level features to high-level abstract features, which can reduce the impact of noise in the original input. (3) Fusion at the abstract level enables MsDBP to handle different features in a simple way. The top-level shared layer at the fusion level can help discover the shared properties among different modules, which effectively and deeply integrates contextual information at different levels. As demonstrated by a series of recent publications27−46 and summarized in a comprehensive review,47 to develop a very useful predictor for a biological system, one needs to follow Chou’s five-step rule to go through the following five steps: (a) introduce or construct a high-quality benchmark data set to train and evaluate the predictor; (b) represent the samples with an effective formulation that can truly reflect the intrinsic correlation between the samples and the targets to be predicted; (c) select or develop a powerful algorithm to make the prediction; (d) properly conduct cross-validation tests to objectively evaluate the anticipated prediction accuracy; (e) establish a user-friendly and publicly accessible web server. Papers presented for developing a new sequenceanalyzing method or statistical predictor by observing the guidelines of Chou’s five-step rules have the following notable merits: (1) crystal clear in logic development, (2) completely transparent in operation, (3) easy to repeat the reported results by other investigators, (4) with high potential in stimulating other sequence-analyzing methods, and (5) very convenient to be used by the majority of experimental scientists. Below, let us elaborate on how to address these five steps.
features and three different coding methods (Global method, Nonlocal method, and Local method). Lou et al. 14 distinguished DBPs by the features comprising properties from primary sequence, PSSM and predicted structures, etc. The size of the feature vector was reduced by performing both feature ranking with RF and the wrapper-based feature selection. Song et al.15 and Xu et al.16 both employed ensemble learning techniques and hybrid features only extracted from protein sequence to predict DBPs. By integrating the evolutionary information into the classical PseAAC vector, Liu et al.17 and Xu et al.18 both developed approaches for DBPs prediction based on support vector machine (SVM). Dong et al.19 proposed an SVM-based method, which converted protein sequences into fixed-length vectors by using the method of autocross covariance transformation and k-mer composition. Waris et al.20 extracted numerical values from protein sequences by using dipeptide composition, PSSM, and split amino acid composition methods. To overcome the issue of the unbalanced data set, they used the synthetic minority oversampling technique (SMOTE). In addition, they selected the best classifier among K-nearest neighbor (KNN), probabilistic neural network (PNN), SVM, and RF for identification DBPs. Wei et al.21 presented a machine learning-based method (called LocalDPP), which predicted DBPs by an RF classifier. The feature representation method they proposed can extract the local feature from the PSSM. Chowdhury et al.22 presented iDNAProt-ES, a DNA-binding protein prediction method that utilized 14 different modes of the general PseAAC to identify DBPs. They also employed recursive feature elimination to choose an optimal set of features and used SVM to train the prediction model. Rahman et al.23 applied three different feature extraction methods to extract information directly from the protein sequences, and then applied the RF method to find a reliable ranking of the features. Afterward, SVM with linear kernel was trained to generate the final predictor. Although the above sequence-based approaches can correctly predict DBPs to some extent, most of them employ complex features, such as physicochemical properties,13,19 structural and functional information,13,14,22 and evolutionary information.13,14,17,18,20−22 Also, traditional machine learning methods have more human intervention in the procedure of feature selection, and they cannot capture nonlinear relationships among input data. With the power of automatically learning useful and more abstract features without hand-crafted features or rules, deep learning algorithms have been successfully applied to protein prediction, such as protein− protein interactions prediction24 and protein structure prediction.25 Additionally, deep learning techniques have been successfully used to identify DBPs. For example, Qu et al.26 proposed a deep learning based method that contains two layers of the convolutional neural network (CNN) and long short-term memory (LSTM) to identify DBPs from primary sequence. The method only relies on raw data, without the need to manually extract features. However, it needs to truncate the protein sequence when the length of the protein sequence is greater than 1000 amino acids, so some sequence information will be lost for protein sequences having more than 1000 amino acids. To overcome the shortcomings of the above methods, we report a deep learning based approach (named MsDBP) for predicting DBPs with variable sequence length without
2. MATERIALS AND METHODS 2.1. Data Sets
The benchmark data set that we named PDB14189 is introduced from Ma et al.48 Ma et al. collected DBPs (positive samples) from the UniProt database49 by searching the keyword of “DNA binding”. To build a nonredundant and high-quality data set, they first removed all protein sequences with less than 50 amino acids or more than 6000 amino acids and then removed all the protein sequences that had irregular B
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
Figure 1. Architecture of the proposed MsDBP. MsDBP first transforms protein sequence into a feature vector. Then the network takes the multiscale feature as input and applies a branch of dense layers to calculate probability.
Figure 2. Generation process of the multiscale feature. a, b, c, and d stand for the first 25%, the first 50%, the first 75%, and the first 100% of the entire sequence, respectively.
Unlike these methods, we first divide a protein sequence into four subsequences (the first 25%, the first 50%, the first 75%, and the first 100% of the entire sequence, respectively) with different length and use composition information to transform each subsequence into a vector. The process is shown in Figure 2. As such, we can get four vectors (multiscale feature) and then merge them into one feature vector to represent the protein sequence. Finally, we feed the extracted multiscale feature into a branch of the dense layer to learn diverse abstract features for DBPs prediction. 2.2.1. Exploring Multiscale Feature from Primary Protein Sequence. With the explosive growth of biological sequences in the postgenomic era, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, yet still keep considerable sequence-order information or key pattern characteristic. This is because all the existing machine-learning algorithms (such as “Optimization” algorithm,53 “Covariance Discriminant” or “CD” algorithm,54,55 “Nearest Neighbor” or “NN” algorithm,56 and “Support Vector Machine” or “SVM” algorithm56,57) can only handle vectors as elaborated in a comprehensive review.58 However, a vector defined in a discrete model may completely lose all the
amino acid characters (“X” and “Z”). Finally, they removed the redundant sequences with more than 40% sequence similarity using BLAST package.50 In this way, they got 7131 DBPs. By employing the procedure proposed by Yu et al.,51 Ma et al. obtained a set of 7131 non-DBPs (negative samples) from the UniProt database. Some proteins have been modified or removed due to the revision of the UniProt database. As a result, the benchmark data set that we used in this study consists of 7129 DBPs and 7060 non-DBPs. In order to evaluate the performance between the proposed method and previous methods, we use the independent data set PDB2272 collected from reported papers.13,52 The original data set consists of 1153 DBPs and 1153 non-DBPs obtained from Swiss-Prot. None of the proteins share sequence similarity more than 25% with each other, and the sequences with irregular characters (“X” or “Z”) is removed. Only 1119 unique sequence can be available now among 1153 non-DBPs. Finally, the PDB2272 data set consists of 1153 DBPs and 1119 non-DBPs. 2.2. Framework of the Proposed Method MsDBP
In Figure 1, we present the overall framework of MsDBP for DBPs prediction. Most existing methods directly convert the whole protein sequence into a vector for DBP prediction. C
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research Table 1. Division of Amino Acids into Seven Groups group 1
group 2
group 3
group 4
group 5
group 6
group 7
A, G, V
I, L, F, P
Y, M, T, S
H, N, Q, W
R, K
D, E
C
Figure 3. Parameter settings for each layer of the proposed architecture.
concept of Chou’s PseAAC was proposed, it has been widely used in nearly all the areas of computational proteomics61−71 as well as a long list of references cited in Chou et al.72 Because
sequence-pattern information. To avoid completely losing the sequence-pattern information for proteins, the pseudo amino acid composition59 or PseAAC60 was proposed. Ever since the D
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research it has been widely and increasingly used, four powerful open access softwares, called “PseAAC”,73 “PseAAC-Builder”,74 “propy”,75 and “PseAAC-General”,76 were established: the former three are for generating various modes of Chou’s special PseAAC;77 while the fourth one is for those of Chou’s general PseAAC,47 including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as “Functional Domain” mode (see eqs 9 and 10 of Chou et al.47), “Gene Ontology” mode (see eqs 11 and 12 of Chou et al.47), and “Sequential Evolution” or “PSSM” mode (see eqs 13 and 14 of Chou et al.47). Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, the concept of PseKNC (Pseudo K-tuple Nucleotide Composition)78 was developed for generating various feature vectors for DNA/RNA sequences79−81 that have proved very useful as well. Particularly, recently a very powerful web server called “Pse-in-One”82 and its updated version “Pse-in-One2.0”83 have been established that can be used to generate any desired feature vectors for protein/ peptide and DNA/RNA sequences according to the need of users’ studies. In this Article, we have used “Pse-in-One2.0” to generate protein composition features that the previous methods13,20 have proved to be effective in predicting DBPs. Inspired by the work of You et al.,84 we first divide the sequence into four subsequences (shown in Figure 2) and then use the “k-mer” mode of “Pse-in-One2.0” to create amino acid composition features and dipeptide composition features for each subsequence to fully mine composition information on the primary protein sequence from multiscale continuous subsequences. a). Amino Acid Composition Feature (AAC). The transformation of each protein sequence to an amino acid composition feature vector can encapsulate the properties of a protein into the vector. For each subsequence, we first use “kmer” mode (k is set to 1) of “Pse-in-One2.0” to calculate the respective frequencies of the 20 standard amino acids. Then we divide 20 native amino acids into seven groups (shown in Table 1) based on the dipoles and volumes of the side chains, and we can get another vector with a length of 7. The AAC is formulated as below
AAC(i) =
Ri L
where DPC(i) represents the dipeptide composition of type i and Di represents the number of the dipeptide of type i, respectively. 2.2.2. Deep Neural Network. As shown in Figure 1, MsDBP employs a branch of dense layers to extract diverse hierarchical features from different levels. After the feature vector passes through the first dense layer, the subsequent dense layers (with different numbers of neurons) will extract diverse abstract features to improve the predictive performance. Then we use a neural network with two hidden layers to connect their outputs. Each layer is implemented as a fully connected layer according to the equation below H ji zy ojl = σ jjjj∑ wil,−j 1oil − 1 + bil,−j 1zzzz j i=1 z k {
where H is the number of neurons in the (l − 1)th layer, wl−1 i,j and bl−1 i,j are the weight and bias associated with the jth neuron, respectively, and σ(·) denotes an activation function in this neuron. Finally, the last fully connected layer with one neuron is followed by a sigmoid function, which ensures that the output is positive and between 0 and 1, allowing us to interpret the output as prediction probability for DBP. In this study, if the value of probability is greater than 0.5, we consider the query sequence as a DBP. The prediction probability of DBP is calculated as follows P(y = 1|x) =
Di L−1
1 1 + e −o
L
(4)
where oL represents the output of the last fully connected layer. The probability, together with the truth label, is then utilized in the loss function, which is minimized via the back-propagation stochastic gradient descent. After the network is trained, given a protein sequence, we apply the trained network to it and obtain the probability to judge whether the query sequence is a DBP or not. All of the experiments performed in this Article use the same parameters. The input and output parameters of each layer are shown in Figure 3. Here, we employ binary cross-entropy as the loss function, which can measure how well the model fits empirical data. Among total N protein sequences, let ti represents target label of the ith sequence and pi represents the prediction probability of the ith sequence. The binary cross entropy can be calculated as below
(1)
where AAC(i) denotes the occurrence frequency of amino acids of type i, Ri denotes the number of amino acids of type i, and L represents the length of the amino acid sequence, respectively. Thus, for each subsequence, we can get a vector with length of 27 (20 + 7) for AAC. b). Dipeptide Composition Feature (DPC). In addition to AAC, we also employ DPC for DBPs prediction. First, each subsequence is put into “k-mer” mode (k is set to 2) of “Pse-inOne2.0” to generate a vector with length of 400. To prevent overfitting, 20 standard amino acids are divided into seven groups according to Table 1, and the vector whose length is 400 is reduced to a vector with length of 49. DPC calculates the occurrence frequency of each two consecutive amino acids. The advantage of DPC is that it incorporates neighborhood information and the local order of amino acid sequences. The DPC is defined as described below DPC(i) =
(3)
N
loss(t , p) = −∑ {ti log pi + (1 − ti) log(1 − pi )} i=1
(5)
We set the batch size to 64, and MsDBP is trained by a stochastic descent (SGD) method with momentum setting to 0.9 and learning rate setting to 0.01 to minimize the loss function, which can effectively train deep learning models.85 To improve the generalization performance of the proposed method, we also employ the dropout technique (set to 0.5) in hidden layers. 2.3. Performance Evaluation
In order to comprehensively measure the prediction quality of the classifier, we calculate the most widely used measurements: accuracy (ACC), precision (PRE), sensitive (SE, also called Recall), specificity (SP), and the Matthews correlation coefficient (MCC), which are formulated as follows
(2) E
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
Table 2. Performance Comparison of MsDBP with Different Scale Feature on Benchmark Dataset Using Five-Fold CrossValidation methods first subsequence (a) second subsequence (b) third subsequence (c) fourth subsequence (d) a + b + c + d (MsDBP)
ACC (%) 72.03 75.19 77.81 79.34 80.29
± 0.30 ± 1.16 ± 0.59 ± 0.54 ± 0.65
PRE (%) 72.43 75.06 77.74 79.30 80.12
SE (%)
± 1.13 ± 1.19 ± 0.95 ± 0.91 ± 0.76
71.65 75.89 78.26 79.72 80.87
SP (%)
± 1.55 ± 3.60 ± 0.73 ± 1.80 ± 1.68
72.41 74.49 77.35 78.95 79.72
MCC (%)
± 2.14 ± 2.35 ± 1.33 ± 1.48 ± 1.17
44.09 50.46 55.62 58.70 60.61
AUC (%)
± 0.64 ± 2.32 ± 1.18 ± 1.09 ± 1.32
80.42 83.86 86.58 87.97 88.31
± 0.48 ± 0.92 ± 0.98 ± 0.45 ± 0.62
Figure 4. ROC and PR curves of different scale feature on the benchmark data set. (a) ROC curves of different scale feature. (b) PR curves of different scale feature.
ACC =
TP + TN TP + TN + FP + FN
TP TP + FN
(7)
SP =
TN TN + FP
(8)
MCC =
l o N + + N+− o o ACC = 1 − −+ o o o N + N− o o o o o o N+ o o o SE = 1 − −+ o o N o o o o o o N− o o SP = 1 − +− o o o N o o m o N+ o 1 − N−+ o o o o PRE = o o N+− − N −+ o o 1 + o o N+ o o o o o N+− N −+ o o 1 − + + o o N N− o o MCC = o o o N− − N + N + − N− o o 1 + + N + − 1 + −N − + o o o n
(6)
SE =
PRE =
and Chou’s symbols,87 eqs 6−10 can be converted into the following form:
TP TP + FP
(9) TP × TN − FP × FN
(TP + FP) × (TP + FN) × (TN + FP) × (TN + FN)
(10)
(
where TP, TN, FP, and FN represent the number of DBPs correctly classified as DBPs, non-DBPs correctly classified as non-DBPs, non-DBPs incorrectly classified as DBPs, and DBPs incorrectly classified as non-DBPs, respectively. In addition to the above metrics, we also compute the area under the Receiver Operating Characteristic curve (AUC), which gives a measure of classifier performance. A higher AUC represents a better predictor. Although the metrics (eqs 6−10) copied from math books are often used in literature to measure the prediction quality of a prediction method, they are no longer good because of lacking intuitiveness and not easy-to-understand for most biologists, particularly the MCC, which is a very important metric used for reflecting the stability of a prediction method. Fortunately, based on the Chou’s symbols introduced for studying protein signal peptides,86−88 a set of four intuitive metrics were derived,27,89 as given in eq 14 of Chen et al.27 or in eq 19 of Xu et al.89 The set of intuitive metrics have been concurred and applauded by a series of recent publications.27,29,38,42−45,80,90−99 According to eq 14 of Chen et al.27
(
+
)
)(
)
(11)
N+−
where N represents the total number of DBPs and is the number of the DBPs incorrectly classified as non-DBPs; N− represents the total number of non-DBPs and N−+ is the number of non-DBPs incorrectly classified as DBPs. From eq 11, it can be observed that SE = 1 when N+− = 0, which means none of the DBPs (positive samples) is incorrectly classified as non-DBP (negative sample). We get SE = 0 and PRE = 0 if N+− = N+, which implies that all DBPs are incorrectly classified as non-DBPs. Similarly, N−+ = 0 indicates none of the non-DBPs is classified as the DBP, which yields SP = 1 and PRE = 1; the value of SP is 0 if N−+ = N−, which shows all the non-DBPs are incorrectly predicted to be DBPs. When N−+ = N+− = 0 indicates that both non-DBPs and DBPs are correctly predicted, we get ACC = 1 and MCC = 1; whereas N+− = N+ and N−+ = N− implying that none of the DBPs and none of the non-DBPs are correctly classified, we get ACC F
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research = 0 and MCC = −1; we get ACC = 0.5 and MCC = 0 if N+− = N+/2 and N−+ = N−/2, which demonstrates that the predictive power of the model is similar to that of random prediction. Based on the discussion of eq 11, we can see that the SE, PRE, SP, ACC, and MCC are easier to understand than the conventional metrics copied from math books. However, it is instructive to point out that either the set of conventional metrics copied from math books or the intuitive metrics derived from the Chou’s symbols86−88 are valid only for the single-label systems (where each sample only belongs to one class). For the multilabel systems (where a sample may simultaneously belong to several classes), whose existence has become more frequent in system biology,100−106 system medicine,107,108 and biomedicine,109 completely different metrics as defined in Chou et al.110 are absolutely needed.
Figure 6. Performance of MsDBP on PDB2272 with different ratios of negative to positive samples.
3. RESULTS AND DISCUSSION In this section, we show the results of all experiments that are performed in this study, and the bold-faced font indicates that the value is the best result among all the methods.
Table 4. Comparison of MsDBP with Other Methods on PDB186 methods
Table 3. Comparison of MsDBP with Previous Methods on Independent Set PDB2272 methods Qu et al. Local-DPP PseDNAPro DPPPseAAC MsDBP
ACC (%)
PRE (%)
SP (%)
MCC (%)
AUC (%)a
SE (%)
48.33 50.57 61.88
49.07 58.72 59.90
48.31 8.76 75.28
48.35 93.66 48.08
−3.34 4.56 24.30
47.76 -----
58.10
59.10
56.63
59.61
16.25
61.00
66.99
66.42
70.69
63.18
33.97
73.83
DPPPseAAC PseDNAPro Local-DPP MsDBP a
ACC (%)
PRE (%)
SE (%)
SP (%)
MCC (%)
AUC (%)a
77.42
74.29
83.87
70.97
55.30
79.86
71.51
67.54
82.80
60.22
44.15
---
79.03 80.11
72.88 76.92
92.47 86.02
65.59 74.19
60.28 60.64
--87.50
“---” represents the value is not available.
“---” represents the value is not available.
a
3.1. Comparison of the Different Scale Features with Cross-Validation
In this part, we evaluate whether or not the multiscale feature can improve the performance of DBPs prediction when compared with the single scale feature. In order to ensure the validation of the experimental results, benchmark data set is randomly divided into the training set and the testing set via the 5-fold cross-validation. Each of five subsets is used as a holdout testing data set and the rest for training the model in
Figure 7. FPRs of the proposed method MsDBP and other methods.
Figure 5. ROC and PR curves of different methods on the independent data set PDB2272. (a) ROC curves of different methods. (b) PR curves of different methods. G
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
MsDBP with other methods, the ROC curves of these methods are drawn in Figure 5(a). The curves highlight the strength of the proposed method in yielding high TPR of ≥70% at a cost of lower FPR of ≥40%. In addition to the ROC curves, PR curves are also drawn as shown in Figure 5(b). In summary, MsDBP outperforms other existing methods on PDB2272, which demonstrates the superiority of MsDBP. In the real world, the number of non-DBPs is much more than that of the DBPs. To simulate real-world applications, MsDBP is evaluated on PDB2272 with different ratios of negative to positive samples. From Figure 6, we can observe that the value of ACC increases slowly as the ratio decreases, which indicates that MsDBP has a stable performance on the imbalanced testing data set and is suitable for DBPs prediction.
Table 5. Performance of MsDBP on Five Species number of predicted DBPs
species human A. thaliana mouse S. cerevisiae fruit fly summary
protein with complete DNA binding annotations 1049 929 424 314 143(142) 2859(2858)
ACC (%)
iDbp134
MsDBP
iDbp
MsDBP
613 489 232 191
617 505 263 203
58.44 52.64 54.72 60.83
58.82 54.36 62.03 64.65
84 1609
86 1674
58.74 56.28
60.56 58.57
3.3. Comparison with Other Existing Methods on PDB186
turn. The advantage of the cross-validation method is that it can reduce the impact of data dependency. As shown in Table 2, our method that uses multiscale feature achieves the best performance, and the corresponding values of ACC, PRE, SE, SP, and MCC are 80.29%, 80.12%, 80.87%, 79.72%, and 60.61%, respectively. From Table 2, we can see that MsDBP obtains larger ACC, PRE, SP, and AUC than that of the fourth subsequence (d), especially the SE and MCC of MsDBP improve by 1.15% and 1.91%, respectively, which reflects that MsDBP has better recognition ability for DBPs and more stability. Using graphic methods to study biological and medical systems can provide an intuitive vision and useful insights for helping analyze complicated relations therein, as indicated by a series of previous studies on many important biological topics,111−124 particularly what happened is for the topics of enzyme kinetics, protein folding rates,119,125−127 and low-frequency internal motion.128,129 So we plot the Receiver Operating Characteristic (ROC)130 and precision-recall (PR) curves (shown in Figure 4) to evaluate the predictive ability of our method. Figure 4(a) highlights the strength of the proposed method MsDBP in yielding high TPR of ≥75% at a cost of lower FPR of ≥20%. From Table 2 and Figure 4, we conclude that our method, which combines the all-scales feature, is superior to those using the single-scale feature.
For a fair comparison with other methods, we also use PDB1075 and PDB186 as training and test sets, respectively, just like DPP-PseAAC23 and Local-DPP.21 PDB1075 contains 525 DBPs and 550 non-DBPs. In the data set PDB186, there are 93 DBPs and 93 non-DBPs. None of the protein sequences in the PDB186 has more than 25% pairwise sequence identity with any other. To avoid homology bias between the training data set PDB1075 and the data set PDB186, those protein sequences in the PDB1075 that have more than 25% sequence identity to any sequence in the PDB186 data set are removed. We rebuild the proposed method MsDBP using the reduced PDB1075, which contains 487 DBPs and 548 non-DBPs. Compared with PDB14189, the reduced PDB1075 has a much smaller sample size, so we reduce the number of neurons in some layers (provided in the Supporting Information) without changing the model architecture. Table 4 shows the predictive results of MsDBP and three other state-of-the-art predictors, including DPP-PseAAC, PseDNA-Pro, and Local-DPP. We find that there is an error in the MCC value of Local-DPP provided by Wei et al.,21 which is also discovered by Rahman et al.23 In this Article, we use the correct MCC value we calculated for comparison. Because the model of Qu et al.131 was trained using PDB186, our method is not compared with their method. From Table 4 we can see that the proposed method MsDBP has the best ACC, PRE, SP, MCC, and AUC. Local-DPP has the best SE, but its SP is only 65.59%, which is 8.60% lower than that of MsDBP. The above comparisons further confirm the better performance of the proposed MsDBP.
3.2. Comparison with Other Existing Methods on PDB2272
To check the robustness of the MsDBP, we evaluate it on the independent data set (PDB2272) and compare its experimental results with that of Qu et al.’s method,131 Local-DPP,21 PseDNA-Pro,132 and DPP-PseAAC.23 In order to avoid homology bias between the benchmark data set (PDB14189) and PDB2272 data set, we use the CD-HIT program133 to remove those proteins in PDB14189 data set with more than 40% sequence similarity to proteins in PDB2272 and rebuild MsDBP on the remaining PDB14189 data set. The reduced PDB14189 contains 5791 DBPs and 6889 non-DBPs. The experimental results are shown in Table 3. From Table 3, we can observe that MsDBP achieves the highest ACC of 66.99%, PRE of 66.42%, and MCC of 33.97% among all the evaluated methods. More specially, Local-DPP achieves the lowest SE of 8.76% and the commendable specificity of 93.66% among all methods, which indicates that this method remarkably moves toward the negative class (nonDBPs). Additionally, as Qu et al.’s method and DPP-PseAAC provide real-value outputs, we calculate the AUC values of them. As shown in Table 3, MsDBP yields the highest AUC of 73.83% compared with Qu et al.’s method and DPP-PseAAC. To provide a graphic illustration for the comparisons of
3.4. Comparison of False Positive Rates with Other Methods
In order to investigate the false positive rate (FPR) of the proposed work, a new-compiled data set (called NDBP3723), which contains 3723 non-DBPs, is introduced in this work. These protein sequences are obtained from the Protein Data Bank (http://www.rcsb.org) with “Does not contain: DNA binding protein” as mmCIF keyword and release data “between 2018-03-01 and 2019-03-01”. Further, we discard the sequences with less than 50 or more than 6000 amino acids. Sequences that contain unknown residues (such as ‘X’) are also removed. None of these proteins share more than 40% sequence identity with each other or with the benchmark data set PDB14189. Then the MsDBP is built on the PDB14189 data set. Figure 7 illustrates the FPRs of the proposed method MsDBP, Qu et al.’s method,131 PseDNA-Pro,132 DNAbinder,52 and DPP-PseAAC23 performed on non-DBPs data set H
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research Table 6. Function of the Top Ten Proteins with the Highest Predictive Probability of Human Species
DNAbinder, and is 1.13% higher than DPP-PseAAC. We find that there are 133 protein sequences in the benchmark data set PDB1075 of DPP-PseAAC that shares over 40% identity to sequences in NDBP3723 data set. In addition, the result of our method in Figure 6 is close to the (100.00%-SP) value
NDBP3723. We do not include the result of Local-DPP into the figure since its prediction is unbalanced, which can be seen in Table 3. As shown in Figure 7, our method yields the FPR of 36.05% (i.e., the SP is 63.95%), which is 4.64%, 7.06%, and 0.72% lower than Qu et al.’s method, PseDNA-Pro, and I
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
in Figure 8. You can get a brief introduction about MsDBP by clicking on the ReadMe button. Step 2. Directly paste the query sequences into the text-box provided in the web server. The query protein sequences must be in the FASTA format. Click on the FASTA format button to get example sequences. Step 3. Click on the Run button to predict the input sequences. The predicted results consist of the query sequence ID, predicted probability, and the predicted label. Step 4. Click on the Download button to download the data sets used in this study.
4. CONCLUSION In this Aticle, we establish a deep learning based predictor, named MsDBP, for predicting DBPs only using primary sequences. It first encodes protein sequences into the multiscale features, and then deep learning-based method automatically learns the types of rich hierarchical features directly from the sequence features, which overcomes more human intervention in feature selection procedure than traditional machine learning methods. The experimental results of the proposed method on independent data sets outperform other existing state-of-the-art methods, which reveal the effectiveness of our predictor. Furthermore, the promising performance on large-scale DBPs proves that the proposed method has good generalization ability.
Figure 8. Index page of the web server.
predicted by our method in Table 3, which indicates that our method is stable. 3.5. Application of MsDBP to Large-Scale DBPs
To validate the generalization of MsDBP, we test it on largescale DBPs, which is collected by Zhang et al.134 By analyzing the 2859 proteins IDs collected from Zhang et al., we find that two different protein ids in the fruit fly species correspond to one protein sequence. In this case, we just keep one of the IDs. Thus, we get the large-scale set of 2858 DBPs (called DBP2858), which contains 1049 DBPs of the human, 929 DBPs of the A. thaliana, 424 DBPs of the mouse, 314 DBPs of the S. cerevisiae, and 142 DBPs of the fruit fly, respectively. The benchmark data set PDB14189 is used to construct the model. As shown in Table 5, among all the species, our method achieves the best predictive performance with successfully recognizing 617 (58.82%) of human DBPs, 505 (54.36%) of A. thaliana DBPs, 263 (62.03%) of mouse DBPs, 203 (64.65%) of S. cerevisiae DBPs, and 86 (60.56%) of fruit fly DBPs, respectively. In summary, among the 2858 DBPs, 1674 (58.57%) proteins (provided in the Supporting Information) are correctly recognized by our method. In human species, we have successfully predicted 617 DBPs. From these 617 proteins, we select the top ten proteins with the highest predictive probability for analysis shown in Table 6. After an investigation, we find that these ten proteins are closely related to important human life activities. The function can be found in Table 6. From the above results, we can conclude that our method is reliable for predicting DBPs in large-scale applications.
■
ASSOCIATED CONTENT
* Supporting Information S
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jproteome.9b00226. Predictive results of MsDBP on DBP2858 data set and parameter settings for the model trained on reduced PDB1075 (PDF) Read me_Source code and data set (TXT)
■
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. Phone: +086-13721058041. ORCID
Xiuquan Du: 0000-0001-7913-7605 Author Contributions
X.Q.D. conceived the study. Y.Y.D. performed the experiments and analyzed the data. X.Q.D. and Y.Y.D. drafted the paper. H.L. and S.L. provided some suggestions for paper writing.
3.6. Web Server for MsDBP
As pointed out in Chou et al.146 and demonstrated in a series of recent publications,38,41,94−96,98−107,147−154 web servers that are user-friendly and accessible to the public represent the future direction for developing practically more useful prediction methods and computational tools. Actually, many practically useful web servers have significantly increased the impacts of bioinformatics on medical science,58 driving medicinal chemistry into an unprecedented revolution.72 Therefore, we have also provided a web server for the prediction method presented in this Article. The detailed steps for using the web server are as follows: Step 1. Click the link at http://47.100.203.218/MsDBP/, and you will see the index page of the web server as shown
Funding
This work is supported by grants from the Anhui Provincial Natural Science Foundation (1708085QF143). Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS We thank all the anonymous reviewers for their valuable comments and constructive suggestions, which were helpful for improving the quality of the paper. We also thank Anhui University for the use of mass premises and equipment. J
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
■
(21) Wei, L. Y.; Tang, J. J.; Zou, Q. Local-DPP: An improved DNAbinding protein prediction method by exploring local evolutionary information. Inf. Sci. 2017, 384, 135−144. (22) Chowdhury, S. Y.; Shatabda, S.; Dehzangi, A. iDNAProt-ES: Identification of DNA-binding Proteins Using Evolutionary and Structural Features. Sci. Rep. 2017, 7 (1), 14938. (23) Rahman, M. S.; Shatabda, S.; Saha, S.; Kaykobad, M.; Rahman, M. S. DPP-PseAAC: A DNA-binding protein prediction model using Chou’s general PseAAC. J. Theor. Biol. 2018, 452, 22−34. (24) Sun, T. L.; Zhou, B.; Lai, L. H.; Pei, J. F. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinf. 2017, 18 (1), 277. (25) Bai, L.; Yang, L. A Unified Deep Learning Model for Protein Structure Prediction. IEEE International Conference on Cybernetics 2017, 1−6. (26) Qu, Y. H.; Yu, H.; Gong, X. J.; Xu, J. H.; Lee, H. S. On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach. PLoS One 2017, 12 (12), e0188129. (27) Chen, W.; Feng, P. M.; Lin, H.; Chou, K. C. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res. 2013, 41 (6), e68. (28) Feng, P. M.; Chen, W.; Lin, H.; Chou, K. C. iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal. Biochem. 2013, 442 (1), 118− 125. (29) Lin, H.; Deng, E. Z.; Ding, H.; Chen, W.; Chou, K. C. iPro54PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 2014, 42 (21), 12961−12972. (30) Chen, W.; Feng, P. M.; Deng, E. Z.; Lin, H.; Chou, K. C. iTISPseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Anal. Biochem. 2014, 462, 76−83. (31) Ding, H.; Deng, E. Z.; Yuan, L. F.; Liu, L.; Lin, H.; Chen, W.; Chou, K. C. iCTX-Type: A Sequence-Based Predictor for Identifying the Types of Conotoxins in Targeting Ion Channels. BioMed Res. Int. 2014, 2014 (4), 286419. (32) Liu, B.; Fang, L. Y.; Wang, S. Y.; Wang, X. L.; Li, H. T.; Chou, K. C. Identification of microRNA precursor with the degenerate Ktuple or Kmer strategy. J. Theor. Biol. 2015, 385, 153−159. (33) Liu, Z.; Xiao, X.; Qiu, W. R.; Chou, K. C. iDNA-Methyl: Identifying DNA methylation sites via pseudo trinucleotide composition. Anal. Biochem. 2015, 474, 69−77. (34) Xiao, X.; Min, J. L.; Lin, W. Z.; Lin, Z.; Cheng, X.; Chou, K. C. iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via benchmark dataset optimization approach. J. Biomol. Struct. Dyn. 2015, 33 (10), 2221− 2233. (35) Jia, J. H.; Liu, Z.; Xiao, X.; Liu, B. X.; Chou, K. C. iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Anal. Biochem. 2016, 497, 48−56. (36) Jia, J. H.; Zhang, L. X.; Liu, Z.; Xiao, X.; Chou, K. C. pSumoCD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC. Bioinformatics 2016, 32 (20), 3133−3141. (37) Liu, B.; Fang, L. Y.; Long, R.; Lan, X.; Chou, K. C. iEnhancer2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics 2016, 32 (3), 362−369. (38) Chen, W.; Feng, P. M.; Yang, H.; Ding, H.; Lin, H.; Chou, K. C. iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences. Oncotarget 2017, 8 (3), 4208−4217. (39) Chen, W.; Ding, H.; Zhou, X.; Lin, H.; Chou, K. C. iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. Anal. Biochem. 2018, 561−562, 59−65.
REFERENCES
(1) Langlois, R. E.; Lu, H. Boosting the prediction and understanding of DNA-binding domains from sequence. Nucleic Acids Res. 2010, 38 (10), 3149−3158. (2) Sarai, A.; Kono, H. Protein-DNA recognition patterns and predictions. Annu. Rev. Biophys. Biomol. Struct. 2005, 34 (1), 379− 398. (3) Leung, C. H.; Chan, D. S. H.; Ma, V. P. Y.; Ma, D. L. DNAbinding small molecules as inhibitors of transcription factors. Med. Res. Rev. 2013, 33 (4), 823−846. (4) Zimmer, C.; Wähnert, U. Nonintercalating DNA-binding ligands: Specificity of the interaction and their use as tools in biophysical, biochemical and biological investigations of the genetic material. Prog. Biophys. Mol. Biol. 1986, 47 (1), 31−112. (5) Parola, M.; Bellomo, G.; Robino, G.; Barrera, G.; Dianzani, M. U. 4-Hydroxynonenal as a biological signal: molecular basis and pathophysiological implications. Antioxid. Redox Signaling 1999, 1 (3), 255−84. (6) Chou, C. C.; Lin, T. W.; Chen, C. Y.; Wang, A. H. J. Crystal structure of the hyperthermophilic archaeal DNA-binding protein Sso10b2 at a resolution of 1.85 Angstroms. J. Bacteriol. 2003, 185 (14), 4066−73. (7) Freeman, K.; Gwadz, M.; Shore, D. Molecular and genetic analysis of the toxic effect of RAP1 overexpression in yeast. Genetics 1995, 141 (4), 1253−1262. (8) Gao, M.; Skolnick, J. DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions. Nucleic Acids Res. 2008, 36 (12), 3978−3992. (9) Nimrod, G.; Szilágyi, A.; Leslie, C.; Ben-Tal, N. Identification of DNA-binding proteins using structural, electrostatic and evolutionary features. J. Mol. Biol. 2009, 387 (4), 1040−1053. (10) Zhao, H. Y.; Yang, Y. D.; Zhou, Y. Q. Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function. Bioinformatics 2010, 26 (15), 1857−1863. (11) Lin, W. Z.; Fang, J. A.; Xiao, X.; Chou, K. C. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One 2011, 6 (9), e24756. (12) Zhao, X. W.; Li, X. T.; Ma, Z. Q.; Yin, M. H. Identify DNAbinding proteins with optimal Chou’s amino acid composition. Protein Pept. Lett. 2012, 19 (4), 398−405. (13) Zou, C. X.; Gong, J. Y.; Li, H. L. An improved sequence based prediction protocol for DNA-binding proteins;using SVM and comprehensive feature analysis. BMC Bioinf. 2013, 14 (1), 90. (14) Lou, W. C.; Wang, X. Q.; Chen, F.; Chen, Y. X.; Jiang, B.; Zhang, H. Sequence based prediction of DNA-binding proteins based ̈ on hybrid feature selection using random forest and Gaussian naive Bayes. PLoS One 2014, 9 (1), e86703. (15) Song, L.; Li, D. P.; Zeng, X. X.; Wu, Y. F.; Guo, L.; Zou, Q. nDNA-prot: identification of DNA-binding proteins based on unbalanced classification. BMC Bioinf. 2014, 15 (1), 298. (16) Xu, R. F.; Zhou, J. Y.; Liu, B.; Yao, L.; He, Y. L.; Zou, Q.; Wang, X. L. enDNA-Prot: identification of DNA-binding proteins by applying ensemble learning. BioMed Res. Int. 2014, 2014 (1), 294279. (17) Liu, B.; Wang, S. Y.; Wang, X. L. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci. Rep. 2015, 5, 15479. (18) Xu, R. F.; Zhou, J. Y.; Liu, B.; He, Y. L.; Zou, Q.; Wang, X. L.; Chou, K. C. Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach. J. Biomol. Struct. Dyn. 2015, 33 (8), 1720− 1730. (19) Dong, Q. W.; Wang, S. Y.; Wang, K.; Liu, X.; Liu, B. Identification of DNA-binding proteins by auto-cross covariance transformation. IEEE International Conference on Bioinformatics & Biomedicine 2015, 470−475. (20) Waris, M.; Ahmad, K.; Kabir, M.; Hayat, M. Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix. Neurocomput. 2016, 199 (C), 154−162. K
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research (40) Chen, W.; Feng, P. M.; Yang, H.; Ding, H.; Lin, H.; Chou, K. C. iRNA-3typeA: identifying 3-types of modification at RNA’s adenosine sites. Mol. Ther.–Nucleic Acids 2018, 11 (C), 468−474. (41) Qiu, W. R.; Sun, B. Q.; Xiao, X.; Xu, Z. C.; Jia, J. H.; Chou, K. C. iKcr-PseEns: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. Genomics 2018, 110 (5), 239−246. (42) Feng, P.; Yang, H.; Ding, H.; Lin, H.; Chen, W.; Chou, K. C. iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics 2019, 111 (1), 96−102. (43) Hussain, W.; Khan, Y. D.; Rasool, N.; Khan, S. A.; Chou, K. C. SPalmitoylC-PseAAC: A sequence-based model developed via Chou’s 5-steps rule and general PseAAC for identifying S-palmitoylation sites in proteins. Anal. Biochem. 2019, 568, 14−23. (44) Hussain, W.; Khan, Y. D.; Rasool, N.; Khan, S. A.; Chou, K. C. SPrenylC-PseAAC: A sequence-based model developed via Chou’s 5steps rule and general PseAAC for identifying S-prenylation sites in proteins. J. Theor. Biol. 2019, 468, 1−11. (45) Jia, J. H.; Li, X. Y.; Qiu, W. R.; Xiao, X.; Chou, K. C. iPPIPseAAC(CGR): Identify protein-protein interactions by incorporating chaos game representation into PseAAC. J. Theor. Biol. 2019, 460, 195−203. (46) Khan, Y. D.; Jamil, M.; Hussain, W.; Rasool, N.; Khan, S. A.; Chou, K. C. pSSbond-PseAAC: Prediction of disulfide bonding sites by integration of PseAAC and statistical moments. J. Theor. Biol. 2019, 463, 47−55. (47) Chou, K. C. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 2011, 273 (1), 236− 247. (48) Ma, X.; Guo, J.; Sun, X. DNABP: Identification of DNABinding Proteins Based on Feature Selection Using a Random Forest and Predicting Binding Residues. PLoS One 2016, 11 (12), e0167345. (49) Consortium, U. P. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012, 40 (Database issue), D71−D75. (50) Altschul, S. F.; Madden, T. L.; Schäffer, A. A.; Zhang, J. H.; Zhang, Z.; Miller, W.; Lipman, D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17), 3389−3402. (51) Yu, X. J.; Cao, J. P.; Cai, Y. D.; Shi, T. L.; Li, Y. X. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J. Theor. Biol. 2006, 240 (2), 175−184. (52) Kumar, M.; Gromiha, M. M.; Raghava, G. P. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinf. 2007, 8 (1), 463. (53) Zhang, C. T.; Chou, K. C. An optimization approach to predicting protein structural class from amino acid composition. Protein Sci. 1992, 1 (3), 401−408. (54) Chou, K. C.; Elrod, D. W. Bioinformatical analysis of Gprotein-coupled receptors. J. Proteome Res. 2002, 1 (5), 429−433. (55) Chou, K. C.; Cai, Y. D. Prediction and classification of protein subcellular location-sequence-order effect and pseudo amino acid composition. J. Cell. Biochem. 2003, 90 (6), 1250−1260. (56) Hu, L. L.; Huang, T.; Shi, X. H.; Lu, W. C.; Cai, Y. D.; Chou, K. C. Predicting Functions of Proteins in Mouse Based on Weighted Protein-Protein Interaction Network and Protein Hybrid Properties. PLoS One 2011, 6 (1), e14556. (57) Cai, Y. D.; Feng, K. Y.; Lu, W. C.; Chou, K. C. Using LogitBoost classifier to predict protein structural classes. J. Theor. Biol. 2006, 238 (1), 172−176. (58) Chou, K. C. Impacts of bioinformatics to medicinal chemistry. Med. Chem. 2015, 11 (3), 218−234. (59) Chou, K. C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Struct., Funct., Genet. 2001, 43 (3), 246−255. (60) Chou, K. C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 2005, 21 (1), 10− 19.
(61) Dehzangi, A.; Heffernan, R.; Sharma, A.; Lyons, J.; Paliwal, K.; Sattar, A. Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou’s general PseAAC. J. Theor. Biol. 2015, 364, 284−294. (62) Behbahani, M.; Mohabatkar, H.; Nosrati, M. Analysis and comparison of lignin peroxidases between fungi and bacteria using three different modes of Chou’s general pseudo amino acid composition. J. Theor. Biol. 2016, 411, 1−5. (63) Kabir, M.; Hayat, M. iRSpot-GAEnsC: identifying recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples. Mol. Genet. Genomics 2016, 291 (1), 285−296. (64) Meher, P. K.; Sahu, T. K.; Saini, V.; Rao, A. R. Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC. Sci. Rep. 2017, 7, 42362. (65) Ju, Z.; He, J. J. Prediction of lysine propionylation sites using biased SVM and incorporating four different sequence features into Chou’s PseAAC. J. Mol. Graphics Modell. 2017, 76, 356−363. (66) Yu, B.; Li, S.; Qiu, W. Y.; Chen, C.; Chen, R. X.; Wang, L.; Wang, M. H.; Zhang, Y. Accurate prediction of subcellular location of apoptosis proteins combining Chou’s PseAAC and PsePSSM based on wavelet denoising. Oncotarget 2017, 8 (64), 107640−107665. (67) Ahmad, J.; Hayat, M. MFSC: Multi-voting based feature selection for classification of Golgi proteins by adopting the general form of Chou’s PseAAC components. J. Theor. Biol. 2019, 463, 99− 109. (68) Akbar, S.; Hayat, M. iMethyl-STTNC: Identification of N6methyladenosine sites by extending the Idea of SAAC into Chou’s PseAAC to formulate RNA sequences. J. Theor. Biol. 2018, 455, 205− 211. (69) Contreras-Torres, E. Predicting Structural Classes of Proteins by Incorporating their Global and Local Physicochemical and Conformational Properties into General Chou’s PseAAC. J. Theor. Biol. 2018, 454, 139−145. (70) Zhang, S. L.; Liang, Y. Y. Predicting apoptosis protein subcellular localization by integrating auto-cross correlation and PSSM into Chou’s PseAAC. J. Theor. Biol. 2018, 457, 163−169. (71) Tahir, M.; Hayat, M.; Khan, S. A. iNuc-ext-PseTNC: an efficient ensemble model for identification of nucleosome positioning by extending the concept of Chou’s PseAAC to pseudo-tri-nucleotide composition. Mol. Genet. Genomics 2019, 294 (1), 199−210. (72) Chou, K. C. An Unprecedented Revolution in Medicinal Chemistry Driven by the Progress of Biological Science. Curr. Top. Med. Chem. 2017, 17 (21), 2337−2358. (73) Shen, H. B.; Chou, K. C. PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem. 2008, 373 (2), 386−388. (74) Du, P. F.; Wang, X.; Xu, C.; Gao, Y. PseAAC-Builder: A crossplatform stand-alone program for generating various special Chou’s pseudo-amino acid compositions. Anal. Biochem. 2012, 425 (2), 117− 119. (75) Cao, D. S.; Xu, Q. S.; Liang, Y. Z. propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics 2013, 29 (7), 960− 962. (76) Du, P. F.; Gu, S. W.; Jiao, Y. S. PseAAC-General: fast building various modes of general form of Chou’s pseudo-amino acid composition for large-scale protein datasets. Int. J. Mol. Sci. 2014, 15 (3), 3495−3506. (77) Chou, K. C. Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology. Curr. Proteomics 2009, 6 (4), 262−274. (78) Chen, W.; Lei, T. Y.; Jin, D. C.; Lin, H.; Chou, K. C. PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition. Anal. Biochem. 2014, 456, 53−60. (79) Chen, W.; Lin, H.; Chou, K. C. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol. BioSyst. 2015, 11 (10), 2620−2634. L
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
(101) Xiao, X.; Cheng, X.; Chen, G. Q.; Mao, Q.; Chou, K. C. pLoc_bal-mGpos: Predict subcellular localization of Gram-positive bacterial proteins by quasi-balancing training dataset and PseAAC. Genomics 2018, DOI: 10.1016/j.ygeno.2018.05.017. (102) Cheng, X.; Xiao, X.; Chou, K. C. pLoc-mVirus: Predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC. Gene 2017, 628, 315−321. (103) Cheng, X.; Zhao, S. G.; Lin, W. Z.; Xiao, X.; Chou, K. C. pLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites. Bioinformatics 2017, 33 (22), 3524−3531. (104) Xiao, X.; Cheng, X.; Su, S. C.; Mao, Q.; Chou, K. C. Nat. Sci. 2017, 9, 330−349. (105) Cheng, X.; Xiao, X.; Chou, K. C. pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. Genomics 2018, 110 (4), 231−239. (106) Cheng, X.; Xiao, X.; Chou, K. C. pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics 2018, 110 (1), 50−58. (107) Cheng, X.; Zhao, S. G.; Xiao, X.; Chou, K. C. iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals. Bioinformatics 2016, 33 (3), 341−346. (108) Cheng, X.; Zhao, S. G.; Xiao, X.; Chou, K. C. iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals. Oncotarget 2017, 8 (35), 58494− 58503. (109) Qiu, W. R.; Sun, B. Q.; Xiao, X.; Xu, Z. C.; Chou, K. C. iPTMmLys: identifying multiple lysine PTM sites and their different types. Bioinformatics 2016, 32 (20), 3116−3123. (110) Chou, K. C. Some remarks on predicting multi-label attributes in molecular biosystems. Mol. BioSyst. 2013, 9 (6), 1092−1100. (111) Chou, K. C.; Jiang, S. P.; Liu, W. M.; Fee, C. H. Graph Theory of enzyme kinetics: 1. Stedy-State reaction system. Scientia Sinica 1979, 22 (3), 341−358. (112) Chou, K. C.; Forsén, S. Graphical rules for enzyme-catalysed rate laws. Biochem. J. 1980, 187 (3), 829−835. (113) Chou, K. C.; Forsén, S.; Zhou, G. Q. Three Schematic Rules for deriving apparent rate constants. Chemica Scripta 1980, 16, 109− 113. (114) Chou, K. C.; Carter, R. E.; Forsén, S. New graphical method for deriving rate equations for complicated mechanisms. Chemica Scripta 1981, 18, 82−86. (115) Kuo-Chen, C.; Forsen, S. Graphical rules of steady-state reaction systems. Can. J. Chem. 1981, 59 (4), 737−755. (116) Zhou, G. P.; Deng, M. H. An extension of Chou’s graphic rules for deriving enzyme kinetic equations to systems involving parallel reaction pathways. Biochem. J. 1984, 222 (1), 169−176. (117) Chou, K. C. Graphic rules in steady and non-steady state enzyme kinetics. J. Biol. Chem. 1989, 264 (20), 12074−12079. (118) Althaus, I. W.; Chou, J. J.; Gonzales, A. J.; Deibel, M. R.; Chou, K. C.; Kezdy, F. J.; Romero, D. L.; Palmer, J. R.; Thomas, R. C.; Aristoff, P. A.; Tarpley, W. G.; Reusser, F. Kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-88204E. Biochemistry 1993, 32 (26), 6548−6554. (119) Chou, K. C. Applications of graph theory to enzyme kinetics and protein folding kinetics: Steady and non-steady-state systems. Biophys. Chem. 1990, 35 (1), 1−24. (120) Althaus, I. W.; Gonzales, A. J.; Chou, J. J.; Romero, D. L.; Deibel, M. R.; Chou, K. C.; Kezdy, F. J.; Resnick, L.; Busso, M. E.; So, A. G. The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase. J. Biol. Chem. 1993, 268 (20), 14875−14880. (121) Chou, K. C. Graphic rule for drug metabolism systems. Curr. Drug Metab. 2010, 11 (4), 369−378. (122) Zhou, G. P. The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein−protein interaction mechanism. J. Theor. Biol. 2011, 284 (1), 142−148.
(80) Liu, B.; Yang, F.; Huang, D. S.; Chou, K. C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multiwindow-based PseKNC. Bioinformatics 2018, 34 (1), 33−40. (81) Tahir, M.; Tayara, H.; Chong, K. T. iRNA-PseKNC(2methyl): Identify RNA 2′-O-methylation sites by convolution neural network and Chou’s pseudo components. J. Theor. Biol. 2019, 465, 1−6. (82) Liu, B.; Liu, F. L.; Wang, X. L.; Chen, J. J.; Fang, L. Y.; Chou, K. C. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015, 43 (W1), W65−W71. (83) Liu, B.; Wu, H.; Chou, K. C. Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences. Nat. Sci. 2017, 9, 67−91. (84) You, Z. H.; Chan, K. C. C.; Hu, P. W. Predicting proteinprotein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest. PLoS One 2015, 10 (5), e0125811. (85) Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. PMLR 2013, 28, 1139−1147. (86) Chou, K. C. Prediction of protein signal sequences and their cleavage sites. Proteins: Struct., Funct., Genet. 2001, 42 (1), 136−139. (87) Chou, K. C. Using subsite coupling to predict signal peptides. Protein Eng., Des. Sel. 2001, 14 (2), 75−79. (88) Chou, K. C. Prediction of signal peptides using scaled window. Peptides 2001, 22 (12), 1973−1979. (89) Xu, Y.; Shao, X. J.; Wu, L. Y.; Deng, N. Y.; Chou, K. C. iSNOAAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ 2013, 1, e171. (90) Xu, Y.; Wen, X.; Wen, L. S.; Wu, L. Y.; Deng, N. Y.; Chou, K. C. iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One 2014, 9 (8), e105018. (91) Jia, J. H.; Liu, Z.; Xiao, X.; Liu, B. X.; Chou, K. C. pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. J. Theor. Biol. 2016, 394, 223−230. (92) Zhang, C. J.; Tang, H.; Li, W. C.; Lin, H.; Chen, W.; Chou, K. C. iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition. Oncotarget 2016, 7 (43), 69783−69793. (93) Chen, W.; Ding, H.; Feng, P. M.; Lin, H.; Chou, K. C. iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 2016, 7 (13), 16895−16909. (94) Liu, B.; Yang, F.; Chou, K. C. 2L-piRNA: A Two-Layer Ensemble Classifier for Identifying Piwi-Interacting RNAs and Their Function. Mol. Ther.–Nucleic Acids 2017, 7, 267−277. (95) Liu, B.; Wang, S. Y.; Long, R.; Chou, K. C. iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics 2017, 33 (1), 35−41. (96) Feng, P. M.; Ding, H.; Yang, H.; Chen, W.; Lin, H.; Chou, K. C. iRNA-PseColl: Identifying the Occurrence Sites of Different RNA Modifications by Incorporating Collective Effects of Nucleotides into PseKNC. Mol. Ther.–Nucleic Acids 2017, 7, 155−163. (97) Ehsan, A.; Mahmood, K.; Khan, Y. D.; Khan, S. A.; Chou, K. C. A Novel Modeling in Mathematical Biology for Classification of Signal Peptides. Sci. Rep. 2018, 8 (1), 1039. (98) Chou, K. C.; Cheng, X.; Xiao, X. pLoc_bal-mEuk: predict subcellular localization of eukaryotic proteins by general PseAAC and quasi-balancing training dataset. Med. Chem. 2019, 15, 1−14. (99) Cheng, X.; Lin, W. Z.; Xiao, X.; Chou, K. C. pLoc_balmAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC. Bioinformatics 2019, 35 (3), 398−406. (100) Cheng, X.; Xiao, X.; Chou, K. C. PLoc-mPlant: Predict subcellular localization of multi-location plant proteins by incorporating the optimal GO information into general PseAAC. Mol. BioSyst. 2017, 13 (9), 1722−1727. M
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
tissue factor in human endothelial cells. Circ. Res. 2009, 104 (5), 589− 599. (143) Zhang, P.; Liu, Q. L.; Yan, S. F.; Yuan, G.; Shen, J.; Li, G. Homeobox-containing protein 1 loss is associated with clinicopathological performance in glioma. Mol. Med. Rep. 2017, 16 (4), 4101− 4106. (144) Henderson, D. M.; Conner, S. D. A novel AAK1 splice variant functions at multiple steps of the endocytic pathway. Mol. Biol. Cell 2007, 18 (7), 2698−2706. (145) Kostich, W.; et al. Inhibition of AAK1 Kinase as a Novel Therapeutic Approach to Treat Neuropathic Pain. J. Pharmacol. Exp. Ther. 2016, 358 (3), 371−386. (146) Chou, K. C.; Shen, H. B. REVIEW: Recent advances in developing web-servers for predicting protein attributes. Nat. Sci. 2009, 1 (2), 63−92. (147) Cheng, X.; Xiao, X.; Chou, K. C. pLoc-mHum: predict subcellular localization of multi-location human proteins via general PseAAC to winnow out the crucial GO information. Bioinformatics 2018, 34 (9), 1448−1456. (148) Qiu, W. R.; Jiang, S. Y.; Xu, Z. C.; Xiao, X.; Chou, K. C. iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget 2017, 8 (25), 41178−41188. (149) Qiu, W. R.; Sun, B. Q.; Xiao, X.; Xu, D.; Chou, K. C. iPhosPseEvo: Identifying Human Phosphorylated Proteins by Incorporating Evolutionary Information into General PseAAC via Grey System Theory. Mol. Inf. 2017, 36 (5−6), 1600010. (150) Chen, Z.; Liu, X.; Li, F.; Li, C.; Marquez-Lago, T.; Leier, A.; Akutsu, T.; Webb, G. I.; Xu, D.; Smith, A. I.; Li, L.; Chou, K. C.; Song, J. Large-scale comparative assessment of computational predictors for lysine post-translational modification sites. Briefings Bioinf. 2018, DOI: 10.1093/bib/bby089. (151) Cheng, X.; Xiao, X.; Chou, K. C. pLoc_bal-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by quasibalancing training dataset and general PseAAC. J. Theor. Biol. 2018, 458, 92−102. (152) Cheng, X.; Xiao, X.; Chou, K. C. pLoc_bal-mPlant: Predict Subcellular Localization of Plant Proteins by General PseAAC and Balancing Training Dataset. Curr. Pharm. Des. 2019, 24 (34), 4013− 4022. (153) Chou, K. C.; Cheng, X.; Xiao, X. pLoc_bal-mHum: Predict subcellular localization of human proteins by PseAAC and quasibalancing training dataset. Genomics 2018, DOI: 10.1016/ j.ygeno.2018.08.007. (154) Xiao, X.; Cheng, X.; Chen, G.; Mao, Q.; Chou, K. C. pLoc_bal-mVirus: predict subcellular localization of multi-label virus proteins by PseAAC and IHTS treatment to balance training dataset. Med. Chem. 2019, 15, 1−14.
(123) Chou, K. C.; Lin, W. Z.; Xiao, X. Wenxiang: a web-server for drawing wenxiang diagrams. Nat. Sci. 2011, 3, 862−865. (124) Althaus, I. W.; Chou, J. J.; Gonzales, A. J.; Deibel, M. R.; Chou, K. C.; Kezdy, F. J.; Romero, D. L.; Aristoff, P. A.; Tarpley, W. G.; Reusser, F. Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E. J. Biol. Chem. 1993, 268 (9), 6119−6124. (125) Chou, K. C.; Forsén, S. Diffusion-controlled effects in reversible enzymatic fast reaction systems - critical spherical shell and proximity rate constant. Biophys. Chem. 1980, 12 (3−4), 255−263. (126) Chou, K. C.; Li, T. T.; Forsén, S. The critical spherical shell in enzymatic fast reaction systems. Biophys. Chem. 1980, 12 (3−4), 265−269. (127) Shen, H. B.; Song, J. N.; Chou, K. C. Prediction of protein folding rates from primary sequence by fusing multiple sequential features. J. Biomed. Sci. Eng. 2009, 2 (3), 136−143. (128) Chou, K. C.; Chen, N. Y.; Forsen, S. The biological functions of low-frequency phonons: 2. cooperative effects. Chemica Scripta 1981, 18, 126−132. (129) Chou, K. C. Low-frequency collective motion in biomacromolecules and its biological functions. Biophys. Chem. 1988, 30 (1), 3−48. (130) Sonego, P.; Kocsor, A.; Pongor, S. ROC analysis: applications to the classification of biological sequences and 3D structures. Briefings Bioinf. 2008, 9 (3), 198−209. (131) Qu, K.; Han, K.; Wu, S.; Wang, G. H.; Wei, L. Y. Identification of DNA-Binding Proteins Using Mixed Feature Representation Methods. Molecules 2017, 22 (10), 1602. (132) Liu, B.; Xu, J. G.; Fan, S. X.; Xu, R. F.; Zhou, J. Y.; Wang, X. L. PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation. Mol. Inf. 2015, 34 (1), 8−17. (133) Huang, Y.; Niu, B. F.; Gao, Y.; Fu, L. M.; Li, W. Z. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26 (5), 680−682. (134) Zhang, J.; Gao, B.; Chai, H. T.; Ma, Z. Q.; Yang, G. F. Identification of DNA-binding proteins using multi-features fusion and binary firefly optimization algorithm. BMC Bioinf. 2016, 17 (1), 323. (135) Hao, Y. W.; Chun, A.; Cheung, K.; Rashidi, B.; Yang, X. L. Tumor suppressor LATS1 is a negative regulator of oncogene YAP. J. Biol. Chem. 2008, 283 (9), 5496−5509. (136) Hisaoka, M.; Tanaka, A.; Hashimoto, H. Molecular alterations of h-warts/LATS1 tumor suppressor in human soft tissue sarcoma. Lab. Invest. 2002, 82 (10), 1427−1435. (137) Ariazi, E. A.; Clark, G. M.; Mertz, J. E. Estrogen-related receptor alpha and estrogen-related receptor gamma associate with unfavorable and favorable biomarkers, respectively, in human breast cancer. Cancer Res. 2002, 62 (22), 6510−6518. (138) Huang, Y. S.; Litvinov, I. V.; Wang, Y.; Su, M. W.; Tu, P.; Jiang, X. Y.; Kupper, T. S.; Dutz, J. P.; Sasseville, D.; Zhou, Y. W. Thymocyte selection-associated high mobility group box gene (TOX) is aberrantly over-expressed in mycosis fungoides and correlates with poor prognosis. Oncotarget 2014, 5 (12), 4418−4425. (139) Swift, G. H.; Liu, Y.; Rose, S. D.; Bischof, L. J.; Steelman, S.; Buchberg, A. M.; Wright, C. V. E.; Macdonald, R. J. An endocrineexocrine switch in the activity of the pancreatic homeodomain protein PDX1 through formation of a trimeric complex with PBX1b and MRG1 (MEIS2). Mol. Cell. Biol. 1998, 18 (9), 5109−5120. (140) Caubit, X.; et al. TSHZ3 deletion causes an autism syndrome and defects in cortical projection neurons. Nat. Genet. 2016, 48 (11), 1359−1369. (141) Duncan, P. I.; Stojdl, D. F.; Marius, R. M.; Scheit, K. H.; Bell, J. C. The Clk2 and Clk3 Dual-Specificity Protein Kinases Regulate the Intranuclear Distribution of SR Proteins and Influence Pre-mRNA Splicing. Exp. Cell Res. 1998, 241 (2), 300−308. (142) Eisenreich, A.; Bogdanov, V. Y.; Zakrzewicz, A.; Pries, A.; Antoniak, S.; Poller, W.; Schultheiss, H. P.; Rauch, U. Cdc2-like kinases and DNA topoisomerase I regulate alternative splicing of N
DOI: 10.1021/acs.jproteome.9b00226 J. Proteome Res. XXXX, XXX, XXX−XXX