Article Cite This: J. Proteome Res. 2019, 18, 2931−2939
pubs.acs.org/jpr
Incorporating Distance-Based Top-n-gram and Random Forest To Identify Electron Transport Proteins Xiaoqing Ru,†,‡ Lihong Li,‡ and Quan Zou*,†,§ †
Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China § Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
Downloaded via BUFFALO STATE on July 19, 2019 at 07:51:43 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
‡
ABSTRACT: Cellular respiration provides direct energy substances for living organisms. Electron storage and transportation should be completed through electron transport chains during the cellular respiration process. Thus, identifying electron transport proteins is an important research task. In protein identification, selection of the feature extraction method and classification algorithm has a direct bearing on classification. The distance-based Top-n-gram method, which was proposed based on the frequency profile and considered evolutionary information, was used in this study for feature extraction. The Max-Relevance-Max-Distance algorithm was adopted for feature selection. The first 4D features that greatly influenced the classification result were selected to form the feature data set. Finally, the random forest algorithm was used to identify electron transport proteins. Under the 10-fold cross-validation of the model constructed in this study, sensitivity, specificity, and accuracy rates surpassed 85%, 80%, and 82%, respectively. In the testing set, F-measure, AUC value, and accuracy exceeded 74%, 95%, and 86%, respectively. These experimental results indicated that the classification model built in this study is an effective tool in identifying electron transport proteins. KEYWORDS: electron transport proteins, protein identification, feature extraction, distance-based Top-n-gram method, feature selection, Max-Relevance-Max-Distance, random forest, F-measure, AUC value, ACC
1. INTRODUCTION A major type of membrane protein, transport protein mediates chemical substances and signal switching inside and outside biological membranes1 Cells are isolated from surroundings through hydrophobic barriers formed by lipid bilayers. As most hydrophilic compounds, such as saccharides, amino acids, and drugs, cannot freely move in and out of cells, specific transport proteins are needed to pass through these hydrophobic barriers. Many types of transport proteins carry ions, proteins, mRNA, and electrons,2,3 which exert strong effects on biological growth, life,4 and treatment of disease.5,6 This study investigated the electron transport chain that plays a significant role in transport proteins.7 According to molecular functions, an electron transport chain consists of five protein complexes, namely, Complexes I−IV and ATP synthase.8 These complexes are embedded in the mitochondrial inner membrane, chloroplast thylakoids, or other biological membranes. Electrons are transported through these compounds. The electron transport chain is the pathway for electron storage and transport in the mitochondria during the cellular respiration process. Cellular respiration provides direct energy sources (ATP) for activities of living organisms. Cellular respiration is the process of extracting electrons from natural compounds and finally generating products, including energy ATP needed in various life activities.8,9 As quantities of finally © 2019 American Chemical Society
generated ATPs vary, electron transport chains in the mitochondria have primary and secondary orders. On the main transport chain, electrons from NADH are transported to oxygen after passing through Complex I, coenzyme Q, Complex III, pigment C, and Complex IV, and finally, products, including 2.5 ATPs, are released. On the secondary transport chain, electrons from FADH2 first pass through Complex II and then coenzyme Q, Complex III, pigment C, and Complex IV, and finally, products, including 1.5 ATPs, are generated. With the advent of the postgenomic era, the quantity of unknown structures and functional protein sequences has grown substantially. Thus, accurately and rapidly completing protein classification is an important task. Traditional biological experimental modes are time consuming because they are characterized by long cycles, thereby leading to the inability of researchers to meet the demand in the postgenomic era. Therefore, researchers have applied computer technologies to protein classification and achieved good classification results. For instance, Zou et al.10 identified cytokines based on a new type of ensemble classifier and achieved 93.3% classification accuracy. Zhang et al.11 used four different feature extraction Received: April 16, 2019 Published: May 28, 2019 2931
DOI: 10.1021/acs.jproteome.9b00250 J. Proteome Res. 2019, 18, 2931−2939
Article
Journal of Proteome Research
Figure 1. Outline flowchart of this study.
feature and category as well as low redundancy between features. Algorithms that have been mostly used in protein classification studies with favorable effects include SVM and random forest. As an integrated machine learning algorithm, random forest28−33 builds and combines multiple decision making to complete accurate classification. Meanwhile, the SVM algorithm takes a longer time to complete the classification of a large data set. Therefore, the random forest algorithm was used in this study to complete classification of the electron transport proteins.
methods to derive bacteriophage features, determined the optimal feature subset through the incremental feature selection algorithm with accuracy, and finally obtained 85% accuracy. Xu et al.12 proposed a SeqSVM classifier, namely, classifying antioxidant proteins, based on extracted sequential features and support vector machine (SVM) algorithm, in which accuracy reached 89.46%. Similarly, other researchers applied machine learning to identify electron transport proteins. Chen et al.1divided electron transport proteins into four types, which were then classified using PSSM profiles and biochemical properties. Le et al.7 combined convolutional neural network in deep learning and position-specific scoring matrix to identify electron transport proteins. Le et al.8used the radial basis function network and biochemical properties to identify molecular functions of electron transport proteins. The machine learning algorithm was also adopted in this paper to complete classification of electron transport proteins. The mathematical expression mode of protein sequence and selected classification algorithm would have a direct bearing on the classification effect. Many feature extraction methods are available, such as methods based on g-gap dipeptide composition,13−16 eight physiochemical properties of amino acids,17 n-gram,18 and profile.19−22 Because of the diversity of feature subspaces, multiple feature subspaces can be fused together to boost the learning performance.23−26 This study selected two feature extraction methods based on the frequency profile, namely, DT and profile-based autocross covariance (ACC-PSSM). The former expands original eigenvectors based on top-n-gram in consideration of relative position information on top-n-gram pairs in the protein sequence to ensure comprehensive information expression of the protein sequence. ACCPSSM21 combines the same properties between residues and relevance as well as different properties and is a multivariate modeling method. Feature dimensionalities extracted through the two methods are high, and redundancy exists between features. In protein classification studies, researchers should not only emphasize the final classification accuracy but also avoid the curve of dimensionality caused by extremely high dimensionalities. Extremely high feature dimensionality consumes a large amount of computer resources and delays the experimental progress. In addition, overfitting may occur, which then causes the deviation in the final experimental results. Therefore, already extracted features have to be screened to find out features that have a great bearing on the classification result to constitute a feature data set with low dimensionality. The MRMD algorithm27 was used in this study to pick out a feature subset with strong relevance between
2. METHODS The main identification work on electron transport proteins can be classified into three parts: data set processing, feature set acquisition, and selection of classification algorithm. The flowchart of this study is shown in Figure 1. 2.1. Data Set Acquisition
Original Data Set Acquisition. Identification of electron transport proteins is a dichotomy problem used to judge whether a protein is an electron or a nonelectron transport protein. Electron transport protein is taken as a positive example and nonelectron transport protein is the negative example in this study. The original positive sample set included 8556 protein sequences downloaded from the Web site provided by Le et al.7 Their data were retrieved from UniProt,34 which has high credibility. After positive example data were obtained, the protein family database PFAM35 was taken as the data source, families with the above-mentioned positive examples were excluded, and the longest protein sequence was taken from each residual family to constitute a negative example set. A total of 10 714 negative example sequences were obtained. Generally, long protein sequences are believed to contain comprehensive information. Acquisition of High-Quality Data Set. The original positive-example and negative-example data sets were obtained through the above steps. To ensure accuracy of experimental results and low redundancy or nonredundancy between data and sample set data, the data set was processed as follows: (1) protein sequences containing invalid alphabets such as B, J, O, U, X, and Z were deleted, (2) protein sequences with lengths smaller than 50 were deleted, and (3) cd-hit redundancy was eliminated.36−40 As positive-example samples were retrieved from UniProt, low data redundancy was guaranteed and the quantity of positive-example samples was small. To reduce the difference between positive and negative examples in the number of protein sequences, the positive-example threshold was taken as 0.8 and the negative-example threshold was taken 2932
DOI: 10.1021/acs.jproteome.9b00250 J. Proteome Res. 2019, 18, 2931−2939
Article
Journal of Proteome Research
In summary, the number of variables of ACC is 20 × dmax + 380 × dmax, namely, 400 × dmax. Setting the maximum distance value as 2 is necessary to explore the influence of the relevance between two residues. DT Algorithm. In 2008, Liu et al.19 proposed a new feature extraction method that calculated the frequency distribution of 20 common amino acids in the given protein sequence, arranged them in descending order, and finally selected n amino acids appearing the most frequently according to the combination of frequency values. This combination of the n amino acid alphabet is Top-n-gram and features the protein sequence obtained according to the frequency of occurrence of each Top-n-gram. This building block of the protein also contains evolutionary information. On the basis of multiple experimental data, the said study verified that this feature extraction method could display preferable performance in multiple fields of bioinformatics. In 2014, Liu et al.20 improved the Top-n-gram-based method, and their research considered sequence order (relative positions of Top-n-gram pairs) besides continuing to use this novel protein gram. Features extracted in such a way can more accurately express protein sequence information. The acquisition method of Top-n-gram refers to the thesis of Liu. In consideration of distance d between Top-n-grams, the feature extraction method is as follows.20 When the distance is 0
as 0.4. Subsequently, a high-quality positive- and negativeexample data set was obtained, including 2678 positive examples (proportions of five complexes were approximately 70:3:6:12:9) and 9630 negative examples. Training Set Acquisition. The proportion of positive to negative examples was obtained as approximately 1:3 according to the preceding discussion, which indicated unbalanced positive- and negative-example data. Thus, model accuracy could be overestimated if this data set was directly used. To avoid inaccurate estimation of the trained model, a data set with quantity ratio of positive to negative examples was established in the follow-up study. The construction method of the balanced training set included (1) acquisition of positive examples (2000 out of 2678 data items were extracted to comprise the positive-example set according to proportions of the five compounds; the quantities of these five complexes were 1379, 66, 113, 255,and 187), (2) acquisition of negative examples (under a large quantity of negative examples, random example extraction was conducted from 9630 negative examples 10 times to guarantee the accuracy of the final classification result (2000 examples each time), and 10 negative-example sets were obtained). This process ensured transversal from the first to the last protein sequence. In summary, 10 balanced data sets with 2000 positive examples and 2000 negative examples in each data set were obtained. 2.2. Feature Extraction Method
Dd = 0(S′) = {T °i1(S′), T °i2(S′), ..., T °i20(S′)}
The DT and ACC-PSSM algorithms were used to convert protein sequences with different lengths into eigenvectors with fixed lengths. Both algorithms were realized based on the frequency profile with evolutionary information considered. ACC-PSSM Algorithm. ACC expresses property relevance between two residues. Property relevance is divided into relevancy between the same properties (AC, auto covariance) and relevance between different properties (CC, cross covariance).21 ACC is a combination of the two. The ACCPSSM algorithm calculates the property relevance between two residues (same or different) based on expression of evolutionary information on protein sequences in the form of position-specific scoring matrix. Under a distance threshold, the calculation methods of ACPSSM and CC-PSSM are as follows. AC-PSSM21
When the distance is greater than 1 but smaller than the given maximum distance D1 ≤ d ≤ dmax (S′) = {T d i1i1(S′), T d i1i2(S′), ..., T d i20i20(S′)}
∑ j = 1 (Si , j − Si̅ )(Si , j + d − Si̅ ) (L − d )
(1)
2.3. Feature Selection
where i is 1 of 20 common amino acids, d is the distance between two residues in the protein sequence, Si,j is one element in the position-specific scoring matrix, which is obtained by calculating probability value for amino acid i to appear at position j, L is length of the protein sequence, and S̅ i is the average probability value of amino acid i in the entire protein sequence. CC-PSSM21
Strong relevance between feature and category can be used to obtain several features with strong predictive abilities. However, this step cannot ensure that the obtained feature set is the optimal feature subset based on such a single point. If some features were closely associated, feature redundancy would certainty exist, which is a waste of computing resources and influences the final classification effect.41−44 The MRMD proposed by Zou et al.27was used in this study to determine the optimal feature subset. Pearson’s correlation coefficient was used to calculate similarities between feature and category. Pearson’s correlation coefficient45 is an important basis for measuring the correlation between two vectors. Three distance formulas, namely, Euclidean distance,46 cosine distance,47 and Tanimoto coefficient,48 were used to determine low-redundancy features. The greater the distance, the higher the degree
L−d
CC(i1 , i2 , d) =
∑ j = 1 (Si1, j − Si̅ 1)(Si2 , j + d − Si̅ 2) (L − d )
(4)
where i1, i2, ..., i20 belong to 20 amino acids, their frequencies of occurrence are in descending order, S′ is the sequence consisting of Top-1-g, and T0i1 is the frequency of occurrence of Top-1-g consisting of i1 with distance being 0 in T0i1. To prevent excessive dimensionalities from influencing the experiment, the n value was set as 1 in this study. However, feature extraction methods were identical when the distance was greater than 1. Thus, maximum d value should be set as a number greater than 2 to accurately explore the influence of this feature extraction method. Subsequently, the d value was set as 3. In this way, 20 + 20 × 20 × 3-dimensional features could be obtained.
L−d
AC(i , d) =
(3)
(2)
where i1 and i2 are two different amino acids in common amino acids and S̅ i1 and S̅ i2 are average probability values of amino acids i1 and i2 in the protein sequence, respectively. 2933
DOI: 10.1021/acs.jproteome.9b00250 J. Proteome Res. 2019, 18, 2931−2939
Article
Journal of Proteome Research
sensitivity (SN), and specificity (SP),49−60 which are expressed as follows
of freedom between features and the lower the redundancy. Therefore, the feature selection criterion is as follows max(MR i + MDi)
(5)
ACC =
In the feature subset, sorting of these features is in accordance with the calculated result of the aforementioned formula. The greater the value calculated through this formula, the closer the feature to the front of the feature subset. In this study, 40 (10 × 4) training sets of 188-dimensional features were extracted based on the physicochemical properties, 400-dimensional features were extracted based on the adaptive k-skip-2-g algorithm, 800-dimensional features were extracted based on the ACC-PSSM algorithm, and 1220dimensional features were extracted based on the DT algorithm. These samples were subsequently subjected to MRMD processing. In feature sets extracted using the same feature extraction method, several features with consistent ranking orders (close to the front, namely, max(mr + md)) in the 10 feature sets, were selected as new feature sets. For instance, in 10 188-dimensional feature sets extracted based on the amino-acid feature extraction method, 120th dimension and 157th dimension features in each were ranked at the first two dimensionalities. Feature ranking in the third place was not in the same dimensionality in the 10 data sets. Concrete results are shown in Table 1
physicochemical
Top-n-gram
ACC-PSSM
DT
1 2 3 4
feature 120 feature 157
feature 211
feature 13 feature 413 feature 19
feature feature feature feature
1 21 421 821
FP
TN
SP =
TN TN + FP
(8)
2PR P+R
(9)
TP TP + FP
(10) 62
The computing method of AUC value AUC =
SP −
is as follows
nP (nP + 1) 2
nP nF
(11)
where nP and nF are the numbers of positive and negative samples, respectively, SP = ∑ri, and ri is the position of the ith positive example in the precedence table (including positive and negative examples). 3.2. Experiment
Experiment 1: Performance Comparison of Various Feature Extraction Methods. According to the relation curve chart between dimensionality and accuracy of 10 training sets, Figure 2 shows that feature extraction methods with favorable effects on identification of electron transport proteins are DT and ACC-PSSM. Relative to the feature extraction algorithm based on the physicochemical properties of amino acids and Top-n-gram-based feature extraction algorithm, DT and ACC-PSSM can achieve high accuracy when the number of dimensionalities is small. When the feature dimensionality in the feature subset is 10, the classification accuracy of the DT algorithm-based classifier is approximately 93% and that of the ACC-PSSM-based classifier is approximately 90%. In particular, DT can achieve above 80% accuracy with a small number of dimensionalities. Therefore, DT is a good feature extraction method for identifying electron transport proteins. Experiment 2: Comparison of Different Classification Algorithms. Selecting the classification algorithm is the final step to construct a classification model. The performance of the classification algorithm has a direct bearing on the performance of the classification model.63 This experiment
Table 2. Confusion Matrix
FN
(7)
P=
The confusion matrix directly represents the performance of the constructed classification model. The number of data in the positive and negative samples are successfully predicted, and the number of falsely predicted samples can be intuitively understood through the confusion matrix, which is expressed as Table 2.
TP
TP TP + FN
where P is precision and R is recall rate. The formula of R is consistent with eq 7
3.1. Measurement
negative examples predicted
SN =
Fscore =
3. RESULTS AND DISCUSSION
positive examples predicted
(6)
In the protein classification process, when the accuracy, sensitivity, and specificity results are largely identical under different classification algorithms, determining which classification algorithm is the most applicable to this study is impossible. Furthermore, the data set used in the study may not certainly be a data set with balanced positive and negative examples, but instead, great differences may exist between the number of positive and negative samples. Basic performance evaluation results under these circumstances are not sufficiently persuasive. Therefore, measuring criteria with greater discriminability, through F- measure and AUC value, were used to identify electron transport proteins. The computing method of F- measure61 is
Table 1. Feature Ranking under Each Method no.
TP + TN TP + TN + FP + FN
actual positive examples actual negative examples
TP expresses that a positive example is predicted as positive, namely, true positive. FN means that the positive example is predicted as negative, namely, false negative. FP means that a negative example is predicted as positive, namely, false positive. TN means that the negative example is predicted as negative, namely, true negative. Evaluation Criteria for Performance. The three most common performance evaluation criteria can be obtained through the confusion matrix, namely, accuracy (ACC), 2934
DOI: 10.1021/acs.jproteome.9b00250 J. Proteome Res. 2019, 18, 2931−2939
Article
Journal of Proteome Research
Figure 2. Relationship between dimension and accuracy under the four methods.
aims to verify that the random forest algorithm is more suitable than the SVM for this study. To prove that the random forest algorithm can achieve a better classification result after feature extraction based on the profile, this experiment was conducted under the DT- and ACC-PSSM-based feature sets. Meanwhile, to reduce the experimental time, feature data sets were derived from several dimensionalities of feature ranking in front with identical ranking order after MRMD of 10 training sets. In this experiment, the feature set under the DT and ACC-PSSM methods contain 4D features, and the feature set under the ACC-PSSM method contains 3D features. Concrete dimensionalities are shown in Table 1. According to data in Tables 3 and 4 and based on the DT feature extraction algorithm, SN, SP, and ACC surpassed 83%,
Table 4. Basic Performance Evaluation Results of the ACCPSSM Method random forest
partETtrain01 partETtrain02 partETtrain03 partETtrain04 partETtrain05 partETtrain06 partETtrain07 partETtrain08 partETtrain09 partETtrain10
Table 3. Basic Performance Evaluation Results of the DT Method random forest
partETtrain01 partETtrain02 partETtrain03 partETtrain04 partETtrain05 partETtrain06 partETtrain07 partETtrain08 partETtrain09 partETtrain10
support vector machine
SN (%)
SP (%)
ACC (%)
SN (%)
SP (%)
ACC (%)
75.8 74.1 75.7 75.2 76.3 75.5 75.7 76.2 74.8 74.7
78.5 78.3 78.3 78.6 78.0 77.9 78.5 78.8 76.6 78.2
77.1 76.2 77.0 76.9 77.1 76.7 77.1 77.5 75.7 76.4
69.4 68.5 68.9 70.7 70.4 69.3 69.3 69.9 69.1 69.4
84.6 84.0 84.7 85.2 84.9 83.9 84.5 84.9 84.9 83.6
77.0 76.2 76.8 77.9 77.6 76.6 76.9 77.4 76.2 76.5
support vector machine
SN (%)
SP (%)
ACC (%)
SN (%)
SP (%)
ACC (%)
86.5 85.7 87.0 85.0 86.0 85.8 85.6 85.8 85.3 85.3
81.5 81.6 81.4 80.2 80.7 80.6 81.1 81.7 80.3 81.1
84.0 83.6 84.2 82.6 83.3 83.2 83.3 83.7 82.8 83.2
84.1 84.9 84.7 84.0 84.0 83.8 83.9 84.0 83.8 83.3
85.3 85.5 85.2 84.1 84.1 85.5 84.3 85.2 84.8 83.9
84.7 85.2 84.9 84.1 84.0 84.6 84.1 84.6 84.3 83.6
Performance evaluation criteria with greater discriminability, namely, F- measure and AUC value, are needed to evaluate the model. Concrete results are presented in Tables 5 and 6. According to data shown in Tables 5 and 6, the DT and ACC-PSSM algorithms could obtain the AUC value of the random forest algorithm higher than the AUC value of the Table 5. F- Measure and AUC Value under the DT Method random forest partETtrain01 partETtrain02 partETtrain03 partETtrain04 partETtrain05 partETtrain06 partETtrain07 partETtrain08 partETtrain09 partETtrain10
80%, and 83%, respectively, only in the first four dimensionalities of the random forest and support vector machine algorithms. This result indicates that DT is a good feature extraction method for identifying electron transport proteins. However, under both the DT and the ACC-PSSM algorithms, no SN, SP, and ACC data from the random forest and SVM algorithms could intuitively draw the conclusion that the random forest algorithm is superior in this study. 2935
support vector machine
F- measure (%)
AUC (%)
F- measure (%)
AUC (%)
84.4 84.0 84.7 83.1 83.8 83.6 83.7 84.1 83.2 83.6
92.1 91.7 91.7 91.2 91.0 91.1 91.3 91.7 91.2 91.3
84.6 85.2 84.9 84.1 84.1 84.5 84.1 84.0 84.3 83.6
84.7 85.3 85.0 84.1 84.1 84.7 84.1 84.6 84.3 83.6
DOI: 10.1021/acs.jproteome.9b00250 J. Proteome Res. 2019, 18, 2931−2939
Article
Journal of Proteome Research
performance of the constructed classification model. Test results are presented in Table 7. The testing set is used to test 10 training sets. All Fmeasures reached approximately 75%, all AUC values exceeded 95%, and accuracy surpassed 86%, proving that the selected method and constructed model are most effective for identification of electron transport proteins. Experiment 4: Comparison with Latest Research. At present, only a few studies deal with the identification of electron transport proteins. In 2017, Le et al.7 incorporated position-specific scoring matrix and neural network knowledge to classify electron transport proteins. Under 5-fold cross validation, accuracy reached 89.4%, SN was 51.1%, and SP was 96.1%. When a testing set was used, the three values reached 92.3%, 80.3%, and 94.4%, respectively. However, the previous study only identified electron transport proteins from transport proteins, and thus, our research scope was broader. The first several-dimensional features with a large correlation coefficient between feature and category and high degree of freedom between features are selected by using the MRMD algorithm. According to results of experiments 1, 2, and 3, these selected several-dimensional features greatly influenced the classification result. This experimental part reflects positive and negative examples of the distribution of features using a box plot and analyzes the influence of the distribution relationship of the positive and negative examples on the classification result. As shown in Figure 3, the distribution difference of positive and negative example data of features extracted through DT algorithm is large. On the basis of experiment 2, the classification performance achieved by the DT-based feature extraction algorithm is better than that of the ACC-PSSM algorithm. Thus, the following conclusion can be drawn: the more obvious the positive and negative example distribution in features extracted by the feature extraction algorithm, the better is its influence on the final classification result. In addition, according to experiment 1, among the four feature extraction algorithms, the 188-dimensional feature set has the poorest effect, which was extracted based on eight physicochemical properties of amino acids and followed by the 400-dimensional features that were extracted based on the Top-n-gram method. Figure 4 shows that positive and negative examples of the distribution of features extracted based on the eight physiochemical properties of amino acids are not
Table 6. F- Measure and AUC Value under the ACC-PSSM Method random forest F- measure (%) AUC (%) partETtrain01 partETtrain02 partETtrain03 partETtrain04 partETtrain05 partETtrain06 partETtrain07 partETtrain08 partETtrain09 partETtrain10
76.8 75.7 76.7 76.5 77.0 76.4 76.8 77.2 75.5 76.0
83.9 83.8 84.3 84.1 84.1 84.2 84.6 84.2 83.4 83.9
support vector machine F- measure (%)
AUC (%)
75.1 74.3 74.8 76.3 75.9 74.8 75.0 75.6 74.4 74.7
77.0 76.3 76.8 78.0 77.7 76.6 76.9 77.5 76.3 76.5
SVM algorithm by 7%. Therefore, the random forest algorithm is more applicable to this study. Experiment 3: Testing Set with Many Samples. All of the preceding experiments used 10-fold cross-validation to train the classification model. To verify the generalizability of the model constructed in this study, the testing set with many samples is used for testing in experiment 3. The results are shown in Table 7. As positive and negative examples are Table 7. Test Performance Results with Testing Set partETtrain01 partETtrain02 partETtrain03 partETtrain04 partETtrain05 partETtrain06 partETtrain07 partETtrain08 partETtrain09 partETtrain10
F- measure (%)
AUC (%)
ACC (%)
75.1 75.1 74.8 74.6 75.5 75.4 75.2 74.9 74.9 76.1
95.2 95.5 95.3 95.4 95.6 95.6 95.6 95.4 95.4 95.7
86.3 86.3 86.0 86.0 86.5 86.4 86.3 86.1 86.1 86.9
unbalanced in the selected testing set, F- measure and AUC value are used to evaluate the performance of the classification model. A high F- measure indicates the effectiveness of the selected method. The higher the AUC value, the better the
Figure 3. Positive and negative data distribution of effective features of ACC-PSSM and DT. 2936
DOI: 10.1021/acs.jproteome.9b00250 J. Proteome Res. 2019, 18, 2931−2939
Article
Journal of Proteome Research
Figure 4. Positive and negative data distribution of effective features of 188 and 400.
obvious, thereby validating the effectiveness of the obtained conclusion.
Notes
4. CONCLUSIONS Electron transport proteins exert significant effects on the growth and life activities of living organisms. Therefore, research is essential to accurately distinguish electron from nonelectron transport proteins. Given this scenario, the following classification model features were proposed in this study: (1) invalid sequences and redundancy of the original data set were eliminated to obtain a relatively high-quality data set; (2) feature extraction of the data set was conducted through DT algorithm, and then features, including evolutionary information, were extracted; (3) feature dimensionality was extremely high, and redundancy occurred between features; the MRMD algorithm was used to determine features with strong correlation between feature and category and low redundancy between features to form a comparatively excellent feature subset; and (4) in view of random data extraction by random forest algorithm and selection of to-be-selected features, the random forest algorithm was used for the final classification. According to multiple experiments, the model constructed in this study achieved good results in both 10-fold cross-validation and test of the testing set. In addition, under different feature extraction modes, the greater the distribution difference of the positive and negative examples, the better is its influence on the final classification effect. According to experimental conclusions in this study, this classification model is an effective tool to distinguish electron from nonelectron transport proteins. A profound study of the electron transport chain could be conducted in the future to develop an effective model than can classify the five complexes in the electron transport chain by using computational intelligence such as neural networks64−67 or multiobjective optimization.68,69
ACKNOWLEDGMENTS The work was supported by the National Key R&D Program of China (2018YFC0910405) and the Natural Science Foundation of China (No. 61771331).
■
The authors declare no competing financial interest.
■ ■
REFERENCES
(1) Chen, S. A.; Ou, Y. Y.; Lee, T. Y.; Gromiha, M. M. Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties. Bioinformatics 2011, 27 (15), 2062− 2067. (2) Saier, M. H.; Tran, C. V.; Barabote, R. D. TCDB: the Transporter Classification Database for membrane transport protein analyses and information. Nucleic Acids Res. 2006, 34 (90001), D181− D186. (3) Ren, Q.; Kang, K. H.; Paulsen, I. T. TransportDB: a relational database of cellular membrane transport systems. Nucleic acids research 2004, 32 (90001), D284−D288. (4) Le, N. Q. K.; Ou, Y. Y. Prediction of FAD binding sites in electron transport proteins according to efficient radial basis function networks and significant amino acid pairs. BMC Bioinf. 2016, 17 (1), 298. (5) Yu, L.; Huang, J. B.; Ma, Z. X.; Zhang, J.; Zou, Y. P.; Gao, L. Inferring drug-disease associations based on known protein complexes. BMC Med. Genomics 2015, 8, 13. (6) Yu, L.; Ma, X.; Zhang, L.; Zhang, J.; Gao, L. Prediction of new drug indications based on clinical data and network modularity. Sci. Rep. 2016, 6, 32530. (7) Le, N. Q.; Ho, Q. T.; Ou, Y. Y. Incorporating deep learning with convolutional neural networks and position specific scoring matrices for identifying electron transport proteins. J. Comput. Chem. 2017, 38 (23), 2000−2006. (8) Le, N. Q. K.; Nguyen, T. T. D.; Ou, Y. Y. Identifying the molecular functions of electron transport proteins using radial basis function networks and biochemical properties. J. Mol. Graphics Modell. 2017, 73, 166−178. (9) Feng, P. M.; Chen, W.; Lin, H.; Chou, K.-C. iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal. Biochem. 2013, 442 (1), 118− 125. (10) Zou, Q.; Wang, Z.; Guan, X.; Liu, B.; Wu, Y.; Lin, Z. An approach for identifying cytokines based on a novel ensemble classifier. BioMed Res. Int. 2013, 2013 (4), 686090. (11) Zhang, L.; Zhang, C.; Gao, R.; Yang, R. An Ensemble Method to Distinguish Bacteriophage Virion from Non-Virion Proteins Based
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. ORCID
Xiaoqing Ru: 0000-0002-2968-6435 Quan Zou: 0000-0001-6406-1142 2937
DOI: 10.1021/acs.jproteome.9b00250 J. Proteome Res. 2019, 18, 2931−2939
Article
Journal of Proteome Research on Protein Sequence Characteristics. Int. J. Mol. Sci. 2015, 16 (9), 21734−21758. (12) Xu, L.; Liang, G.; Shi, S.; Liao, C. SeqSVM: A Sequence-Based Support Vector Machine Method for Identifying Antioxidant Proteins. Int. J. Mol. Sci. 2018, 19 (6), 1773. (13) Chou, K. Prediction of protein cellular attributes using pseudoamino acid composition. Proteins: Struct., Funct., Genet. 2001, 43 (3), 246−255. (14) Ding, Y.; Tang, J.; Guo, F. Identification of Protein−Protein Interactions via a Novel Matrix-Based Sequence Representation Model with Amino Acid Contact Information. Int. J. Mol. Sci. 2016, 17 (10), 1623. (15) Feng, P. M.; Ding, H.; Chen, W.; Lin, H. Naive Bayes classifier with feature selection to identify phage virion proteins. Computational and mathematical methods in medicine 2013, 2013 (2), 530696. (16) Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L.; Chou, K.-C. Psein-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015, 43 (W1), W65−W71. (17) Cai, C. Z.; Han, L. Y.; Ji, Z. L.; Chen, X.; Chen, Y. Z. SVMProt: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 2003, 31 (13), 3692−3697. (18) Xu, L.; Liang, G.; Wang, L.; Liao, C. A Novel Hybrid SequenceBased Model for Identifying Anticancer Peptides. Genes 2018, 9 (3), 158. (19) Liu, B.; Wang, X.; Lin, L.; Dong, Q.; Wang, X. A discriminative method for protein remote homology detection and fold recognition combining Top- n -grams and latent semantic analysis. BMC Bioinf. 2008, 9 (1), 510. (20) Liu, B.; Xu, J.; Zou, Q.; Xu, R.; Wang, X.; Chen, Q. Using distances between Top-n-gram and residue pairs for protein remote homology detection. BMC Bioinf. 2014, 15 (Suppl 2), S3. (21) Dong, Q.; Zhou, S.; Guan, J. A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics 2009, 25 (20), 2655−2662. (22) Wei, L.; Bowen, Z.; Zhiyong, C.; Gao, X.; Liao, M. Exploring Local Discriminative Information from Evolutionary Profiles for Cytokine-Receptor Interaction Prediction. Neurocomputing 2016, 217, 37−45. (23) Zhu, P. F.; Hu, Q.; Hu, Q. H.; Zhang, C. Q.; Feng, Z. Z. Multiview label embedding. Pattern Recognition 2018, 84, 126−135. (24) Zhu, P. F.; Hu, Q. H.; Han, Y. H.; Zhang, C. Q.; Du, Y. Combining neighborhood separable subspaces for classification via sparsity regularized optimization. Inf. Sci. 2016, 370−371, 270−287. (25) Liu, Y.; Wang, X.; Liu, B. A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction. Briefings Bioinf. 2019, 20 (1), 330−346. (26) Liu, B.; Jiang, S.; Zou, Q. HITS-PR-HHblits: Protein Remote Homology Detection by Combining PageRank and HyperlinkInduced Topic Search. Briefings Bioinf. 2018, DOI: 10.1093/bib/ bby104. (27) Zou, Q.; Zeng, J.; Cao, L.; Ji, R. A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing 2016, 173, 346−354. (28) Yao, Y.; Li, X.; Liao, B.; Huang, L.; He, P.; Wang, F.; Yang, J.; Sun, H.; Zhao, Y.; Yang, J. Predicting influenza antigenicity from Hemagglutintin sequence data based on a joint random forest method. Sci. Rep. 2017, 7 (1), 1545. (29) Cutler, A.; Cutler, D. R.; Stevens, J. R. Random Forests. Machine Learning 2004, 45 (1), 157−176. (30) Wei, L.; Tang, J.; Zou, Q. SkipCPP-Pred: an improved and promising sequence-based predictor for predicting cell-penetrating peptides. BMC Genomics 2017, 18 (S7), 1. (31) Ding, Y.; Tang, J.; Guo, F. Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinf. 2016, 17 (1), 398.
(32) Zeng, X.; Liao, Y.; Liu, Y.; Zou, Q. Prediction and Validation of Disease Genes Using HeteSim Scores. IEEE/ACM Trans. Comput. Biol. Bioinf. 2017, 14 (3), 687−695. (33) Liu, B.; Yang, F.; Huang, D-s; Chou, K.-C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multiwindow-based PseKNC. Bioinformatics 2018, 34 (1), 33−40. (34) Consortium, U. P. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012, 40 (D1), D71. (35) Finn, R. D.; Tate, J.; Mistry, J.; Coggill, P. C.; Sammut, S. J.; Hotz, H.-R.; Ceric, G.; Forslund, K.; Eddy, S. R.; Sonnhammer, E. L. L. The Pfam protein families database. Nucleic Acids Res. 2007, 36 (Database), D281−D288. (36) Li, W.; Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22 (13), 1658. (37) Li, W.; Jaroszewski, L.; Godzik, A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 2001, 17 (3), 282−283. (38) Huang, Y.; Niu, B; Gao, Y.; Fu, L.; Li, W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26 (5), 680−682. (39) Chen, W.; Lv, H.; Nie, F.; Lin, H. i6mA-Pred: Identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics 2019, DOI: 10.1093/bioinformatics/btz015. (40) Liu, Y.; Wang, X.; Liu, B. IDP−CRF: Intrinsically Disordered Protein/Region Identification Based on Conditional Random Fields. Int. J. Mol. Sci. 2018, 19, 2483. (41) Zhu, P. F.; Xu, Q.; Hu, Q. H.; Zhang, C. Q. Co-regularized unsupervised feature selection. Neurocomputing 2018, 275, 2855− 2863. (42) Zhu, P. F.; Xu, Q.; Hu, Q. H.; Zhang, C. Q.; Zhao, H. Multilabel feature selection with missing labels. Pattern Recognition 2018, 74, 488−502. (43) Zhu, P. F.; Zhu, W. C.; Hu, Q. H.; Zhang, C. Q.; Zuo, W. M. Subspace clustering guided unsupervised feature selection. Pattern Recognition 2017, 66, 364−374. (44) Yan, K.; Xu, Y.; Fang, X.; Zheng, C.; Liu, B. Protein fold recognition based on sparse representation based classification. Artificial Intelligence in Medicine 2017, 79, 1−8. (45) Pearson, K. Determination of the Coefficient of Correlation. Science 1909, 30 (757), 23−25. (46) Gosling, C. Encyclopedia of Distances. Reference Reviews 2010, 24 (6), 34−34. (47) Senoussaoui, M.; Kenny, P.; Stafylakis, T.; Dumouchel, P. A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization. IEEE/ACM Transactions on Audio Speech & Language Processing 2014, 22 (1), 217−227. (48) Rogers, D. J.; Tanimoto, T. T. A Computer Program for Classifying Plants. Science 1960, 132 (3434), 1115−1118. (49) Liu, B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Briefings Bioinf. 2017, DOI: 10.1093/bib/bbx165. (50) Wei, L.; Chen, H.; Su, R. M6APred-EL: A Sequence-Based Predictor for Identifying N6-methyladenosine Sites Using Ensemble Learning. Mol. Ther.–Nucleic Acids 2018, 12, 635−644. (51) Tang, H.; Zhao, Y. W.; Zou, P.; Zhang, C. M.; Chen, R.; Huang, P.; Lin, H. HBPred: a tool to identify growth hormonebinding proteins. Int. J. Biol. Sci. 2018, 14 (8), 957−964. (52) Xiong, Y.; Liu, J.; Wei, D.-Q. An accurate feature-based method for identifying DNA-binding residues on protein surfaces. Proteins: Struct., Funct., Genet. 2011, 79 (2), 509−517. (53) Wei, L.; Xing, P.; Zeng, J.; Chen, J.; Su, R.; Guo, F. Improved prediction of protein−protein interactions using novel negative samples, features, and an ensemble classifier. Artificial Intelligence in Medicine 2017, 83, 67−74. (54) Wei, L.; Wan, S.; Guo, J.; Wong, K. K. A novel hierarchical selective ensemble classifier with bioinformatics application. Artificial Intelligence in Medicine 2017, 83, 82−90. 2938
DOI: 10.1021/acs.jproteome.9b00250 J. Proteome Res. 2019, 18, 2931−2939
Article
Journal of Proteome Research (55) Yang, H.; Lv, H.; Ding, H.; Chen, W.; Lin, H. iRNA-2OM: A Sequence-Based Predictor for Identifying 2′-O-Methylation Sites in Homo sapiens. Journal of computational biology 2018, 25 (11), 1266− 1277. (56) Chen, W.; Yang, H.; Feng, P.; Ding, H.; Lin, H. iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics 2017, 33 (22), 3518−3523. (57) Zeng, X.; Lin, W.; Guo, M.; Zou, Q. A comprehensive overview and evaluation of circular RNA detection tools. PLoS Comput. Biol. 2017, 13 (6), e1005420. (58) Ding, Y.; Tang, J.; Guo, F. Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing 2019, 325, 211−224. (59) Ding, Y.; Tang, J.; Guo, F. Identification of drug-target interactions via multiple information integration. Inf. Sci. 2017, 418− 419, 546−560. (60) Liu, B.; Li, K.; Huang, D.-S.; Chou, K.-C. iEnhancer-EL: Identifying enhancers and their strength with ensemble learning approach. Bioinformatics 2018, 34 (22), 3835−3842. (61) Pillai, I.; Fumera, G.; Roli, F. F-measure optimization in multilabel classifiers. International Conference on Pattern Recognition; 2012; pp 2424−2427. (62) Ling, C. X.; Huang, J.; Zhang, H. AUC: a Statistically Consistent and more Discriminating Measure than Accuracy. International Joint Conference on Artificial Intelligence; Morgan Kaufmann Publishers Inc., 2003. (63) Zhang, C.; Liu, C.; Zhang, X.; Almpanidis, G. An Up-to-Date Comparison of State-of-the-Art Classification Algorithms. Expert Systems With Applications 2017, 82, 128−150. (64) Cabarle, F. G. C.; Adorna, H. N.; Jiang, M.; Zeng, X. Spiking Neural P Systems With Scheduled Synapses. IEEE Transactions on Nanobioscience 2017, 16 (8), 792−801. (65) Song, T.; Rodriguez-Paton, A.; Zheng, P.; Zeng, X. Spiking Neural P Systems with Colored Spikes. IEEE Transactions on Cognitive and Developmental Systems 2018, 10 (4), 1106−1115. (66) Zhang, X.; Pan, L.; Paun, A. On the universality of axon P systems. IEEE Transactions on Neural Networks and Learning Systems 2015, 26 (11), 2816−2829. (67) Zou, Q.; Xing, P.; Wei, L.; Liu, B. Gene2vec: Gene Subsequence Embedding for Prediction of Mammalian N6-Methyladenosine Sites from mRNA. RNA 2019, 25 (2), 205−218. (68) Xu, H.; Zeng, W.; Zhang, D.; Zeng, X. MOEA/HD: A Multiobjective Evolutionary Algorithm Based on Hierarchical Decomposition. IEEE Transactions on Cybernetics 2019, 49 (2), 517−526. (69) Zhang, X.; Tian, Y.; Jin, Y. A knee point-driven evolutionary algorithm for many-objective optimization. IEEE Transactions on Evolutionary Computation 2015, 19 (6), 761−776.
2939
DOI: 10.1021/acs.jproteome.9b00250 J. Proteome Res. 2019, 18, 2931−2939