Article pubs.acs.org/jcim
Cite This: J. Chem. Inf. Model. 2018, 58, 1169−1181
Multiclassification Prediction of Enzymatic Reactions for Oxidoreductases and Hydrolases Using Reaction Fingerprints and Machine Learning Methods Yingchun Cai, Hongbin Yang, Weihua Li, Guixia Liu, Philip W. Lee, and Yun Tang* Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
Downloaded via DURHAM UNIV on June 25, 2018 at 07:39:40 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
S Supporting Information *
ABSTRACT: Drug metabolism is a complex procedure in the human body, including a series of enzymatically catalyzed reactions. However, it is costly and time consuming to investigate drug metabolism experimentally; computational methods are hence developed to predict drug metabolism and have shown great advantages. As the first step, classification of metabolic reactions and enzymes is highly desirable for drug metabolism prediction. In this study, we developed multiclassification models for prediction of reaction types catalyzed by oxidoreductases and hydrolases, in which three reaction fingerprints were used to describe the reactions and seven machine learnings algorithms were employed for model building. Data retrieved from KEGG containing 1055 hydrolysis and 2510 redox reactions were used to build the models, respectively. The external validation data consisted of 213 hydrolysis and 512 redox reactions extracted from the Rhea database. The best models were built by neural network or logistic regression with a 2048-bit transformation reaction fingerprint. The predictive accuracies of the main class, subclass, and superclass classification models on external validation sets were all above 90%. This study will be very helpful for enzymatic reaction annotation and further study on metabolism prediction.
■
INTRODUCTION Metabolism of xenobiotics is one of the major concerns in drug discovery and development. Drug metabolism can produce metabolites that might have different physicochemical and pharmacological profiles from the parent drugs and consequently affect drug safety and efficacy.1,2 Drug metabolism in vivo is involved in enzyme-catalyzed reactions. Therefore, rapid and accurate identification of the metabolic reaction and metabolite is highly desirable. However, it is costly and time consuming to investigate drug metabolism experimentally. Computational methods have demonstrated great advantages in the prediction of drug pharmacokinetic properties and hence are developed to discriminate metabolic reactions and further predict drug metabolism.3 Classification of enzymatic reactions is the first step of drug metabolism prediction. It is well known that protein catalytic functions are officially classified by the Enzyme Commission (EC) classification system, which is defined by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) and widely adopted in chemical and biological communities (http://www. chem.qmul.ac.uk/iubmb/enzyme/). Meanwhile, there are a large number of data available in public databases, which make the classification of enzymatic reactions feasible. The EC numbers not only represent enzymatic reactions (chemical information) but are also employed as identifiers of enzymatic © 2018 American Chemical Society
genes (genomic information). The duality of EC numbers links the genomic repertoire of enzymatic genes to the chemical repertoire of metabolic pathways, called metabolic reconstruction. The EC system manually annotates the overall reactions with a hierarchical four numbers, namely, EC a.b.c.d, where a, b, c, and d are the numbers.4 The first number defines the chemical type of reaction catalyzed by the enzyme. The second (subclass) and the third (subsubclass) numbers are dependent on the first number but essentially define the chemistry in more detail and describe the bond type(s) and the nature of substrate(s). The last one is a serial number to indicate the substrate specificity. The enzymes are classified into six primary types according to the first number from 1 to 6, namely, oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases in order. Although widely accepted, the EC system still has some limitations.5 It is often deficient with regard to such situations where a reaction happens reversely, or an enzyme catalyzes more than one reaction, or the same reaction is catalyzed by more than one enzyme.6−9 Therefore, different methods have been developed to automatically annotate and compare metabolic reactions, including those with incomplete EC numbers or without EC numbers.8 Received: November 12, 2017 Published: May 7, 2018 1169
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181
Article
Journal of Chemical Information and Modeling
Figure 1. General workflow of model building. Note that PF does not support TRF, Morgan2 does not support SRF, and Morgan2 and PF do not support RDF.
analyzed. The best classification models were further validated by external data sets to ensure their robustness. This study would be very helpful for correcting wrong or inconsistent classification of enzymes and classifying a number of similar reactions.
In 2004, Kotera et al. developed a method to match a substrate onto a product for identification of a reaction center and then to assign a reaction classification number with the EC system.8 In 2007, O’Boyle et al. developed a method using a reaction mechanism to measure the enzyme similarity.10 In 2009, Latino et al. calculated the physicochemical descriptors of the reactive bonds and generated a reaction MOLMAP to encode the reaction center.11 The MOLMAP information was also employed for an automatic classification of enzyme reaction with self-organizing maps.12 Subsequently, the Gasteiger group used physicochemical properties of the bonds or atoms at the reaction center to perform reaction classification.13,14 In 2012, Nath and Mitchel investigated the relationships between the EC number, associated reaction, and reaction mechanism with three machine learning methods and five encoding descriptors.15 In 2013, Matsuta et al. exhibited another EC number predictor using mutual information and a support vector machine.16 Although these reaction description approaches can achieve better results in the classification of enzyme reactions, it is extremely dependent on performance of atom mapping and investigation of the reaction center to identify the changed atoms or bonds in one reaction. A new strategy to represent a chemical reaction is to encode the transformation as a reaction fingerprint,17−19 first proposed by Daylight (http://www.daylight.com/). A reaction fingerprint describes a chemical reaction by “different Daylight fingerprints”reactant fingerprints subtracted from product fingerprints which need not play atom−atom mapping to identify a reaction center. Schneider et al. employed reaction fingerprints to classify large-scale chemical reactions with various machine learning methods and obtained excellent results.20 Nevertheless, their method still needs to identify agents and nonagents by atom−atom mapping. In this study, we explored several types of reaction fingerprints on enzymatic reactions, which are really independent of atom−atom mapping. With these reaction fingerprints, we built multiclassification models for hydrolysis (EC 3.b.c.d) and oxidation−reduction (redox) reactions (EC 1.b.c.d), via machine learning methods. We also built subclass and superclass classification models for the two types of reactions. Then, the performance of different models was
■
RESULTS In this study, a total of 4290 reactions collected from KEGG21,22 and Rhea23 databases were used to build multiclassification models for prediction of enzyme-catalyzed hydrolysis and redox reactions. Three types of reaction fingerprints were applied to describe the reactions, namely, reaction difference fingerprint (RDF),24 structural reaction fingerprint (SRF),24 and transformation reaction fingerprint (TRF),20 which were generated by four types of molecular fingerprints, i.e., atom-pairs (AP),25 Morgan2,26 topological torsions (TT),27 and pattern fingerprint (PF).24 Meanwhile, seven machine learning methods were employed to build the multiclassification models,28 including decision tree (DT), knearest neighbors (k-NN), logistic regression (LR), naive Bayes (NB), neural network (NN), random forest (RF), and support vector machine (SVM). In addition, three metrics P (Precision), R (Recall), and F (F1-score) were used to evaluate the performance of each model. The whole workflow of model building for reaction classification is shown in Figure 1 (detailed information about model construction is described in the Materials and Methods section). We also constructed subclass and superclass models for further classification of those reactions. Data Set Analysis. Data from KEGG were used to build the classification models, whereas those from Rhea were employed for validation of the models. Originally, 1429 hydrolysis reactions were acquired from KEGG and 1000 ones from Rhea, while 3236 redox reactions came from KEGG and 2555 ones from Rhea. After a preprocess procedure, 1055 hydrolysis and 2510 redox reactions were from KEGG, and 213 hydrolysis and 512 redox reactions came from Rhea. The details of all these main-class reactions are shown in Figure 2. Meanwhile, the details of all those selected subclass hydrolysis and redox reactions are shown in Figures S1 and S2, respectively. 1170
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181
Article
Journal of Chemical Information and Modeling
was markedly worse than RDF and TRF (Figures S3C and S4C). In Figures S3D and S4D, PF provides a worse performance when compared with the other three molecular fingerprints. Parameter-Optimized Models on Main Class. Based on the above results, the fingerprint length from 8 to 64 bits, the machine learning method NB, the reaction fingerprint SRF, and the molecular fingerprint PF were discarded first. All the others were employed in a further study, and their parameters were optimized by a grid search algorithm. The results are shown in Figure 3 for hydrolysis reactions and Figure 4 for redox reactions.
Figure 2. Data sets of enzyme-catalyzed hydrolysis and redox reactions from KEGG and Rhea databases. (A, C) Training set from KEGG. (B, D) External validation set from Rhea. X axis: subtypes of reactions. Y axis: count for each reaction subtype.
From Figure 2A, we can see that two subclasses, namely, EC 3.1.c.d and EC 3.5.c.d, contained most of the hydrolysis reactions. Similarly, EC 1.1.c.d and EC 1.14.c.d were two major subclasses for redox reactions including 734 and 697 reactions (Figure 2C and D), separately. For the two classes of reactions, there are also several subclasses containing very few reactions, which were still employed in model building in order to validate the performance of the classification models. Baseline Models on Main Class. Hydrolysis and redox reactions contain 13 and 21 subclasses, respectively. We first conducted a series of explorations to identify the most effective classification models without parameter optimization, called baseline models. Here, reaction fingerprints were generated by different molecular fingerprints (AP, Morgan2, TT, and PF) in different lengths (from 8 bits to 8192 bits). Totally 88 reaction fingerprints were obtained to describe the reactions. Here, 616 baseline models were then constructed for classification of hydrolysis and redox reactions, separately, with seven machine learning algorithms (DT, k-NN, LR, NB, NN, RF, and SVM). In this step, all parameters of the algorithms were set in default values. The measurement metrics of these baseline models were averaged and are shown in Figure S3 for hydrolysis reactions and Figure S4 for redox reactions. As shown in Figure S3A, we can see that the accuracy was in a rising tendency along with the fingerprint length from 8 to 1024 bits. The same results are observed in Figure S4A for redox reactions. Figures S3B and S4B show the contributions of different machine learning methods to classification of reactions. Obviously, NB performed the worst, followed by SVM. However, from our previous work,29−32 we learned that SVM was an excellent classification algorithm and could produce satisfactory results. In this study, SVM performed not so well. The probable reason is that the parameters were not optimized. Therefore, we promoted the SVM classifier into next step to evaluate its performance again. Three types of reaction fingerprints were compared in this procedure, and SRF
Figure 3. Parameter-optimized models for hydrolysis reactions. X axis: (A) fingerprint length, (B) machine learning methods, (C) reaction fingerprint types, and (D) molecular fingerprint to calculate reaction fingerprint. Y axis: P (Precision), R (Recall), and F (F1-score). The black bar indicates the standard deviation.
Hydrolysis Reactions. As shown in Figure 3A, a longer reaction fingerprint could enhance model performance. The measurement metrics ranged from 0.91 to 0.94 for all the three parameters along with the increasing length of reaction fingerprint. The 8192-bit models exceeded slightly in performance with all three metrics over 0.94 (±std. = 0.005−0.008). In Figure 3B, we observed an obviously enhanced performance for SVM method (F = 0.92−0.948). The two methods k-NN and NN were comparative with SVM in classification, with three measurement metrics all higher than 0.93 (±std. = 0.004− 0.019). LR and RF methods performed slightly worse, with P, R, and F all higher than 0.92 (±std. = 0.005−0.026). The DT classifier was terrible compared to others. In Figure 3C, we compared two types of reaction fingerprints: RDF and TRF. It is obvious that TRF performed better than RDF. For molecular fingerprints, in Figure 3D, we found that Morgan2 performed almost the same as AP, which could recall 93% of the samples with a low misclassification ratio (P = 0.92−0.936). Based on above identification, we selected these better models built by LR, k-NN, NN, RF, and SVM, using TRF 1171
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181
Article
Journal of Chemical Information and Modeling
From Figure 5A, we learn that the fingerprint length increased from 128 to 4096 bits without decreasing the performance of the classification models. Further increasing the length to 8192 bits, the accuracy decreased a little (P = 0.625− 0.911, R = 0.87−0.90, and F = 0.85−0.882). By comparing the results from 2048 and 4096 bits, it was easy to see that no gain was obtained as the fingerprint length increased. From Figure 5B, it was obvious that k-NN produced the most robust classification models, followed by LR and NN. RF and SVM exhibited slightly worse performance for distinguishing hydrolysis reactions. For molecular fingerprints, we obtained an opposite result in comparison of the internal test result (Figure 5C). AP performed better than Morgan2 (with 0.96 for P, R, and F). On the basis of these results, the combinations of these methods (k-NN, LR, and NN) and these bit sizes (2048 and 4096) generated the most effective classification models (Table S1). Redox Reactions. From Figure 4, we can find that the models performed almost the same when the length of the reaction fingerprint increased from 1024 to 8192 bits (Figure 4A), quite similar to those in the hydrolysis reactions. For 8192 bits, all three metrics were higher than 0.86 (±std. = 0.009− 0.023) but lower than those with hydrolysis reactions. As for machine learning methods (Figure 4B), NN performed the best (F = 0.844−0.879), followed by RF and SVM. The parameteroptimization procedure boosted the classification capacity for SVM. DT performed the worst (F = 0.824−0.861). K-NN generated a high P value of 0.88 but a low R value of 0.83, also a worse classifier here. In Figure 4C, classification models built by TRF excelled those by RDF. As to molecular fingerprints, Morgan2 was still located in a superior level compared to AP (Figure 4D).
Figure 4. Parameter-optimized models for redox reactions. X axis: (A) fingerprint length, (B) machine learning methods, (C) reaction fingerprint types, and (D) molecular fingerprint to calculate reaction fingerprint. Y axis: P, R, and F. The black bar indicated the standard deviation.
calculated from AP and Morgan2. These models were then validated by external data set. All the results are shown in Figure 5.
Figure 5. External validation of classification models for hydrolysis reactions (A, B, and C) and redox reactions (D, E, and F). X axis: (A, D) fingerprint length, (B, E) machine learning methods, and (C, F) molecular fingerprint to calculate reaction fingerprint. Y axis: P, R, and F. The black bar indicated the standard deviation. 1172
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181
Article
Journal of Chemical Information and Modeling
Table 1. Some Examples of Correctly Classified Reactions Shown in Confusion Matrix (Figures S5 and S6) for Main Class of Hydrolases and Oxidoreductases
According to the above results, NN, LR, RF, and SVM methods were selected to build the better models, together with AP and Morgan2 to create TRF from 512 to 8196 bits. These new models were further validated by the external data set. All the results are shown in Figure 5. From Figure 5D, it is easy to see that different fingerprint lengths did not affect the performance of the models significantly. Comparing the four machine learning methods (Figure 5E), SVM performed slightly better than LR and NN (P = 0.827−0.846, R = 0.779−0.801, F = 0.785−0.805). RF exhibited a little worse classification capacity (P = 0.811−0.834, R = 0.718−0.746, and F = 0.737−0.768). As for the molecular fingerprint (Figure 5F), the same results as those of the hydrolysis reactions were obtained, with AP slightly better than
Morgan2 (P = 0.788−0.816, R = 0.733−0.783, and F = 0.752− 0.788). Based on these results, the combinations of machine learning methods (LR, NN, and SVM) and these bit sizes (2048 and 4096) produced the most effective classification models shown in Table S2. In Table 1, we listed some examples to demonstrate the powerful classification capability of the best models that could correctly tell a minority of, and EC 3.9.c.d catalyzes phosphorus−nitrogen bond, while EC 3.13.c.d is a catalyzer of carbon−sulfur bonds. There were only four, three, and one reactions belonging to the three subclasses, respectively, but our models could correctly identify all of them. For redox reactions, we took four reactions as instances to examine and get similar 1173
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181
Article
Journal of Chemical Information and Modeling Table 2. Classification Models for Three Subclasses of Hydrolases (EC 3.b.c.d) Internal Test Set
a
External Validation Set
EC Number
Bits
Methods
P
R
F
Supporta
P
R
F
Supporta
EC 3.1.c.d
2048 4096 2048 4096 2048 4096
k-NN k-NN LR LR NN NN
0.96 0.96 0.98 0.98 0.98 0.98
0.98 0.96 0.99 0.99 0.99 0.99
0.97 0.96 0.98 0.98 0.98 0.98
81 81 81 81 81 81
0.96 0.96 0.98 0.98 0.98 0.98
0.96 0.96 0.98 0.98 0.98 0.98
0.96 0.96 0.98 0.98 0.98 0.98
49 49 49 49 49 49
EC 3.2.c.d
2048 4096 2048 4096 2048 4096
k-NN k-NN LR LR NN NN
1.00 1.00 1.00 1.00 1.00 1.00
1.00 1.00 1.00 1.00 1.00 1.00
1.00 1.00 1.00 1.00 1.00 1.00
34 34 34 34 34 34
1.00 1.00 1.00 1.00 1.00 1.00
1.00 1.00 1.00 1.00 1.00 1.00
1.00 1.00 1.00 1.00 1.00 1.00
34 34 34 34 34 34
EC 3.5.c.d
2048 4096 2048 4096 2048 4096
k-NN k-NN LR LR NN NN
0.81 0.81 0.88 0.88 0.91 0.93
0.80 0.80 0.86 0.86 0.89 0.92
0.78 0.78 0.86 0.86 0.90 0.92
55 55 55 55 55 55
0.80 0.80 0.95 0.95 0.94 0.94
0.69 0.69 0.94 0.94 0.92 0.93
0.63 0.63 0.94 0.94 0.93 0.93
35 35 35 35 35 35
Support means the number of samples used for validation.
Superclass Prediction Models. To validate the performance of classification models, we selected AP-based TRF to generate the reaction fingerprint in 2048 bits, then built four models to classify EC 1.b.c.d and EC 3.b.c.d with four methods (k-NN, LR, NN, and SVM). The prediction results are shown in Table 4. Compared with the other three methods, SVM resulted in excellent prediction for both data sets, with all metrics up to 0.99 on the external validation set. Both LR and NN performed slightly worse on the internal test set (F = 0.98) but still well on external data set. K-NN exhibited moderate classification capability on both internal and external data sets. Based on the above investigation of classification models for the main class, subclass, and superclass of both hydrolysis and redox reactions, we obtained the best models which were built by the NN or LR method with a 2048-bit TRF from AP. To further confirm the effectiveness and robustness of these two models, we conducted a leave-one-cluster-out cross validation on the training set to show the generalization ability. To evaluate truly new data and similarity analysis between the training set, internal test set, and external validation set, we conducted a similarity analysis to ensure that the three data sets shared a low similarity. As shown in Table 5, the results were above 0.8 for hydrolysis reactions, whereas the results were only around 0.7 for redox reactions. However, for the superclass prediction model, much higher results were obtained (up to 0.95 for all the three metrics). Comparing LR and NN, we could see that they showed almost equal classification ability. After analyzing these results for the 11 subclass models in Tables S3 and S4, we got a similar conclusion. These findings demonstrated the border application domain of our selected models to predict future data. The low similarity in Figure S6 also proved the usefulness of these models.
results. These examples illustrated that the classification performance of our models was excellent. Subclass Prediction Models. Three subclasses of the hydrolysis reaction data (b is 1, 2, and 5 in EC 3.b.c.d) and eight subclasses of redox reaction data (b is 1, 2, 3, 4, 5, 8, 13, and 14 in EC 1.b.c.d) were extracted for further studies to build subclass classification models using above validated combinations (TRF; 2048 bits and 4096 bits; k-NN, LR, and NN for EC 3.b.c.d; LR, NN, and SVM for EC 1.b.c.d; AP). We also used 80% of the data as the training set, the remaining 20% as the test set, and an external set for validation of the models. These new models are shown in Table 2 for hydrolysis reactions and Table 3 for redox reactions, as well as in Figure S5 for the standard deviation frequency distribution of the three metrics of all models. Hydrolysis Reactions. EC 3.1.c.d contains 450 reactions divided into eight subsubclasses. As shown in Table 2, excellent performance was obtained for the six models. As for EC 3.2.c.d, the performance of all models could achieve complete accuracy for the external 49 reactions. All of the 34 reactions in the test set and 34 in the external validation set catalyzed by EC 3.2.c.d were also completely distinguished. Analyzing these classification results for EC 3.5.c.d showed a relatively lower performance of each model. The LR models performed better than the others (P, R, and F higher than 0.90) for an external validation set of 35 samples. From Table 2, we also found that a 2048-bit fingerprint was enough to generate excellent models in comparison to the 4096-bit one. Redox Reactions. In Table 3, all these models showed very similar results to those of the hydrolysis reactions. The 2048-bit fingerprint performed equally to the 4096-bit one, except for EC 1.8.c.d, where the prediction model built by NN with a 2048-bit fingerprint only yielded P = 0.76, R = 0.79, and F = 0.75 on the internal data set. Comparing the performance of all the models, we found that models of EC 1.4.c.d, EC 1.8.c.d, and EC 1.14.c.d performed worse than the others.
■
DISCUSSION In this study, we built multiclassification models for prediction of enzymatic reaction types using machine learning methods 1174
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181
Article
Journal of Chemical Information and Modeling Table 3. Classification Models for Eight Subclasses of Oxidoreductases (EC 1.b.c.d) Internal Test Set
a
External Validation Set
EC Number
Bits
Methods
P
R
F
Supporta
P
R
F
Supporta
EC 1.1.c.d
2048 4096 2048 4096 2048 4096
LR LR NN NN SVM SVM
0.97 0.97 0.97 0.97 0.98 0.98
0.98 0.98 0.98 0.98 0.98 0.98
0.98 0.98 0.98 0.98 0.98 0.98
147 147 147 147 147 147
0.98 0.98 0.98 0.98 0.99 0.99
0.99 0.99 0.99 0.99 0.99 0.99
0.99 0.99 0.99 0.99 0.99 0.99
99 99 99 99 99 99
EC 1.2.c.d
2048 4096 2048 4096 2048 4096
LR LR NN NN SVM SVM
0.96 0.98 0.97 0.97 0.93 0.93
0.95 0.98 0.97 0.97 0.93 0.93
0.95 0.98 0.97 0.97 0.93 0.93
43 43 43 43 43 43
0.92 0.92 0.92 0.92 0.87 0.87
0.91 0.95 0.95 0.95 0.86 0.86
0.91 0.93 0.93 0.93 0.85 0.85
21 21 21 21 21 21
EC 1.3.c.d
2048 4096 2048 4096 2048 4096
LR LR NN NN SVM SVM
0.96 0.96 0.95 0.95 0.90 0.90
0.9 0.94 0.96 0.97 0.90 0.90
0.95 0.95 0.95 0.96 0.90 0.90
50 50 50 50 50 50
0.99 0.99 0.94 0.97 0.93 0.93
0.97 0.97 0.94 0.96 0.87 0.87
0.98 0.98 0.94 0.96 0.88 0.88
39 39 39 39 39 39
EC 1.4.c.d
2048 4096 2048 4096 2048 4096
LR LR NN NN SVM SVM
0.87 0.87 0.86 0.86 0.76 0.76
0.91 0.91 0.91 0.91 0.86 0.86
0.89 0.89 0.88 0.88 0.80 0.80
22 22 22 22 22 22
0.67 0.67 0.67 0.67 0.67 0.67
0.82 0.82 0.82 0.82 0.82 0.82
0.74 0.74 0.74 0.74 0.74 0.74
11 11 11 11 11 11
EC 1.5.c.d
2048 4096 2048 4096 2048 4096
LR LR NN NN SVM SVM
0.90 0.90 0.90 0.90 0.90 0.90
0.94 0.94 0.94 0.94 0.94 0.94
0.92 0.92 0.92 0.92 0.92 0.92
18 18 18 18 18 18
0.91 0.91 0.91 0.91 0.91 0.91
0.95 0.95 0.95 0.95 0.95 0.95
0.93 0.93 0.93 0.93 0.93 0.93
21 21 21 21 21 21
EC 1.8.c.d
2048 4096 2048 4096 2048 4096
LR LR NN NN SVM SVM
0.80 0.80 0.76 0.78 0.75 0.75
0.86 0.86 0.79 0.80 0.79 0.79
0.83 0.83 0.75 0.77 0.74 0.74
14 14 14 14 14 14
0.83 0.83 0.83 0.83 0.83 0.83
0.83 0.83 0.83 0.83 0.75 0.75
0.83 0.83 0.83 0.83 0.78 0.78
12 12 12 12 12 12
EC 1.13.c.d
2048 4096 2048 4096 2048 4096
LR LR NN NN SVM SVM
0.90 0.90 0.90 0.90 0.93 0.93
0.93 0.93 0.93 0.93 0.96 0.96
0.91 0.91 0.91 0.91 0.95 0.95
28 28 28 28 28 28
0.91 0.91 0.91 0.91 1.00 1.00
0.90 0.90 0.90 0.90 1.00 1.00
0.90 0.90 0.90 0.90 1.00 1.00
10 10 10 10 10 10
EC 1.14.c.d
2048 4096 2048 4096 2048 4096
LR LR NN NN SVM SVM
0.92 0.92 0.89 0.89 0.93 0.93
0.91 0.91 0.89 0.89 0.93 0.93
0.91 0.91 0.88 0.88 0.93 0.93
140 140 140 140 140 140
0.82 0.82 0.78 0.79 0.79 0.79
0.65 0.66 0.66 0.66 0.67 0.67
0.63 0.63 0.61 0.61 0.61 0.61
250 250 250 250 250 250
Support means the number of samples used for validation.
together with reaction fingerprints. Please note that our study was not intended to predict the enzymatic reaction type for a
substrate but attempted to find out the relation between a whole reaction and a certain enzyme to correct the wrong EC 1175
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181
Article
Journal of Chemical Information and Modeling Table 4. Superclass Classification Models for Hydrolases and Oxidoreductases Internal Test Set Methods k-NN Std.a LR Std.a NN Std.a SVM Std.a a
P 0.97 ±0.00 0.98 ±1.11 0.98 ±1.67 0.99 ±0.00
R
× 1000 × 10−16 × 10−03 × 1000
0.97 ±0.00 0.98 ±1.11 0.98 ±1.63 0.99 ±1.11
External Validation Set F
× 1000 × 10−16 × 10−03 × 10−16
0.97 ±2.22 0.98 ±1.11 0.98 ±1.65 0.99 ±0.00
Supportb 713
× 10−16 713 × 10−16 713 × 10−03 713 × 1000
P 0.96 ±1.11 0.99 ±2.22 0.99 ±1.40 0.99 ±1.11
× 10−16 × 10−16 × 10−03 × 10−16
R 0.96 ±1.11 0.99 ±1.11 0.99 ±1.41 0.99 ±0.00
Supportb
F
× 10−16 × 10−16 × 10−03 × 1000
0.96 ±1.11 0.99 ±2.22 0.99 ±1.40 0.99 ±0.00
725 × 10−16 725 × 10−16 725 × 10−03 725 × 1000
Std. means standard deviation. bSupport means the number of samples used for validation.
Table 5. Leave-One-Cluster-Out Cross Validation on Training Set for Main Class and Superclass of Both Hydrolysis and Redox Reactions
a
Method
P
Std.a
R
LR NN
0.82 0.80
±1.11 × 10−16 ±4.15 × 10−03
0.86 0.84
LR NN
0.70 0.71
±0.00 × 1000 ±8.64 × 10−03
LR NN
0.95 0.95
±2.22 × 10−16 ±2.29 × 10−03
Std.a
F
Std.a
EC 3.b.c.d ±0.00 × 1000 ±3.40 × 10−03
0.84 0.82
±0.00 × 1000 ±2.58 × 10−03
0.71 0.73
EC 1.b.c.d ±1.11 × 10−16 ±7.01 × 10−03
0.70 0.71
±0.00 × 1000 ±7.18 × 10−03
0.95 0.95
EC 3.b.c.d + EC 1.b.c.d ±1.11 × 10−16 ±2.24 × 10−03
0.95 0.95
±2.22 × 10−16 ±2.27 × 10−03
Std. means standard deviation.
numbers of some reactions and to assign experimentally investigated reactions to given enzymes. Factors To Affect the Accuracy of Models. On the basis of the above-mentioned studies, several excellent prediction models were obtained for hydrolysis and redox reactions, separately. From the study, we have learned that several factors might affect the accuracy of classification models. At first, how to describe chemical reactions is an important thing to the results of model building. In this study, we used three types of reaction fingerprints to describe the chemical reactions, namely, SRF, RDF, and TRF. Among them, SRF was filtered out first because it only put the molecular fingerprints of reactants and products together. We know that most reactions happen in only a small part of the reactants. Simple addition of two molecular fingerprints would bring duplicated bits of chemical features, which may account for the low accuracy of the SRF method (Figures S3C and S4C). Then we compared RDF with TRF and realized that TRF performed better than RDF (Figures 3C and 4C). RDF was calculated by a built-in method in the RDKit software package, while TRF was calculated by our in-house method that was partly referenced to those supportive codes in Schneider’s work.20 A good reaction fingerprint is related to its length and the molecular fingerprint used to generate it. Many more parameters (three types of molecular fingerprints and 11 types of fingerprint length) to be selected in our TRF supplied more chances and a larger parameter-search space for us to build the best models. As to the length of fingerprint, we found that 2048 bit is enough to describe the reaction fingerprint in comparison to all of the 11 lengths (Figures 3A and 4A and Figures S3A and S4A). We believe that the 2048 bit contained critical information to describe chemical reactions. Increasing the fingerprint length to
4096 or longer would introduce redundant information that did influence the performance of models, and vice versa, decreasing fingerprint length to 1024 or shorter would not abundantly represent one reaction. Regarding the molecular fingerprint, AP performed better than the others. The reason might be that AP contains atom−atom mapping information, which makes generation of a reaction fingerprint much easier. Morgan2 performed slightly worse compared with AP when those models were tested onan external validation set. Second, the machine learning methods are key to the quality of models. NN and LR were outstanding in classification of both hydrolysis and redox reactions at all the three levels. From Figures 3B and 4B, we also see that k-NN performed well for the hydrolysis reactions, and SVM was good enough for the redox reactions. NB performed the worst here. The reason might be that NB is a simple algorithm based on Bayes’ theory ̈ assumption of independence between every pair with the naive of features. However, the correlation analysis was not conducted in calculation of reaction fingerprints. DT and RF are two tree-structure classifiers, and RF is the advanced version of DT. In our baseline models, DT and RF performed equally to LR and NN (Figures S3B and S4B). After parameter optimization, DT exhibited lower prediction results, while RF can still be comparable to NN on the internal test set (Figures 3B and 4B). On the external validation set, the RF model was just a little worse than NN on classification capacity (Figure 5B and E). Meanwhile, one obvious phenomenon observed in this study is that classification models of hydrolysis reactions performed better than those of redox reactions. By the definition of IUBMB, hydrolyses catalyze 13 classes of reactions and hundreds of subclasses of reactions. The hydrolysis reaction 1176
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181
Article
Journal of Chemical Information and Modeling Table 6. Misclassified Reactions Shown in Confusion Matrix (Figures S7 and S8) for Main Class of Hydrolases and Oxidoreductases as well as Superclass in Table 4
predicted wrongly. The methods used to identify outliers are different from those for identification of misclassifications. One reaction might belong to either outliers or misclassifications or both. From Figures S9−S11, we can see that our models could identify most of outliers, and some misclassifications were outliers. Then, we carefully analyzed these misclassified cases to further understand the advantages and disadvantages of our models (Table 6). Ten hydrolysis reactions were misclassified by the three machine learning methods (Figure S7). One of the reactions (Table 6, entry 1) was labeled as EC 3.4.c.d but incorrectly predicted as EC 3.1.c.d by all three classifiers. EC 3.4.c.d acts on peptide bonds (peptidases), whereas EC 3.1.c.d acts on ester bonds. The reaction contained a substructure of thioester, which could interact with EC 3.1.c.d and hence lead to the wrong prediction. For the second entry of Table 6, the EC 3.5.c.d reaction was wrongly classified as EC 3.8.c.d. The same product feature might be the reason why the prediction was not correct. The third wrong reaction (Table 6, entry 3) catalyzed by EC 3.1.7.10 (geranylgeranyl-diphosphate diphosphohydrolase) was predicted as substrates of EC 3.7.c.d (acting on carbon−carbon bonds) by LR and NN classifiers. We supposed that it might be a two-step reaction catalyzed by EC 3.7.c.d, followed by EC 3.1.c.d. At first, its substrate compound formed a circle structure by a long chain transformation with EC 3.7.c.d as the catalyst. Then, the ring structure happened to hydrolyze under enzyme EC 3.1.c.d to eliminate diphosphate from the alcohol.34 For redox reactions, the reaction in entry 4 of Table 6 was predicted as EC 1.2.c.d (acting on the aldehyde or oxo group of donors), which was actually metabolized by EC 1.1.c.d (acting on the CH−OH group of donors). The reason might be that
is a kind of easily understood reaction and happens by addition of proton and hydroxyl in water to reactants to form acid−base products. For example, one carboxylic ester substrate can be hydrolyzed to split into an alcohol and a carboxylate.A simple catalysis mechanism makes hydrolysis reactions easily classified and predicted. However, a redox reaction is more complicated than hydrolysis. There are various reactants in the real world, which could be mediated by oxidoreductases, including these substances that might contain −CH− or −CH2, −CH−OH, aldehyde or oxo, −CH−CH−, −CH−NH2, or −CH−NH− groups and so on. Oxidoreductases can also catalyze NADH or NADPH which is a cofactor used in anabolic reactions, such as lipid and nucleic acid synthesis. Moreover, a variety of reaction types could be summarized from thousands of redox reactions, such as elimination, addition, ring-opening, ring-closing, and rearrangement, which burden the complexity of the reaction mechanism. The complex and multifarious nature of redox reactions leads to difficulty for identified and classified. Analysis of Misclassified Reactions. In spite of good performance for the prediction, our models still have some limitations and always gave incorrect results to some reactions. The confusion matrices of the three models are shown in Figures S7 and S8. Most of the numbers were located in the diagonal area, which indicated that the models have high performance. However, some confusion points were located out of the diagonal area. We further performed outlier detection on an external validation set to explore whether these misclassification reactions were outliers or not (Figures S9−S11). An outlier33 is defined as an observation point that is distant from the major distribution area of all observations, whereas a misclassification is an observation that is located within the major distribution area of all observations but 1177
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181
Journal of Chemical Information and Modeling
■
the aldehyde group could be identified by EC 1.2.c.d. Another reaction (Table 6, entry 5) from EC 1.5.c.d (acting on the CH−NH group of donors) was predicted to be catalyzed by EC 1.4.c.d (acting on the CH−NH2 group of donors) because of a similar reactive group. EC 1.21.c.d could catalyze reactions with NAD+ or NADP+ as the acceptor, which is the reverse of EC 1.14.c.d (with NADH or NADPH as one donor and incorporation of one atom of oxygen). So the reaction shown in entry 6 of Table 6 could probably be regarded as EC 1.14.c.d by classifiers. As to the superclass prediction models, some confusions were obtained on several samples by certain classifiers. One reaction (Table 6, entry 7) labeled as EC 1.2.1.95 was not predicted correctly by the k-NN, LR, NN, and SVM methods at all. The thioester and ester functional groups included in the first two reactions could be identified as substrates of EC 3.b.c.d. Two reactions (Table 6, entry 8) labeled as EC 3.5.99.10 and EC 3.5.99.7 were not correctly identified by all three of the methods. These misclassifications might be attributed to limitations of machine learning methods and insufficiency of reaction description.
■
Article
MATERIALS AND METHODS
Data Collection and Preparation. All the enzymatic reactions were retrieved from the KEGG21,22 and Rhea23 databases together with their EC numbers to label categories, which resulted in four data sets with molecular structures in the SMILES format. All the data sets were processed as follows. If a polymeric compound was involved in a reaction and could be identified in its monomeric form, the polymer was replaced by the monomer; otherwise, the reaction was abandoned. Those reactions including compounds with no SMILES descriptions were removed. For the EC 3.b.c.d reactions, in some cases, the direction of a reaction was reversed in order to obtain a unique hydrolysis reaction rather than a hydration one. Unbalanced reactions were not used or manually balanced in some clear situations. Some reactions contained a generalized group R; then, these R groups were replaced by methyl. General fragment symbols such as “X” were substituted by a chlorine atom. Following the above procedure, duplicated reactions of Rhea with KEGG were removed. Definition of Reaction Fingerprints. In this study, three types of reaction fingerprints were used to describe the reactions (http://www.daylight.com/dayhtml/doc/theory/ index.html). RDF is a type of fingerprint specifically tailored for reaction processing, in which the difference in the fingerprints of the reactants and the products reflects the bond changes during the reaction. SRF is the combination of the normal structural fingerprints for the reactant and the product molecules within the reaction. TRF was obtained by subtracting the reactant fingerprints from the product fingerprints. Four types of molecular fingerprints, AP, Morgan2, TT, and PF, were used to generate the reaction fingerprints. It should be noted that not every molecular fingerprint can support the three reaction fingerprints. PF does not support TRF, and Morgan2 does not support SRF, while Morgan2 and PF do not support RDF. All the fingerprints were generated using the open-source cheminformatics toolkit RDKit.24 Machine Learning Methods. In this study, seven machine learning algorithms (DT, k-NN, LR, NB, NN, RF, and SVM) were employed to build the multiclassification models.28 The parameters of each classifier were optimized using the grid search algorithm which exhaustively considers all parameter combinations. All the classifiers were performed in the opensource machine-learning toolkit scikit-learn (version 0.18).35 The default parameters and a grid of parameter values set by us for each classifier are shown in Table S5. Decision Tree. DT is an algorithm employed to generate a tree for a decision as a predictive model.36 Each interior node of the tree corresponds to one of the input variables. DT can effectively split its set of samples into subsets enriched in one class or the other, based on the attribute with the highest normalized information gain of the data. Then, the algorithm recurses on the smaller sublists. K-Nearest Neighbors. K-NN is a nonparametric method for discriminating objects based on the majority voting of its k nearest neighbors in the feature space.37 The object would be assigned to the class most common among its k nearest neighbors, with hamming distance metrics to measure the nearest. Logistic Regression. LR is a regression model where the dependent variable is categorical, developed by statistician
CONCLUSIONS
In this study, multiclassification models were constructed to predict enzymatic reactions catalyzed by hydrolases (EC 3.b.c.d) and oxidoreductases (EC 1.b.c.d) using machine learning methods and reaction fingerprints. Weighted metrics were used to evaluate models in data imbalance situations. Although the machine learning methods are commonly used, it is comprehensive research to predict EC number of each enzyme-catalyzed reaction with reaction fingerprints. Moreover, excellent prediction models were obtained in our simple workflow. After comparing different models, we demonstrated that AP-based TRF in 2048 bits could sufficiently describe one reaction and allow correct classification of reactions using machine learning methods. Furthermore, we found that NN and LR performed very well in assigning EC numbers of hydrolysis and redox reactions. These selected prediction models were tested on external validation sets to demonstrate their robustness in multiple classifications. In spite of good performances, in some instances, our models still led to wrong predictions. We then analyzed the misclassified reactions with an outlier detecting method. We also considered their structural characteristics and found some limitations for machine learning methods. The advantage to our models is that our input SMILES does not need atom−atom mapping and identification of the reaction center at all, which are difficult to reach. This is a big step to classification of enzymatic reactions with reaction fingerprints. However, there are still two disadvantages of our method. The first one is that the input SMILES in calculation of the reaction fingerprints must meet the law of conservation of mass (i.e., in a reaction, an atom on the left side must exit to the right side after a chemical change happens). To meet this situation, much labor and time must be spent to clean those reactions from the KEGG and Rhea databases. The other is that our models cannot assign all four EC numbers at one time. Next, we will develop new descriptive methods to generate reaction fingerprints and new model building methods to assign EC numbers consecutively. 1178
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181
Article
Journal of Chemical Information and Modeling David Cox in 1958.38 The logistic model is performed to estimate the probability of a response based on predictor variables. ̈ Bayes. NB probabilistically classifies samples based on Naive ̈ independence assumptions.39 the Bayes rule with strong naive This statistical strategy allows us categorize instances based on the equal and independent contributions of their attributes. Neural Network. NN is a computational approach modeling the way a biological brain solves problems and has been shown to be an effective tool in solving nonlinear problems to categorize a set of samples.40 Topologically, an NN consists of input, hidden, and output layers of neurons connected by weights. Each input layer node corresponds to a single independent variable with the exception of the bias node. The output layer node corresponds to the label or class predicted by this model. Random Forest. RF is an ensemble of tree predictors, where each tree is formed by a first random selection on a small set of input features to split at each node.41 Each tree can be seen as a predictor and gives a label. The final output category of an instance depends on the mode of classes output by all ensemble trees. Support Vector Machine. The SVM algorithm is a kernelbased tool for data classification and originally aims at minimizing the structural risk under the frame of Vapnik− Chervonenkis theory.42 The purpose of SVM training is to find a hyperplane which could discriminate samples from different categories. Kernel function is used to map the input vectors from low-dimensional space into high-dimensional space. The commonly used kernel functions include a linear kernel, polynomial kernel, Gaussian radial basis function kernel, and sigmoid kernel. The penalty parameter C and the kernel coefficient γ are two of the most important variables to be optimized. Model Building. The general workflow of model building for reaction classification is shown in Figure 1. As shown in the left side of Figure 1, at first, reaction fingerprints were generated for each reaction with molecular fingerprints of reactants and products. Then, baseline models were constructed for the data sets with machine learning methods. To decrease computational resources, no parameters were optimized in the baseline models. Those baseline models with good performance were then subjected to parameter optimization. The better models were further assessed on external validation sets to ensure their predictive capability. Each model was repeated 10 times to calculate the mean and standard deviation of its metrics. Thirteen populated types of hydrolysis reactions and 21 populated types of redox reactions, all from the KEGG database, were used to build the main class of reaction classification models separately. The best combination of machine learning methods and reaction fingerprints were then determined. For a subclass of reaction classification, three subclasses of hydrolysis reactions (EC 3.1.c.d, EC 3.2.c.d, and EC 3.5.c.d) and eight subclasses of redox reactions (with the EC second number as 1, 2, 3, 4, 5, 8, 13, 14) were employed in model building separately. Furthermore, superclass models were constructed to predict one reaction belonging to either a hydrolysis or redox reaction. For each of the above data sets, a training set and a test set were randomly split in the ratio of 8:2. The reaction SMILES was stored with an assigned reaction label for each of the reactions.
Performance Assessment of Models. The 10-fold cross validation, leave-one-cluster-out cross validation,43 and external validation were used to evaluate the performance of all models. Among them, leave-one-cluster-out cross validation is a variation of n-fold cross validation. It ensures that the same group does not exist simultaneously in both the test set and training set, so it is possible to detect some overfitting situations. To split one data set into different groups, a kmeans44 algorithm was conducted to do unsupervised learning to cluster data. Ten of the clusters were specified in this work. Additionally, three weighted parameters (R, P, and F) were calculated for each model, based on the counts of true positives (TP), false positives (FP), and false negatives (FN).20 “Weight” could account for class imbalance by computing the average of the binary metrics where each class’ score is weighted by its presence in the true label. R is estimated as a fraction of the correct predictions compared to all positives. P is the ratio of correct predictions compared to all positive predictions. F is a measure that combines R and P. These three parameters are calculated in the following equations: R=
TP TP + FN
(1)
P=
TP TP + FP
(2)
F=
2 × TP 2 × TP + FP + FN
(3)
Outlier Analysis. In this study, we also performed an isolation forest algorithm (IF)45 to detect outliers for analysis of misclassification cases. The algorithm isolates observations by random selection of a feature and a split value. The split value should be between the maximum and minimum values of the selected feature. With this method, some novelties could be recognized. To visually analyze the outliers, we performed dimensional reduction for features of the best model using the AutoEncoder (AE) method46 and projected them into a twodimensional plot.
■
ASSOCIATED CONTENT
* Supporting Information S
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00656. Tables S1−S4: selected most effective models, and leaveone-cluster-out cross validation results for both hydrolysis and redox reactions. Table S5: parameter grid. Figures S1−11: subclass data sets, baseline models, standard deviation distribution, similarity plots, confusion matrices, outlier analysis for both hydrolysis and redox reactions. (PDF)
■
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. ORCID
Yingchun Cai: 0000-0001-5058-5308 Hongbin Yang: 0000-0001-6740-1632 Weihua Li: 0000-0001-7055-9836 Yun Tang: 0000-0003-2340-1109 Notes
The authors declare no competing financial interest. 1179
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181
Article
Journal of Chemical Information and Modeling
■
Application to Large-Scale Reaction Classification and Similarity. J. Chem. Inf. Model. 2015, 55, 39−53. (21) Kanehisa, M.; Goto, S.; Sato, Y.; Kawashima, M.; Furumichi, M.; Tanabe, M. Data, Information, Knowledge and Principle: Back to Metabolism in KEGG. Nucleic Acids Res. 2014, 42, D199−D205. (22) Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; Morishima, K. KEGG: New Perspectives on Genomes, Pathways, Diseases and Drugs. Nucleic Acids Res. 2017, 45, D353−D361. (23) Alcántara, R.; Axelsen, K. B.; Morgat, A.; Belda, E.; Coudert, E.; Bridge, A.; Cao, H.; De Matos, P.; Ennis, M.; Turner, S.; et al. Rhea A Manually Curated Resource of Biochemical Reactions. Nucleic Acids Res. 2012, 40, D754−D760. (24) RDKit: Open-Source Cheminformatics. http://www.rdkit.org (accessed February 8, 2017). (25) Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications. J. Chem. Inf. Model. 1985, 25, 64−73. (26) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (27) Nilakantan, R.; Bauman, N.; Dixon, J. S.; Venkataraghavan, R. Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors. J. Chem. Inf. Model. 1987, 27, 82−85. (28) Tewari, A.; Bartlett, P. L. On the Consistency of Multiclass Classification Methods. Springer: Berlin Heidelberg, 2007; pp 143−157. (29) Chen, Y.; Cheng, F.; Sun, L.; Li, W.; Liu, G.; Tang, Y. Computational Models to Predict Endocrine-Disrupting Chemical Binding with Androgen or Oestrogen Receptors. Ecotoxicol. Environ. Saf. 2014, 110, 280−287. (30) Zhang, C.; Zhou, Y.; Gu, S.; Wu, Z.; Wu, W.; Liu, C.; Wang, K.; Liu, G.; Li, W.; Lee, P. W.; Tang, Y. In Silico Prediction of hERG Potassium Channel Blockage by Chemical Category Approaches. Toxicol. Res. 2016, 5, 570−582. (31) Li, X.; Chen, L.; Cheng, F.; Wu, Z.; Bian, H.; Xu, C.; Li, W.; Liu, G.; Shen, X.; Tang, Y. In Silico Prediction of Chemical Acute Oral Toxicity Using Multi-Classification Methods. J. Chem. Inf. Model. 2014, 54, 1061−1069. (32) Cheng, F.; Yu, Y.; Shen, J.; Yang, L.; Li, W.; Liu, G.; Lee, P. W.; Tang, Y. Classification of Cytochrome P450 Inhibitors and Noninhibitors Using Combined Classifiers. J. Chem. Inf. Model. 2011, 51, 996−1011. (33) Outlier. Wikipedia. https://en.wikipedia.org/wiki/Outlier (accessed February 8, 2017). (34) Mafu, S.; Hillwig, M. L.; Peters, R. J. A Novel Labda-7,13E-dien15-ol-Producing Bifunctional Diterpene Synthase from Selaginella Moellendorffii. ChemBioChem 2011, 12, 1984−1987. (35) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825−2830. (36) Podgorelec, V.; Kokol, P.; Stiglic, B.; Rozman, I. Decision Trees: An Overview and Their Use in Medicine. J. Med. Syst. 2002, 26, 445− 463. (37) Cover, T.; Hart, P. Nearest Neighbor Pattern Classification. IEEE Trans. Inf. Theory 1967, 13, 21−27. (38) Mood, C. Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It. European Sociological Review 2010, 26, 67−82. (39) Watson, P. Naive Bayes Classification Using 2D Pharmacophore Feature Triplet Vectors. J. Chem. Inf. Model. 2008, 48, 166−178. (40) Myint, K.-Z.; Wang, L.; Tong, Q.; Xie, X.-Q. Molecular Fingerprint-Based Artificial Neural Networks QSAR for Ligand Biological Activity Predictions. Mol. Pharmaceutics 2012, 9, 2912− 2923. (41) Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947−1958.
ACKNOWLEDGMENTS This work was supported by the National Key Research and Development Program of China (Grant 2016YFA0502304), the National Natural Science Foundation of China (Grants 81373329 and 81673356), and the 111 Project (Grant B07023).
■
REFERENCES
(1) Kirchmair, J.; Williamson, M. J.; Tyzack, J. D.; Tan, L.; Bond, P. J.; Bender, A.; Glen, R. C. Computational Prediction of Metabolism: Sites, Products, SAR, P450 Enzyme Dynamics, and Mechanisms. J. Chem. Inf. Model. 2012, 52, 617−648. (2) Kirchmair, J.; Göller, A. H.; Lang, D.; Kunze, J.; Testa, B.; Wilson, I. D.; Glen, R. C.; Schneider, G. Predicting Drug Metabolism: Experiment and/or Computation? Nat. Rev. Drug Discovery 2015, 14, 387−404. (3) Sharma, M.; Garg, P. Computational Approaches for Enzyme Functional Class Prediction: A Review. Curr. Proteomics 2014, 11, 17− 22. (4) Thompson, R. Classification and Nomenclature of Enzymes. Science 1962, 137, 405−408. (5) Babbitt, P. C. Definitions of Enzyme Function for the Structural Genomics Era. Curr. Opin. Chem. Biol. 2003, 7, 230−237. (6) Todd, A. E.; Orengo, C. A.; Thornton, J. M. Evolution of Function in Protein Superfamilies, from A Structural Perspective. J. Mol. Biol. 2001, 307, 1113−1143. (7) Green, M.; Karp, P. Genome Annotation Errors in Pathway Databases Due to Semantic Ambiguity in Partial EC Numbers. Nucleic Acids Res. 2005, 33, 4035−4039. (8) Kotera, M.; Okuno, Y.; Hattori, M.; Goto, S.; Kanehisa, M. Computational Assignment of the EC Numbers for Genomic-Scale Analysis of Enzymatic Reactions. J. Am. Chem. Soc. 2004, 126, 16487− 16498. (9) Dönertaş, H. M.; Martínez Cuesta, S.; Rahman, S. A.; Thornton, J. M. Characterising Complex Enzyme Reaction Data. PLoS One 2016, 11, e0147952. (10) O’Boyle, N. M.; Holliday, G. L.; Almonacid, D. E.; Mitchell, J. B. Using Reaction Mechanism to Measure Enzyme Similarity. J. Mol. Biol. 2007, 368, 1484−1499. (11) Latino, D. A.; Aires-de-Sousa, J. Assignment of EC Numbers to Enzymatic Reactions with MOLMAP Reaction Descriptors and Random Forests. J. Chem. Inf. Model. 2009, 49, 1839−1846. (12) Latino, D. A.; Zhang, Q.-Y.; Aires-de-Sousa, J. Genome-Scale Classification of Metabolic Reactions and Assignment of EC Numbers with Self-Organizing Maps. Bioinformatics 2008, 24, 2236−2244. (13) Sacher, O.; Reitz, M.; Gasteiger, J. Investigations of EnzymeCatalyzed Reactions Based on Physicochemical Descriptors Applied to Hydrolases. J. Chem. Inf. Model. 2009, 49, 1525−1534. (14) Hu, X.; Yan, A.; Tan, T.; Sacher, O.; Gasteiger, J. Similarity Perception of Reactions Catalyzed by Oxidoreductases and Hydrolases Using Different Classification Methods. J. Chem. Inf. Model. 2010, 50, 1089−1100. (15) Nath, N.; Mitchell, J. B. Is EC Class Predictable from Reaction Mechanism? BMC Bioinf. 2012, 13, 60. (16) Matsuta, Y.; Ito, M.; Tohsato, Y. ECOH: An Enzyme Commission Number Predictor Using Mutual Information and A Support Vector Machine. Bioinformatics 2013, 29, 365−372. (17) Ridder, L.; Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of Metabolites. ChemMedChem 2008, 3, 821−832. (18) Patel, H.; Bodkin, M. J.; Chen, B.; Gillet, V. J. Knowledge-Based Approach to De Novo Design Using Reaction Vectors. J. Chem. Inf. Model. 2009, 49, 1163−1184. (19) Hu, Q.-N.; Zhu, H.; Li, X.; Zhang, M.; Deng, Z.; Yang, X.; Deng, Z. Assignment of EC Numbers to Enzymatic Reactions with Reaction Difference Fingerprints. PLoS One 2012, 7, e52901. (20) Schneider, N.; Lowe, D. M.; Sayle, R. A.; Landrum, G. A. Development of A Novel Fingerprint for Chemical Reactions and Its 1180
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181
Article
Journal of Chemical Information and Modeling (42) Czarnecki, W. M.; Podlewska, S.; Bojarski, A. J. Robust Optimization of SVM Hyperparameters in the Classification of Bioactive Compounds. J. Cheminf. 2015, 7, 38. (43) Kramer, C.; Gedeck, P. Leave-Cluster-Out Cross-Validation Is Appropriate for Scoring Functions Derived from Diverse Protein Data Sets. J. Chem. Inf. Model. 2010, 50, 1961−1969. (44) Jain, A. K. Data Clustering: 50 Years beyond K-means. Pattern Recognition Letters 2010, 31, 651−666. (45) Liu, F. T.; Ting, K. M.; Zhou, Z. H. Isolation Forest. In 2008 Eighth IEEE International Conference on Data Mining, 15−19 December 2008; pp 413−422. (46) Wang, Y.; Yao, H.; Zhao, S. Auto-Encoder Based Dimensionality Reduction. Neurocomputing 2016, 184, 232−242.
1181
DOI: 10.1021/acs.jcim.7b00656 J. Chem. Inf. Model. 2018, 58, 1169−1181