Multi-Classification Prediction of Enzymatic Reactions for

Subscriber access provided by UNIVERSITY OF THE SUNSHINE COAST

Chemical Information

Multi-Classification Prediction of Enzymatic Reactions for Oxidoreductases and Hydrolases Using Reaction Fingerprints and Machine Learning Methods Yingchun Cai, Hongbin Yang, Weihua Li, Guixia Liu, Philip W Lee, and Yun Tang J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00656 • Publication Date (Web): 07 May 2018 Downloaded from http://pubs.acs.org on May 9, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Multi-Classification Prediction of Enzymatic Reactions for Oxidoreductases and Hydrolases Using Reaction Fingerprints and Machine Learning Methods

Yingchun Cai, Hongbin Yang, Weihua Li, Guixia Liu, Philip W. Lee, Yun Tang*

Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China

*Corresponding author, E-mail: [email protected]

Key words: Metabolism prediction; Enzymatic reactions; Multi-classification models; Machine learning; Reaction fingerprint; Hydrolases; Oxidoreductases; Drug metabolism; EC numbers

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract Drug metabolism is a complex procedure in human body, including a series of enzymatically catalyzed reactions. However, it is costly and time-consuming to investigate drug metabolism experimentally, computational methods are hence developed to predict drug metabolism and have shown great advantages. As the first step, classification of metabolic reactions and enzymes is highly desirable for drug metabolism prediction. In this study, we developed multi-classification models for prediction of reaction types catalyzed by oxidoreductases and hydrolases, in which three reaction fingerprints were used to describe the reactions and seven machine learnings algorithms were employed for model building. Data retrieved from KEGG containing 1055 hydrolysis and 2510 redox reactions were used to build the models, respectively. The external validation data consisted of 213 hydrolysis and 512 redox reactions extracted from the Rhea database. The best models were built by neural network or logistic regression with 2048-bit transformation reaction fingerprint. The predictive accuracies of the main class, subclass and superclass classification models on external validation sets were all above 90%. This study would be very helpful for enzymatic reaction annotation and further study on metabolism prediction.


Page 2 of 40

Page 3 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Introduction Metabolism of xenobiotics is one of the major concerns in drug discovery and development. Drug metabolism can produce metabolites that might have different physicochemical and pharmacological profiles from the parent drugs, and consequently affect drug safety and efficacy.1, 2 Drug metabolism in vivo is involved in enzyme-catalyzed reactions. Therefore, rapid and accurate identification of metabolic reaction and metabolite is highly desirable. However, it is costly and time-consuming to investigate drug metabolism experimentally. Computational methods have demonstrated great advantages in the prediction of drug pharmacokinetic properties, hence are developed to discriminate metabolic reactions and further predict drug metabolism.3 Classification of enzymatic reactions is the first step of drug metabolism prediction. It is well-known that protein catalytic functions are officially classified by the Enzyme Commission (EC) classification system, which is defined by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) and widely adopted in chemical and biological communities (http://www.chem.qmul.ac.uk/iubmb/enzyme/). Meanwhile, there are a large number of data available in public databases, which make the classification of enzymatic reactions feasible. The EC numbers not only represent enzymatic reactions (chemical information), but are also employed as identifiers of enzymatic genes (genomic information). The duality of EC numbers links the genomic repertoire of enzymatic genes to the chemical repertoire of metabolic pathway that called metabolic



reconstruction. The EC system manually annotates the overall reactions with a hierarchical four numbers, namely EC a.b.c.d, where a, b, c and d are the numbers.4 The first number defines the chemical type of reaction catalyzed by the enzyme. The second (subclass) and the third (sub-subclass) numbers are dependent on the first number, but essentially define the chemistry in more detail and describe the bond type(s) and the nature of substrate(s). The last one is a serial number to indicate the substrate specificity. The enzymes are classified into six primary types according to the first number from 1 to 6, namely oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases in order. Although widely accepted, the EC system still has some limitations.5 It is often deficient with regard to such situations where a reaction happens reversely, or an enzyme catalyzes more than one reaction, or the same reaction is catalyzed by more than one enzyme.6-9 Therefore, different methods have been developed to automatically annotate and compare metabolic reactions, including those with incomplete EC numbers or without EC numbers.8 In 2004, Kotera et al. developed a method to match substrate onto product for identification of reaction center, and then to assign a reaction classification number with the EC system.8 In 2007 O’Boyle et al. developed a method using reaction mechanism to measure the enzyme similarity.10 In 2009, Latino et al. calculated the physicochemical descriptors of the reactive bonds and generated a reaction MOLMAP to encode the reaction center.11 The MOLMAP information was also employed for an automatic classification of enzyme reaction with self-organizing maps.12 Subsequently,


Page 4 of 40

Page 5 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Gasteiger group used physicochemical properties of the bonds or atoms at the reaction center to perform reaction classification.13, 14 In 2012, Nath and Mitchel investigated the relationships between EC number, associated reaction and the reaction mechanism with three machine learning methods and five encoding descriptors.15 In 2013, Matsuta et al. exhibited another EC number predictor using mutual information and support vector machine.16 Although these reaction description approaches can achieve better results in classification of enzyme reactions, it is extremely dependent on performance of atom mapping and investigation of reaction center to identify the changed atoms or bonds in one reaction. A new strategy to represent a chemical reaction is to encode the transformation as a reaction fingerprint,17-19 first proposed by Daylight (http://www.daylight.com/). A reaction fingerprint describes a chemical reaction by ‘different Daylight fingerprints’ – reactant fingerprints subtracted from product fingerprints, which need not play atom-atom mapping to identify reaction center. Schneider et al. employed reaction fingerprints to classify large-scale chemical reactions with various machine learning methods and obtained excellent results.20 Nevertheless, their method still needs to identify agents and nonagents by atom-atom mapping. In this study, we explored several types of reaction fingerprints on enzymatic reactions, which are really independent on atom-atom mapping. With these reaction fingerprints, we built multi-classification models for hydrolysis (EC 3.b.c.d) and oxidation-reduction (redox) reactions (EC 1.b.c.d), respectively, via machine learning methods. We also built subclass and superclass classification models for the two types



Page 6 of 40

of reactions. Then, the performance of different models was analyzed. The best classification models were further validated by external datasets to ensure their robustness. This study would be very helpful for correcting wrong or inconsistent classification of enzymes, and classifying a number of similar reactions.

Results In this study, totally 4290 reactions collected from KEGG21,

22

and Rhea23

databases were used to build multi-classification models for prediction of enzyme-catalyzed hydrolysis and redox reactions. Three types of reaction fingerprints were applied to describe the reactions, namely reaction difference fingerprint (RDF)24, structural reaction fingerprint (SRF)24 and transformation reaction fingerprint (TRF)20, which were generated by four types of molecular fingerprints, i.e. atom-pairs (AP)25, Morgan226, topological torsions (TT)27, and pattern fingerprint (PF)24. Meanwhile, seven machine learning methods were employed to build the multi-classification models,28 including decision tree (DT), k-nearest neighbors (k-NN), logistic regression (LR), naive Bayes (NB), neural network (NN), random forest (RF), and support vector machine (SVM). In addition, three metrics P (Precision), R (Recall), and F (F1-score) were used to evaluate the performance of each model. The whole workflow of model building for reaction classification was shown in Figure 1 (detailed information about model construction was described in the Method section).


Page 7 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


We also constructed subclass and superclass models for further classification of those reactions.

Figure 1. General workflow of model building. Note that PF does not support TRF, Morgan2 does not support SRF, while Morgan2 and PF do not support RDF.

Data Set Analysis Data from KEGG were used to build the classification models, whereas those from Rhea were employed for validation of the models. Originally, 1429 hydrolysis reactions were acquired from KEGG and 1000 ones from Rhea, while 3236 redox reactions came from KEGG and 2555 ones from Rhea. After preprocess procedure, 1055 hydrolysis and 2510 redox reactions were from KEGG, 213 hydrolysis and 512 redox reactions came from Rhea. The details of all these main-class reactions were shown in Figure 2. Meanwhile, the details of all those selected sub-class hydrolysis and redox reactions were shown in Figures S1 and S2, respectively.



EC 3.b.c.d

EC 3.b.c.d

A

B

EC 1.b.c.d

EC 1.b.c.d

C

D

Figure 2. Data sets of enzyme-catalyzed hydrolysis and redox reactions from KEGG and Rhea databases. (A) and (C): training set from KEGG, (B) and (D): external validation set from Rhea. X axis: subtypes of reactions; Y axis: count for each reaction subtype.

From Figure 2A, we could see that two subclasses, namely EC 3.1.c.d and EC 3.5.c.d, contained most of hydrolysis reactions. Similarly, EC 1.1.c.d and EC 1.14.c.d were two major subclasses for redox reactions including 734 and 697 reactions (Figure 2C and 2D), separately. For the two classes of reactions, there are also several


Page 8 of 40

Page 9 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


subclasses containing very few reactions, which were still employed in model building in order to validate the performance of the classification models. Baseline Models on Main Class Hydrolysis and redox reactions contain 13 and 21 subclasses, respectively. We first conducted a series of explorations to identify the most effective classification models without parameter optimization, called baseline models. Here, reaction fingerprints were generated by different molecular fingerprints (AP, Morgan2, TT, and PF) in different lengths (from 8 bits to 8192 bits). Totally 88 reaction fingerprints were obtained to describe the reactions. 616 baseline models were then constructed for classification of hydrolysis and redox reactions, separately, with seven machine learning algorithms (DT, k-NN, LR, NB, NN, RF, and SVM). In this step, all parameters of the algorithms were set in default values. The measurement metrics of these baseline models were averaged and shown in Figure S3 for hydrolysis reactions and Figure S4 for redox reactions. As shown in Figure S3A, we could see that the accuracy was in a rising tendency along with the fingerprint length from 8 to 1024 bits. The same results were observed in Figure S4A for redox reactions. Figures S3B and S4B showed the contributions of different machine learning methods to classification of reactions. Obviously NB performed the worst, followed by SVM. However, from our previous work,29-32 we learned that SVM was an excellent classification algorithm and could produce satisfactory results. In this study, SVM performed not so well. The probable reason is that the parameters were not optimized. Therefore, we promoted SVM classifier into



next step to evaluate its performance again. Three types of reaction fingerprints were compared in this procedure and SRF was markedly worse than RDF and TRF (Figures S3C and S4C). In Figures S3D and S4D, PF provided a worse performance when compared with the other three molecular fingerprints. Parameter-Optimized Models on Main Class Based on the above results, the fingerprint length from 8 to 64 bits, the machine learning method NB, the reaction fingerprint SRF, and the molecular fingerprint PF were discarded first. All the others were employed in further study and their parameters were optimized by grid search algorithm. The results were shown in Figure 3 for hydrolysis reactions and Figure 4 for redox reactions. Hydrolysis Reactions. As shown in Figure 3A, a longer reaction fingerprint could enhance model performance. The measurement metrics ranged from 0.91 to 0.94 for all the three parameters along with the increasing length of reaction fingerprint. The 8192-bit models exceeded slightly in performance with all three metrics over 0.94 (±std. = 0.005 ~ 0.008). In Figure 3B, we observed an obviously enhanced performance for SVM method (F = 0.92 ~ 0.948). The two methods k-NN and NN were comparative with SVM in classification, with three measurement metrics all higher than 0.93 (±std. = 0.004 ~ 0.019). LR and RF methods performed slightly worse, with P, R and F all higher than 0.92 (±std. = 0.005 ~ 0.026). The DT classifier was terrible compared to others. In Figure 3C, we compared two types of reaction fingerprints: RDF and TRF. It is obvious that TRF performed better than RDF. For molecular fingerprint, in Figure 3D we found that Morgan2 performed almost the


Page 10 of 40

Page 11 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


same as AP, which could recall 93% samples with a low misclassification ratio (P = 0.92 ~ 0.936). Bit

Method

A

B

Type

FP

C

D

Figure 3. Parameter-optimized models for hydrolysis reactions. X axis: (A) fingerprint length; (B) machine learning methods; (C) reaction fingerprint types; (D) molecular fingerprint to calculate reaction fingerprint. Y axis: P (Precision), R (Recall) and F (F1-score). The black bar indicated the standard deviation.

Based on above identification, we selected these better models built by LR,



k-NN, NN, RF, and SVM, using TRF calculated from AP and Morgan2. These models were then validated by external data set. All the results were shown in Figure 5. From Figure 5A, we learned that the fingerprint length increased from 128 to 4096 bits without decreasing the performance of the classification models. Further increasing the length to 8192 bits, the accuracy decreased a little (P = 0.625 ~ 0.911, R = 0.87 ~ 0.90, and F = 0.85 ~ 0.882). By comparing the results from 2048 bits and 4096 bits, it was easy to see that no gain was obtained as the increase of fingerprint length. From Figure 5B, it was obvious that k-NN produced the most robust classification models, followed by LR and NN. RF and SVM exhibited slightly worse performance for distinguishing hydrolysis reactions. For molecular fingerprints, we obtained an opposite result in comparison of the internal test result (Figure 5C). AP performed better than Morgan2 (with 0.96 for P, R and F respectively). On the basis of these results, the combinations of these methods (k-NN, LR and NN) and these bit sizes (2048 and 4096) generated the most effective classification models (Table S1). Redox Reactions. From Figure 4, we could find that the models performed almost the same when the length of reaction fingerprint increased from 1024 bits to 8192 bits (Figure 4A), quite similar to those in hydrolysis reactions. For 8192 bits, all three metrics were higher than 0.86 (±std. = 0.009 ~ 0.023), but lower than those with hydrolysis reactions. As to machine learning methods (Figure 4B), NN performed the best (F = 0.844 ~ 0.879), followed by RF and SVM. Parameter optimization procedure boosted the classification capacity for SVM. DT performed the worst (F = 0.824 ~ 0.861). K-NN generated a high P value 0.88, but a low R value 0.83, also a


Page 12 of 40

Page 13 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


worse classifier here. In Figure 4C, classification models built by TRF excelled those by RDF. As to molecular fingerprints, Morgan2 was still located in the superior level compared to AP (Figure 4D). Bit

Method

A

B

Type

FP

C

D

Figure 4. Parameter-optimized models for redox reactions. X axis: (A) fingerprint length; (B) machine learning methods; (C) reaction fingerprint types; (D) molecular fingerprint to calculate reaction fingerprint. Y axis: P, R and F. The black bar indicated the standard deviation.



Page 14 of 40

According to the above results, NN, LR, RF, and SVM methods were selected to build the better models, together with AP and Morgan2 to create TRF from 512 bits to 8196 bits. These new models were further validated by the external data set. All the results were shown in Figure 5. Bit

Method

FP

A

B

C

D

E

F

Figure 5. External validation of classification models for hydrolysis reactions (A, B, and C) and redox reactions (D, E, and F). X axis: (A, D) fingerprint length; (B, E) machine learning methods; (C, F) molecular fingerprint to calculate reaction fingerprint. Y axis: P, R and F. The black bar indicated the standard deviation.

From Figure 5D, it is easy to see that different fingerprint length did not affect the


Page 15 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


performance of the models significantly. Comparing the four machine learning methods (Figure 5E), SVM performed slightly better than LR and NN (P = 0.827 ~ 0.846, R = 0.779 ~ 0.801, F = 0.785 ~ 0.805). RF exhibited a little worse classification capacity (P = 0.811 ~ 0.834, R = 0.718 ~ 0.746 and F = 0.737 ~ 0.768). As for molecular fingerprint (Figure 5F), the same results as those of hydrolysis reactions were obtained, AP slightly better than Morgan2 (P = 0.788 ~ 0.816, R = 0.733 ~ 0.783 and F = 0.752 ~ 0.788). Based on these results, the combinations of machine learning methods (LR, NN and SVM) and these bit sizes (2048 and 4096) produced the most effective classification models shown in Table S2. In Table 1, we listed some examples to demonstrate the powerful classification capability of best models that could correctly tell a minority of reactions from others. For hydrolysis reactions, there are three small categories that contain eight reactions in total. EC 3.5.c.d acts on carbon-nitrogen bonds other than peptide bonds; EC 3.9.c.d catalyzes phosphorus-nitrogen bonds; while EC 3.13.c.d is a catalyzer of carbon-sulfur bonds. There were only four, three, and one reactions belonging to the three subclasses, respectively, but our models could tell all of them out correctly. For redox reactions, we took four reactions as instances to examine and get similar results. These examples illustrated that the classification performance of our models was excellent. Table 1. Some examples of correctly classified reactions shown in confusion matrix (Figures S5 and S6) for main class of hydrolases and oxidoreductases.



Entry

EC Number

Reaction Diagrams O N+

-

O

O N+

O-

+

-

H 2O

O

O N+

O N+

EC 3.5.c.d

H+

+

+

OH

O

NH3 +

O

O-

O-

O

1

Page 16 of 40

O O

-O

+

O-

O

O-

H2 O

+

O-

NH3 +

O

OH +H

O

2

N H

NH

EC 3.9.c.d

2N

+

NH2 O P N - OH O

NH2 NH

+

H2O O

NH

O

O O P N O O-

+ N

H 2O

HN N

3

HN

O +

HN

H+

H 2O

+

-

S O

-

O OO P O

N

O

O

N

N

O P O HN O O- P O O

O

O

P

S O-

O HN S

O-

O N

HO N

H

H+

H

O

O

+

H

H

O

5

O

O

6

EC 1.20.c.d

H N HS

7

EC 1.21.c.d

O

O + N H

N

O O

O + HO As O - + H + O-

O

N

N

O O

OH

O

O + H2 O +

N

N

+ H+

O

O N

O-

O + O O +

N

O S

O O

OH

EC 1.17.c.d

O

NH2

OH

EC 1.10.c.d

-

N

N

4

+

O

O

NH2

N

HO

HN

OH O -O

EC 3.13.c.d

O HO P O O-

+

HN

OH O O O P O- O O P -O O

O HO P O O-

+

O

OH

HN S

S

+

NH

-

O

OH + H2 O As O OH

H N

H N O ONH2 +

+

HO OH

NH O -

O

O-

N H

O

Subclass Prediction Models


+ H2O

+

H+

+

NH 4+

Page 17 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Three subclasses of hydrolysis reaction data (b is 1, 2 and 5 in EC 3.b.c.d) and eight subclasses of redox reaction data (b is 1, 2, 3, 4, 5, 8, 13, and 14 in EC 1.b.c.d) were extracted for further studies, to build subclass classification models using above validated combinations (TRF; 2048 bits and 4096 bits; k-NN, LR and NN for EC 3.b.c.d, LR, NN and SVM for EC 1.b.c.d; AP). We also used 80% of the data as training set, the remaining 20% as test set, and an external set for validation of models. These new models were shown in Table 2 for hydrolysis reactions and Table 3 for redox reactions, as well as in Figure S5 for standard deviation frequency distribution of the three metrics of all models. Table 2. Classification models for three subclasses of hydrolases (EC 3.b.c.d). EC Number

EC 3.1.c.d

EC 3.2.c.d

EC 3.5.c.d

Bits

Methods

Internal Test Set

External Validation Set

R 0.98

F 0.97

Supporta 81

P 0.96

R 0.96

F 0.96

Supporta 49

2048

k-NN

P 0.96

4096

k-NN

0.96

0.96

0.96

81

0.96

0.96

0.96

49

2048

LR

0.98

0.99

0.98

81

0.98

0.98

0.98

49

4096

LR

0.98

0.99

0.98

81

0.98

0.98

0.98

49

2048

NN

0.98

0.99

0.98

81

0.98

0.98

0.98

49

4096

NN

0.98

0.99

0.98

81

0.98

0.98

0.98

49

2048

k-NN

1.00

1.00

1.00

34

1.00

1.00

1.00

34

4096

k-NN

1.00

1.00

1.00

34

1.00

1.00

1.00

34

2048

LR

1.00

1.00

1.00

34

1.00

1.00

1.00

34

4096

LR

1.00

1.00

1.00

34

1.00

1.00

1.00

34

2048

NN

1.00

1.00

1.00

34

1.00

1.00

1.00

34

4096

NN

1.00

1.00

1.00

34

1.00

1.00

1.00

34

2048

k-NN

0.81

0.80

0.78

55

0.80

0.69

0.63

35

4096

k-NN

0.81

0.80

0.78

55

0.80

0.69

0.63

35

2048

LR

0.88

0.86

0.86

55

0.95

0.94

0.94

35

4096

LR

0.88

0.86

0.86

55

0.95

0.94

0.94

35

2048

NN

0.91

0.89

0.90

55

0.94

0.92

0.93

35

4096

NN

0.93

0.92

0.92

55

0.94

0.93

0.93

35

a

Support means the number of samples used for validation.



Page 18 of 40

Hydrolysis Reactions. EC 3.1.c.d contains 450 reactions divided into eight sub-subclasses. As shown in Table 2, excellent performance was obtained for the six models. As to EC 3.2.c.d, the performance of all models could achieve complete accuracy for the external 49 reactions. All the 34 reactions in test set and 34 ones in external validation set catalyzed by EC 3.2.c.d were also completely distinguished out. Analyzing these classification results for EC 3.5.c.d showed a relatively lower performance of each model. The LR models performed better than the others (P, R and F higher than 0.90) for external validation set of 35 samples. From Table 2, we also found that 2048-bit fingerprint was enough to generate excellent models in comparison of 4096-bit one. Redox Reactions. In Table 3, all these models showed very similar results to those on hydrolysis reactions. 2048-bit fingerprint performed equally to 4096-bit one, except EC 1.8.c.d where the prediction model built by NN with 2048-bit fingerprint only yielded P = 0.76, R = 0.79 and F = 0.75 on internal data set. Comparing the performance of all the models, we found that models of EC 1.4.c.d, EC 1.8.c.d and EC 1.14.c.d performed worse than the others. Table 3. Classification models for eight subclasses of oxidoreductases (EC 1.b.c.d). EC Number

EC 1.1.c.d

EC

Bits

Methods

Internal Test Set

External Validation Set

R 0.98

F 0.98

Supporta 147

P 0.98

R 0.99

F 0.99

Supporta 99

2048

LR

P 0.97

4096

LR

0.97

0.98

0.98

147

0.98

0.99

0.99

99

2048

NN

0.97

0.98

0.98

147

0.98

0.99

0.99

99

4096

NN

0.97

0.98

0.98

147

0.98

0.99

0.99

99

2048

SVM

0.98

0.98

0.98

147

0.99

0.99

0.99

99

4096

SVM

0.98

0.98

0.98

147

0.99

0.99

0.99

99

2048

LR

0.96

0.95

0.95

43

0.92

0.91

0.91

21


Page 19 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


1.2.c.d

EC 1.3.c.d

EC 1.4.c.d

EC 1.5.c.d

EC 1.8.c.d

EC 1.13.c.d

EC 1.14.c.d

4096

LR

0.98

0.98

0.98

43

0.92

0.95

0.93

21

2048

NN

0.97

0.97

0.97

43

0.92

0.95

0.93

21

4096

NN

0.97

0.97

0.97

43

0.92

0.95

0.93

21

2048

SVM

0.93

0.93

0.93

43

0.87

0.86

0.85

21

4096

SVM

0.93

0.93

0.93

43

0.87

0.86

0.85

21

2048

LR

0.96

0.9

0.95

50

0.99

0.97

0.98

39

4096

LR

0.96

0.94

0.95

50

0.99

0.97

0.98

39

2048

NN

0.95

0.96

0.95

50

0.94

0.94

0.94

39

4096

NN

0.95

0.97

0.96

50

0.97

0.96

0.96

39

2048

SVM

0.90

0.90

0.90

50

0.93

0.87

0.88

39

4096

SVM

0.90

0.90

0.90

50

0.93

0.87

0.88

39

2048

LR

0.87

0.91

0.89

22

0.67

0.82

0.74

11

4096

LR

0.87

0.91

0.89

22

0.67

0.82

0.74

11

2048

NN

0.86

0.91

0.88

22

0.67

0.82

0.74

11

4096

NN

0.86

0.91

0.88

22

0.67

0.82

0.74

11

2048

SVM

0.76

0.86

0.80

22

0.67

0.82

0.74

11

4096

SVM

0.76

0.86

0.80

22

0.67

0.82

0.74

11

2048

LR

0.90

0.94

0.92

18

0.91

0.95

0.93

21

4096

LR

0.90

0.94

0.92

18

0.91

0.95

0.93

21

2048

NN

0.90

0.94

0.92

18

0.91

0.95

0.93

21

4096

NN

0.90

0.94

0.92

18

0.91

0.95

0.93

21

2048

SVM

0.90

0.94

0.92

18

0.91

0.95

0.93

21

4096

SVM

0.90

0.94

0.92

18

0.91

0.95

0.93

21

2048

LR

0.80

0.86

0.83

14

0.83

0.83

0.83

12

4096

LR

0.80

0.86

0.83

14

0.83

0.83

0.83

12

2048

NN

0.76

0.79

0.75

14

0.83

0.83

0.83

12

4096

NN

0.78

0.80

0.77

14

0.83

0.83

0.83

12

2048

SVM

0.75

0.79

0.74

14

0.83

0.75

0.78

12

4096

SVM

0.75

0.79

0.74

14

0.83

0.75

0.78

12

2048

LR

0.90

0.93

0.91

28

0.91

0.90

0.90

10

4096

LR

0.90

0.93

0.91

28

0.91

0.90

0.90

10

2048

NN

0.90

0.93

0.91

28

0.91

0.90

0.90

10

4096

NN

0.90

0.93

0.91

28

0.91

0.90

0.90

10

2048

SVM

0.93

0.96

0.95

28

1.00

1.00

1.00

10

4096

SVM

0.93

0.96

0.95

28

1.00

1.00

1.00

10

2048 4096 2048 4096 2048 4096

LR LR NN NN SVM SVM

0.92 0.92 0.89 0.89 0.93 0.93

0.91 0.91 0.89 0.89 0.93 0.93

0.91 0.91 0.88 0.88 0.93 0.93

140 140 140 140 140 140

0.82 0.82 0.78 0.79 0.79 0.79

0.65 0.66 0.66 0.66 0.67 0.67

0.63 0.63 0.61 0.61 0.61 0.61

250 250 250 250 250 250

a

Support means the number of samples used for validation.



Page 20 of 40

Superclass Prediction Models To validate the performance of classification models, we selected AP-based TRF to generate the reaction fingerprint in 2048 bits, then built four models to classify EC 1.b.c.d and EC 3.b.c.d with four methods (k-NN, LR, NN, and SVM). The prediction results were shown in Table 4. Table 4. Superclass classification models for hydrolases and oxidoreductases. Methods

Internal Test Set P 0.97 ±0.00E+00 0.98 ±1.11E-16 0.98 ±1.67E-03 0.99 ±0.00E+00

k-NN Std.a LR Std.a NN Std.a SVM Std.a

R 0.97 ±0.00E+00 0.98 ±1.11E-16 0.98 ±1.63E-03 0.99 ±1.11E-16

F 0.97 ±2.22E-16 0.98 ±1.11E-16 0.98 ±1.65E-03 0.99 ±0.00E+00

External Validation Set b

Support 713 713 713 713

P 0.96 ±1.11E-16 0.99 ±2.22E-16 0.99 ±1.40E-03 0.99 ±1.11E-16

R 0.96 ±1.11E-16 0.99 ±1.11E-16 0.99 ±1.41E-03 0.99 ±0.00E+00

F 0.96 ±1.11E-16 0.99 ±2.22E-16 0.99 ±1.40E-03 0.99 ±0.00E+00

Supportb 725

a

Std. means standard deviation. bSupport means the number of samples used for validation.

Compared with the other three methods, SVM resulted in excellent prediction on both data sets, with all metrics up to 0.99 on external validation set. Both LR and NN performed slightly worse on internal test set (F = 0.98), but still well on external data set. K-NN exhibited moderate classification capability on both internal and external data sets. Based on above investigation of classification models for main class, subclass, and superclass of both hydrolysis and redox reactions, we obtained the best models which were built by NN or LR method with 2048-bit TRF from AP. To further confirm the effectiveness and robustness of these two models, we conducted leave-one-cluster-out cross-validation on training set to show the generalization ability.


725 725 725

Page 21 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


To evaluate truly new data and similarity analysis between training set, internal test set and external validation set, we conducted similarity analysis to ensure that the three data sets shared low similarity. As shown in Table 5, the results were above 0.8 for hydrolysis reactions, whereas the results were only around 0.7 for redox reactions. However, for superclass prediction model, much higher results were obtained (up to 0.95 for all the three metrics). Comparing LR and NN, we could see that they showed almost equal classification ability. After analyzing these results of the 11 subclass models in Tables S3 and S4, we got a similar conclusion. These findings demonstrated the border application domain of our selected models to predict future data. Low similarity in Figure S6 also proved the usefulness of these models. Table 5. Leave-one-cluster-out cross-validation on training set for main class and

superclass of both hydrolysis and redox reactions. Method

Std.a

P

LR NN

0.82 0.80

±1.11E-16 ±4.15E-03

LR NN

0.70 0.71

±0.00E+00 ±8.64E-03

LR NN

0.95 0.95

±2.22E-16 ±2.29E-03

R

Std.a

EC 3.b.c.d 0.86 ±0.00E+00 0.84 ±3.40E-03 EC 1.b.c.d 0.71 ±1.11E-16 0.73 ±7.01E-03 EC 3.b.c.d + EC 1.b.c.d 0.95 ±1.11E-16 0.95 ±2.24E-03

F

Std.a

0.84 0.82

±0.00E+00 ±2.58E-03

0.70 0.71

±0.00E+00 ±7.18E-03

0.95 0.95

±2.22E-16 ±2.27E-03

a

Std. means standard deviation.

Discussion In this study, we built multi-classification models for prediction of enzymatic



reaction types using machine learning methods together with reaction fingerprints. Please note that, our study was not intended to predict the enzymatic reaction type for a substrate, but attempted to find out the relation between a whole reaction and a certain enzyme, to correct the wrong EC numbers of some reactions, and to assign experimentally investigated reactions to given enzymes. Factors to Affect the Accuracy of Models On the basis of the above-mentioned studies, several excellent prediction models were obtained for hydrolysis and redox reactions, separately. From the study, we have learned that several factors might affect the accuracy of classification models. At first, how to describe chemical reactions is an important thing to the results of model building. In this study, we used three types of reaction fingerprints to describe the chemical reactions, namely SRF, RDF and TRF. Among them, SRF was filtered out first because it only put the molecular fingerprints of reactants and products together. We know that most reactions happen in only a small part of the reactants. Simple addition of two molecular fingerprints would bring duplicated bits of chemical features, which may account for the low accuracy of SRF method (Figures S3C and Figures S4C). Then we compared RDF with TRF and realized that TRF performed better than RDF (Figures 3C and 4C). RDF was calculated by built-in method in RDKit software package, while TRF was calculated by our in-house method that was partly referenced to those supportive codes in Schneider’s work.20 A good reaction fingerprint is related to its length and the molecular fingerprint used to generate it. Much more parameters (three types of molecular fingerprints and eleven types of


Page 22 of 40

Page 23 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


fingerprint length) to be selected in our TRF supplied more chances and larger parameter-search space for us to build best models. As to the length of fingerprint, we found that 2048-bit is enough to describe the reaction fingerprint in comparison of all the eleven type of length (Figures 3A, 4A, S3A, and S4A). We believe that 2048-bit contained critical information to describe chemical reactions. Increasing of fingerprint length to 4096 or longer would introduce redundant information that did influence the performance of models. And vice versa, decreasing fingerprint length to 1024 or shorter would not abundantly represent one reaction. About the molecular fingerprint, AP performed better than the others. The reason might be that AP contains atom-atom mapping information, which makes generation of reaction fingerprint much easier. Morgan2 performed slightly worse compared with AP when those models were tested on external validation set. Secondly, the machine learning methods are key to the quality of models. NN and LR were outstanding in classification of both hydrolysis and redox reactions at all the three levels. From Figures 3B and 4B, we also see that k-NN performed well for hydrolysis reactions, and SVM was good enough for redox reactions. NB performed the worst here. The reason might be that NB is a simple algorithm based on Bayes’ theory with the naïve assumption of independence between every pair of features. However, the correlation analysis was not conducted in calculation of reaction fingerprints. DT and RF are two tree-structure classifiers, and RF is the advanced version of DT. In our baseline models, DT and RF performed equally to LR and NN (Figures S3B and S4B). After parameter optimization, DT exhibited lower prediction



results while RF can still be comparable to NN on internal test set (Figures 3B and 4B). On external validation set, RF model was just a little worse than NN on classification capacity (Figures 5B and 5E). Meanwhile, one obvious phenomenon observed in this study is that classification models of hydrolysis reactions performed better than those of redox reactions. By the definition of IUBMB, hydrolyses catalyze thirteen classes of reactions and hundreds of sub-classes of reactions. Hydrolysis reaction is a kind of easy-understood reaction and happens by addition of proton and hydroxyl in water to reactants to form acid-base products. For example, one carboxylic ester substrate can be hydrolyzed to split into an alcohol and a carboxylate. Simple catalysis mechanism makes hydrolysis reactions easily to be classified and predicted. However, redox reaction is more complicated than hydrolysis. There are various reactants in real world, which could be mediated by oxidoreductases, including these substances that might contain -CH- or -CH2, -CH-OH, aldehyde or oxo, -CH-CH-, -CH-NH2, -CH-NH- groups, and so on. Oxidoreductases can also catalyze NADH or NADPH which is a cofactor used in anabolic reactions, such as lipid and nucleic acid synthesis. Moreover, a variety of reaction types could be summarized from thousands of redox reactions, such as elimination, addition, ring-opening, ring-closing, and rearrangement, which burden the complexity on reaction mechanism. Complex and multifarious nature of redox reactions leads to difficulty to be identified and classified. Analysis of Misclassified Reactions In spite of good performance on prediction, our models still have some


Page 24 of 40

Page 25 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


limitations and always gave incorrect results to some reactions. The confusion matrices of the three models were shown in Figures S7 and S8. Most of the numbers were located in diagonal, which indicated that the models have high performance. However, some confusion points were located out of the diagonal area. We further performed outlier detection on external validation set to explore whether these 33

misclassification reactions were outliers or not (Figures S9-S11). An outlier

is

defined as an observation point that is distant from the major distribution area of all observations, whereas a misclassification is an observation that is located within the major distribution area of all observations but predicted wrongly. The methods used to identify outliers are different from those for identification of misclassifications. One reaction might belong to either outliers or misclassifications or both. From Figures S9-S11, we could see that our models could identify most of outliers, and some misclassifications were outliers. Then, we carefully analyzed these misclassified cases to further understand the advantages and disadvantages of our models (Table 6). Table 6. Misclassified reactions shown in confusion matrix (Figures S7 and S8) for main class of hydrolases and oxidoreductases as well as superclass in Table 4.

Entry

EC Number

Predicted EC Number

1

EC 3.4.c.d

EC 3.1.c.d

EC 3.5.c.d

EC 3.8.c.d

EC 3.1.c.d

EC 3.7.c.d

2

3

Reaction Diagrams O

O

O

H N

N H

O-

S

H N

+ H2 O

NH3 +

O

N

O-

HS

H+

+

NH3 +

OH N

N

+ N

+

O

OH N

O O-

N H

N

H+

H2O +

NH2

N

N

NH4+

+ OH

O O O

P O-

O O

P O-

O

P O-


O

O-

P O

O-

O-


Page 26 of 40

O

4

5

EC 1.1.c.d

EC 1.2.c.d

EC 1.5.c.d

EC 1.4.c.d

O O-

-

S

O

O

H N

N H

O NH3+

O

+

NAD(+)

H2O

+

O

O-

-

S

O

O

O

H2 N O-

H2O

+

+

O

+H

O

+

3N

6

7

EC 1.14.c.d

EC 1.b.c.d

EC 3.b.c.d

OH

O O-

NH O

OP O

O O OH

+

HO

OH

+

O

+

O

HO

OH

O

O

O N H

O-

HO

HO

O

O

O-

O

EC 1.21.c.d

O-

O

O

+

H+

+ NADH +

O

H N

N H

O

S

N H

NH3+

+ AMP+ -O

P

P OH

O

O

O O-

O

O

O-

-

O-

O

O O-

O

NH3+

+ ATP +

O NH

O P O O-

OH

H N

O

H N O

S H

O O O-

8

EC 3.b.c.d

+ H2O

NH 2+

EC 1.b.c.d

NH4+ +

O-

O

O

+

O

H3 N

O-

ONH3 +

Ten hydrolysis reactions were misclassified by the three machine learning methods (Figure S7). One of the reactions (Table 6, Entry 1) was labelled as EC 3.4.c.d but incorrectly predicted as EC 3.1.c.d by all three classifiers. EC 3.4.c.d acts on peptide bonds (peptidases) whereas EC 3.1.c.d acts on ester bonds. The reaction contained a substructure of thioester, which could interact with EC 3.1.c.d and hence lead to the wrong prediction. For the second entry of Table 6, the EC 3.5.c.d reaction was wrongly classified as EC 3.8.c.d. The same product feature might be the reason why the prediction was not correct. The third wrong reaction (Table 6, Entry 3) catalyzed by EC 3.1.7.10 (geranylgeranyl-diphosphate diphosphohydrolase) was predicted as substrates of EC 3.7.c.d (acting on carbon-carbon bonds) by LR and NN classifiers. We supposed that it might be a two-step reaction catalyzed by EC 3.7.c.d


Page 27 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


followed by EC 3.1.c.d. At first, its substrate compound formed circle structure by long chain transformation with EC 3.7.c.d as catalyst. Then the ring structure happened to hydrolyze under enzyme EC 3.1.c.d to eliminate diphosphate to the alcohol.34 For redox reactions, the reaction in Entry 4 of Table 6 was predicted as EC 1.2.c.d (acting on the aldehyde or oxo group of donors), which was actually metabolized by EC 1.1.c.d (acting on the CH-OH group of donors). The reason might be that the aldehyde group could be identified by EC 1.2.c.d. Another reaction (Table 6, Entry 5) from EC 1.5.c.d (acting on the CH-NH group of donors) was predicted to be catalyzed by EC 1.4.c.d (acting on the CH-NH2 group of donors) because of similar reactive group. EC 1.21.c.d could catalyze reaction with NAD+ or NADP+ as acceptor, which is the reverse of EC 1.14.c.d (with NADH or NADPH as one donor, and incorporation of one atom of oxygen). So the reaction shown in Table 6, Entry 6 could probably be regarded as EC 1.14.c.d by classifiers. As to the superclass prediction models, some confusions were obtained on several samples by certain classifier. One reaction (Entry 7 in Table 6) labeled as EC 1.2.1.95 was not predicted correctly by k-NN, LR, NN, and SVM methods at all. The thioester and ester functional groups included in the first two reactions could be identified as substrates of EC 3.b.c.d. Two reactions (Entry 8 in Table 6) labeled as EC 3.5.99.10 and EC 3.5.99.7 were not correctly identified by all the three methods. These misclassifications might be attributed to limitations of machine learning methods and insufficiency of reaction description.



Conclusions In this study, multi-classification models were constructed to predict enzymatic reactions catalyzed by hydrolases (EC 3.b.c.d) and oxidoreductases (EC 1.b.c.d) using machine learning methods and reaction fingerprints. Weighted metrics were used to evaluate models in data imbalance situation. Although the machine learning methods are commonly used, it is a comprehensive research to predict EC number of each enzyme-catalyzed reaction with reaction fingerprints. Moreover, excellent prediction models were obtained in our simple workflow. After comparing different models, we demonstrated that AP-based TRF in 2048 bits could sufficiently describe one reaction and allow correct classification of reactions using machine learning methods. Furthermore, we found that NN and LR performed very well in assigning EC numbers of hydrolysis and redox reactions. These selected prediction models were tested on external validation sets to demonstrate their robustness in multiple classifications. In spite of good performance, for some instances our models still led to wrong predictions. We then analyzed the misclassified reactions with outlier detecting method. We also considered their structural characteristics and found some limitations for machine learning methods. The advantage of our models is that our input SMILES does not need atom-atom mapping and identification of reaction center at all, which are difficult to be reached. This is a big step to classification of enzymatic reactions with reaction fingerprints.


Page 28 of 40

Page 29 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


However, there are still two disadvantages for our method. The first one is that the input SMILES in calculation of reaction fingerprints must meet the law of conservation of mass (i.e. in a reaction an atom in left side must exit in right side after a chemical change happens). To meet this situation, numerous labor and time must be spent to clean those reactions from KEGG and Rhea databases. The other is that our models cannot assign all the four EC numbers in one time. Next, we will develop new descriptive methods to generate reaction fingerprints and new model building methods to assign EC numbers consecutively.

Materials and Methods Data Collection and Preparation All the enzymatic reactions were retrieved from KEGG21, 22 and Rhea23 databases together with their EC numbers to label categories, which resulted in four data sets with molecular structures in SMILES format. All the data sets were processed as following. If a polymeric compound was involved in a reaction and could be identified in its monomeric form, the polymer was replaced by the monomer, otherwise the reaction was abandoned. Those reactions including compounds with no SMILES description were removed. For the EC 3.b.c.d reactions, in some cases the direction of a reaction was reversed in order to obtain a unique hydrolysis reaction rather than a hydration one. Unbalanced reactions were not used or manually balanced in some clear situations. Some reactions contained



generalized group R, then these R groups were replaced by methyl. General fragment symbols such as ‘X’ were substituted by a chlorine atom. Following above procedure, duplicated reactions of Rhea with KEGG were removed. Definition of Reaction Fingerprints In this study, three types of reaction fingerprints were used to describe the reactions (http://www.daylight.com/dayhtml/doc/theory/index.html). RDF is a type of fingerprint specifically tailored for reaction processing, in which the difference in the fingerprints of the reactants and the products reflects the bond changes during the reaction. SRF is the combination of the normal structural fingerprints for the reactant and the product molecules within the reaction. TRF was obtained by subtracting the reactant fingerprints from the product fingerprints. Four types of molecular fingerprints AP, Morgan2, TT, and PF, were used to generate the reaction fingerprints. It should be noted that not every molecular fingerprint can support the three reaction fingerprints. PF does not support TRF, Morgan2 does not support SRF, while Morgan2 and PF do not support RDF. All the fingerprints were generated using the open-source cheminformatics toolkit RDKit 24. Machine Learning Methods In this study, seven machine learning algorithms (DT, k-NN, LR, NB, NN, RF, and SVM) were employed to build the multi-classification models.28 The parameters of each classifier were optimized using grid search algorithm which exhaustively considers all parameter combinations. All the classifiers were performed in the open-source machine-learning toolkit scikit-learn (version 0.18).35 The default


Page 30 of 40

Page 31 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


parameters and a grid of parameter values set by us for each classifier were shown in Table S5. Decision tree. DT is an algorithm employed to generate a tree for decision as a predictive model.36 Each interior node of the tree corresponds to one of the input variables. DT can effectively split its set of samples into subsets enriched in one class or the other, based on the attribute with the highest normalized information gain of the data. Then the algorithm recurses on the smaller sublists. K-nearest neighbors. K-NN is a nonparametric method for discriminating objects based on the majority voting of its k nearest neighbors in the feature space.37 The object would be assigned to the class most common among its k nearest neighbors, with hamming distance metrics to measure the nearest. Logistic regression. LR is a regression model where the dependent variable is categorical, developed by statistician David Cox in 1958.38 The logistic model is performed to estimate the probability of a response based on predictor variables. Naïve Bayes. NB probabilistically classifies samples based on Bayes rule with strong naïve independence assumptions.39 This statistical strategy allows categorize instances based on the equal and independent contributions of their attributes. Neural network. NN is a computational approach modeling the way a biological brain solves problems, and has been shown to be an effective tool in solving nonlinear problems to category a set of samples.40 Topologically, an NN consists of input, hidden, and output layers of neurons connected by weights. Each input layer node



corresponds to a single independent variable with the exception of the bias node. Output layer node corresponds to label or class predicted by this model. Random forest. RF is an ensemble of tree predictors, where each tree is formed by first random selection on a small set of input features to split at each node.41 Each tree can be seen as a predictor and gives a label. The final output category of an instance depends on the mode of classes output by all ensemble trees. Support vector machine. The SVM algorithm is a kernel-based tool for data classification and originally aims at minimizing the structural risk under the frame of Vapnik–Chervonenkis theory.42 The purpose of SVM training is to find a hyperplane which could discriminate samples from different categories. Kernel function is used to map the input vectors from low-dimensional space into a high-dimensional space. The commonly used kernel functions include linear kernel, polynomial kernel, Gaussian radial basis function kernel, and sigmoid kernel. The penalty parameter C and the kernel coefficient γ are two most important variables to be optimized. Model Building The general workflow of model building for reaction classification was shown in Figure 1. As shown in the left side of Figure 1, at first reaction fingerprints were generated for each reaction with molecular fingerprints of reactants and products. Then baseline models were constructed for the data sets with machine learning methods. To decrease computational resources, no parameters were optimized in baseline models. Those baseline models with good performance were then subjected to parameter optimization. The better models were further assessed on external


Page 32 of 40

Page 33 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


validation sets to ensure their predictive capability. Each model was repeated ten times to calculate the mean and standard deviation of its metrics. 13 populated types of hydrolysis reactions and 21 populated types of redox reactions, all from the KEGG database, were used to build the main class of reaction classification models, separately. The best combination of machine learning methods and reaction fingerprints were then determined. For subclass of reaction classification, three subclass of hydrolysis reactions (EC 3.1.c.d, EC 3.2.c.d and EC 3.5.c.d) and eight subclass of redox reactions (with EC second number as 1, 2, 3, 4, 5, 8, 13, 14) were employed in model building, separately. Furthermore, superclass models were constructed to predict one reaction belonging to either hydrolysis or redox reaction. For each of the above data sets, a training set and a test set were randomly split in the ratio of 8:2. The reaction SMILES was stored with assigned reaction label for each of the reactions. Performance Assessment of Models The 10-fold cross-validation, leave-one-cluster-out cross-validation 43 and external validation were used to evaluate the performance of all models. Among them, leave-one-cluster-out cross-validation is a variation of n-fold cross-validation. It ensures that the same group does not exist simultaneously in both test set and training set, so it is possible to detect some overfitting situations. To split one dataset into different groups, k-means

44

algorithm was conducted to do unsupervised learning to

cluster data. 10 of clusters were specified in this work.



Page 34 of 40

Additionally, three weighted parameters (R, P and F) were calculated for each model, based on the counts of true positives (TP), false positives (FP) and false negatives (FN).20 ‘Weight’ could account for class imbalance by computing the average of binary metrics where each class’ score is weighted by its presence in the true label. R is estimated as fraction of correct predictions compared to all positives. P is the ratio of correct prediction compared to all positive predictions. F is a measure that combines R and P. These three parameters were calculated in the following equations:

R=

்௉

(1)

்௉ାிே ்௉

(2)

P = ்௉ାி௉ ଶ×்௉

(3)

F = ଶ×்௉ାி௉ାிே Outlier Analysis

In this study, we also performed isolation forest algorithm (IF)45 to detect outliers for analysis of misclassification cases. The algorithm isolates observations by random selection of a feature and a split value. The split value should be between the maximum and minimum values of the selected feature. With this method, some novelties could be recognized. To visually analyze the outliers, we performed dimensional reduction for features of the best model using AutoEncoder (AE) method 46

and projected them into a two-dimensional plot.


Page 35 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


ASSOCIATED CONTENT Supporting Information Supporting Information Available: Tables S1-S4, selected most effective models, and leave-one-cluster-out cross validation results for both hydrolysis and redox reactions; Table S5, parameter grid; Figures S1-11, subclass data sets, baseline models, standard deviation distribution, similarity plots, confusion matrices, outlier analysis for both hydrolysis and redox reactions.

AUTHOR INFORMATION Corresponding Authors E-mail: [email protected] (Y.T.)

ORCID Yingchun Cai: 0000-0001-5058-5308 Yun Tang: 0000-0003-2340-1109 Weihua Li: 0000-0001-7055-9836

Notes The authors declare no competing financial interest.

ACKNOWLEDGEMENTS



This work was supported by the National Key Research and Development Program of China (Grant 2016YFA0502304), the National Natural Science Foundation of China (Grants 81373329 and 81673356) and the 111 Project (Grant B07023).


Page 36 of 40

Page 37 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


References 1.

Kirchmair, J.; Williamson, M. J.; Tyzack, J. D.; Tan, L.; Bond, P. J.; Bender, A.; Glen, R. C.

Computational Prediction of Metabolism: Sites, Products, SAR, P450 Enzyme Dynamics, and Mechanisms. J. Chem. Inf. Model. 2012, 52, 617-648. 2.

Kirchmair, J.; Göller, A. H.; Lang, D.; Kunze, J.; Testa, B.; Wilson, I. D.; Glen, R. C.; Schneider,

G. Predicting Drug Metabolism: Experiment and/or Computation? Nat. Rev. Drug Discovery 2015, 14, 387-404. 3.

Sharma, M.; Garg, P. Computational Approaches for Enzyme Functional Class Prediction: A

Review. Curr. Proteomics 2014, 11, 17-22. 4.

Thompson, R. Classification and Nomenclature of Enzymes. Science 1962, 137, 405-408.

5.

Babbitt, P. C. Definitions of Enzyme Function for the Structural Genomics Era. Curr. Opin. Chem.

Biol. 2003, 7, 230-237. 6.

Todd, A. E.; Orengo, C. A.; Thornton, J. M. Evolution of Function in Protein Superfamilies, from

A Structural Perspective. J. Mol. Biol. 2001, 307, 1113-1143. 7.

Green, M.; Karp, P. Genome Annotation Errors in Pathway Databases Due to Semantic Ambiguity

in Partial EC Numbers. Nucleic Acids Res. 2005, 33, 4035-4039. 8.

Kotera, M.; Okuno, Y.; Hattori, M.; Goto, S.; Kanehisa, M. Computational Assignment of the EC

Numbers for Genomic-Scale Analysis of Enzymatic Reactions. J. Am. Chem. Soc. 2004, 126, 16487-16498. 9.

Dönertaş, H. M.; Martínez Cuesta, S.; Rahman, S. A.; Thornton, J. M. Characterising Complex

Enzyme Reaction Data. PLoS One 2016, 11, e0147952. 10. O'Boyle, N. M.; Holliday, G. L.; Almonacid, D. E.; Mitchell, J. B. Using Reaction Mechanism to Measure Enzyme Similarity. J. Mol. Biol. 2007, 368, 1484-1499. 11. Latino, D. A.; Aires-de-Sousa, J. Assignment of EC Numbers to Enzymatic Reactions with MOLMAP Reaction Descriptors and Random Forests. J. Chem. Inf. Model. 2009, 49, 1839-1846. 12. Latino, D. A.; Zhang, Q.-Y.; Aires-de-Sousa, J. Genome-Scale Classification of Metabolic Reactions and Assignment of EC Numbers with Self-Organizing Maps. Bioinformatics 2008, 24, 2236-2244. 13. Sacher, O.; Reitz, M.; Gasteiger, J. Investigations of Enzyme-Catalyzed Reactions Based on Physicochemical Descriptors Applied to Hydrolases. J. Chem. Inf. Model. 2009, 49, 1525-1534. 14. Hu, X.; Yan, A.; Tan, T.; Sacher, O.; Gasteiger, J. Similarity Perception of Reactions Catalyzed by Oxidoreductases and Hydrolases Using Different Classification Methods. J. Chem. Inf. Model. 2010, 50, 1089-1100. 15. Nath, N.; Mitchell, J. B. Is EC Class Predictable from Reaction Mechanism? BMC Bioinf. 2012, 13, 60. 16. Matsuta, Y.; Ito, M.; Tohsato, Y., ECOH: An Enzyme Commission Number Predictor Using Mutual Information and A Support Vector Machine. Bioinformatics 2013, 29, 365-372. 17. Ridder, L.; Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of Metabolites. ChemMedChem 2008, 3, 821-832. 18. Patel, H.; Bodkin, M. J.; Chen, B.; Gillet, V. J. Knowledge-Based Approach to De Novo Design Using Reaction Vectors. J. Chem. Inf. Model. 2009, 49, 1163-1184. 19. Hu, Q.-N.; Zhu, H.; Li, X.; Zhang, M.; Deng, Z.; Yang, X.; Deng, Z. Assignment of EC Numbers to Enzymatic Reactions with Reaction Difference Fingerprints. PLoS One 2012, 7, e52901.



20. Schneider, N.; Lowe, D. M.; Sayle, R. A.; Landrum, G. A. Development of A Novel Fingerprint for Chemical Reactions and Its Application to Large-Scale Reaction Classification and Similarity. J. Chem. Inf. Model. 2015, 55, 39-53. 21. Kanehisa, M.; Goto, S.; Sato, Y.; Kawashima, M.; Furumichi, M.; Tanabe, M. Data, Information, Knowledge and Principle: Back to Metabolism in KEGG. Nucleic Acids Res. 2014, 42, D199-D205. 22. Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; Morishima, K. KEGG: New Perspectives on Genomes, Pathways, Diseases and Drugs. Nucleic Acids Res. 2017, 45, D353-D361. 23. Alcántara, R.; Axelsen, K. B.; Morgat, A.; Belda, E.; Coudert, E.; Bridge, A.; Cao, H.; De Matos, P.; Ennis, M.; Turner, S. Rhea—A Manually Curated Resource of Biochemical Reactions. Nucleic Acids Res. 2012, 40, D754-D760. 24. RDKit: Open-Source Cheminformatics. http://www.rdkit.org (accessed February 8, 2017). 25. Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications. J. Chem. Inf. Comput. Sci. 1985, 25, 64−73. 26. Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. 27. Nilakantan, R.; Baumann, N.; Dixon, J. S.; Venkataraghavan, R. Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors. J. Chem. Inf. Comput. Sci. 1987, 27, 82−85. 28. Tewari, A.; Bartlett, P. L. On the Consistency of Multiclass Classification Methods. Springer Berlin Heidelberg 2007, p 143-157. 29. Chen, Y.; Cheng, F.; Sun, L.; Li, W.; Liu, G.; Tang, Y. Computational Models to Predict Endocrine-Disrupting Chemical Binding with Androgen or Oestrogen Receptors. Ecotoxicol. Environ. Saf. 2014, 110, 280-287. 30. Zhang, C.; Zhou, Y.; Gu, S.; Wu, Z.; Wu, W.; Liu, C.; Wang, K.; Liu, G.; Li, W.; Lee, P. W.; Tang, Y. In Silico Prediction of hERG Potassium Channel Blockage by Chemical Category Approaches. Toxicol. Res. 2016, 5, 570-582. 31. Li, X.; Chen, L.; Cheng, F.; Wu, Z.; Bian, H.; Xu, C.; Li, W.; Liu, G.; Shen, X.; Tang, Y. In Silico Prediction of Chemical Acute Oral Toxicity Using Multi-Classification Methods. J. Chem. Inf. Model. 2014, 54, 1061-1069. 32. Cheng, F.; Yu, Y.; Shen, J.; Yang, L.; Li, W.; Liu, G.; Lee, P. W.; Tang, Y. Classification of Cytochrome P450 Inhibitors and Noninhibitors Using Combined Classifiers. J. Chem. Inf. Model. 2011, 51, 996-1011. 33. WIKIPEDIA: The Free Encyclopedia. https://en.wikipedia.org/wiki/Outlier (accessed February 8, 2017). 34. Mafu, S.; Hillwig, M. L.; Peters, R. J. A Novel Labda-7,13E-dien-15-ol-Producing Bifunctional Diterpene Synthase from Selaginella Moellendorffii. ChemBioChem 2011, 12, 1984-1987. 35. Pedregosa F, V. G., Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay, E. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825--2830. 36. Podgorelec, V.; Kokol, P.; Stiglic, B.; Rozman, I. Decision Trees: An Overview and Their Use in Medicine. J. Med. Syst. 2002, 26, 445-463. 37. Cover, T.; Hart, P. Nearest Neighbor Pattern Classification. IEEE Trans. Inf. Theory 1967, 13, 21-27.


Page 38 of 40

Page 39 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


38. Mood, C. Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It. European Sociological Review 2010, 26, 67-82. 39. Watson, P. Naive Bayes Classification Using 2D Pharmacophore Feature Triplet Vectors. J. Chem. Inf. Model. 2008, 48, 166-178. 40. Myint, K.-Z.; Wang, L.; Tong, Q.; Xie, X.-Q. Molecular Fingerprint-Based Artificial Neural Networks QSAR for Ligand Biological Activity Predictions. Mol. Pharmaceutics 2012, 9, 2912-2923. 41. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947-1958.37. 42. Czarnecki, W. M.; Podlewska, S.; Bojarski, A. J. Robust Optimization of SVM Hyperparameters in the Classification of Bioactive Compounds. J. Cheminf. 2015, 7, 38. 43. Kramer, C.; Gedeck, P., Leave-Cluster-Out Cross-Validation Is Appropriate for Scoring Functions Derived from Diverse Protein Data Sets. J. Chem. Inf. Model. 2010, 50, 1961-1969. 44. Jain, A. K. Data Clustering: 50 Years beyond K-means. Pattern Recognition Letters 2010, 31, 651-666. 45. Liu, F. T.; Ting, K. M.; Zhou, Z. H. Isolation Forest. In 2008 Eighth IEEE International Conference on Data Mining, 15-19 Dec. 2008, pp 413-422. 46. Wang, Y.; Yao, H.; Zhao, S. Auto-Encoder Based Dimensionality Reduction. Neurocomputing 2016, 184, 232-242.



For Table of contents Only


Page 40 of 40

Multi-Classification Prediction of Enzymatic Reactions for

Recommend Documents