ATLAS of Biochemistry: A Repository of All Possible Biochemical

Jul 12, 2016 - Moreover, ATLAS reactions integrate 42% of KEGG metabolites that are not currently present in any KEGG reaction into one or more novel ...
0 downloads 8 Views 2MB Size
Research Article pubs.acs.org/synthbio

ATLAS of Biochemistry: A Repository of All Possible Biochemical Reactions for Synthetic Biology and Metabolic Engineering Studies Noushin Hadadi, Jasmin Hafner, Adrian Shajkofci, Aikaterini Zisaki, and Vassily Hatzimanikatis* Laboratory of Computational Systems Biotechnology (LCSB), Swiss Federal Institute of Technology (EPFL), CH-1015 Lausanne, Switzerland S Supporting Information *

ABSTRACT: Because the complexity of metabolism cannot be intuitively understood or analyzed, computational methods are indispensable for studying biochemistry and deepening our understanding of cellular metabolism to promote new discoveries. We used the computational framework BNICE.ch along with cheminformatic tools to assemble the whole theoretical reactome from the known metabolome through expansion of the known biochemistry presented in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. We constructed the ATLAS of Biochemistry, a database of all theoretical biochemical reactions based on known biochemical principles and compounds. ATLAS includes more than 130 000 hypothetical enzymatic reactions that connect two or more KEGG metabolites through novel enzymatic reactions that have never been reported to occur in living organisms. Moreover, ATLAS reactions integrate 42% of KEGG metabolites that are not currently present in any KEGG reaction into one or more novel enzymatic reactions. The generated repository of information is organized in a Web-based database (http://lcsb-databases.epfl.ch/atlas/) that allows the user to search for all possible routes from any substrate compound to any product. The resulting pathways involve known and novel enzymatic steps that may indicate unidentified enzymatic activities and provide potential targets for protein engineering. Our approach of introducing novel biochemistry into pathway design and associated databases will be important for synthetic biology and metabolic engineering. KEYWORDS: generalized enzyme reaction rules, novel reaction prediction, metabolite integration, automated network reconstruction, metabolic engineering

T

biochemical reactions and metabolites, is one of the most complete repositories of metabolic data. This manually curated knowledge base systematically organizes and visualizes the experimental knowledge of biological systems in an interactive and computable form.8,9 The KEGG database comprises 15 main databases that categorize biological data at three different levels: genomic information, systems information, and biochemical information.10 The KEGG database was created in 1995, and increasing numbers of genes, metabolites, and enzymatic reactions are included in each annual release (Figure 1). The rapid growth of the KEGG database and the results of our study illustrate the abundance of as-of-yet uncharacterized enzymatic reactions in nature, indicating remarkable potential for the discovery of new metabolic functionalities. This knowledge gap calls for the design of computational tools that harness the enzymatic capabilities of known enzymes, aiming to explore all of the

he revolutionary ongoing sequencing of whole genomes is yielding tremendous amounts of biological data1 for revealing new biochemical reactions and cellular processes. In addition, with increasing progress in analytical and instrumental techniques in metabolomics studies, many new compounds and metabolites are being identified in biological systems.2 However, because the origins and biological functions of the majority of characterized metabolites remain unknown, we must design tools for functional interpretation to map them within the context of metabolic pathways. Another key challenge is to determine the roles of newly identified genes and their associated proteins in an organism.3−5 To take full advantage of the wealth of generated information, it is of importance to annotate and categorize predicted proteins, identify their biochemical functionalities, and integrate them into their corresponding pathways.6 To address these challenges, we propose a computational approach with several design elements to gain a systems-level comprehension of the enzymatic reactions across all species and identify all of the theoretically possible enzymatic reactions based on known biochemistry. The Kyoto Encyclopedia of Genes and Genomes (KEGG) database,7 which we consider to be the reference for known © 2016 American Chemical Society

Special Issue: Synthetic Biology in Europe Received: February 8, 2016 Published: July 12, 2016 1155

DOI: 10.1021/acssynbio.6b00054 ACS Synth. Biol. 2016, 5, 1155−1166

Research Article

ACS Synthetic Biology

biodegradation pathways for xenobiotics,20,21 to investigate lipid metabolism,22 and to design novel enzymatic reactions, pathways, and intermediate metabolites.17,18,23,24 In the present work, we applied BNICE.ch for the first time to all of the biological compounds reported in the KEGG database to determine how the known biochemistry of established enzymatic reactions evolves if we use the generalized enzyme reaction rules of BNICE.ch over all of the KEGG compounds, thus exploring the possible space of metabolic reactions that potentially can be found in nature. In doing so, we reconstructed all of the known and reproducible KEGG reactions (as of March 2014) and discovered more than 130 000 novel enzymatic reactions between two or more known KEGG compounds. We have organized these results into a Web-based database named ATLAS of Biochemistry (shortened to “ATLAS” in the rest of this article) that comprises all of the known and hypothetically possible enzymatic reactions between any two (or more) KEGG compounds. We have performed several further analyses on the generated de novo reactions, such as thermodynamic feasibility studies and comparative structural analyses between the newly discovered and previously known KEGG reactions. To the best of our knowledge, there is no available database that accounts for all of the theoretically possible enzymatic reactions that connect known biological compounds on the basis of known biochemistry. Nevertheless, such information is of great interest not only to fill the knowledge gaps in metabolic networks but also for synthetic biology and metabolic engineering, in which the discovery of novel reactions for designing de novo biosynthesis pathways is an important mission. Furthermore, we discovered that approximately 65% of the newly added KEGG reactions in 2015 already exist in ATLAS as novel reactions. This finding validates the consistency of our generalized reaction rules with acknowledged biochemistry and demonstrates that our proposed hypothetical enzymatic reactions are an important complement to known cataloged biochemistry. Every newly discovered reaction that is already part of our ATLAS database confirms the biochemical validity and importance of our method.

Figure 1. Growth of the KEGG compounds and reactions databases over the last 16 years.

theoretically possible biochemical reactions between known metabolites. Another interesting observation is the divergence between the numbers of newly added reactions and compounds in recent years. One possible explanation is that the recent advances in analytical techniques in metabolomic and lipidomic experiments have characterized many metabolites in biological samples. Another challenge raised by the advances in metabolomics is identifying the biological functions of novel compounds and integrating them into biological pathways. We believe that the best way to predict or design a novel enzymatic biotransformation is to learn from known biochemical reactions and, accordingly, to derive the biotransformation rules that govern known enzyme chemistry in nature.11 Several computational tools exist for predicting novel biological reactions that rely on the concept of “generalized biochemical rules”.12−16 These tools differ in their scopes and numbers of generalized rules and curating procedures, and their performances can be compared on the basis of the numbers of known enzymatic reactions that they can reconstruct. Distinctions in capabilities and applications of these methods have recently been extensively reviewed and evaluated by Hadadi and Hatzimanikatis.17 The generalized biochemical rules are formulated on the basis of known biochemistry; they simulate the biotransformations that occur in enzymatic reactions and are designed in a generalized manner to act on a broader range of substrates rather than the native ones. Therefore, applying these rules to known metabolites can computationally reconstruct known enzymatic reactions and predict enzymatic reactions that have never been observed in nature but are likely feasible because their reaction mechanisms are based on known biochemistry. The Biochemical Network Integrated Computational Explorer (BNICE.ch) has been under development for the last 15 years and is one of the first methods for discovering and characterizing new enzymatic reactions based on known biochemistry.12,18 The latest version of BNICE.ch has distilled the cataloged biochemistry from the KEGG database into 361 bidirectional generalized reaction rules that recapitulate the biochemical actions of more than 6500 KEGG reactions. The BNICE.ch framework has previously been effectively applied in several studies to explore retrobiosynthetic routes for the biosynthesis of different chemicals, 19 to identify novel



RESULTS AND DISCUSSION We have generated a repository of all possible enzymatic reactions between KEGG compounds, indicating that there is a huge potential for biological compounds to be interconverted by relatively few reaction mechanisms. This collection is a valuable source of information for biological and biochemical studies, and its characteristics and significance will be presented and discussed in the following sections. ATLAS can be consulted at the Web site http://lcsb-databases.epfl.ch/atlas/ and is free for academic use upon subscription. ATLAS of Biochemistry: Known and Novel Reactions. Network Generation and Exact Reconstruction of KEGG Reactions. Database preprocessing allows us to assess the quality of the information in the KEGG database and ensures that KEGG compounds and reactions fulfill the minimum requirements for BNICE.ch (more details can be found in Table 2 in Methods). In the first step of our analysis, we applied the 720 generalized reaction rules to the 16 798 compounds with identified two-dimensional (2D) structures in KEGG 2015, resulting in 137 877 generated reactions between KEGG compounds. Next, we screened the generated reactions against the KEGG 2015 reaction database to distinguish between 1156

DOI: 10.1021/acssynbio.6b00054 ACS Synth. Biol. 2016, 5, 1155−1166

Research Article

ACS Synthetic Biology

Table 1. Statistics of Reactions for Three Different Sections: (i) Preprocessing Step; (ii) KEGG Reaction Reconstruction (Exact Reconstruction or Biotransformation Reconstruction); (iii) Generation of ATLAS Reactionsa

a

The reference database for creating the generalized reaction rules was the KEGG 2014 reaction database, whereas all of the reconstructions were done using the KEGG 2015 database (compounds and reactions). The last column shows the analysis of the reactions that appeared in KEGG 2015. Remarkably, 65% of the qualified reactions that were introduced in 2015 (665 reactions) were already part of ATLAS based on KEGG 2014 reactions.

“known” and “novel” KEGG reactions. We identified 5270 reconstructed KEGG reactions that appeared exactly as described in KEGG, and the remaining majority of the 132 607 enzymatic reactions interconnected KEGG compounds through novel reactions. Pathway Search and Biotransformation Reconstruction. Of the 8959 preprocessed KEGG 2015 reaction database entries (Table 1), 5270 were reconstructed exactly as they appeared in the KEGG database. We revisited the 3689 reactions that were not reconstructed and performed the pathway reconstruction algorithm from their substrate(s) to their product(s) within ATLAS. As a benchmark, we searched for pathways of lengths 1, 2, and 3 to investigate whether the biotransformation of these reactions can be reconstructed through an alternative reaction mechanism. This study resulted in reconstruction of the biotransformation (as opposed to exact mechanism reconstruction) of 1381 KEGG reactions (37% of the 3689 reactions); 916 KEGG reactions were apprehended in one step, 387 KEGG reactions were generated in two sequential reaction steps, and 78 KEGG reactions required three sequential reaction steps. The user can perform the same analysis on the Web site, choosing any pathway length to connect any two desired compounds. Table 1 provides an overview of the KEGG reaction statistics and summarizes the results of the exact and biotransformation reconstruction approaches. The lists of curated KEGG reactions in each approach are provided in the Table A in the Supporting Information. The following examples demonstrate the importance of the results obtained from the pathway search and how these results can clarify the reaction mechanisms of KEGG reactions that are not (yet) fully characterized. Example 1: One-Step Reconstructed Biotransformation. Out of the 916 KEGG reactions that were reconstructed in one biotransformation step, 43 are annotated in KEGG with the comment “enzyme not yet characterized.” As Figure 2 illustrates, BNICE.ch also proposes a three-level Enzymatic

Figure 2. (A) Example of a KEGG reaction (R06580) with missing enzyme characterization. (B) Three alternative ATLAS reactions (suggesting different pairs of cofactors) to recover the biotransformation of R06580. The ATLAS reactions provide the EC classification numbers up to the third level (1.14.13.- or 1.14.16.-), which are essential for the enzyme characterization.

Commission (EC) number for such reactions, which can guide the identification of the exact mechanisms underlying these reactions. The complete list of this class of reconstructed KEGG reactions (biotransformation reconstruction in onestep) is available on the Web site. Furthermore, the user can perform this analysis and retrieve the demonstrated results. Example 2: Multistep Reconstructed Biotransformation. The biotransformations of 387 KEGG reactions are reconstructed in two-step reaction mechanisms using ATLAS reactions, and for 78 KEGG reactions the biotransformation is captured by three consecutive ATLAS reactions. In both 1157

DOI: 10.1021/acssynbio.6b00054 ACS Synth. Biol. 2016, 5, 1155−1166

Research Article

ACS Synthetic Biology

Figure 3. Reconstruction of a biotransformation of an example KEGG reaction (R03359) with an unknown reaction mechanism as two consecutive ATLAS reactions. (A) Available information for R03359 in the KEGG database, indicating a multistep mechanism without further characterization of the reaction steps and intermediates. (B) Proposed sets of two consecutive ATLAS reactions that allow the uncharacterized mechanism of R03359 to be captured via two different KEGG intermediates. In the first proposed two-step mechanism (upper part), our results suggest several alternative reaction mechanisms for each step (different pairs of cofactors). For the second proposed two-step mechanism (lower part), we found several alternative reaction mechanisms for the first reaction step and a single reaction mechanism for the second reaction step.

Figure 4. (A) Available information for R07949 in the KEGG database, indicating a multistep mechanism without further characterization of the reaction steps. (B) For reconstructing the biotransformation of R07949, ATLAS proposes three consecutive reaction steps involving two intermediate KEGG compounds. Interestingly, the second step is a KEGG reaction, and also, our results propose two different alternative reactions for the last step, one KEGG reaction and one ATLAS reaction.

cases, most of these reactions are categorized as “multistep” reactions in the KEGG database, with missing information regarding their mechanisms (e.g., the individual reaction steps). Figures 3 and 4 present examples of two- and three-step cases, respectively, and the complete information on this class of reconstructed KEGG reactions is provided on the Web site. Validation of the Generalized Reaction Rules. The reference reactions that were used to formulate the generalized reaction rules of BNICE.ch were from the KEGG 2014 reaction database (see Methods). Thus, comparing the generated novel

ATLAS reactions to the reactions that were added in 2015 to the KEGG database helps us investigate the consistency of the generated hypothetical reactions with known biochemistry. Moreover, through this analysis we can assess the predictive characteristics of our method for creating novel enzymatic reactions that may become known reactions in the future. KEGG introduced 776 additional reactions in 2015 relative to the 2014 version (Table 1). We performed the same preprocessing procedures as explained in Table 2 for the 776 reactions and after the automatic preprocessing determined that 1158

DOI: 10.1021/acssynbio.6b00054 ACS Synth. Biol. 2016, 5, 1155−1166

Research Article

ACS Synthetic Biology

Figure 5. Disconnected KEGG compounds and their appearance in novel reactions of ATLAS. The two extreme cases at the right and left ends of the graph show that 460 compounds that are not in any KEGG reaction participated in at least 1 novel reaction in ATLAS and that 190 compounds were integrated in numerous novel reactions ranging from 101 to 708 novel reactions.

Figure 6. (A) Example of a not well characterized KEGG reaction (R05540) than can be reconstructed in two consecutive BNICE.ch reactions; no EC classification or other biochemical and biological information is provided in the database. (B) BNICE.ch reconstruction of the uncharacterized reaction using a PubChem compound as an intermediate metabolite.

665 of these reactions have the aforementioned required criteria. Remarkably, 354 of these reactions were predicted in ATLAS as novel reactions before they were incorporated into the KEGG database. Additionally, through the pathway search algorithm, we reconstructed the biotransformations of 61 of the remaining reactions in one-step mechanisms, 17 in two-step mechanisms, and four in three-step mechanisms (Table 1). The list of 776 reactions and information regarding their reconstructions are provided in Table B in the Supporting Information. Integration of KEGG Compounds in de Novo Reactions. Fifty-six percent of the 16 798 KEGG compounds with defined structures (9371 compounds) do not participate in any KEGG reaction. As such, these compounds are not connected to any other compound in the KEGG database (disconnected metabolites). An important outcome of this study is the integration of 3945 of the 9371 KEGG compounds (42%) that are not part of any reaction in the KEGG database into at least one novel enzymatic reaction. The participation of these disconnected KEGG metabolites in ATLAS reactions ranges from 1 to 708 reactions (Figure 5). This finding validates the potential of our methodology for future applications to explain the origin of metabolomics data and their integration into metabolic pathways. The list of KEGG compounds that have been

integrated into novel reactions together with the number of reactions for each compound is provided in Table C in the Supporting Information. Reconsideration of Biochemistry in the Context of Chemical Knowledge. After reconstruction of the exact mechanisms of the biotransformations of 6651 KEGG reactions through first- and second-level analyses, 2308 out of the 8959 KEGG 2015 reactions remained non-reconstructed (Table 1). For these reactions, we investigated whether their biotransformations could be reconstructed by BNICE.ch when we allowed chemical compounds from PubChem to be used as intermediates in the network generation and reaction reconstruction. We first identified the substrate(s) of each reaction and performed BNICE.ch using all of the generalized reaction rules for two iterations. In each iteration, we allowed both KEGG and PubChem compounds in the generated network. We next performed pathway searches as described in the previous section and reconstructed the biotransformations of 100 additional KEGG reactions that include at least one PubChem intermediate between the original substrates and products (Figure 6 and Table A in the Supporting Information). These results further suggest that the reason why the mechanisms of several biological reactions are not yet characterized is a lack of information at the metabolite level (unknown intermediates). 1159

DOI: 10.1021/acssynbio.6b00054 ACS Synth. Biol. 2016, 5, 1155−1166

Research Article

ACS Synthetic Biology

Figure 7. The KEGG compound aspartame, without any information regarding its corresponding metabolic reactions, reacts in silico with several generalized reaction rules using different pairs of cofactors. The reactions result in four combinations of one KEGG product and one PubChem product.

We further investigated in detail the remaining 2208 nonreconstructed reactions. These reactions are either from 2014 or appeared only in KEGG 2015. In the case of the 2014 reactions, the primary reasons why these reactions were not used for the formulation of the generalized reaction rules and were therefore not reconstructed by BNICE.ch are presented in Table 2 in Methods. The 2015 reactions will require further analysis to investigate whether new generalized reaction rules can be formulated to capture their mechanisms (Table 2 and Table B in the Supporting Information). Furthermore, we investigated the 5385 KEGG compounds that were not integrated into any ATLAS reaction and therefore remained disconnected from any other KEGG metabolite. Specifically, we examined whether each of these compounds could be connected to any other metabolite by the use of PubChem compounds in the BNICE.ch network generation step. Notably, 3641 of these compounds (68%) were integrated into at least one reaction that involves both KEGG and PubChem compounds (Figure 7 and Table C in the Supporting Information). The reconstruction of KEGG reactions and the integration of KEGG compounds into enzymatic reactions can be increased by considering compounds from chemical databases such as PubChem. Thus, many compounds that are currently only

known to chemical databases can potentially participate in metabolism and therefore could be included in biological databases. Moreover, the existence of these intermediates in the PubChem database suggests that they can be chemically detected and that their absence from biological databases could be due to several reasons, e.g., their instability in a biological environment. EC Number Analysis. BNICE.ch either exactly or partially reconstructed the mechanisms of 6528 KEGG reactions along with a repository of 132 607 novel reactions and proposed potential third-level EC numbers for all of the reactions that it generates. Each reconstructed KEGG reaction is therefore annotated with one (or several alternative) EC numbers that are either consistent with or different from the EC classification that KEGG proposes. Several KEGG reactions lack information regarding EC classification and can be categorized on the basis of their available EC information, e.g., full (fourth-level) EC number, partial EC number (can be first-, second-, or third-level), or no EC number at all (Figure 8). Notably, for 60% of these KEGG reactions lacking complete EC information, we propose thirdlevel EC annotation. The information regarding the EC annotation for reconstructed KEGG and novel reactions is available on the Web site. 1160

DOI: 10.1021/acssynbio.6b00054 ACS Synth. Biol. 2016, 5, 1155−1166

Research Article

ACS Synthetic Biology

a large portion of novel reactions are transferases, which might indicate an underrepresentation of transferase activity in KEGG. BridgIT Analysis of Novel Reactions. We performed BridgIT analysis to compare the structural similarity of the 132 607 ATLAS novel reactions predicted by BNICE.ch with known KEGG reactions. The results are available on the Web site (http://lcsb-databases.epfl.ch/atlas/) and include for each novel reaction the structurally most similar KEGG reaction with its Tanimoto similarity score. The information regarding the closest KEGG reaction is provided as a link to the KEGG database, which directs users to the KEGG Web site for the given KEGG reaction. This design allows users to attain other useful information such as genes, organisms, and pathways that are provided by KEGG for that reaction. The association of novel reactions with gene sequences is crucial for metabolic engineering purposes and gap filling in metabolic networks. Description of the Online Database. The ATLAS information on the Web site is organized in two tables: “BNICE.ch curated KEGG reactions” and “BNICE.ch ATLAS reactions.” The first table, “BNICE.ch curated KEGG reactions”, lists each BNICE.ch-curated KEGG reaction together with its KEGG reaction ID, reaction equation, enzyme name, and EC number. Furthermore, each KEGG reaction is annotated with information regarding the corresponding BNICE.ch reaction rule (third-level EC number from BNICE.ch), the Gibbs free energy of reaction, and information regarding how the reaction is reconstructed in ATLAS (either the exact KEGG reaction reconstruction or the biotransformation reconstruction). The second table, “BNICE.ch ATLAS reactions”, indexes the ensemble of reactions generated by BNICE.ch, comprising KEGG and novel reactions. The information in this table is organized in the same way as in the previous table but presents BridgIT results for all of the novel reactions. The user-friendly interface allows users to search for keywords, sort the entries of columns, and export the tables in the desired format. The user can also replicate all of the reported examples of biotransformations via the integrated “pathway search tool”. Furthermore, one can query for all of the

Figure 8. In KEGG, the information regarding EC classification ranges from no EC assignment to complete EC classif ication. Since BNICE.ch proposes EC numbers up to the third level for reconstructed KEGG reactions, we could improve the EC information for 732 KEGG reactions with missing or incomplete EC classification.

Thermodynamic Analysis. The standard Gibbs free energy of reaction (ΔrG′°) is estimated for all of the reactions generated by BNICE.ch using the group contribution method (GCM).25 GCM decomposes the compounds into several predefined groups with their corresponding Gibbs free energies of formation (ΔfG′°) and estimates the ΔfG′° values for all of the compounds on the basis of the group values. In some cases, the decomposition of compounds results in groups without a corresponding free energy in the GCM database, in which case GCM cannot estimate the ΔfG′° value. ΔrG′° is further calculated from the estimated ΔfG′° values for the compounds. In this study, GCM reported ΔrG′° for 76% of the KEGG reactions and 39% of the novel ATLAS reactions. To gain more insight into the novel reactions, we performed comparative analyses of the KEGG and novel ATLAS reactions with respect to the EC number distribution and the ranges of the estimated Δ r G′° values. When we compared the distributions of estimated ΔrG′° for the known and novel reactions, we observed that the estimated ΔrG′° values for the novel reactions fall within the same range as those of the KEGG reactions for the same EC class (Figure 9). Moreover, in the reaction distribution along the six EC classes, we observed that

Figure 9. Distributions of the Gibbs free energy of reaction for each type of first-level EC classification (classes 1 to 6). Red bars represent KEGG reactions, and blue bars represent ATLAS reactions. 1161

DOI: 10.1021/acssynbio.6b00054 ACS Synth. Biol. 2016, 5, 1155−1166

Research Article

ACS Synthetic Biology

Figure 10. Workflow for generating ATLAS, which is divided into three major blocks. Block I includes the process of database preparation and data generation. Block II contains the different analyses of the generated data. Block III corresponds to the publicly available database “ATLAS of Biochemistry”, with two sections “BNICE.ch curated KEGG reactions” and “BNICE.ch ATLAS reactions”. The numbers (from 1 to 8) indicate the corresponding subsections in Methods.



METHODS To explore the biochemistry of enzymatic reactions, we (i) used the KEGG compound and reaction databases as a reference, (ii) preprocessed the KEGG compound and reaction information, (iii) applied the BNICE.ch framework to generate all of the theoretically possible biochemical reactions between the KEGG compounds, and (iv) performed several other complementary analyses to assess the closeness of the generated information to known biochemistry. In the following sections, we discuss the different elements of our approach. An outline of the workflow is presented in Figure 10. 1. Preprocessing of the KEGG Compound and Reaction Databases. We used the KEGG reaction database (2014 and earlier) as a reference for developing the generalized reaction rules. Prior to this step, we preprocessed the KEGG reactions on the basis of several criteria to assess the integrity of the cataloged reactions to be used as a reference for the formulation of reaction rules. For certain reactions, the information provided in KEGG is not complete; hence, those reactions are not reconstructable using BNICE.ch and had to be excluded from further analyses. Our reaction preprocessing step was accomplished in two levels (Table 2):

possible pathways (one, two, or three steps or longer) between the substrate(s) and product(s) of any specified KEGG reaction and obtain all of the alternative biotransformations that ATLAS proposes for that KEGG reaction. One possible application of this tool is gap filling in metabolic networks, in which the user introduces the two disconnected metabolites in the pathway search tool and the maximum desired pathway length. The output is a list of all possible pathways between the two metabolites sorted by pathway length, with information regarding the intermediate metabolites and the reaction IDs (KEGG or novel reactions) catalyzing each reaction step. A detailed visualization of the generated pathways is available via the “Graph” button, including information regarding cofactor utilization and the Gibbs free energy of the reaction. Additionally, for novel reactions we provide the BridgIT results. The second integrated tool on the Web site is the “connectivity map”, which loads all of the reactions in ATLAS for any given KEGG compound. The user introduces the KEGG compound ID or the compound name and a maximal number of desired enzymatic steps, and the underlying algorithm generates a map of all of the enzymatically connected compounds and reactions. 1162

DOI: 10.1021/acssynbio.6b00054 ACS Synth. Biol. 2016, 5, 1155−1166

Research Article

ACS Synthetic Biology

Table 2. Two-Level Preprocessing of KEGG Reactions To Exclude Reactions That Did Not Fulfill the Minimum Requirements for Formulating the Generalized Reaction Rules

Table 3. In KEGG 2015, 655 out of 17 453 Compounds Did Not Have a Defined Structure and Were Therefore Excluded from Further Analyses in This Work; On the Other Hand, We Identified 1729 Entries for Compounds Having More than One Compound, with Their Only Difference Being Stereochemistry

of the second step to be a reference set of well-explained enzymatic reactions for developing the generalized reaction rules. To generate ATLAS, we applied the generalized reaction rules to the KEGG 2015 compound database. Because we used the KEGG 2014 reaction database as a reference for developing generalized reaction rules, applying these rules to the KEGG 2015 compound database helped us investigate the predictive property of our proposed method. Therefore, we could examine whether the generalized reaction rules could predict a reaction as novel on the basis of the KEGG 2014 database that became known in KEGG 2015. Before applying the reaction rules to the KEGG compounds, we performed a preprocessing step to evaluate the quality of the compound information. Table 3 summarizes the criteria that we considered for preprocessing and excluding KEGG compounds and the ensuing results. After preprocessing of the KEGG compound and reaction databases with the mentioned criteria, we applied the 361 bidirectional generalized reaction rules incorporated into BNICE.ch to the 16 798 KEGG compounds with defined 2D structures. The list of KEGG reactions in the reference database and the list of preprocessed KEGG compounds can be found in Table D in the Supporting Information. BNICE.ch. The BNICE.ch framework involves two main components: (i) a database of generalized enzymatic reaction rules and (ii) a network generation algorithm. The output of

(i) Automatic preprocessing. We introduced a list of predefined criteria to be used as a metric for filtering the reactions that do not fulfill these criteria. In this first step, we excluded reactions involving compounds with undefined molecular structures, e.g., compounds without a structural file (molfile). Furthermore, we excluded reactions that change only the stereochemistry of the molecules (e.g., racemase and epimerase) because BNICE.ch does not constrain the stereochemistry of molecules. However, BNICE.ch can reformat the results to introduce and analyze the molecules on the basis of stereochemistry.24 (ii) Manual quality control step. We manually analyzed the output of the first preprocessing step reaction by reaction. In this second level of preprocessing, we performed manual quality control of the remaining reactions to identify well-characterized enzymatic reactions with detailed information regarding their reaction mechanisms that are eligible to be used as references for developing the generalized reaction rules. Some examples of types of reactions that were excluded include reactions without EC numbers and incomplete and unbalanced reactions. We considered the outcome of the first step to be a database of biochemical reactions that we could use to test the functionality of our generalized reaction rules and the outcome 1163

DOI: 10.1021/acssynbio.6b00054 ACS Synth. Biol. 2016, 5, 1155−1166

Research Article

ACS Synthetic Biology BNICE.ch is subject to screening against biological and chemical databases, which are chosen on the basis of the type and objectives of the study. Furthermore, the group contribution method25 for the estimation of the Gibbs free energies of formation and reaction is integrated into the BNICE.ch framework, and the energetic properties are calculated and reported for all of the generated compounds and reactions. These methods allow us to evaluate the thermodynamic feasibility of the generated information. 2. BNICE.ch: Generalized Enzymatic Reaction Rules. The enzymatic reaction rules constitute the backbone of the BNICE.ch framework. These rules are based on the following concept: a given enzyme is expected to recognize substrates other than the native ones that share the same reactive site(s) and a similar neighboring structure and catalyze (or evolve to catalyze) the same biotransformation for the non-native substrate. BNICE.ch reaction rules are named and classified on the basis of the Enzymatic Commission (EC) system.26 BNICE.ch rules describe the biochemistry of the reaction and the reactive site(s) of a substrate rather than the specific structure of the entire molecule and therefore group enzymatic reactions that involve similar chemistry. Because the rules can act on different substrates, they not only reconstruct the specific native reactions for which they have been designed but also postulate many more novel reactions. Every novel reaction generated with BNICE.ch is associated with a third-level EC number and is attributed to a biochemically relevant reaction mechanism. If a specific KEGG reaction can be replicated using a generalized reaction rule, we denote the KEGG reaction as being “reconstructed” by BNICE.ch. The percentage of KEGG reactions that can be reconstructed with BNICE.ch is called the “coverage”, which is an important indicator of the performance of our method and makes it distinguishable from other similar tools. It should be noted that the coverage is a moving target because it depends on the actual number of “known KEGG reactions”, which increases every year. BNICE.ch currently incorporates 361 bidirectional generalized reaction rules (or 722 forward and reverse rules). More details regarding the procedure for developing the generalized reaction rules are provided in previous publications on the BNICE.ch framework.11,12,19,22,27 3. BNICE.ch: Network Generation Algorithm. BNICE.ch employs an automated network generation algorithm that works in an iterative manner. Starting with a set of input molecule(s), cofactors, and generalized reaction rules, the algorithm proceeds as follows: (1) Every molecule is checked for reactivity, i.e., it is evaluated to find whether it has the appropriate reactive sites (functionalities) to undergo the reactions corresponding to the specified list of reaction rules. (2) Upon acting on a molecule, the generalized reaction rules recognize the reactive sites of the molecule and apply the biotransformation in which the atoms and bonds are rearranged to form the product. (3) Next, all of the reactants are placed in a “reacted” list, and all of the products from these reactants are placed in an “unreacted” list if they are molecules that have not been specified or generated previously. This completes the first step and is defined as “iteration 1”.

(4) Each molecule in the unreacted list is checked for its reactivity, the reaction rules are applied, and new reacted and unreacted lists are created for “iteration 2”. (5) The procedure is repeated iteratively, and an iteration count is maintained as new molecules are created, keeping track of the iteration number of each species, which corresponds to the number of steps required to create a given product from the original reactant(s). To manage the exponentially growing number of products that are created because of the combinatorial nature of the network generation process, we have defined several input parameters that govern the size of the generated reaction network and should be specified by the user prior to running the algorithm:17 (1) number of generalized reaction rules (2) number of iterations (3) databases to be included in the process of screening and eventually filtering the generated compounds and reactions. In this study, because we aimed to explore known biochemistry, we used all of the 361 bidirectional reaction rules and applied them in one iteration to all of the KEGG compounds. With respect to database integration, we allowed only KEGG compounds to be included in the generated reaction network without any constraints on the generated reactions. 4. Known and Novel Reactions: Biological and Chemical Databases Integrated. Another important component of BNICE.ch is its interaction with the integrated biological and chemical databases, which allows us to screen our results against available information on compounds and reactions and investigate which portion of the obtained information (compounds and reactions) is already known and which portion is novel. The most important integrated sources of metabolic data in BNICE.ch are the KEGG,7 SEED,28 MetaCyc,29 and ChEBI databases30 and PubChem,31 the most comprehensive available database for compound structures. In this study, KEGG was used as our reference database for compounds (we allowed only KEGG compounds in our results) and reactions (for differentiating between known and novel reactions). If a reaction could be found in the KEGG reaction database, we designated it as known; otherwise, we designated it as novel. Reconstruction of KEGG Reactions. We reconstructed thousands of known enzymatic KEGG reactions. We present this collection of reconstructed KEGG reactions as “BNICE.ch curated KEGG reactions” because we not only reconstructed them as reported in KEGG but also annotated them with a third-level EC number (coming from BNICE.ch reaction rules). We also provide ΔrG′° values, which are not available in the KEGG database. The curation of known reactions is performed on two levels: 5. Exact Reconstruction of KEGG Reactions. After generating ATLAS, we screened all of the ATLAS reactions against the KEGG reaction database. When we identified an exact match for a reaction, we categorized it as “exactly reconstructed”. 6. Biotransformation Reconstruction. In a second step, for all of the reactions that were not exactly reconstructed as presented in KEGG, we performed a pathway search analysis to investigate whether “biotransformation” of these reactions could be reconstructed irrespective of their mechanisms and 1164

DOI: 10.1021/acssynbio.6b00054 ACS Synth. Biol. 2016, 5, 1155−1166

Research Article

ACS Synthetic Biology

synthetic biology studies directed toward finding novel biosynthesis or biodegradation pathways. Moreover, ATLAS integrates 42% of the KEGG compounds that are not currently part of any KEGG reaction into de novo enzymatic reactions. This high number of plausible reactions interconnecting KEGG compounds is a significant outcome, particularly because we could correctly predict 452 enzymatic reactions on the basis of the 2014 version of KEGG using BNICE.ch that were identified as known reactions in the 2015 version. The potential of the novel introduced biocatalytic reactions can be used by synthetic biology to drive discoveries in biotechnology, medicine, and green chemistry. This study also illustrates that the current knowledge concerning metabolic reactions can be expanded dramatically and that further experimental and computational effort is indispensable to complete our understanding of cellular metabolism.

involved cofactors. Our algorithm identifies the substrate(s) and product(s) of each reaction and then performs a pathway search within ATLAS to identify pathways of one, two, or three enzymatic reaction steps that can replicate the biotransformation of the reaction by connecting its substrate(s) to its product(s). If a pathway between the substrate and the product of the reaction is found, the corresponding generalized reaction rule or a combination of two or three consecutive generalized reaction rules can supposedly catalyze the KEGG reaction. 7. BridgIT Analysis. To further analyze the generated hypothetical reactions, we utilized the computational tool BridgIT32 to assess the structural similarity of the generated novel reactions to KEGG reactions. BridgIT has recently been developed in our group as a complementary tool to BNICE.ch that enables the quantification of the similarity of a novel reaction to a known KEGG reaction with respect to the structures of their substrates and products. BridgIT translates the structural definition of the molecules that participate in a reaction into a mathematical notation using compound fingerprints, which are then used to create a so-called “reaction vector.” BridgIT has an integrated database of KEGG reaction vectors that allows it to compare the reaction vector of a given reaction (for instance, a de novo reaction generated by BNICE.ch) with all of the KEGG reaction vectors using the Tanimoto distance.33 Through assessment of the structural similarity between novel and known reactions, each novel reaction is assigned a Tanimoto similarity score that quantifies its similarity to the existing reactions. The Tanimoto score varies between 0 and 1, in which 1 indicates high similarity and 0 indicates no similarity. Using this score, we can further assign gene and protein sequences to novel reactions, which can be useful in evolutionary protein engineering and computational protein design for the experimental implementation of the de novo reactions. We performed BridgIT for all of the novel reactions in our database, and the results are available on the Web site. 8. Reaction Characterization. The EC number is a classification scheme for enzyme-catalyzed reactions that defines four levels of information, with the first level describing the most general classification and the fourth level providing information on the substrates participating in the reaction. This globally accepted enzyme nomenclature provides valuable information regarding the type of the enzyme-catalyzed reaction. All of the novel reactions of ATLAS are assigned a third-level EC number derived from their corresponding generalized reaction rules indicating their potential enzymatic group. Furthermore, we estimate the ΔrG′° values for the ATLAS reactions, facilitating the evaluation of the thermodynamic feasibility of the reactions under standard conditions. If the value of ΔrG′° is negative, the reaction is feasible in the designated direction. We have performed several comparative analyses using these two pieces of information to compare the known KEGG reactions and the novel ATLAS reactions with respect to the distributions of their EC numbers and ΔrG′° values.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acssynbio.6b00054. Tables A−D (XLSX)



AUTHOR INFORMATION

Corresponding Author

*Tel: +41 21 693 9870. Fax: +41 21 693 9875. E-mail: vassily. hatzimanikatis@epfl.ch. Author Contributions

V.H., N.H., A.Z., and J.H. designed the study; N.H. and J.H. performed the experiments; V.H., N.H., and J.H. analyzed the data and wrote the manuscript. A.S. has developed the Web site and its tools. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by the Swiss National Science Foundation (SNF) and SystemsX.ch, the Swiss Initiative in Systems Biology.



REFERENCES

(1) Pareek, C. S., Smoczynski, R., and Tretyn, A. (2011) Sequencing technologies and genome sequencing. J. Appl. Genet. 52, 413−435. (2) Sugimoto, M., Kawakami, M., Robert, M., Soga, T., and Tomita, M. (2012) Bioinformatics Tools for Mass Spectroscopy-Based Metabolomic Data Processing and Analysis. Curr. Bioinf. 7, 96−108. (3) Eren, K., Deveci, M., Kucuktunc, O., and Catalyurek, U. V. (2013) A comparative analysis of biclustering algorithms for gene expression data. Briefings Bioinf. 14, 279−292. (4) Brown, T. A. (2010) Gene Cloning and DNA Analysis: An Introduction, 6th ed., Wiley-Blackwell, Oxford, U.K. (5) Friedberg, I. (2006) Automated protein function prediction - the genomic challenge. Briefings Bioinf. 7, 225−242. (6) Radivojac, P., Clark, W. T., Oron, T. R., Schnoes, A. M., Wittkop, T., Sokolov, A., Graim, K., Funk, C., Verspoor, K., Ben-Hur, A., Pandey, G., Yunes, J. M., Talwalkar, A. S., Repo, S., Souza, M. L., Piovesan, D., Casadio, R., Wang, Z., Cheng, J. L., Fang, H., Gough, J., Koskinen, P., Toronen, P., Nokso-Koivisto, J., Holm, L., Cozzetto, D., Buchan, D. W. A., Bryson, K., Jones, D. T., Limaye, B., Inamdar, H., Datta, A., Manjari, S. K., Joshi, R., Chitale, M., Kihara, D., Lisewski, A. M., Erdin, S., Venner, E., Lichtarge, O., Rentzsch, R., Yang, H. X., Romero, A. E., Bhat, P., Paccanaro, A., Hamp, T., Kassner, R., Seemayer, S., Vicedo, E., Schaefer, C., Achten, D., Auer, F., Boehm, A., Braun, T., Hecht, M., Heron, M., Honigschmid, P., Hopf, T. A.,



CONCLUSIONS We have introduced here the “ATLAS of Biochemistry”, a large collection of novel reactions along with their EC identifiers up to the third level and candidate enzymes that can potentially catalyze these de novo reactions. ATLAS provides a valuable resource of information for those who build and analyze metabolic models and for metabolic engineering projects and 1165

DOI: 10.1021/acssynbio.6b00054 ACS Synth. Biol. 2016, 5, 1155−1166

Research Article

ACS Synthetic Biology

namic analysis of complex metabolic networks. Biophys. J. 95, 1487− 1499. (26) Lilley, D. M., Clegg, R. M., Diekmann, S., Seeman, N. C., von Kitzing, E., and Hagerman, P. (1995) Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NCIUBMB). A nomenclature of junctions and branchpoints in nucleic acids. Recommendations 1994. Eur. J. Biochem. 230, 1−2. (27) Soh, K. C., and Hatzimanikatis, V. (2010) Dreams of Metabolism. Trends Biotechnol. 28, 501−508. (28) Overbeek, R., Begley, T., Butler, R. M., Choudhuri, J. V., Chuang, H.-Y., Cohoon, M., de Crecy-Lagard, V., Diaz, N., Disz, T., Edwards, R., Fonstein, M., Frank, E. D., Gerdes, S., Glass, E. M., Goesmann, A., Hanson, A., Iwata-Reuyl, D., Jensen, R., Jamshidi, N., Krause, L., Kubal, M., Larsen, N., Linke, B., McHardy, A. C., Meyer, F., Neuweger, H., Olsen, G., Olson, R., Osterman, A., Portnoy, V., Pusch, G. D., Rodionov, D. A., Ruckert, C., Steiner, J., Stevens, R., Thiele, I., Vassieva, O., Ye, Y., Zagnitko, O., and Vonstein, V. (2005) The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 33, 5691−5702. (29) Karp, P. D., Riley, M., Saier, M., Paulsen, I. T., Paley, S. M., and Pellegrini-Toole, A. (2000) The EcoCyc and MetaCyc databases. Nucleic Acids Res. 28, 56−59. (30) Hastings, J., de Matos, P., Dekker, A., Ennis, M., Harsha, B., Kale, N., Muthukrishnan, V., Owen, G., Turner, S., Williams, M., and Steinbeck, C. (2013) The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 41, D456−D463. (31) Wang, Y., Xiao, J., Suzek, T. O., Zhang, J., Wang, J., and Bryant, S. H. (2009) PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 37, W623−633. (32) Seijo, M., Hadadi, N., Soh, K. C., Miskovic, L., and Hatzimanikatis, V. (2016) A Method for Evaluating Similarity of Biochemical Reactions and Its Uses for Mapping Orphan and Novel Reactions to Gene Sequences,. (33) Godden, J., Xue, L., and Bajorath, J. (2000) Combinatorial preferences affect molecular similarity/diversity calculations using binary fingerprints and Tanimoto coefficients. J. Chem. Inf. Comput. Sci. 40, 163−166.

Kaufmann, S., Kiening, M., Krompass, D., Landerer, C., Mahlich, Y., Roos, M., Bjorne, J., Salakoski, T., Wong, A., Shatkay, H., Gatzmann, F., Sommer, I., Wass, M. N., Sternberg, M. J. E., Skunca, N., Supek, F., Bosnjak, M., Panov, P., Dzeroski, S., Smuc, T., Kourmpetis, Y. A. I., van Dijk, A. D. J., ter Braak, C. J. F., Zhou, Y. P., Gong, Q. T., Dong, X. R., Tian, W. D., Falda, M., Fontana, P., Lavezzo, E., Di Camillo, B., Toppo, S., Lan, L., Djuric, N., Guo, Y. H., Vucetic, S., Bairoch, A., Linial, M., Babbitt, P. C., Brenner, S. E., Orengo, C., Rost, B., Mooney, S. D., and Friedberg, I. (2013) A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221−227. (7) Kanehisa, M., and Goto, S. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27−30. (8) Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A. C., and Kanehisa, M. (2007) KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res. 35, W182−185. (9) Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M. (2016) KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457−D462. (10) Kanehisa, M., Goto, S., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M. (2014) Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 42, D199−205. (11) Hatzimanikatis, V., Li, C. H., Ionita, J. A., and Broadbelt, L. J. (2004) Metabolic networks: enzyme function and metabolite structure. Curr. Opin. Struct. Biol. 14, 300−306. (12) Hatzimanikatis, V., Li, C. H., Ionita, J. A., Henry, C. S., Jankowski, M. D., and Broadbelt, L. J. (2005) Exploring the diversity of complex metabolic networks. Bioinformatics 21, 1603−1609. (13) Moriya, Y., Shigemizu, D., Hattori, M., Tokimatsu, T., Kotera, M., Goto, S., and Kanehisa, M. (2010) PathPred: an enzyme-catalyzed metabolic pathway prediction server. Nucleic Acids Res. 38, W138−143. (14) Cho, A., Yun, H., Park, J. H., Lee, S. Y., and Park, S. (2010) Prediction of novel synthetic pathways for the production of desired chemicals. BMC Syst. Biol. 4, 35. (15) Hou, B. K., Ellis, L. B., and Wackett, L. P. (2004) Encoding microbial metabolic logic: predicting biodegradation. J. Ind. Microbiol. Biotechnol. 31, 261−272. (16) Ellis, L. B., Gao, J., Fenner, K., and Wackett, L. P. (2008) The University of Minnesota pathway prediction system: predicting metabolic logic. Nucleic Acids Res. 36, W427−432. (17) Hadadi, N., and Hatzimanikatis, V. (2015) Design of computational retrobiosynthesis tools for the design of de novo synthetic pathways. Curr. Opin. Chem. Biol. 28, 99−104. (18) Soh, K. C., and Hatzimanikatis, V. (2010) DREAMS of metabolism. Trends Biotechnol. 28, 501−508. (19) Henry, C. S., Broadbelt, L. J., and Hatzimanikatis, V. (2010) Discovery and Analysis of Novel Metabolic Pathways for the Biosynthesis of Industrial Chemicals: 3-Hydroxypropanoate. Biotechnol. Bioeng. 106, 462−473. (20) Finley, S. D., Broadbelt, L. J., and Hatzimanikatis, V. (2010) In silico feasibility of novel biodegradation pathways for 1,2,4trichlorobenzene. BMC Syst. Biol. 4, 7. (21) Finley, S. D., Broadbelt, L. J., and Hatzimanikatis, V. (2009) Thermodynamic analysis of biodegradation pathways. Biotechnol. Bioeng. 103, 532−541. (22) Hadadi, N., Soh, K. C., Seijo, M., Zisaki, A., Guan, X. L., Wenk, M. R., and Hatzimanikatis, V. (2014) A computational framework for integration of lipidomics data into metabolic pathways. Metab. Eng. 23, 1−8. (23) Gonzalez-Lergier, J., Broadbelt, L. J., and Hatzimanikatis, V. (2006) Analysis of the maximum theoretical yield for the synthesis of erythromycin precursors in Escherichia coli. Biotechnol. Bioeng. 95, 638−644. (24) Gonzalez-Lergier, J., Broadbelt, L. J., and Hatzimanikatis, V. (2005) Theoretical considerations and computational analysis of the complexity in polyketide synthesis pathways. J. Am. Chem. Soc. 127, 9930−9938. (25) Jankowski, M. D., Henry, C. S., Broadbelt, L. J., and Hatzimanikatis, V. (2008) Group contribution method for thermody1166

DOI: 10.1021/acssynbio.6b00054 ACS Synth. Biol. 2016, 5, 1155−1166