Identification of Enzyme Genes Using Chemical Structure Alignments

Jan 29, 2016 - Detailed results of new enzyme gene prediction are available at http://web.kuicr.kyoto-u.ac.jp/supp/moriya/2014_ezyme2/. Figure 5. (A) ...
5 downloads 8 Views 1MB Size
Article pubs.acs.org/jcim

Identification of Enzyme Genes Using Chemical Structure Alignments of Substrate−Product Pairs Yuki Moriya,†,|| Takuji Yamada,‡ Shujiro Okuda,§ Zenichi Nakagawa,† Masaaki Kotera,‡ Toshiaki Tokimatsu,†,|| Minoru Kanehisa,† and Susumu Goto*,† †

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo, 152-8550, Japan § Graduate School of Medical and Dental Sciences, Niigata University, 1-757 Asahimachi-dori, Chuo-ku, Niigata 951-8510, Japan ‡

ABSTRACT: Although there are several databases that contain data on many metabolites and reactions in biochemical pathways, there is still a big gap in the numbers between experimentally identified enzymes and metabolites. It is supposed that many catalytic enzyme genes are still unknown. Although there are previous studies that estimate the number of candidate enzyme genes, these studies required some additional information aside from the structures of metabolites such as gene expression and order in the genome. In this study, we developed a novel method to identify a candidate enzyme gene of a reaction using the chemical structures of the substrate−product pair (reactant pair). The proposed method is based on a search for similar reactant pairs in a reference database and offers ortholog groups that possibly mediate the given reaction. We applied the proposed method to two experimentally validated reactions. As a result, we confirmed that the histidine transaminase was correctly identified. Although our method could not directly identify the asparagine oxo-acid transaminase, we successfully found the paralog gene most similar to the correct enzyme gene. We also applied our method to infer candidate enzyme genes in the mesaconate pathway. The advantage of our method lies in the prediction of possible genes for orphan enzyme reactions where any associated gene sequences are not determined yet. We believe that this approach will facilitate experimental identification of genes for orphan enzymes.



INTRODUCTION

In addition, several computational methods have been developed to predict reaction paths for unknown substrate or product, including interactive Web servers such as PathPred10 and UM-PPS11 and a method based on a supervised approach.12 Although they use known chemical transformations occurring in metabolic pathways to predict the unknown metabolism of natural products, environmental substances, and pharmaceuticals, the predicted reactions cannot be associated with its catalytic enzyme gene sequences by these methods. These reactions without any associated gene sequences are called orphan enzyme reactions.13 A previous study showed that about 40% of enzymes newly reported even in the past decade are orphan.14 There are many computational methods to identify genes for orphan enzymes in a pathway of interest based on the gene order in the genome,14,15 phylogenetic profiles,15,16 and gene expression profiles.17 In addition, statistical algorithms have been developed based on models, such as the Bayesian model,18 and a supervised approach. 19 Although they successfully facilitate the experimental design, it is important

It is crucial to reconstruct a metabolic pathway from its basic components, i.e., enzymatic reactions and/or metabolites, to understand various biological phenomena occurring in nature and for industrial applications of synthetic biology. The knowledge on these basic components has been currently accumulated in pathway databases such as the KEGG PATHWAY,1 MetaCyc,2 Reactome,3 and UM-BBD4 and in reaction databases such as BRENDA5 and Rhea.6 For example, the metabolism category of the KEGG PATHWAY database contains approximately 6,000 metabolites and 6,200 reactions, whereas the International Union of Biochemistry and Molecular Biology (IUBMB) officially recognizes approximately 6,000 enzymes.7 However, the estimated number of human metabolites is >40,000,8 and the overall number of plant metabolites is estimated to be at least 1,060,000.9 If we assume the same order of numbers of reactions as the metabolites, approximately 1 million reactions are supposed to exist in plants, and the above databases cover only a small portion of the existing metabolism. Thus, it is very likely that many more metabolites are synthesized in nature, including final products with biological functions of interest. © 2016 American Chemical Society

Received: April 16, 2015 Published: January 29, 2016 510

DOI: 10.1021/acs.jcim.5b00216 J. Chem. Inf. Model. 2016, 56, 510−516

Article

Journal of Chemical Information and Modeling

The RDM pattern was constructed from a reaction center (R atom), difference atoms (D atoms), and matched atoms (M atoms) of a chemical transformation (Figure 1A, B).30 The R

to note that these methods require functional associations to other identified enzyme genes in the context of pathway or genome information. When the derived substrate−product relationship does not contain any known information on the catalytic enzyme, the enzyme has to be predicted using the chemical structures of the substrate and product. We developed a tool named “E-zyme” in 2004 to compute a suggested list of EC subsubclasses (the first three digits of the IUBMB Enzyme Commission numbers) of chemical reactions by template matching of a biochemical transformation pattern between the metabolites.20 This strategy uses the fact that the transformation patterns of reactions in the same EC subsubclass are similar to each other. However, it is not easy to identify enzyme genes from the predicted EC subsubclasses because the definitions of many EC subsubclasses are too general to determine the relationship between the enzyme and substrate specificity. Therefore, a method to identify the plausible enzyme gene for a reaction of interest is required. Meanwhile, other methods have been developed to search for similar reactions based on the fingerprint or profile that represent compounds and bond changes21 or the difference between substrates and products.22,23 Based on the similar concept, EC-BLAST24 was developed to find similar reactions to the query from the KEGG REACTION database, based on bond changes, reaction centers, and structural similarity. A number of studies using machine learning approaches were also published on EC number predictions by random forest, self-organizing neural network, support vector machine, and k-nearest neighbor.25−28 However, it should be noted that none of these methods suggest catalytic enzyme gene sequences. In this study, we extended the E-zyme strategy to identify enzyme genes for a query orphan enzyme reaction. We extract candidate enzyme genes, which catalyze the similar reactions to the query orphan reaction, based on the whole-structure comparison of substrate−product pairs. Our method successfully identified candidate enzyme genes that are more likely to catalyze the reaction of interest.

Figure 1. Construction of the RDM pattern and substructure profile. The KEGG database applies KEGG atom types, i.e., three-letter codes, to every atom of a chemical compound, which comprises the KCF (KEGG chemical function) format. The first letter of the KEGG type represents an atom species (e.g., C for a carbon atom and O for an oxygen atom), and the first two letters in the KEGG atom type are referred to as the atom class. (A) An example of a reactant pair. Atoms are classified into an R atom (red), D atom (blue), M atom (yellow), and other aligned atoms (green) according to the structure alignment. (B) An RDM pattern constructed from R-, D-, and M atoms. (C) Aligned substructure notations of three atoms surrounded by a gray line in (A). The symbol “A” in the atom class substructure shows the atom of an aromatic ring, and the symbol “R” indicates other ring atoms. (D) An example of a substructure profile converted from a reactant pair.



MATERIALS Reactions and the Related Data. We used the KEGG database (release 67.0+) for reactions and the related data.1 KEGG REACTION stores 9,398 enzymatic reactions, most of which consist of multiple substrates and products. To make the chemical structure transformations clearer for the cheminformatics application, such reaction equations were represented as the combinations of substrate−product pairs or “reactant pairs,” where each pair shares at least one atom other than hydrogen. KEGG RPAIR stores 14,218 reactant pairs that are categorized into five types, i.e., (i) “main” pairs, describing changes of main compounds; (ii) “cofac” pairs, describing changes of cofactors for oxidoreductases; (iii) “trans” pairs, focused on transferred groups for transferases; (iv) “ligase” pairs, describing the consumption of nucleoside triphosphates for ligases; and (v) “leave” pairs, describing the separation or addition of inorganic compounds for such enzymes as lyases and hydrolases.29 The RPAIR entries contain the atom-to-atom mapping information that represents the conservation of the atoms during reactions, which are referred to as “chemical structure alignment” by analogy with sequence alignment in genomic analyses. The chemical structure alignment reveals chemical transformation patterns, referred to as “RDM” patterns that represent the chemical changes occurring during the reactions.

atom is defined by the following three conditions. 1) The bond of the atom is either created or broken. 2) The oxidation state of the atom changes. 3) The atom is centered in the R/S isomerization of a chiral form or the E/Z isomerization of a double bond. The neighboring atoms of the R atom in the aligned structure are defined as the M atoms, and other neighboring atoms that are not in the aligned structure are defined as the D atoms. The KEGG RCLASS database contains 2,831 RDM patterns for the main-type reactant pairs, and we used the corresponding 8,846 main-type reactant pairs in this study. Ortholog Data. KEGG has two sets of ortholog databases: KEGG Orthology (KO) as a collection of manually curated ortholog groups based on sequence similarity and KEGG Ortholog Cluster (OC) as a computationally defined comprehensive ortholog database.31 The KO groups are associated with the functional information such as the KEGG pathway maps and BRITE functional hierarchies and used for 511

DOI: 10.1021/acs.jcim.5b00216 J. Chem. Inf. Model. 2016, 56, 510−516

Article

Journal of Chemical Information and Modeling

RCLASS database were calculated (the details are mentioned below), and the known RDM patterns whose score was smaller than a given threshold were discarded. Reactant pairs with the remaining RDM patterns were retrieved from the KEGG RPAIR database. 3) The prediction score of the query pair to each KO entry was calculated by using the selected reactant pairs. In the second and the third steps, the RDM patterns and reactant pairs were converted to the substructure profiles that describe the structural changes as integer vectors (Figure 1D), and the Tanimoto coefficient between the profiles was calculated as a similarity between the RDM patterns or between the reactant pairs. Construction of Substructure Profiles for Similarity Measures. To define the similarity between reactant pairs, each reactant pair was converted to the substructure profile, an integer vector that describes the aligned substructure and its frequency (Figure 1C, D), in a similar manner to the chemical descriptors dealing with the collection of substructures such as KCF-S33 and Daylight fingerprints.34 The substructure profiles were designed so that each substructure contains the atoms from both the substrate and product. In this study, substructures in the profiles were obtained by collecting the paths from the traversal of the molecular graph. We constructed the paths with a length of 2−3 atoms, described by the KEGG atom types, which are referred to as the “atom type chains,” and the paths with a length of 2−8 atoms, described by the KEGG atom classes, which are referred to as the “atom class chains,” so that they contained at least one R- or D atom and were defined as the substructure of a reactant pair. By designing the substructure profiles as explained above, the atoms that were closer to the reaction center were given higher priority compared with the others. In addition, in the atom class chains, the aromatic ring atoms were labeled by the shared symbol “A,” and other ring atoms were labeled with “R” instead of using the original atom classes when considering their biochemical features. The Tanimoto coefficient was used as a measure of similarity between the profile of the query pair and profiles of reference reactant pairs, where 1 means identical and 0 means no similarity at all. After calculating similarities between reference reactant pairs and the query pair, the reactant pairs with the similarity