Semisupervised Gaussian Process for Automated ... - ACS Publications

Mar 23, 2016 - Manchester Institute of Biotechnology, University of Manchester, Manchester M13 9PL, U.K.. §. iSSB, Institute of Systems and Synthetic...
1 downloads 0 Views 2MB Size
Research Article pubs.acs.org/synthbio

Semisupervised Gaussian Process for Automated Enzyme Search Joseph Mellor,†,‡ Ioana Grigoras,§ Pablo Carbonell,⊥ and Jean-Loup Faulon*,†,⊥,§,∥ †

School of Chemistry, University of Manchester, Manchester M13 9PL, U.K. Manchester Institute of Biotechnology, University of Manchester, Manchester M13 9PL, U.K. § iSSB, Institute of Systems and Synthetic Biology, CNRS, University of Évry-Val-d’Essonne, 91000 Évry, France ⊥ SYNBIOCHEM Centre, Manchester Institute of Biotechnology, University of Manchester, Manchester M13 9PL, U.K. ∥ MICALIS Institute, INRA, 78352 Jouy en Jossas, France ‡

S Supporting Information *

ABSTRACT: Synthetic biology is today harnessing the design of novel and greener biosynthesis routes for the production of added-value chemicals and natural products. The design of novel pathways often requires a detailed selection of enzyme sequences to import into the chassis at each of the reaction steps. To address such design requirements in an automated way, we present here a tool for exploring the space of enzymatic reactions. Given a reaction and an enzyme the tool provides a probability estimate that the enzyme catalyzes the reaction. Our tool first considers the similarity of a reaction to known biochemical reactions with respect to signatures around their reaction centers. Signatures are defined based on chemical transformation rules by using extended connectivity fingerprint descriptors. A semisupervised Gaussian process model associated with the similar known reactions then provides the probability estimate. The Gaussian process model uses information about both the reaction and the enzyme in providing the estimate. These estimates were validated experimentally by the application of the Gaussian process model to a newly identified metabolite in Escherichia coli in order to search for the enzymes catalyzing its associated reactions. Furthermore, we show with several pathway design examples how such ability to assign probability estimates to enzymatic reactions provides the potential to assist in bioengineering applications, providing experimental validation to our proposed approach. To the best of our knowledge, the proposed approach is the first application of Gaussian processes dealing with biological sequences and chemicals, the use of a semisupervised Gaussian process framework is also novel in the context of machine learning applied to bioinformatics. However, the ability of an enzyme to catalyze a reaction depends on the affinity between the substrates of the reaction and the enzyme. This affinity is generally quantified by the Michaelis constant KM. Therefore, we also demonstrate using Gaussian process regression to predict KM given a substrate-enzyme pair. KEYWORDS: semisupervised Gaussian process, Gaussian process regression, metabolic engineering, enzyme screening, enzyme kinetics, reaction fingerprint

T

metabolic engineering projects, the community will soon exhaust the list of drugs that can be bioproduced if projects continue to be limited to known metabolic transformations. The same observation as for drugs could be made for terpenes or flavonoids, which are precursors of many added value chemicals. When it comes to chemical synthesis or transformation, synthetic biology and metabolic engineering offer two definite advantages: reduced costs and the possibility of synthesizing compounds that are not easily accessible through traditional organic synthesis. There is thus a critical need for biotechnology to go beyond the set of known metabolic

he USDA and the World Economic Forum anticipate that biobased production will increase from 10% up to 28% within the chemical market by 2025,1 which represents an overall investment of 614 billion US dollars in many fields such as commodity chemicals or fine and specialty chemicals (including pharmaceutical products). The biosynthesis routes for most of the chemicals that have so far been bioproduced are well documented. For instance, the 70 small molecule drugs2 that have to date been produced biologically are all found in metabolic databases. Despite the fact that the DrugBank repository comprises almost 7000 entries, pathway enumeration on metabolic databases indicate that no more than 150 of these entries could potentially be engineered in Escherichia coli,3 other entries in DrugBank corresponding to molecules not found in metabolic databases. Since the discovery and validation of enzymes catalyzing novel biochemical reactions lags far behind the requirements set by © 2016 American Chemical Society

Special Issue: IWBDA 2015 Received: December 21, 2015 Published: March 23, 2016 518

DOI: 10.1021/acssynbio.5b00294 ACS Synth. Biol. 2016, 5, 518−528

Research Article

ACS Synthetic Biology

In this paper, we make use of Gaussian processes, which give a predictive variance estimate around the predictions and thus provides a clear probabilistic interpretation. Additionally in our implementation, we make use of a semisupervised framework, where we provide only positive measurement (i.e., experimentally verified enzyme reaction pairs). Indeed, it is inappropriate to generate data sets of negative pairs by randomly shuffling enzymes and reactions of positive pairs since we do not know to which extent enzymes are promiscuous. Additionally, data on negative enzyme-reaction pairs are scarce in the literature, as negative results are generally not reported. More pragmatically, the Gaussian process we have developed aids with answering the two following questions,

pathways in order to make possible the biosynthesis and biotransformation of non-natural compounds. To address the need of searching pathways producing nonnatural compounds, retrosynthesis algorithms have been developed.4−6 These algorithms first compile rules from known metabolic reactions and then apply the rules either in a forward fashion from a set of available metabolites (e.g., the metabolites of a growth medium or the endogenous metabolites a chassis strain) to a desired product, or in a reverse fashion from the desired product to a set of available metabolites. During the forward or reverse process the compounds generated are not necessarily known metabolites because the rules are not substrate or product specific but instead code for the chemical transformations occurring around the reaction center. As an example, the rule taken from Table S1 in Henry et al.,4 Glutamate + R1C(O)R2 → Oxoglutarate + R1C(NH2)R2 (where R1, R2 = C or H), codes for all reactions within the enzyme class 2.6.1. That rule alone can be applied to 8182 out of the 14 080 compounds stored in MetaCyc version 17.07 generating at least as many products as there are substrates (for some substrates the rule many be applied at several different locations). Yet, MetaCyc comprises only 75 substrate product pairs in EC 2.6.1; thus, many of the compounds generated by the rule are compounds unknown in MetaCyc and other metabolic databases. There is some legitimacy to compile reactions rules that are not substrate specific since enzyme promiscuity is more prevalent than originally thought. For instance, Nam et al.8 have recently shown that about 37% of E. coli enzymes might have promiscuous activities. Additionally, underground metabolites and reactions resulting from the promiscuous properties of some enzymes have been identifies and validated experimentally.9 Yet, not all enzymes are promiscuous and one may wonder how many enzyme sequences can be found to metabolize the 8182 substrates of the rule 2.6.1 mentioned above? Searching for enzyme sequences catalyzing reactions generated by retrosynthesis algorithms is thus a critical step in the process of bioproducing non-natural compounds. Only few studies have addressed this issue, some are based on molecular simulations, density functional theory (DFT) or partitioned quantum mechanics and molecular mechanics (QM/MM) (see review10) while others are based on machine learning. Molecular simulations require a 3D structure for the enzyme and are generally limited to specific enzyme classes. Molecular simulations are time-consuming and since in retrosynthesis the calculations must be carried out for thousands of reactions, these methods are not amendable to large scale search for enzymes and pathways producing nonnatural compounds. Aside from machine learning studies targeted to specific enzyme families (e.g., P450 see review in Kirchmair et al.11), only few large scale studies are making use of machine learning to predict enzyme sequences catalyzing reactions.12 These method are using nearest neighbor or support vector machine (SVM) classifiers. Enzyme reaction pairs are learnt using fingerprints for the reactions (e.g., functional groups, fragments, molecular signatures13), and various features for the sequence (e.g., physicochemical properties, amino acid composition, k-mers). While these methods exhibit good accuracies (>80% in cross validation) they suffer from a lack of assessment of uncertainties in the predictions.

1. Given an enzyme sequence E and a reaction R, how likely is it that E catalyzes R? 2. Given an enzyme sequence E and a reaction R, can the affinity between the substrates of R and E be quantified? The first question can be answered using classification, we employ a semisupervised Gaussian process framework, which can be trained with positive examples and unlabeled data to provide a probability estimate under a given model. A Gaussian process model provides a principled Bayesian framework in which to provide these estimates. The estimate can be used as a score for decision making in the design of novel pathways. We use data from BRENDA14 as our training set. Gaussian processes scale poorly with the size of the data. Some solutions to this problem include sparse Gaussian processes based on a small number of inducing points. We instead first cluster the data and build models for the individual clusters to reduce the computational burden. In the training phase we first preprocess the enzyme-reaction pairs using CD-HIT15 to remove redundant enzymes. We used sequence identity threshold of 0.5 and a word length of 3. Each enzyme in the training set is replaced by the representative enzyme in its cluster as defined by the CD-HIT tool. Duplicate enzyme-reaction pairs were then removed. For each remaining pair we then create a feature vector. The feature space is then reduced via feature selection and the data is clustered. We try two approaches to clustering, the first was based on E.C. number and the second using the DBSCAN clustering algorithm.16 The data associated with each cluster is then used to train a semisupervised null-category noise model Gaussian process. There is one such model for each cluster. A new enzyme-reaction pair can then be given a score by first determining the nearest cluster to it and then using the null-category noise model Gaussian process associated with that cluster to determine the score. Details of the above procedure is left to the Methods section. Our training pipeline and its integration into the synthetic biology design/ build/test cycle is shown in Figure 1. The ability to search for enzymes capable of catalyzing a reaction is of course valuable. However, the classification approach to this problem ignores the effectiveness of such an enzyme, and thus is unable to provide an answer to our second question on affinity quantification. In a pathway design workflow a synthetic biologist will want to choose enzymes that maximize the product of the desired chemical. The rate at which an enzymatic reaction happens is dependent on the affinity between the substrates of the reaction and the enzyme and is generally quantified by the Michaelis constant KM. A system to predict KM would be a further improvement oversimply classifying whether an enzyme can catalyze a 519

DOI: 10.1021/acssynbio.5b00294 ACS Synth. Biol. 2016, 5, 518−528

Research Article

ACS Synthetic Biology

Figure 2. Mean receiver operating characteristic curve for the overall classifier based on E.C. clusters. The results are from a crossfold validation where unlabeled data, generated as described in the Methods section, is assumed to be negative in the test fold.

that enzymes appear often annotated on multiple E.C. numbers because of enzyme promiscuity, i.e., the ability of enzymes to catalyze multiple reactions or to process multiple substrates. Therefore, we considered an alternative grouping of reactions based on the clustering of their reaction signature features. Since promiscuous reactions in enzymes often display a higher degree of similarity, we expected an improved overall score when clustering the reactions based on features rather than on the E.C. level classification. In such second test, we created clusters using the DBSCAN clustering algorithm.16 In order to evaluate the performance of the classifiers, we computed the receiver operating characteristic (ROC) curve, which is a plot of the true positive rate against the false positive rate at various thresholds. The ROC curves for the individual clusters based on the DBSCAN clustering are shown in Figure S2. For the combined classifier we chose a combination of linear and squared exponential kernels based on the best results. We compared the predictor to a support vector machine (SVM)based predictor where an SVM was trained for each of the clusters. The SVM-based classifiers is the kernel used was a product of linear kernels. The linear-kernel SVM is the same as used by Faulon et al.12 The ROC curve for the combined classifier and the SVM comparison set is given in Figure 3. The overall score for an example using the DBSCAN-based classifier was the score of the subclassifier associated with the nearest cluster to the example. The DBSCAN-based classifier gave

Figure 1. Training architecture for the Gaussian process and its integration as a tool for enzyme selection in an automated pipeline of pathway design.

reaction. Predicting KM is a regression problem. Gaussian processes are very natural probabilistic models for regression problems where inference is more straightforward than in the classification case. Like Gaussian process for classification, the models provide a measure of uncertainty in a given prediction. Therefore, we also demonstrate using Gaussian process regression prediction of KM given a substrate-enzyme pair.



RESULTS AND DISCUSSION We validated and challenged our system with external test sets to evaluate its ability for automated enzyme sequence selection. Cross-validation was based on the BRENDA enzyme information database. Moreover, the predictor was challenged with several external sets corresponding to relevant synthetic biology applications. As such, our enzyme-reaction prediction tool finds application for enzyme design and genome functional annotation. We tested its performance through a series of independent validation tests that considered several application cases in the design/build/test cycle: prediction of enzyme/ reaction pairs; prediction of substrate promiscuity leading to unnatural products; and pathway ranking. Our first test of the system is a cross-validation using the BRENDA data set. For this validation and in order to compute false-positives and true-negatives we made the assumption that unlabeled data in the test split were negative. This assumption is reasonable given how the data was created (see the Methods section). ROC curves for individual clusters which were clustered using the top level E.C. number are shown in Figure S1. We only show the squared exponential kernel results for this clustering which gave the best performance. Despite the promising results individually we found that when combining the classifiers the performance diminished. The ROC curve for this classifier is shown in Figure 2. The overall score of an example for the E.C. based classifier was given as the maximum of the scores for all subclassifiers trained for a specific E.C. level. The E.C. based classifier had an AUC = 0.82 (area under the ROC curve) on the crossfold validation data. Such decrease in performance in the overall classifier may be caused by the fact

Figure 3. Mean receiver operating characteristic curve for the overall classifier based on DBSCAN generated clusters. The results are from a crossfold validation where unlabeled data, generated as described in the Methods section, is assumed to be negative in the test fold. The results are compared with those obtained using SVM following a procedure described in the literature.12 520

DOI: 10.1021/acssynbio.5b00294 ACS Synth. Biol. 2016, 5, 518−528

Research Article

ACS Synthetic Biology

Figure 4. Production of N-Acetyl-L-Leucine by native E. coli BL21 enzymes. (A) GP score vs N-Acetyl-L-Leucine fold increase measured by LC−MS analysis. MS measurements and quantifications were performed after culturing for 4h wild type and engineered BL21 strains. Shaded area corresponds to GP variance estimate. (B) MS/MS spectrum of E. coli BL21 cell extract corresponding to N-Acetyl-L-Leucine (see text and Methods section for details).

result is not surprising considering that in our training set there is only one reaction corresponding to 2.3.1.1, while we have several reactions for 3.5.1.16 and 2.6.1.42. We might thus anticipate enzymes in class 2.3.1.1 to be more specific than enzymes belonging to the other classes, and in turn we should be less confident for enzyme in 2.3.1.1 to catalyze reaction 1. We next validated experimentally the GP results. We first confirmed that N-acetyl-L-Leucine was indeed an E. coli BL21 metabolite. Figure 4B shows the MS/MS spectrum we obtained for a wild type cell extract grown on minimum medium to which we added L-Leucine and N-Acetyl-L-Glutamate (see Methods section for further-details on our experimental protocols in particular the paragraphs on construction of expression plasmids and E. coli culturing, extraction of metabolites, and mass spectrometry analysis). Next, we overexpressed the three above-mentioned enzymes in E. coli BL21 and repeated the MS measurements. Figure 4A shows that the increase of N-Acetyl-L-Leucine compared to wild type is almost insignificant for ECBD_0907 (fold increase 1.1) while it is larger for the other enzymes (1.5 for ECBD_4067 and 2.9 for ECBD_4269). This data set thus demonstrates the advantage GP brings by providing variances to scores. Such variances cannot be computed using other machine learning techniques. In the present case, all enzymes exhibited a positive score but the enzyme with the lowest score and largest score variance turns out to be a nonproducer. N-Acetyl-L-Leucine is a compound used in clinical practice in some countries to treat acute vertigo,18 with recent clinical trials showing its efficacy in the treatment of cerebellar ataxia.19 Beyond validating our GP procedure, the N-Acetyl-L-Leucine example is therefore interesting as it provides a way of producing an added value compound in E. coli using native enzymes. Pathway Ranking. In this section, we tested the ability of the tool in the applied context of a protocol for ranking synthetic metabolic pathways for automated biodesign. One of the challenges at earlier stages of the synthetic biology design process for chemical production is deciding what are the best pathways to engineer in order to produce a target compound in a host organism. Although there are no single criteria about the way pathways should be ranked, several factors are recognized as part of any pathway ranking strategy. We have previously proposed and validated a pathway ranking function that

AUC = 0.91 on the crossfold validation data. The Gaussian process predictor exhibits performance slightly better than those obtained by the SVM-based classifier (AUC = 0.87). Using Score Variance to Search for Enzymes Catalyzing Reactions. Carbonell et al.17 annotated an E. coli BL21 cell extract MS spectrum with compounds that are not necessarily known to belong to the E. coli metabolome (according to EcoCyc). Instead, these compounds were found in an E. coli extended metabolic space (EMS), which was computed using reaction signatures (see Figure 8 for an example of reaction signature and Carbonell et al.17 for details on the computational procedures). N-Acetyl-L-Leucine was found among the EMS compounds matching the spectrum masses. That compound is not known to be a metabolite of E. coli, but was recovered in the EMS as the product of any of the three following reactions: Acetyl‐CoA + L‐Leucine → CoA + N ‐Acetyl‐L‐Leucine (1)

Acetate + L‐Leucine → N ‐Acetyl‐L‐Leucine + H 2O (2)

4‐Methyl‐2‐oxovaleric acid + N ‐Acetyl‐L‐Glutamate → N ‐Acetyl‐L‐Leucine + 2‐oxoglutarate

(3)

The above reactions have the same reaction signatures than MetaCyc references N-ACETYLTRANSFER-RXN (EC 2.3.1.1), RXN-7933 (EC 3.5.1.16), and BRANCHED/CHAINAMINOTRANSFERLEU/RXN (EC 2.6.1.42). These reference reactions are respectively catalyzed by BL21 enzymes ECBD_0907, ECBD_4067, and ECBD_4269, and our assumption is therefore that the three enzymes are also responsible for the synthesis of N-Acetyl-L-Leucine. To verify the above assumption, we created a training set composed of all reactions and sequences extracted from MetaCyc and Rhea and belonging to the third EC level classes 2.3.1, 3.5.1, and 2.6.1. We then used the trained Gaussian process (GP) predictor to compute the score of each of the three BL21 enzymes for catalyzing reactions 1, 2 and 3. The results in Figure 4A show that the three enzymes have a positive score and are thus predicted to synthesize N-Acetyl-LLeucine. However, the associated variance score for ECBD_0907 (EC 2.3.1.1) is larger than for the other enzymes and the confidence for that prediction is therefore lower. This 521

DOI: 10.1021/acssynbio.5b00294 ACS Synth. Biol. 2016, 5, 518−528

Research Article

ACS Synthetic Biology considers three factors: metabolic fluxes, metabolite toxicity and enzyme performance.20 Such ranking strategy considering those aspects was implemented in the RetroPath/XTMS tool for metabolic pathway design.17 In a previous work, the proposed strategy was applied in order to guide the pathway design for the production of the flavonoid pinocembrin.21 In that study, the best pathway among the 11 possible choices for producing pinocembrin was predicted and expressed in E. coli. Next, in order to boost the availability of the cosubstrate malonyl-CoA, 4 different pathways were identified and ranked: malonyl-CoA synthase (EC 6.2.1.), acetyl-CoA carboxylase (EC 6.4.1.2), malonate-CoA transferase (EC 2.8.3.3), and malonyl-CoA reductase (EC 1.2.1.75). The differences between the ranking in the pathways were given by the predicted changes in fluxes and by the selection of the enzymatic step producing malonylCoA. The selected sequences for each pathway were mat, acc, atoDE, and mms/cagg, respectively as described in Carbonell et al.17). In order to assess the performance of RetroPath, all 4 pathways were expressed and the amount of pinocembrin production was measured, arriving at a maximum of 24.1 mg L−1. This example provides an excellent test set for validation of the tool presented here. The goal of the test is to evaluate the use of our GP predictor for estimating the effect of enzyme performance within a general pathway ranking framework. Enzyme performance is commonly assessed through Michaelis−Menten kinetics parameters kcat (or Vmax) and KM. KM values are selected with the goal of tuning the affinity of the enzyme to the substrate.22 As shown in the next section, our GP predictor can provide guidance for that purpose in the sequence selection. Turnover numbers, kcat, in turn, are selected to balance rates of conversion for each of the enzymatic steps in the pathway, avoiding bottlenecks.23 Such parameters are often determined in vitro and they can substantially differ from their in vivo values.24 In order to overcome those limitations, we applied the GP tool predictor in order to estimate the overall performance of enzymes in the pathway. More precisely, the 4 alternative reactions producing malonyl-CoA were encoded following the reaction signature mapping descriptor described in Methods and illustrated in Figure 8. Reaction signatures were then paired with their corresponding engineered enzyme sequences in order to build 4 pairs reaction-sequence, each one associated with one of the pathways. Using the Gaussian process (GP) tool, these pairs reaction-sequence were scored (Table 1). The RetroPath tool can be then applied in order to combine the GP scores for reaction-sequence with the scores estimated for other factors varying in the pathway. The factors considered in Fehér et al.21 were growth-coupled production, enzyme performance and reaction thermodynamics, as shown

in Table 1. The score for growth-coupled production of each pathway is estimated by the application of flux balance analysis, while the contribution of each enzymatic step is obtained by combining reaction Gibbs free energy and the performance of the enzyme for the given reaction, which will be here computed by our predictive GP tool. The way these different scores are combined was determined based on parametric optimization, as described in Carbonell et al.20 The new set up using the GP tool provided an updated RetroPath score, which we used to evaluate the quality of GP as an enzyme performance predictor. Notably, as shown in Figure 5, the updated score was able to closely follow the measured

Figure 5. Pathway ranking in comparison with pinocembrin production levels.

production levels. Our previous enzyme ranking approach in Carbonell et al.17 showed some degree of discrepancy between the score and the observed production levels, especially for the reaction-sequence pair malonate-CoA transferase (EC 2.8.3.3), atoDE. In our updated score, all predictions appeared consistent with measurements, suggesting that our GP approach provides a more reliable estimate of enzyme performance that is compatible with the RetroPath pathway ranking strategy. This result shows that the GP tool can be used to select high performance pairs of reaction-sequences by combining reaction and enzyme information, an essential feature for choosing a pathway in metabolic engineering and synthetic biology applications. In the described example a single enzyme was replaced in each pathway, resulting in controlled conditions where the effects were reduced to flux variation, reaction energies and enzyme performance. The integration of this tool into a general automated pipeline for metabolic pathway design that considers multiple factors and design constraints17 should enable the implementation of design of experiments strategies allowing efficient exploration of the chemical/sequence space associated with the pathway. Predicting KM. Besides the ability of our tool for classification, we tested its performance as a quantitative predictor of enzyme-reaction activity, as part of an automated pipeline for enzyme selection that will provide an estimate of enzyme catalytic performance toward some designated substrate. We considered as our test parameter the Michaelis constant KM, which provides an inverse measure of affinity of the enzyme to the substrate. To achieve this we used Gaussian process regression models following our proposed strategy. These models were trained using 3 data sets from different sources: (1) The first was curated from the BRENDA database; (2) the second was taken from the study of kinetic parameters by Bar-Even et al.25 performed on several thousand enzymes

Table 1. RetroPath Scores Using Flux Balance Analysis (FBA), Free Gibbs Energies and Scores from the Semisupervised Gaussian process (GP), and Measured Pinocembrin Production Levels for Pathways Engineered in Fehér et al.21 pathway matCB acc atoDE mms/ cagg

prod (mmol/ gDW·h)

GP

FBA

Gibbs (kcal/mol)

RetroPath

0.0743 0.0364 0.0122 0.0499

0.644 0.622 0.0645 0.412

2.03 1.84 2.03 2.12

−14.6 −4.37 −3.40 −10.5

4.13 2.90 2.43 3.58

522

DOI: 10.1021/acssynbio.5b00294 ACS Synth. Biol. 2016, 5, 518−528

Research Article

ACS Synthetic Biology

Table 2. Q2 Scores for the 3 Datasets Using Leave-One-Out Cross-Validationa

collected from the literature; (3) the third was taken from Bastard et al.,26 a systematic screening of enzymatic activities for an enzyme family. Data Set 1. The BRENDA database provides an extensive collection of experimentally determined KM values from the literature. We grouped data KM values, given in mM units, into sets based on their third-level E.C. number and performed leave-one-out crossfold validation (LOO) for these data sets, which is a cross-validation method where every single observation is repeatedly used for validation and the rest of the data set as training set. We used a tensor-product kernel as described in the Methods section, taking the reaction signature rather than a substrate signature in the tensor-product. We performed the same experiment for both a tensor-product of linear kernels and a tensor-product of squared-exponential (RBF) kernels. As already stated Gaussian processes have the advantage of providing a measure of uncertainty with each prediction. In an engineering application we can use this to filter our choices to only include those predictions for which we have some level of confidence. On the basis of this principle, coefficient of determination of a leave-one-out crossfold validation (Q2) results for this study are shown in Table 2 when we filtered out by keeping only confident predictions. Data Set 2. We performed another test similar to the previous one with an alternative data set to again probe the ability of the predictor to retrieve KM values. This data set was compiled by Bar-Even et al.25 Table 2 shows Q2 scores for confident predictions for both linear and RBF tensor-products, with notable cases such as methyltransferases (E.C. 2.5.1) or hydrolases (E.C. 3.1.1) achieving Q2 values of 0.78. Some examples of such study for specific E.C. numbers are plotted in Figure 7. Figure 6 shows predictions across all E.C. numbers for both the data sets 1 and 2. The plots vary the variance parameter of the predictions, with lower variance corresponding in general to better predictions. Data Set 3. The third set of regression tests was performed using data curated by Bastard et al.26 In that study, a conserved protein domain of unknown function (DUF849) was systematically tested against multiple enzymatic activities. This data set, thus, contains examples of the same enzyme family with information on the active site available. In order to analyze this data set, we experimented with 4 different covariance functions (kernels) . These 4 kernels differed only with respect to the sequence space. First, sequences were represented as K-mers like in all previous experiments as described in the Methods section. Second, sequences were represented so that we kept only the section of the sequence that corresponded to the active site. The smaller subsequence was then transformed to a K-mer representation. In other words, given the full sequence s = sbeforesactivesafter where sbefore, sactive, and safter are subsequences, only sactive was kept and represented as K-mers. Since the examples were all from the same family it also made sense to use an alternative to the K-mer representation. We will refer to this representation as an aligned representation. For the aligned representation we first aligned all sequences (or active site subsequences) . This means that each amino acid in the sequence could be given a position. The representation was then a binary vector denoted in the presence of each positionamino acid pair. We tried the aligned representation for both the full sequence and the active site subsequence. Given one of the 4 enzyme representations; K-mer full sequence, K-mer active site, aligned full sequence, and aligned active site; we ran

data set Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data

set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set

2 2 2 2 2 2 1 2 2 2 1 1 2 2 2 2 1 1 3 1 2 1 1 2 2 2 1 1 1 3 3 1 2 3 2 2 2 1 1 2 1 1 2

kernel

E.C.

Q2

data size

Linear Linear RBF Linear RBF RBF RBF RBF Linear RBF Linear RBF RBF Linear RBF RBF Linear RBF aligned full sequence Linear RBF RBF RBF RBF RBF RBF Linear RBF Linear K-mer active site aligned active site RBF RBF K-mer full sequence Linear Linear RBF Linear RBF RBF Linear Linear RBF

2.5.1 3.1.1 5.3.1 1.1.1 3.1.1 3.1.2 1.1.1 1.1.1 5.3.1 2.5.1 1.1.1 5.3.1 3.5.2 6.2.1 6.2.1 3.6.1 5.3.1 2.5.1 − 2.5.1 4.1.1 2.7.2 1.3.1 5.5.1 1.4.3 5.3.3 1.3.1 2.7.1 2.7.1 − − 2.4.1 4.2.3 − 1.4.3 4.1.1 1.14.18 2.7.2 4.3.1 5.4.99 2.4.1 4.3.1 1.5.1

0.7829218 0.781617 0.701818 0.691839 0.6883683 0.6739155 0.6715151 0.657464 0.6492067 0.6454528 0.6449325 0.6309146 0.627767 0.6178345 0.6159273 0.6067722 0.5980584 0.5917988 0.580379 0.5725317 0.5640928 0.5571894 0.5565383 0.5558261 0.5506132 0.5443981 0.5429638 0.5428989 0.5390469 0.534551 0.533824 0.5217184 0.5214784 0.520937 0.5146148 0.5106441 0.5090601 0.5088667 0.5085799 0.507783 0.5062585 0.5052285 0.5049698

47 69 94 647 69 39 316 647 94 47 316 56 50 73 73 58 56 50 3273 50 162 50 46 41 77 21 46 159 159 3273 3273 21 36 3273 77 162 27 50 45 23 21 45 57

a

The scores were calculated only for E.C. classes with variance less than 4 (σ2 < 4) .

the same Leave-one-out (LOO) experiment using a linear tensor-product. Table 2 includes LOO cross-validation results for a linear tensor-product for these 4 representations of enzymes.



CONCLUSION In this work, we have presented a systematic approach to the problem of enzyme selection for a given desired reaction based on semisupervised Gaussian process. Through the application of our proposed technique, we have successfully addressed the issue of definition of the negative set for training. This approach provides a unified pipeline in synthetic biology for the selection of enzymes for pathway design for chemical production in 523

DOI: 10.1021/acssynbio.5b00294 ACS Synth. Biol. 2016, 5, 518−528

Research Article

ACS Synthetic Biology

Figure 6. Plots of predictions of KM for leave-one-out crossfold-validation across all E.C. numbers for both RBF tensor-product kernel Data sets 1 and 2 with units in log mM. The three plots show varying levels of confidence in the predictions. As the confidence increases (variance decreases) the predictions tend to improve.

for sequence selection when challenged with enzymes catalyzing multiple substrates. In that way, the remarkable performance of the tool for regression of enzyme kinetic parameters for some E.C. classes was shown through different independent test sets. To the best of our knowledge our KM predictor, albeit not perfect, is the first of its kind as we are not aware of any other studies using machine learning to predict kinetics parameters at a genome scale level. Production of high-value chemicals is perhaps the most successful application of synthetic biology. As the demand for engineering the production of novel chemicals in chassis organism in efficient and greener ways increases, more powerful in silico design tools informing the synthetic biologists are necessary to select optimal parts in the design/build/test cycles. By using the predictive tools presented here, the process of rationally designing novel and more efficient enzymes in yet unforeseen reactions for new specialty chemicals becomes streamlined and, at large, anticipatedly feasible within design automated pipelines.



Figure 7. Plot of predicted log KM (μM) from leave-one-out crossvalidation where the predicted variance was less than 4 for E.C. 3.1.1 trained with a linear tensor-product.

METHODS In this section we detail the data set, data representation and machine learning models used in our system. Data Representation. An example enzyme-reaction pair is represented by a feature vector xi = [eTi rTi ]T where ei is a feature vector describing the enzyme and ri is a feature vector describing the reaction. The enzyme was converted from its amino acid sequence representation to a K-mer vector representation. The reaction was represented by its signature, which describes how atom connectivity to its neighbors up to a designated diameter.13 Each signature-based feature is located

chassis organism. Results were experimentally evaluated on several external test sets associated with design applications. Notably, the system was able to correctly assign scores to enzymes expected to promiscuously produce the metabolite NAcetyl-L-Leucine in E. coli according to the extended metabolic space. Furthermore, we showed the ability of the system to predict best enzymatic steps for pathway ranking for flavonoids production. In general, the system provides a robust technique

Figure 8. Example of signature representation. Each line represents a given molecular transformation in the reaction. Here 3 bonds in total are broken; a double bond between an oxygen and carbon atom is reduced to a single bond and 3 bonds are created. 524

DOI: 10.1021/acssynbio.5b00294 ACS Synth. Biol. 2016, 5, 518−528

Research Article

ACS Synthetic Biology

provides an appropriate probabilistic framework which meets this condition.28 Moreover, the null-category noise model Gaussian process27 is a semisupervised Gaussian process model (SGP) . It is a probabilistic approach to classification that models P(yn|xn). In the Gaussian process framework latent variables f are introduced such that

at reaction-centers where bond changes happen. The type of bond change, the size of the change and the signatures of molecules involved (center at the bond change) identify a feature. The reaction representation differentiates between 4 types of bond change; a deletion, a change of order (change from single to double bond or vice versa), a bond creation between atoms within the sample molecule, and a bond creation between atoms in different molecules. The value of a specific reaction feature is 1 if the feature is present in the reaction and 0 otherwise. Figure 8 shows the diameter 0 features present in an example reaction. The 4 types of bond change are color coded and the present features shown as a vector with the corresponding colors. Training Data. We constructed a data set derived from the BRENDA database.14 Signatures from depth 0 to 6 were constructed for reactions in the database. Unbalanced reactions, reactions with an incorrect atom-mapping or with more than 999 atoms were discarded. The final data set contained 7318 reactions and 9001 enzymes. A training example consisted of an enzyme-reaction pair. The total number of positive training examples (an example where the enzyme is known to catalyze the reaction) was 16 811. A set of unlabeled pairs was also constructed where it was not known if the enzyme catalyzed the reaction. The unlabeled pairs were constructed so as to make them likely to be negative (the enzyme did not catalyze the reaction) . For each positive-labeled training point xi an associated training point xj was found. The associated point xj was such that the Jaccard distance J(ri, rj) is minimized under the constraint that the enzymes represented by ei and ej have a different top-level E.C. number. An unlabeled point xk = [eTi rTj ]T was generated using the enzyme from data point xi and the reaction from data point xj. Semisupervised Learning. The BRENDA data set contains pairs of enzymes and reactions where the enzyme is known, from experimental data, to catalyze the reaction. That is, the data set contains only positively labeled pairs. It does not contain information about when an enzyme does not catalyze a reaction. As described in previous section, we can however generate enzyme-reaction pairs that are not experimentally validated where we would expect that the enzyme did not catalyze the reaction. These generated pairs are not labeled but they potentially give us valuable information about the distribution of the enzymatic space. This added information may allow a semisupervised learning algorithm to find a more appropriate decision boundary for the data. More precisely, the semisupervised learning problem is to learn a classifier from partially labeled data such as in our data set of enzyme-reaction pairs. Here, we will concern ourselves with a 2-class semisupervised problem.27 We are presented with a collection of N training input examples X = [x1...xN]T, where xn has dimension d, and associated labels z = [z1...zN]T, zn ∈ {−1, 0, 1}. Here −1 denotes the negative class, 1 denotes the positive class, and 0 denotes an example being unlabeled. There is also assumed to be some associated latent targets y = [y1...yN]T, yn ∈ {−1, 1} such yn = zn ⇔ zn ≠ 0. The goal is to learn a mapping between inputs and targets that gives high predictive accuracy. Null-Category Noise Model Gaussian process. In order to aid better automated decision making for novel pathway design we are not just concerned with the classification of enzymes that catalyze a particular reaction. The ability to provide a score associated with our confidence of the classification is also important. The Gaussian process model

P(yn |x n, X, y) =

∫f P(yn |fn )P(fn |X, y) dfn n

(4)

The latent variables f are then modeled as a Gaussian process with covariance function K. It is important to note it is the latent variable not P(yn) that is modeled by a Gaussian process. However, given P(yn|f n) we can then calculate P(yn). The function P(yn|f n) is known as a link-function and “squashes” the latent variable into the unit interval. In the fully supervised case this is commonly the Probit or Heaviside link-function. For the null-category noise model, a different link function is used. For example the link-function in this model analogous to the Probit link-function is given by ⎧ Φ( −(f + w/2)) for zn = −1 n ⎪ ⎪ − ⎪ γ Φ( −(f + w/2)) for zn = 0 n P(yn |fn ) = ⎨ + ⎪ + γ Φ(fn − w/2) ⎪ ⎪ Φ(fn − w/2) for zn = 1 ⎩

(5)

x

where Φ(x) = ∫ 5(z) dz , P(zn = 1|yn = 0) = γ+, and P(zn = −∞ −1|yn = 0) = γ−. The model makes the cluster assumption, that data-density is low near decision boundaries. This low density area is modeled by a null region around a decision boundary whose margin in the case of eq 5 is given by w. The inferred quantity P(yn) is a probability estimate that an enzyme catalyzes a reaction. The Gaussian process modeling the latent variables also provides a variance measure of the estimate of the latent variables. This can be used to gauge the confidence one has in the quantity P(yn). We implemented the null-category noise model using a NCNM-Probit link function by extending the GPy Gaussian process Python library.29 We infer the Gaussian process model for the latent variable using the implementation of expectationpropogation given in the GPy library. When training each model with the training data we optimized the model using the L-BFGS-B optimization procedure.30 We did the optimization with restarts and selected the best solution with respect to the objective function of the optimization. Clustering Data. To reduce the computation load we clustered the data into smaller clusters. Two approaches to clustering were tried. First the data was clustered based on the top level E.C. number. Second the data was clustered using DBSCAN16 based on the Jaccard distance of the reaction features ri. Nearest-neighbor models are trained to classify new points to the clusters. We use the Scikit-learn Python library implementation of both nearest-neighbor and DBSCAN.31 Feature Selection. The feature space of K-mers and signatures is very large. For example the number of distinct signatures of depth 0 to 6 seen across the data set was 21550 and the number of distinct 2-mers and 3-mers was 4667. To initially reduce the large feature space we use a filter feature selection method called CMIM. CMIM uses a conditional mutual information-based score to select features which is fast to compute. CMIM is described in Algorithm 1. We used the 525

DOI: 10.1021/acssynbio.5b00294 ACS Synth. Biol. 2016, 5, 518−528

Research Article

ACS Synthetic Biology FEAST implementation of this algorithm.32 The score CMIM uses requires label information to partition the data set into different classes, it then finds features that best discriminate between these classes. We chose the top level E.C. number to be this label. The top level E.C. number has a high correlation with the chemistry of the enzymes and so features that discriminate this property are conjectured to be informative in discriminating between valid and invalid enzyme-reaction pairs. The 21550 × 4667 K-mersignature product space was prohibitively large to compute and so we selected features independently for the K-mer feature space (enzymes) and the signature feature space (reactions) .

Var[y ] = K (x , x ) + k T(K + σ 2I )−1k * * * * *

In regression the variance of the prediction gives a direct measure of uncertainty in the estimate yn since the dependent variable is modeled directly. Support Vector Machines. Support vector machines(s) are another widely used machine learning method and uses a kernel/covariance function like Gaussian processes. Faulon et al.12 showed that SVMs can be successfully used to classify enzymatic reactions. SVMs learn a decision function f (x n) =

(x i , x n) + b ∑ αiyK i i

[eTi

(6)

rTi

where xi = ]. The choice of Ke and Kr remains. In this paper we give results for the case where both are a linear kernel and the case where both are the squared exponential (or radial basis function) kernel. A linear kernel is defined as K (x i , x j) = x Ti x j

(7)

A squared exponential kernel is defined as ⎛ (x − x )2 ⎞ i j ⎟ K (x i , x j) = σ 2exp⎜⎜ − ⎟ 2 2 l ⎝ ⎠

(11)

where b is a threshold and the αi are parameters chosen by solving a quadratic programming problem. An example is said to be positive if f(xn) is greater than 0 and negative otherwise. Construction of Expression Plasmids and E. coli culturing. E. coli genes ECBD_0907, ECBD_4067 and ECBD_4269 were amplified by PCR from E. coli genomic DNA and cloned into the plasmid pQlinkN.33 These expression plasmids were then introduced into E. coli BL21-AI (Thermo Fisher Scientific) . Transformed cells were grown at 37 °C in mineral salts medium34 containing 2 mg/mL glucose and 50 μg/mL ampicillin. When cells reached early/middle exponential growth phase (OD600nm = 0.6), protein expression was induced with 1 mM isopropyl β-D-thiogalactopyranoside (IPTG) and media supplemented with 1 mM L-Leucine in case of ECBD_0907 and ECBD_4067 and with 1 mM N-Acetyl-LGlutamic acid, 1 mM 4-Methyl-2-oxovaleric acid, 0.1 mM ZnSO4 and 0.1 mM CoCl2 in case of ECBD_4269. After 4 and 24 h of culture at 37 °C, cells from 5 mL culture were harvested by centrifugation at 5000g for 10 min at 4 °C. Cell pellet was stored at −80 °C. Extraction of Metabolites. Intracellular metabolites were extracted using a protocol adapted from the literature.35 Briefly, cell pellet was resuspended in 500 μL of 50% methanol, 25 mM ammonium formate pH = 3−3.75 solution. After two cycles of freezing in liquid nitrogen, unfreezing and votexing, the lysed cells were centrifuged at 14 000g for 5 min at 4 °C. The supernatant containing the metabolites was recovered, filtrated through a Pall Nanosep centrifugal device with Omega membrane (MWCO 3 kDa) by centrifugation at 14 000g for 10 min at 4 °C and vacuum centrifuged for 4 h at mid temperature. Dried pellets were stored at −80 °C until analysis. Mass Spectrometry Analysis. Reagents. LC−MS grade methanol (MeOH) was from VWR, acetonitrile (ACN) and formic acid were from Sigma-Aldrich (Sigma Chemical Co., St Louis, MO, USA) and ultrapure water (H2O) was obtained from a Maxima II (USF Elga) . Liquid Chromatography. Sample extracts were analyzed using a Transcend ultra high performance liquid chromatography device (UHPLC, Thermo Fisher Scientific, San Jose, CA, USA) and a kinetex C8 150 × 2.1 mm, 2.6 μm column (Phenomenex, Sydney, NSW, Australia). Mobile phases were composed of (A) ultrapure water (H2O) and (B) high performance liquid chromatography (HPLC) grade acetonitrile (ACN) both containing 0.1% formic acid. The gradient program was as follows: solvent B was maintained for 2 min at 5%, from 2 to 12 min, it was increased to 95% B, from 12 to 15 min it is maintained at 95% B, and from 15 to 16 min, 5% B was maintained, for column re-equilibration. The flow rate was 400 μL/min and the column temperature was set to 60 °C.

Covariance Function. The covariance function captures similarity between inputs. When an enzyme catalyzes a reaction there is an interaction between some molecular structure in the enzyme and the reaction. To reflect this in the model we use a tensor-product. This is a product of a kernel defined over the enzyme feature space ei (K-mers) and then a kernel defined over the reaction feature space ri (signatures) . The tensorproduct kernel is given by K (x i , x j) = Ke(ei , ej)K r(ri, rj)

(10)

(8)

where the variance σ and length scale l are parameters of the kernel. These parameters are optimized to maximized the loglikelihood of the Gaussian process model during training. Regression. In regression we are presented with a collection of N training input examples X = [x1...xN]T, where xn has dimension d, and associated labels y = [y1 ... yN ]T , yn ∈ . The goal is then to predict a new point y* given x* such that the error in the prediction is small. The Gaussian process framework for regression is much simpler than the classification case. This is because the Gaussian process directly models the dependent variable yn. This is in contrast to classification where the Gaussian process models the latent variable f n as described above. The mean prediction of a new example x* is given by, 2

[y ] = k T(K + σ 2I )−1y (9) * * where K denotes the covariance matrix between all the training examples X, k* denotes the covariance between the new example x* and the training examples and σ2 is a small constant. The variance of the prediction is given by, 526

DOI: 10.1021/acssynbio.5b00294 ACS Synth. Biol. 2016, 5, 518−528

Research Article

ACS Synthetic Biology Mass Spectrometry. After injection of 10 μL of sample, the column effluent was directly introduced into the electrospray source of a Q-Exactive mass spectrometer (Thermo Scientific, San Jose, CA) and analysis was performed in negative ionization mode. The HESI source parameters were as follows: the spray voltage was set to −2.5 kV, the heater temperature was adjusted to 380 °C, the heated capillary temperature to 200 °C and the sheath and auxiliary gas flow were set to 80 and 20 (arbitrary units), respectively. For quantitative LC−MS analysis, the full scan mode was used with a m/z range of 50 to 500. The mass spectrometer acquired scans at a resolution of 140 000 full width at half-maximum (fwhm) at m/z 200. External mass calibration was done before analysis. Data processing was done with the Xcalibur software (version 2.2, Thermo Fischer Scientific, San Jose, CA, USA) and specific ions used for quantification were selected with a mass tolerance of 5 ppm. Quantification by the Standards Addition Method. The method of standard addition36 was performed by adding increasing concentrations, i.e., 0, 0.45 ng, 0.9 ng, 2.25 ng and 4.5 ng of pure acetyl leucine in a pool made of all samples (QC) . These calibration curves were analyzed in triplicate and placed at the beginning and at the end of the analytical sequence. The endogenous acetyl leucine levels in sample extracts were determined by the x-intercept (on the negative x-axis) from the extrapolation of the experimental curves. Tandem mass spectrometry experiments were performed on a QC sample and compared to MS/MS spectra of the pure standard for formal identification. Precursor ion corresponding to the deprotonated form of acetyl leucine detected at m/z 172.0979 (at 5.3 min retention time) were selected in the quadrupole with a 0.4 Da window and subsequently fragmented in the HCD cell with normalized collision energy of 14.



(3) Carbonell, P., Fichera, D., Pandit, S., and Faulon, J.-L. (2012) Enumerating metabolic pathways for the production of heterologous target chemicals in chassis organisms. BMC Syst. Biol. 6, 10. (4) Henry, C. S., Broadbelt, L. J., and Hatzimanikatis, V. (2010) Discovery and analysis of novel metabolic pathways for the biosynthesis of industrial chemicals: 3-hydroxypropanoate. Biotechnol. Bioeng. 106, 462. (5) Yim, H., et al. (2011) Metabolic engineering of Escherichia coli for direct production of 1,4-butanediol. Nat. Chem. Biol. 7, 445−452. (6) Campodonico, M. A., Andrews, B. A., Asenjo, J. A., Palsson, B. O., and Feist, A. M. (2014) Generation of an atlas for commodity chemical production in Escherichia coli and a novel pathway prediction algorithm, GEM-Path. Metab. Eng. 25, 140−158. (7) Caspi, R., et al. (2012) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 40, 742−753. (8) Nam, H., Lewis, N., Lerman, J., Lee, D., Chang, R., Kim, D., and Palsson, B. (2012) Network context and selection in the evolution to enzyme specificity. Science 337, 1101−4. (9) Notebaart, R. A., Szappanos, B., Kintses, B., Pál, F., Györkei, Á , Bogos, B., Lázár, V., Spohn, R., Csörgő , B., Wagner, A., Ruppin, E., Pál, C., and Papp, B. (2014) Network-level architecture and the evolutionary potential of underground metabolism. Proc. Natl. Acad. Sci. U. S. A. 111, 11762−11767. (10) Alderson, R. G., De Ferrari, L., Mavridis, L., McDonagh, J. L., Mitchell, J. B., and Nath, N. (2012) Enzyme informatics. Curr. Top. Med. Chem. 12, 1911. (11) Kirchmair, J., Williamson, M. J., Tyzack, J. D., Tan, L., Bond, P. J., Bender, A., and Glen, R. C. (2012) Computational Prediction of Metabolism: Sites, Products, SAR, P450 Enzyme Dynamics, and Mechanisms. J. Chem. Inf. Model. 52, 617−648. (12) Faulon, J.-L., Misra, M., Martin, S., Sale, K., and Sapra, R. (2008) Genome scale enzyme-metabolite and drug-target interaction predictions using the signature molecular descriptor. Bioinformatics 24, 225−233. (13) Carbonell, P., Carlsson, L., and Faulon, J. (2013) Stereo Signature Molecular Descriptor. J. Chem. Inf. Model. 53, 887−897. (14) Schomburg, I., Chang, A., and Schomburg, D. (2002) BRENDA, enzyme data and metabolic information. Nucleic Acids Res. 30, 47−49. (15) Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150−3152. (16) Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. of 2nd International Conference on Knowledge Discovery, pp 226−231. (17) Carbonell, P., Parutto, P., Baudier, C., Junot, C., and Faulon, J.L. L. (2014) Retropath: Automated pipeline for embedded metabolic circuits. ACS Synth. Biol. 3, 565−577. (18) Neuzil, E., Ravaine, S., and Cousse, H. (2002) La N-acetyl-DLleucine, medicament symptomatique des etats vertigineux. Bull. Soc. Pharm. Bordeaux 141, 15−38. (19) Strupp, M., Teufel, J., Habs, M., Feuerecker, R., Muth, C., van de Warrenburg, B. P., Klopstock, T., and Feil, K. (2013) Effects of acetylDL-leucine in patients with cerebellar ataxia: a case series. J. Neurol. 260, 2556−2561. (20) Carbonell, P., Planson, A.-G. G., Fichera, D., and Faulon, J.-L. L. (2011) A retrosynthetic biology approach to metabolic pathway design for therapeutic production. BMC Syst. Biol. 5, 122. (21) Fehér, T., Planson, A.-G. G., Carbonell, P., Fernández-Castané, A., Grigoras, I., Dariy, E., Perret, A., and Faulon, J.-L. L. (2014) Validation of RetroPath, a computer-aided design tool for metabolic pathway engineering. Biotechnol. J. 9, 1446−1457. (22) Bennett, B. D., Kimball, E. H., Gao, M., Osterhout, R., Van Dien, S. J., and Rabinowitz, J. D. (2009) Absolute metabolite concentrations and implied enzyme active site occupancy in Escherichia coli. Nat. Chem. Biol. 5, 593−599.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acssynbio.5b00294. Figure S1. Mean receiver operating characteristic curves for the 6 individual clusters formed using the top level E.C. number. Figure S2. Mean receiver operating characteristic curves for the 12 individual clusters formed using DBSCAN. (PDF)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by BBSRC/EPSRC grant BB/ M017702/1, “Centre for synthetic biology of fine and speciality chemicals”.



REFERENCES

(1) Williamson, M. A. (2010) U.S. Biobased Products Market Potential and Projections through 2025, p 272, NOVA Science Publishers, New York. (2) Pauthenier, C., Faulon, J.-L. (2013) Ingénierie métabolique et biologie de synthése. Concepts, Équipements et Réglementations des Biotechnologies TIB164DUO, Techniques de l’Ingénieur. 527

DOI: 10.1021/acssynbio.5b00294 ACS Synth. Biol. 2016, 5, 518−528

Research Article

ACS Synthetic Biology (23) Biggs, B. W., De Paepe, B., Santos, C. N. S., De Mey, M., and Ajikumar, P. K. (2014) Multivariate modular metabolic engineering for pathway and strain optimization. Curr. Opin. Biotechnol. 29, 156−162. (24) Weaver, L. J., Sousa, M. M., Wang, G., Baidoo, E., Petzold, C. J., and Keasling, J. D. (2015) A kinetic-based approach to understanding heterologous mevalonate pathway function in E. coli. Biotechnol. Bioeng. 112, 111−119. (25) Bar-Even, A., Noor, E., Savir, Y., Liebermeister, W., Davidi, D., Tawfik, D. S., and Milo, R. (2011) The Moderately Efficient Enzyme: Evolutionary and Physicochemical Trends Shaping Enzyme Parameters. Biochemistry 5, 04402−4410. PMID: 21506553. (26) Bastard, K., et al. (2014) Revealing the hidden functional diversity of an enzyme family. Nat. Chem. Biol. 10, 42−49. (27) Lawrence, N. D., and Jordan, M. I. (2004) Semi-supervised Learning via Gaussian processes, NIPS. (28) Rasmussen, C. E., and Williams, C. K. I. (2005) Gaussian processes for Machine Learning (Adaptive Computation and Machine Learning), The MIT Press. (29) GPy: A Gaussian process framework in python, http://github. com/SheffieldML/GPy, 2012−2014. (30) Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. (1995) A Limited Memory Algorithm for Bound Constrained Optimization. SIAM J. Sci. Comput. 16, 1190−1208. (31) Pedregosa, F., et al. (2011) Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825−2830. (32) Brown, G., Pocock, A., Zhao, M.-J., and Luján, M. (2012) Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection. J. Mach. Learn. Res. 13, 27− 66. (33) Scheich, C., Kümmel, D., Soumailakakis, D., Heinemann, U., and Büssow, K. (2007) Vectors for co-expression of an unrestricted number of proteins. Nucleic Acids Res. 35, e43. (34) Hall, B. G. (1998) Activation of the bgl operon by adaptive mutation. Mol. Biol. Evol. 15, 1−5. (35) Narainsamy, K., Cassier-Chauvat, C., Junot, C., and Chauvat, F. (2013) High performance analysis of the cyanobacterial metabolism via liquid chromatography coupled to a LTQ-Orbitrap mass spectrometer: evidence that glucose reprograms the whole carbon metabolism and triggers oxidative stress. Metabolomics 9, 21−32. (36) Harris, D. C. (2003) Quantitative Chemical Analysis, 6th ed., W.H. Freeman, New York.

528

DOI: 10.1021/acssynbio.5b00294 ACS Synth. Biol. 2016, 5, 518−528