A Semiautomated Structure-Based Method To ... - ACS Publications

The source code used will be released under the GNU General Public License (GPLv3) and will be free to download. We believe that the present method wi...
20 downloads 0 Views 5MB Size
Article pubs.acs.org/jcim

A Semiautomated Structure-Based Method To Predict Substrates of Enzymes via Molecular Docking: A Case Study with Candida antarctica Lipase B Zhiqiang Yao,† Lujia Zhang,*,† Bei Gao,† Dongbing Cui,† Fengqing Wang,† Xiao He,‡,§ John Z. H. Zhang,‡,§ and Dongzhi Wei*,† †

State Key Laboratory of Bioreactor Engineering, New World Institute of Biotechnology, East China University of Science and Technology, Shanghai 200237, China ‡ State Key Laboratory of Precision Spectroscopy, Institute of Theoretical and Computational Science, East China Normal University, Shanghai 200062, China § NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China S Supporting Information *

ABSTRACT: The discovery of unique substrates is important for developing potential applications of enzymes. However, the experimental procedures for substrate identification are laborious, time-consuming, and expensive. Although in silico structurebased approaches show great promise, recent extensive studies have shown that these approaches remain a formidable challenge for current biocomputational methodologies. Here we present an open-source, extensible, and flexible software platform for predicting enzyme substrates called THEMIS, which performs in silico virtual screening for potential catalytic targets of an enzyme on the basis of the enzyme’s catalysis mechanism. On the basis of a generalized transition state theory of enzyme catalysis, we introduce a modified docking procedure called “mechanism-based restricted docking” (MBRD) for novel substrate recognition from molecular docking. Comprising a series of utilities written in C/Python, THEMIS automatically executes parallel-computing MBRD tasks and evaluates the results with various molecular mechanics (MM) criteria such as energy, distance, angle, and dihedral angle to help identify desired substrates. Exhaustive sampling and statistical measures were used to improve the robustness and reproducibility of the method. We used Candida antarctica lipase B (CALB) as a test system to demonstrate the effectiveness of our computational prediction of (non)substrates. A novel MM score function for CALB substrate identification derived from the near-attack conformation was used to evaluate the possibility of chemical transformation. A highly positive rate of 93.4% was achieved from a CALB substrate library with 61 known substrates and 35 nonsubstrates, and the screening rate has reached 103 compounds/day (96 CPU cores, 100 samples/compound). The performance shows that the present method is perhaps the first reported scheme to meet the requirement for practical applicability to enzyme studies. An additional study was performed to validate the universality of our method. In this verification we employed two distinct enzymes, nitrilase Nit6803 and SDR Gox2181, where the correct rates of both enzymes exceeded 90%. The source code used will be released under the GNU General Public License (GPLv3) and will be free to download. We believe that the present method will provide new insights into enzyme research and accelerate the development of novel enzyme applications.



INTRODUCTION Enzymes catalyze a wide variety of reactions in aqueous solution under ambient conditions with exquisite selectivity and stereospecificity1,2 and are among the most proficient catalysts known.3 The fascinating catalytic ability of enzymes has tremendous © 2016 American Chemical Society

potential for the development of novel synthetic biochemical pathways.4,5However, in vitro methodological progress is slow in Received: September 24, 2015 Published: August 16, 2016 1979

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling

Because each enzyme follows a different catalytic mechanism, we used a serine hydrolase, Candida antarctica lipase B (CALB/EC 3.1.1.3), which catalyzes ester hydrolysis, as a target for effective computational prediction of (non)substrates. CALB was chosen for a number of reasons, including the availability of a high-resolution crystal structure,23 a well-characterized catalytic mechanism,24 rich experimental data,25−38 wide use in academic and industrial laboratories, a great number of registered patents and various applications, and its appropriateness as a candidate in pharmaceutical, chemical, and food industries.39 It is generally accepted that in serine protease-like enzymes, the formation of the tetrahedral intermediate (TI) or the near-attack conformation (NAC), a ground-state conformation that can convert directly to the transition state, is rate-determining.40,41 We therefore assumed in our sampling that a higher probability of the NAC correlates to a lower barrier for this reaction and thus higher overall activity of the enzyme for the substrate. By defining key interactions between substrates and CALB in the NAC, we designed a novel substrate prediction score function (SP-CALB) to evaluate the possibility of chemical transformation of CALB substrates by combining the contributions of each key interaction. Second, the contribution of the transmission coefficient γ(T) to the reaction rate is relatively small (typically no more than a factor of 103) in comparison to the effect of the reaction barrier (as much as a factor of 1019).20 The transmission coefficient is also important for predicting enzyme catalysis and is sensitive to substrate−enzyme and substrate−water interactions: it can either accelerate or decelerate reactions. From a biochemical perspective, the “lock and key” model42 is the most commonly accepted hypothesis for enzyme catalysis. In this hypothesis, the binding of a substrate molecule to the active site of the enzyme results in activation of the substrate (a reactive conformation). Subsequently, a modified version in which the “key does not quite fit the lock perfectly but exercises a certain strain on it” (ground-state destabilization) was proposed.43 Along the reaction coordinate for enzyme catalysis, the energy barrier for the reversible binding of substrate to form the enzyme−substrate complex, which then undergoes catalytic conversion, is critical for discriminating substrate from nonsubstrate. This binding free energy can be calculated with various methodologies; frequently used MM methods such as MM-PBSA and molecular docking can predict enzyme selectivity between substrates with a certain degree of accuracy.44,45 To achieve a high screening rate, sampling was performed in the present study using molecular docking, which is a high-efficiency binding free energy-based method that is widely used to investigate enzyme−substrate interactions.46 However, MM-based methods alone are not sufficient for sampling of the NAC because the energy function used cannot accurately describe the electronic structures that promote enzyme catalysis. Therefore, we introduced three enhanced hydrogen-bonding terms that mimic the stabilization of the substrate by an oxyanion hole in CALB to constrain the sample space within the subspace around the NAC. A modified docking procedure called “mechanism-based restricted docking” (MBRD) is introduced to achieve substrate recognition for molecular docking. For robustness and reproducibility, exhaustive sampling was applied along with MBRD in this work. Here we present an open-source, extensible, and flexible software platform called THEMIS that performs in silico screening of enzyme substrates to drastically reduce experimental effort. THEMIS is composed of a series of utilities written in C/Python to automatically execute parallel-computing MBRD tasks and evaluate the results with various MM criteria such as

comparison with the evolution of new enzyme discovery. As of May 16, 2016, RefSeq (version 76) contained 63 971 766 protein sequences, and the list is rapidly increasing as a result of highthroughput sequencing technology.6 Because functional characterization of the substrates of uncharacterized enzymes by experiment is laborious and time-consuming, the progress in utilizing more efficient enzymes has been slow.7 On the other hand, computational methods are increasingly being applied in the design of biocatalysts to provide rational guidelines, direct experimental planning, and avoid expensive and time-consuming experiments.8 Nonetheless, the in silico quantitative prediction of enzyme activity and selectivity is still a formidable task.9 Although bioinformatics can suggest clues regarding activities from sequence similarity, it provides no direct evidence because the activities of enzymes are based on their three-dimensional structures.10 When the target enzyme has little relationship to orthologous enzymes of known activity, any inference could be unreliable,11 making structure-based prediction inevitable. Structure-based computational methods are able to describe the biocatalytic machinery in detail and thus can provide more information that is otherwise difficult to obtain.1 From the standpoint of applicability, a viable substrate screening method would have to match or exceed a screening rate of 102 to 103 compounds per day in order to be competitive in terms of cost with common experimental methods.12 In the present study, we used a molecular mechanics (MM)-based method to perform in silico screening against substrates. This approach has the advantage of requiring modest computational costs, and it has been widely used in enzyme engineering and mechanistic investigations.13−16 Meeting speed requirements introduces the second and most critical performance criterion of computational methods, namely, the ability to reliably discriminate between active and inactive molecules in a virtual library.17 Generally, enzymes are large and heterogeneous systems containing thousands of atoms, in which not only active residues but also other residues and the solvent environment determine the activities of enzymes.18,19 In generalized transition state theory,20 the rate constant k(T) for a reaction at temperature T can be conveniently expressed as follows: k(T ) = γ(T )kTST(T ) = γ(T )

1 −β ΔG⧧(T ) e βh

(1)

in which γ(T) is the transmission coefficient, k is the transition state theory rate constant, β = (kBT)−1, where kB is Boltzmann’s constant, h is Planck’s constant, and ΔG⧧(T) is the quasithermodynamic free energy of activation. Computational studies have shown that the dominant factor responsible for the rate enhancement by enzymes is the lowering of ΔG⧧(T) compared with that of the uncatalyzed reaction in water.1,2 Nevertheless, the transmission coefficient γ(T), which can either accelerate or decelerate reactions, is also critical for understanding enzymatic reactions. Thus, to accurately predict the substrates for enzymes, the two key factors must be carefully examined for each candidate substrate. First, the reduction of the quasi-thermodynamic free energy of activation for the chemical transformation is a key step in enzyme catalysis from a physicochemical perspective.21 However, such computations require expensive, high-level quantum-mechanical (QM) simulations and are too slow to allow sufficient screenings of a wide range of substrate candidates even when combined with MM.22 Thus, we tried to use simple MM molecular geometry elements to replace such heavy computation in the present study. TST

1980

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling

of which differed over a wide range.25−−38 All of the candidates gathered from the literature were experimentally verified substrates or nonsubstrates of CALB. In order to create a baseline test set, the investigated candidates varied in physicochemical properties. As a result of the high enantioselectivity, the majority of data came from investigations of CALB chiral resolution. For clarity, only substrates with notably high reactivity rates were employed, and they were marked as active. In contrast, candidates with no detectable or extremely low hydrolytic rates were labeled as inactive. Among them, (R)-5-allyl 1-ethyl 2-aminopentanedioate and (R)-1-allyl 5-ethyl 2-aminopentanedioate each had two active ester groups that can be hydrolyzed. All of the other candidates were monoesters. In typical applications of in silico screening, ligand databases generally consist of thousands to millions of compound structures. However, not all of the structures in these databases contain chemical groups such as esters that can be hydrolyzed by the enzyme of interest. Furthermore, not only does MBRD require information about the substrate carbonyl oxygen, but also, assessment by the score function requires information about the ester group atoms. Thus, a proper prescreening procedure is essential. To overcome the drawbacks of manually established configurations for each unique candidate substrate, a utility program was developed to automatically analyze chemical groups from candidate substrate structures, extract key atomic information, and generate docking configuration files accordingly. A recursive algorithm was used to match any predefined chemical group (Tripos mol2 format), such as esters in the given compound structure, according to atom type and topological connection. Additional information for the chemical group is written into the docking configuration files as comments, which are used in substrate assessment by the score function. Details of inputs for prescreening can be found in the Supporting Information (Text S5). MBRD. In principle, standard simulation approaches using a classical force field (MM) can only simulate a stable structure corresponding to a local minimum on the potential energy surface, whereas the reactant transition state is always associated with a first-order saddle point on the potential energy surface during a reaction process. Hence, molecular docking using a classical force field cannot directly simulate a transition state. Furthermore, MM cannot simulate the processes of covalent bond breaking or formation or their transition states because it cannot accurately describe electronic structures. Conversely, a NAC can be well-described; it is an approximation of the groundstate conformation of the reactant that can directly convert to the transition state. In CALB, the NAC takes the form of a tetrahedral intermediate (TI) (Figure 2). Hence, effective sampling of the NAC in the complex of CALB and the candidate substrate is the pivot point of our substratescreening platform. Because molecular docking programs are designed to efficiently sample the overall minimum of the complex between the protein receptor and the ligand, the docking score function prefers strong binding for inhibitors and target receptors in drug development. However, the NAC of the substrate is not at an overall minimum in most circumstances. Thus, we introduced the MBRD procedure to constrain the sample space within the subspace around the NAC to improve the sampling efficiency. It should be pointed out that the only purpose of performing such types of MBRD simulation is to rule out irrelevant binding poses and focus on sampling of the local minima of docking for candidate substrates in the subspace around the NAC of the enzymatic reaction. Subsequently,

energy, distance, angle, and dihedral angle to identify substrates. Against a library of 61 known substrates and 35 nonsubstrates of CALB, a considerably high positive rate of 93.4% was achieved, and the screening speed reached 103 compounds/day (96 CPU cores, 100 samples/compound). Additional verification with a distinct enzyme, nitrilase Nit6803, was performed to validate the universality of our method. For 12 substrates and 12 nonsubstrates that were experimentally confirmed, the positive rate reached 91.7%.



RESULTS AND DISCUSSION Workflow of the Substrate-Screening Platform. The goal of our design was to establish a general screening platform for common use in enzyme studies. The workflow of the in silico substrate-screening system is shown in Figure 1. The entire

Figure 1. Workflow of the substrate prediction platform, THEMIS. The workflow starts with either a crystal structure or homology model of the target enzyme. The catalytic mechanism of the enzyme, including the active residues, binding pocket, and catalysis pathway, must be known before any screening. The ligand database is composed of the compound structures to investigate. The prescreening procedure analyzes the reactive groups in the ligands and extracts essential information for substrate prediction. The ligand docking procedure is processed in parallel with the message-passing interface protocol. The results are automatically analyzed by utilities in THEMIS.

blueprint was derived from the relationship between the substrate specificity of the enzyme and the enzyme structure; in other words, the function is determined by the structure. It starts with a suitable structure of an enzyme obtained either by download from the Protein Data Bank (PDB) or by homology modeling, followed by rigorous analysis of the enzyme mechanism. Important information is extracted and formed into proper settings; CALB will be used to demonstrate details of this process later in this paper. To identify substrates, the score function is composed of MM criteria derived from the NAC, such as energy, distances, angles, and dihedral angles. From then on, THEMIS takes over all of the miscellaneous tasks, automatically conducting highperformance computational investigations of substrate screening. Substrate Candidates and Prescreening. Ninety-four ester compounds were investigated in this study, two substituents 1981

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling

number (102) of parallel docking results. Simulations of enzymes are commonly only used as a theoretical explanation of experimental data, where the reproducibility of the results is not of concern. However, we consider that the reproducibility of the results is the most important feature of our system. Thus, we enlarged the sampling number to control the contingency of MBRD, just as Klepsch and colleagues did in their investigations of inhibitors of P-glycoprotein.51 The other important thing is that the vast amount of data are automatically processed by Python scripts rather than manually, which not only minimizes the artificial impact on the raw data but also improves the screening speed.52 At the end of MBRD, an appropriate number of candidate substrate conformations should be acquired. Obviously, if not a single pose can be found, the potential for being a substrate is very limited. However, all of the candidate substrates successfully generated 100 poses as expected. To distinguish active candidates, further evaluation was needed. Assessment Criteria. To accurately evaluate the possibility of chemical transformation for CALB substrates, the score function has to integrate effective MM geometry assessment criteria for the NAC. First, distances between key atoms in the substrate and enzyme catalytic residues were reported to be reasonable criteria for predicting the enantioselectivity or stereoselectivity of enzymes.53 Schulz et al.54 established a simple model for Pseudomonas cepacia lipase in which the difference between the distances D(HNε−Oα) between the catalytic histamine side chain and the alcohol oxygen in the preferred and nonpreferred enantiomers was correlated with the enantioselectivity. Another simple quantitative model of Candida rugosa lipase revealed the same phenomenon.55 Thus, the first assessment criterion adopted is the distance D1 between the imidazole hydrogen atom (HNε) of His224 and the oxygen atom (Oα) of the alcohol part of the candidate substrate ester (Figure 2). This distance represents the abstraction of the eventful hydrogen bond between His224 and the substrate Oα. Therefore, the smaller D1 is, the greater the possibility that a hydrolytic reaction will take place. In addition to D1, a second distance criterion, the distance D2 between the Oγ atom of Ser105 and the Cα atom of the carbonyl group of the candidate substrate, was also employed (Figure 2). This distance is a measure of the feasibility of the nucleophilic attack on the planar carbonyl group by the active Ser105 Oγ atom that is the trigger for hydrolytic reaction. As with D1, a smaller D2 indicates a greater possibility that a hydrolytic reaction will take place. The third criterion used in this study is the torsion angle (T) defined in Figure 3. On the basis of Kazlauskas’s rule, which states that the size of substrate substituents at the stereocenter affects

Figure 2. Key residues of CALB, which include the catalytic triad (Asp187-His224-Ser105) and the oxyanion hole (Thr40 and Gln106). The red dashed lines represent three hydrogen bonds between the substrate and the enzyme oxyanion hole that stabilize the substrate during the reaction. The proton transfer between His224 and Ser105 is represented with a pink arrow. The green arrow shows the attack on the carbonyl group of the substrate initiated by Ser105 Oγ (the nucleophile). The nucleophilic attack prompts the breaking of the carbon−oxygen π bond, resulting in the formation of a tetrahedral intermediate, which promptly collapses and releases as product.

SP-CALB is designed to evaluate the potential activities of candidate substrates using the geometric parameters of their corresponding conformations. Besides the NAC model used in the present study, some investigators have implemented a number of explorations of and attempts at in silico prediction of substrates, in which covalent docking is a promising research direction. Since the formation of a covalent bond with the target to achieve irreversible inhibition is essential for a number of successful drugs, covalent docking has been used to successfully describe covalent interactions between inhibitors and biological targets,47 and this technique is also available for the prediction of substrates. Recently, some fruitful investigations of substrate prediction using covalent docking have been reported.48−50 Different from these efficient investigations, our study adopted the NAC model and tried to predict substrates according to the specific recognition between the substrates and enzymes. We tried to explore the ability of molecular docking to predict more broad and common substrates that do not possess a clear reaction path with covalent bonds. A comparative study with the current method is discussed in a later section. In CALB, three hydrogen bonds between the carbonyl oxygen of the substrate and an oxyanion hole consisting of enzyme residues Gln106 and Thr40 are believed to form before the substrate can be hydrolyzed (Figure 2). Thus, these hydrogen bonds can form a natural constraint condition to sample the NAC. The hydrogen bonds were simplified down to three groups of pseudoharmonic potential energies, while the bond angles were ignored in the docking procedure. The potential energy takes the form E = kx2, where x is the difference between the distance and the closest constraint bound and k is the force constant. Each hydrogen bond was restricted to a lower bound of 2.5 Å and an upper bound of 3.5 Å with k = 1000 in the following simulation. Because of the triangular pyramid consisting of the oxyanion hole and the carbonyl oxygen, the bond length is able to restrict the substrate to the proper position. Therefore, the bond angles of these hydrogen bonds were ignored in this study. Another important improvement to enhance the accuracy in this study is that we used statistical data based on a sufficient

Figure 3. Definition of the torsion angle T in projection mode. This torsion angle was defined to distinguish different orientations of ligands. The two planes share a common axis (green) composed of the carboxylic oxygen and Cα. The angle can be measured by the angle between the reference atom (Cβ of Gln106, black) and the hydroxyl oxygen (gray). 1982

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling the selectivity of the lipase,56 we derived T as a quantifiable criterion that can be easily measured. The CALB pocket can be divided into two smaller pockets, and the substituents of the substrate candidate interact with each of them (Figure 4).

was made among standard molecular docking, MBRD, and covalent docking by docking of 10 compounds with known activities into CALB (Figure 5). In standard molecular docking and MBRD, for each candidate, 100 rounds of molecular docking toward the CALB receptor were performed using identical parameters except for the three distance restrictions between the carbonyl oxygen of the substrate and the oxyanion hole consisting of enzyme residues Gln106 and Thr40 as presented in Figure 2. For covalent docking, the docking configuration was the same as for standard molecular docking except that the covalent bond between the oxygen atom of Ser105 (Oγ) and the substrate carbonyl carbon atom was fixed during the docking with the GOLD program. This covalent docking protocol followed the work of Liu et al.57 The results showed that the docking score for MBRD is slightly smaller than that for standard docking (the bigger the better), while the distances in MBRD are much smaller, especially in the substrates. The origin of this difference is that the standard docking procedure performs global optimization of the free energy of binding between the ligand and enzyme, which produces a conformational energy minimum regardless of the interactions with the substrate required for catalysis. Therefore, no consistent pattern between active and nonactive compounds could be detected in the standard docking results. On the other hand, MBRD restricts the docking program to focus on sampling conformations around the subspace of the NAC, where the geometric demands for enzyme catalysis could be satisfied. By comparing the data for active candidates to those for the nonactive ones in MBRD, clear patterns could be easily found, where the distances D1 and D2 have a good correlation with the activity. This phenomenon demonstrates that MBRD is better at mimicking the NAC than standard molecular docking. Moreover, MBRD produces equal degrees of docking scores according to the Boltzmann equation, which means that in equilibrium substrate binding these conformations have probabilities similar to those of standard docking. However, all of the covalent docking results showed negative scores (the bigger the better), which correspond to poor binding of the candidates. Moreover, the distance D1 in covalent docking did not present a clear correlation with the activity either (D2 is the length of the covalent bond, which was fixed here and thus is not included in Table 1). The reason for the dissatisfying results may be that the constraint of the covalent bond is too strong for the current

Figure 4. The CALB catalytic pocket. The blue surface represents the pocket of CALB, which is able to bind and orient the substrate. The substrate, (R)-3-methylbutan-2-yl isobutyrate, is shown in the binding pocket in stick representation, with the atoms of the substrate colored according to their atom types (hydrogen, white; oxygen, red; carbon, cyan). The red mesh shows the atom surface of the substrate. The commonly accepted explanation for the lipase catalytic selectivity is that there must be a match between the enzyme pocket and the substrate’s two substituents.

To evaluate the proper orientation of the ester group in the NAC, T is defined by the dihedral angle between the ester group and a reference atom, Cβ of Gln106. In this way, the orientation difference determined by two substituents is effectively detectable by T. Performance of MBRD. To verify the performance improvement due to the application of MBRD, a comparison

Figure 5. Structures of the 10 compounds used to verify the improvement of MBRD compared with the standard docking procedure. The data for these candidates were taken from ref 25. 1983

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling Table 1. Validation of MBRD regular docking

MBRD

covalent docking

compd no.

active?

Gold score

D1

D2

Gold score

D1

D2

Gold score

D1

1 2 3 4 5 6 7 8 9 10

yes no yes no yes no yes no yes no

31.61 34.25 31.97 32.71 28.56 29.80 36.62 43.43 36.62 37.07

3.844 3.696 3.68 3.905 3.6 3.379 5.144 6.909 5.144 4.777

3.133 2.846 3.917 3.176 3.377 2.621 4.664 6.707 4.664 4.163

26.33 31.65 28.08 29.92 24.50 29.16 35.66 35.93 29.01 35.70

3.228 3.562 2.942 3.575 2.948 3.450 3.018 3.249 2.980 3.667

2.381 2.540 2.234 2.461 2.247 2.479 2.291 2.351 2.213 2.485

−33.12 −30.19 −20.14 −16.53 −26.27 −18.92 −7.94 −5.67 −35.56 −21.79

3.197 2.979 2.434 2.536 3.240 2.717 2.746 2.778 2.541 2.730

Thus, it was ignored as an option for an assessment criterion in the remainder of the study. The ROC curves (Figure 7) were also used to evaluate the abilities of these criteria to independently predict substrates. As already indicated, the performances of the criteria were found to be in the following order: torsion angle T (AUC = 0.92), distance D2 (AUC = 0.92), distance D1 (AUC = 0.80), and docking score (AUC = 0.50). In fact, the prediction efficiency of the docking score was virtually random. The stability and repeatability were verified by histogram and probability plots of standard deviations as shown in Figure 8. The majority of standard deviations (90%) for D1 were below 0.04 Å, which is less than the average value of D1 by 2 orders of magnitude (Figure 8A). Meanwhile, D2 shows an even smaller standard deviation, 90% of which is below 0.015 Å, which is 2 orders of magnitude less than the average value of D2 (Figure 8B). Compared with D1 and D2, the torsion angle T is less stable (Figure 8C), the standard deviation of which is less than 5° (90%). Since the difference between the torsion angles of substrates and nonsubstrates (Figure 6C) is much larger (more than 10° for 95% of the candidates), indicating the high discriminatory performance of T, it is sufficiently stable in most circumstances. Furthermore, exhaustive sampling of the robustness of all of the criteria will be improved to meet the requirements of practical use. In order to integrate the merits of the three criteria, their combined use was investigated (Figure 9). The xy scatter plot of D1 and D2 shows a near-linear distribution (Figure 9A). As expected, most of the substrates could be successfully distinguished by D1 and D2, but some failed because of the indistinct border between substrates and nonsubstrates. In contrast, the scatter plot of torsion angle versus either distance showed better distinguishability (Figure 9B,C). Finally, the 3D scatter plot shows that combining the three criteria resulted in a quite high efficiency of activity prediction (Figure 9D). It should be pointed out that although the torsion angle T was the most efficient criterion in this study, D1 and D2 were still employed in the final score function because the above performance comparison is data-dependent and D1 and D2 have more intuitive relevance to the catalytic mechanism than T. Moreover, the torsion angle itself does not require that atoms are close to each other. Thus, when the substrate is not in the enzyme active site, on the face of it, the value of T may still seem reasonable. However, it says nothing about the interaction between the enzyme and substrate in that situation. On the other hand, the distances are quite straightforward in their physical meaning, which makes them much more reliable. When they diverge from their normal value ranges, something has very likely

system, which leads to poor binding and unsuitable conformations. In summary, by measurement of the key interactions between the candidate compound and enzyme in the complex produced by MBRD, the possibility of an enzyme-catalyzed reaction could be successfully estimated. The torsion angle T is not included in Table 1 because it is dependent on the geometric location of the ligand, which means that it is not comparable without distance restrictions. Performance, Stability, and Repeatability of Criteria. Before the form of the score function was established, independent testing and cross-testing had to be performed to confirm the effectiveness of these three criteria. To circumvent the uncertainty of sampling by molecular docking, we exhaustively sampled the pose space by generating 100 docking poses for each ligand. To verify the robustness, five parallel trials were run to assess the stability and repeatability of our method. Statistical data for the assessment criteria from all poses were analyzed upon exhaustive sampling. The average values of parallel trials were used at the end of our study, where each final prediction is based on the results for 500 samples. The discriminatory abilities of the three criteria and the docking score (Gold score) are presented as box charts in Figure 6 and receiver operating characteristic (ROC) curves in Figure 7. From Figure 6A it can be seen that D1 ranged from 2.7 to 4.3 Å, which is slightly greater than the sum of the atomic van der Waals radii of H (1.10 Å) and N (1.55 Å) .58 The reactive and nonreactive candidates showed different distributions, with the distances for the nonreactive candidates being greater than those for the substrates, as expected. However, the probability region between 50% and 95% for substrates overlapped with the main body of nonsubstrates. As a result, only substrates with D1 less than 3.0 Å can be confirmed independently by D1. A similar situation can be seen for D2 in Figure 6B, in that only D2 values smaller than 2.3 Å can be independently confirmed as substrates. Although the majorities of the two groups had small overlap regions, only a very few regions were injective. Compared with the distance assessment criteria, the torsion angle T produced a distribution very close to that of one-to-one mapping, providing a significant improvement in the distinguishability. As shown in Figure 6C, substrates tend to have smaller angles, most of which are below 90° and 75% of which are below 60°. In contrast, the majority of nonsubstrates are distributed above 90°, and 75% are above 130°. The overlapping region of the torsion angle distributions of the substrate and nonsubstrate was less than 5%. Furthermore, the docking score (Gold score) was not valid for distinguishing substrates in this study, as no direct difference between the distributions could be detected (Figure 6D). 1984

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling

Figure 6. Box plots showing the discriminatory abilities of the criteria: (A) D1 (HNε−Oα distance); (B) D2 (Oγ−Cα distance); (C) torsion angle T; (D) docking score (Gold score). The boxes are determined by the 25th, 50th, and 75th percentiles, and the whiskers are determined by the fifth and 95th percentiles. Minimum and maximum values are marked as crosses, and average values are presented as small squares. The boxes are colored by the activities of the candidate substrates: red for nonreactive candidates and blue for active candidates.

D2, and T and provide a quantitative standard for the prediction of substrates and nonsubstrates. It is derived from the Euclidean space norm (distance) and takes the form ⎫ ⎧ ⎪ ⎪ D12 D2 2 [1 − cos(T )]2 ⎬ × w5 w Score = ⎨ + + − 4 ⎪ ⎪ w2 w3 ⎭ ⎩ w1 (2)

The torsion angle is mapped to a monotonically increasing value over the interval from 0 to 2 with a triangle function. The ideal torsion angle for a substrate is considered to be below 90°, corresponding to the interval from 0 to 1. Three constants (w1, w2, w3) are used to normalize the inputs. The default values w1 = 16 and w2 = 9 are equal to the rounded-off squares of the sums of van der Waals radii for the atom pairs in D1 and D2 (H, 1.10 Å; O, 1.52 Å; N, 1.55 Å) .58 The default value w3 = 4 is derived from the maximum value of the cosine function. The constant w4 = 1.3 is designed to adjust the range of final scores to be distributed asymmetrically so that its meaning can be easily understood by different signs, and w5 = 100 is a scalar to amplify score differences for perceptibility. With these default values, substrates can be indicated by negative scores. The performance of the SP-CALB score is presented in Figure 10. A majority of candidates (89 out of 96) were successfully predicted, and only seven failed. As shown in Tables 2 and 3,

Figure 7. ROC curves (sensitivity vs 1-specificity) for the three criteria and the docking score. The areas under the curves (AUCs) are 0.8016 for D1, 0.9274 for D2, 0.9522 for T, and 0.5016 for the Gold score.

gone wrong. Thus, both distances were used in the following score function. Score Function and Prediction Result. The SP-CALB score is designed to integrate the merits of the three criteria D1, 1985

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling

Figure 8. Stability and repeatability of the criteria. The standard deviations for (A) D1, (B) D2, and (C) the torsion angle T are represented by histogram and probability plots.

three cases out of 96 were false-positive hits (3.1%) and four cases were false-negative hits (4.1%). The high accuracy rate (92.7%) and high true-positive rate (93.4%) guarantee not only the feasibility but also the availability of the SP-CALB score. Prediction Error. Before the analysis of the prediction errors is discussed, it is worth noting that the flexibility of CALB was ignored in this study. Although the flexibility of the enzyme can significantly affect the accuracy of prediction, CALB has been confirmed to undergo negligible structural changes upon binding of the substrate (PDB entry 1LBS). In fact, we tried to introduce molecular dynamics (MD) into our protocol to fully investigate the flexibility of CALB. However, this weakened rather than improved every aspect of the prediction. Thus, we considered that although MD brought flexibility to the enzyme active pocket,

it also brought dislocation of important residues, which reduced the prediction accuracy. It should be pointed out that although for CALB extensive measures to model the flexibility could be skipped, this may not be the case with other enzymes. Although several improvements have been made to the prediction protocol, there are still seven failed cases (Figure 11). According to the molecular formula, four of them contain analogues of pyrrole (3-pyrroline or pyrrolidine) directly connected to the ester bond. It is likely that pyrrole ring systems are not handled properly in the current protocol. The problem with the pyrrole ring may come from unfavorable interactions between the atoms of the pyrrole ring and the side-chain atoms of the enzyme pocket residues. Ideally, a cocrystal structure of a pyrrole-ring substrate and CALB could provide a good 1986

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling

Figure 9. Scatter plots showing the discriminatory abilities of combinations of the three criteria: (A) D1 and D2; (B) D1 and T; (C) D2 and T; (D) all three criteria. Each candidate substrate is represented by a colored dot (red for nonactive, blue for active).

explanation of this problem. Otherwise, some MM methods such as flexible docking or MD may improve the accuracy of prediction by providing the ability to rearrange the side chains of the enzyme pocket residues to fit the pyrrole ring of the substrate. Vinyl propionate is the only non-ring-containing molecule among the seven failures. After investigating the details of the docking conformations, we have concluded that the positive SP-CALB score is mainly caused by an improper torsion angle. Further analysis shows that vinyl propionate appears to be quite free to rotate because of the structural symmetry and similarities of interactions in both directions. In contrast, vinyl butanoate performs well, although it differs from vinyl propionate by only one carbon atom. The difference might be due to the break in symmetry resulting from replacing butyric acid with propionic acid. Thus, when the hydrolysis of vinyl propionate is catalyzed by CALB, there might be a balance between the two major conformations that could lead to a transformation, which could not be detected by our system. To improve the prediction of similar molecules, one possible solution would be to use the clustering center value instead of the average value, because the latter produces a meaningless value between the two peaks in the current circumstance. Verification with Nitrilase Nit6803. After successfully testing and verifying our protocol with a lipase, we developed a

Figure 10. Prediction results for the SP-CALB score. Ninety-six candidate substrates were predicted by the SP-CALB score. The experimental activities are shown by different colors (red for nonactive, green for active). The columns are generated by the SP-CALB score values. The candidates with negative scores are predicted to be substrates and those with positive scores to be nonsubstrates. 1987

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling Table 2. Confusion Matrix for Validation of the Prediction Methoda substrate nonsubstrate total correct

substrate TP FP TP + FP precision = TP/(TP + FP)

nonsubstrate FN TN FN + TN

total TP + FN FP + TN TP + TN + FP + FN

correct sensitivity = TP/(TP + FN) specificity = TN/(TN + FP) accuracy = (TP + TN)/(TP + TN + FP + FN)

a

The performance of the proposed method of this study was fully validated using the following measurements: precision, sensitivity, specificity, and accuracy. The table shows the confusion matrix used, where TP = true positive, FP = false positive, TN = true negative, and FN = false negative.

Table 3. Confusion Matrix for the Prediction Method substrate nonsubstrate total correct

substrate 57 4 61 93.4%

nonsubstrate 3 32 35

total 60 36 96

correct 95.0% 88.9% 92.7%

program called THEMIS to feature utilities that implement the automated workflow of substrate screening (MBRD). Scripts written in Python implement coarse-grained parallelism via the message-passing interface, which can accelerate the speed of calculation by distributing computation-intensive docking tasks (Gold v4.1259) within clusters of computers. In the test with CALB, a speed of 103 compounds/day was achieved with a cluster of 96 CPU cores (Xeon X5650). To verify the utility of our method with different enzymes, another substrate-screening experiment using a distinct enzyme, nitrilase, was undertaken using THEMIS. Nitrilases catalyze the hydrolysis of organic nitriles to the corresponding carboxylic acids, some of which are important intermediates in the chemical synthesis of various products.60 A nitrilase from Cyanobacterium syechocystis sp. PCC6803 (Nit6803) was used in the current study. The crystal structure of Nit6803 that was determined in our previous work61 is the world’s second crystal structure of nitrilase, and there are only two nitrilase structures in the PDB to date. In contrast with the first nitrilase that was crystallized,62 Nit6803 has a very broad substrate spectrum and much higher catalytic activities, making it more feasible for industrial applications and theoretical investigations. Following the same routine as for CALB, a thorough analysis of the mechanism was initially performed.63 The form of the transition state in the hydrolytic reaction with nitrile substrates catalyzed by Nit6803 is presented in Figure 12. On the basis of key interactions between the enzyme and the substrate, constraint conditions and evaluation factors could be abstracted according to the NAC theory. In the current study, the distance between the cyano nitrogen of the substrate and the ε-amino nitrogen of Lys135 was used as a

Figure 12. Mechanism of Nit6803-catalyzed hydrolysis of nitrile substrates. The catalytic triad of Nit6803 consists of Glu53, Lys135, and Cys169. The figure presents the first step of the hydrolytic reaction according to ref 63. The reaction takes place via a thioimidate intermediate (Cys substrate), and a water molecule plays an important role in the reaction. During the reaction, Lys135 stabilizes the substrate by providing a hydrogen bond, and Glu53 stabilizes the transition state by interacting with the positive charge on the cyano nitrogen of the substrate.

restriction that filters positive NAC-like conformations during docking. This distance restriction mimics the essential hydrogen bond that stabilizes the substrate at the beginning of the reaction. Following stabilization of the substrate, a water-bridged hydrogen bond is formed to promote the chemical transformation. Since it is relatively difficult to mimic the water-bridged three-body (substrate, water, and enzyme) hydrogen-bonding network in docking simulations, the water is assumed to be able to freely participate in the hydrogen-bonding interaction in the substrate− enzyme complex as long as there is a suitable gap for it to implant. This very gap was the criterion used in the study, and it was represented by the distance between the cyano nitrogen of the substrate and the carboxylic acid carbon of Glu53. If the gap is too big, then the water will not be able to connect the substrate and enzyme, preventing the enzymatic reaction from taking place. This gap’s maximum length could be estimated by the sum of the bond length of O−H in water, the length of the hydrogen bond between the water and the substrate, and length of the hydrogen

Figure 11. Prediction errors. Seven compounds failed in the prediction. According to the experiment results, compounds 1, 2, and 3 are substrates (false negatives) and compounds 4, 5, 6, and 7 are nonsubstrates (false positives). 1988

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling

database could be deduced on the basis of this difference. For example, a binary classification by threshold (4.5 Å) could accomplish a correct rate of 91.7% (22/24) in the current data set. Since the threshold of 4.5 Å is only an optimal value estimated from a relatively small group of candidates, further investigations using other substrates of Nit6803 may be needed to adjust this threshold depending on the state. Verification with SDR Gox2181. Short-chain dehydrogenase/reductases (SDRs) consist of a large superfamily of NAD(P)H-dependent oxidoreductases, with over 47 000 members available in the sequence database. However, most of these are distantly related, with typically 20−30% residue identity in pairwise comparisons.68 Thus, it is difficult for sequence-based methods to predict the detailed activities of this superfamily. Meanwhile, there are more than 300 crystal structures of SDRs deposited in the PDB, which makes SDR family enzymes more suitable for structure-based activity predictions. Here the SDR family enzyme Gox2181 (PDB entry 3AWD), whose structure had been determined by our colleagues,69 was used to further validate the generality of our method. Gox2181 was demonstrated to be active toward a wide range of substrates, including sugar alcohols, secondary alcohols, ketones, and ketoses. Therefore, it has good potential for industrial applications.69 As with Nit6803, the mechanism of Gox2181-catalyzed oxidation and reduction reactions was determined initially (Figure 16),70 and then constraint conditions and evaluation factors were abstracted on the basis of key interactions between the enzyme and substrate according to the NAC theory. Since the oxidation and reduction catalyzed by the same enzyme of Gox2181 are reverse reactions, where the core difference is the different protonation state of the coenzyme, the oxidation and reduction substrates are treated identically in the docking setup step, except for the special treatment of the protonation for the coenzyme NAD+ and NADH. In the present study, the distance between the hydroxy/keto/ aldehyde oxygen atom of the substrate and the phenolic oxygen atom of Tyr162, which mimics the essential hydrogen bond that stabilizes the substrate at the beginning of the reaction and initiates the proton transfer between the substrate and Tyr162, was used as a distance restriction that filters positive NAC-like conformations during docking. The evaluation of the potential activities was based on the distance between the nicotinamidering C4 atom of the coenzyme and the hydroxy/keto/aldehyde oxygen atom of the substrate. The smaller the distance is, the better is the activity that the compound might possess. Thirteen substrates and seven nonsubstrates for Gox2181 were obtained from the literature (Figures 17 and 18).69 As with Nit6803, each compound was docked 1000 times against Gox2181. The distances between the nicotinamide-ring C4 atom of the coenzyme and the hydroxy/keto/aldehyde oxygen atom of the substrate were extracted from 20 000 conformations and are plotted with the KDE in Figure 19. The KDE plots revealed a notable difference between the distributions of substrates and nonsubstrates, where the substrates tend to have a centralized distribution within the range of about 3 to 5 Å while the control group (nonsubstrates) mainly dominate the space of longer distances (about 4.5 to 6.5 Å). On the basis of this difference, an effective model of binary classification to separate these substrates could be deduced. For instance, a threshold of 3.2 Å, which is the sum of the van der Waals radii of oxygen (1.7 Å) and carbon (1.5 Å) atoms, could accomplish a correct rate of 95% (19/20) in the current data set. As with Nit6803, the threshold is only an optimal value estimated from a relatively small group of

bond between the water and the enzyme, which is about 6 Å (0.96 Å for the O−H bond in water and 2.5 Å for each hydrogen bond). Since the four atoms of the hydrogen-bonding network are not perfectly collinear in three-dimensional space, the practical distance of the gap is probably smaller to some extent. In general, similar to the case for CALB, the smaller this distance is, the higher is the possibility that the hydrogen-bonding network will be established, and therefore, the better is the activity that the compound might possess. Twelve known nitrile substrates for Nit6803 were obtained from the literature64 (Figure 13), and 12 nonhydrolyzable nitrile

Figure 13. Substrates of Nit6803 from the literature.64

Figure 14. Control group compounds for Nit6803 (nonsubstrates).

compounds that were experimentally determined using identical methods as for ref 64 were used as controls (Figure 14). Each compound was docked 1000 times against enzyme Nit6803. This number of dockings was intentionally increased here to make the kernel density estimation (KDE)65 plot smoother and reduce pseudopeaks in the plot. KDE is better at analyzing a small number of samples because it can represent the overall distribution better than the box plot that was used in previous study of CALB. However, the KDE requires much more computation, limiting the speed of predictions, and thus, it is not as convenient as classical statistical methods in the circumstance of large sample size. The distances between the compounds and the carboxylic acid carbon of Glu53 of the enzyme were extracted from 24 000 conformations and are plotted with the KDEs in Figure 15. The KDE plots reveal that a notable difference between the distributions of substrates and nonsubstrates exists. All of the substrates have a centralized distribution within the region containing distances less than 4.5 Å; however, the control group shows evident distribution only in the zone above 4.5 Å. Thus, an effective model to separate substrates from a chemical 1989

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling

Figure 15. Plots of kernel density estimation for (A) substrates and (B) nonsubstrates of Nit6803. The plots were generated by pandas v17.066 using a Gaussian kernel density estimation algorithm.67 KDE is a data analysis technique in statistics. Just like histograms, KDE reveals the distribution pattern of a finite data sample, but it possesses properties such as smoothness or continuity that histograms lacks.65 In the plots, the x axis represents the distance between the cyano nitrogen of the candidate substrate and the carboxylic acid carbon of Glu53 in the 1000 conformations in each compound’s docking results, while the y axis reflects the corresponding probability of the distance based on its kernel density. A notable difference between the distribution of substrates and nonsubstrates exists in the zone below 4.5 Å (marked by the red vertical line). Only two nonsubstrates, (S)-2-(2-chlorophenyl)-2hydroxyacetonitrile and (S)-2-(4-chlorophenyl)-2-hydroxyacetonitrile, have minor distributions there.

Figure 17. Substrates of Gox2181. Figure 16. Mechanism of Gox2181-catalyzed reduction of ketone substrates. Gox2181 has the typical conserved catalytic tetrad of Asn119-Ser147-Tyr162-Lys166 and the NAD-binding pocket of the SDR superfamily. The figure presents only the essential elements required by this study. Catalysis is initiated by the hydrogen bonding, and a proton is transferred between the phenolic oxygen atom of Tyr162 and the keto group of the substrate, followed by a hydride transfer from the nicotinamide-ring C4 atom of NADH to the substrate. Figure 18. Control-group compounds for Gox2181 (nonsubstrates).

candidates, and further investigations on other substrates of Gox2181 will also be needed for adjustment as the case may be.



We have tested the ability of this method to differentiate between substrates and nonsubstrates of a lipase, CALB, using consistently measured experimental data and have carefully discussed all aspects of the substrate prediction. Furthermore, its extensive applicability was successfully demonstrated using two other enzymes, Nit6803 and Gox2181. We expect that our method will

CONCLUSIONS In summary, we have presented an in silico method that can predict enzyme substrates. On the basis of this method, we developed an open-source software platform called THEMIS that could meet the requirements for applicability in enzyme research. 1990

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling

Figure 19. Plots of kernel density estimation for (A) substrates and (B) nonsubstrates of Gox2181. The figures were generated by pandas v17.066 using a Gaussian kernel density estimation algorithm.67 In the plots, the x axis represents the distance between the nicotinamide-ring C4 atom of the coenzyme and the hydroxy/keto/aldehyde oxygen atom of the substrate in the 1000 conformations of each compound’s docking results, while the y axis reflects the corresponding probability of the distance based on its kernel density. The red vertical line marks the 3.2 Å threshold. In substrates 4 and 5, only the secondary alcohols were investigated and plotted in the figure. Analogously, in substrates 10 and 11 and nonsubstrate 4, only the keto group was investigated. Only one plot was kept in the figure for the compounds possessing symmetrical alcohol/keto/aldehyde groups.

help accelerate enzyme development by reducing the heavy costs of experimental screening. In this study, molecular docking is the core of the overall method and occupies the majority of the computational time. Although the GOLD docking suite was used in this study, our protocol is quite generalizable. The protocol could easily be reprogrammed to operate with any other docking program, as long as it provides constraint functions. The main advantage of molecular docking is the high efficiency of sampling of different ligand and enzyme complex conformations. With modern computer capability, a standard docking takes only several minutes or less. Among the existing computational methods, it is currently the only available solution for high-throughput screening of substrates. Compared with other methods such as MD simulation, the accuracy of molecular docking may be lower because of the simplified score function and the sampling algorithm. However, the heavy demand for computational resources makes MD as well as QM simulations relatively difficult to use in practical experimental design. The complexity of MD/QM simulations makes them unsuitable to be fully accelerated by high-performance computing platforms because of restrictions in network communication. In contrast, molecular docking is capable of performing large-scale parallel computation against millions of compounds on thousands of CPU cores. In this study, we considered everyday experimental usability as the first priority. Thus, MBRD combined with exhaustive sampling and an MM score function was used. To further improve the usability of our method, the whole workflow has been coded into utilities and scripts written in Python/C. Except for the mechanistic analysis, all of the predictions with massive sets of candidate compounds were able to be accomplished automatically.

Overall, our results suggest that predicting enzyme substrates is promising in computational biology and opens a new door to practical in silico screening of enzyme substrates. On the basis of our method, further tests can be applied with a larger range of enzyme targets that have known structures and mechanisms. For example, other lipases can be studied with similar score functions and very few changes in settings. For other enzymes, the definitions of restrictions and assessment criteria from the abstraction of a totally different enzyme mechanism is an inconvenient process. Although our results demonstrate that in silico screening of substrates could be accomplished by a carefully designed workflow and score function, there is still a significant gap between the activities of our prediction and experimental data. Narrowing this gap presents an exciting prospect for future work.



MATERIALS AND METHODS Preparation of the Free Enzyme. The crystal structure of CALB was obtained from the PDB. PDB entry 1TCA, which has no significant rearrangement but higher resolution than other structures, was used as the starting structure in this study. Hydrogen atoms were added, and two N-acetyl-D-glucosamine moieties along with all of the water molecules were removed. The structures of Nit6803 (PDB entry 3WUY) and Gox2181 (PDB entry 3AWD) were prepared with identical operations. PDB entry 1LBS, which is a complex structure of CALB and a substrate analogue, was used to confirm catalytic residues and the binding pocket. The catalytic amino acid of CALB, His224, was defined as protonated. Preparation of the Potential Substrates. A total of 94 candidate substrates for CALB and 24 candidate substrates for Nit6803 were gathered from the literature, which contains both 1991

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Journal of Chemical Information and Modeling



substrates and nonreactive compounds that were experimentally proved. Three-dimensional structures were built with ChemOffice suite v12. The structures were energy-minimized and stored separately (Mol2 file format). The detailed information and structures of all of the candidate substrates are supplied in Table S1 and File S3, respectively, in the Supporting Information. The substrates of Nit6803 and Gox2181 were prepared using the same method. MBRD. CCDC GOLD v4.12 was used for molecular docking. The enzyme structures were prepared with GOLD HERMES. Residues within 12 Å of the hydroxyl oxygen atom of Ser105 of CALB were defined as the binding site, and residues within the same distance of the ε-amino nitrogen of Lys135 were used in nitrilase Nit6803. The configuration template for docking is supplied in the Supporting Information (Text S6). Early termination was turned off, and therefore, all conformations were saved at the end of docking. The Gold score was used as the score function for docking. The GA parameter was set to very flexible. The script file prepare.py (File S4) was applied in the prescreening, and docking configuration files were automatically generated on the basis of the template. The script file do_docking_mpi.py (File S4) was used to execute docking computations over cluster nodes. Data Collection and Processing. With the help of utilities of THEMIS, geometric information was automatically extracted from the docking results. All of the data mining tasks were achieved by the script file score.py (File S4), which does a traversal visit in the docking results. In this study, Origin v9.0 and Matplotlib were used to analyze and plot data.



REFERENCES

(1) Benkovic, S. J.; Hammes-Schiffer, S. A Perspective on Enzyme Catalysis. Science 2003, 301, 1196−1202. (2) Garcia-Viloca, M.; Gao, J.; Karplus, M.; Truhlar, D. G. How Enzymes Work: Analysis by Modern Rate Theory and Computer Simulations. Science 2004, 303, 186−195. (3) Wolfenden, R.; Snider, M. J. The Depth of Chemical Time and the Power of Enzymes as Catalysts. Acc. Chem. Res. 2001, 34, 938−945. (4) Bolon, D. N.; Voigt, C. A.; Mayo, S. L. De Novo Design of Biocatalysts. Curr. Opin. Chem. Biol. 2002, 6, 125−129. (5) Hilvert, D. Critical Analysis of Antibody Catalysis. Annu. Rev. Biochem. 2000, 69, 751−793. (6) Release Note for Refseq v76. http://www.ncbi.nlm.nih.gov/ refseq/ (accessed May 16, 2016). (7) Braiuca, P.; Ebert, C.; Basso, A.; Linda, P.; Gardossi, L. Computational Methods to Rationalize Experimental Strategies in Biocatalysis. Trends Biotechnol. 2006, 24, 419−425. (8) Dwyer, M. A.; Looger, L. L.; Hellinga, H. W. Computational Design of a Biologically Active Enzyme. Science 2004, 304, 1967−1971. (9) Kazlauskas, R. J. Molecular Modeling and Biocatalysis: Explanations, Predictions, Limitations, and Opportunities. Curr. Opin. Chem. Biol. 2000, 4, 81−88. (10) Shoichet, B. K. Virtual Screening of Chemical Libraries. Nature 2004, 432, 862−866. (11) Hermann, J. C.; Marti-Arbona, R.; Fedorov, A. a.; Fedorov, E.; Almo, S. C.; Shoichet, B. K.; Raushel, F. M. Structure-Based Activity Prediction for an Enzyme of Unknown Function. Nature 2007, 448, 775−779. (12) Zhu, H.; Snyder, M. Protein Chip Technology. Curr. Opin. Chem. Biol. 2003, 7, 55−63. (13) Korkegian, A.; Black, M. E.; Baker, D.; Stoddard, B. L. Computational Thermostabilization of an Enzyme. Science 2005, 308, 857−860. (14) Jiang, L.; Althoff, E. A.; Clemente, F. R.; Doyle, L.; Röthlisberger, D.; Zanghellini, A.; Gallaher, J. L.; Betker, J. L.; Tanaka, F.; Barbas, C. F.; Hilvert, D.; Houk, K. N.; Stoddard, B. L.; Baker, D. De Novo Computational Design of Retro-Aldol Enzymes. Science 2008, 319, 1387−1391. (15) Lutz, S. Reengineering Enzymes. Science 2010, 329, 285−287. (16) Tyagi, S.; Pleiss, J. Biochemical Profiling in Silico Predicting Substrate Specificities of Large Enzyme Families. J. Biotechnol. 2006, 124, 108−116. (17) Clark, M.; Meshkat, S.; Talbot, G. T.; Carnevali, P.; Wiseman, J. S. Fragment-Based Computation of Binding Free Energies by Systematic Sampling. J. Chem. Inf. Model. 2009, 49, 1901−1913. (18) Bornscheuer, U. T. Methods to Increase Enantioselectivity of Lipases and Esterases. Curr. Opin. Biotechnol. 2002, 13, 543−547. (19) Berglund, P.; Christiernin, M.; Hedenström, E. Enantiorecognition of Chiral Acids by Candida Rugosa Lipase: Two Substrate Binding Modes Evidenced in an Organic Medium. ACS Symp. Ser. 2001, 776, 263−273. (20) Gao, J.; Ma, S.; Major, D. T.; Nam, K.; Pu, J.; Truhlar, D. G. Mechanisms and Free Energies of Enzymatic Reactions. Chem. Rev. 2006, 106, 3188−3209. (21) Kraut, J. How Do Enzymes Work? Science 1988, 242, 533−540. (22) König, G.; Hudson, P. S.; Boresch, S.; Woodcock, H. L. Multiscale Free Energy Simulations: An Efficient Method for Connecting Classical Md Simulations to Qm or Qm/Mm Free Energies Using NonBoltzmann Bennett Reweighting Schemes. J. Chem. Theory Comput. 2014, 10, 1406−1419. (23) Uppenberg, J.; Hansen, M. T.; Patkar, S.; Jones, T. A. The Sequence, Crystal Structure Determination and Refinement of Two Crystal Forms of Lipase B from Candida Antarctica. Structure 1994, 2, 293−308. (24) Uppenberg, J.; Ö hrner, N.; Norin, M.; Hult, K.; Kleywegt, G. J.; Patkar, S.; Waagen, V.; Anthonsen, T.; Jones, T. A. Crystallographic and Molecular-Modeling Studies of Lipase B from Candida Antarctica Reveal a Stereospecificity Pocket for Secondary Alcohols. Biochemistry 1995, 34, 16838−16851.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.5b00585. Compounds used for substrate screening of CALB and their original references (Table S1); complete table for the validation of MBRD (Table S2); input file for preprocessing of ester, cyano, hydroxy, and keto groups (Text S5); and template for Gold docking configuration (Text S6) (PDF) Structures of all compounds used for substrate screening, in both 2D and 3D formats (File S3) (ZIP) THEMIS software with demo (File S4) (ZIP)



Article

AUTHOR INFORMATION

Corresponding Authors

*E-mail: [email protected]. *E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS Z.Y., L.Z., B.G., D.C., F.W., and D.W. received funding from the National Natural Science Foundation of China (31571786), the National Basic Research Program of China (973 Program) (2012CB721003), the National High Technology Research and Development Program of China (2012AA020403), the Natural Science Foundation of Shanghai (16ZR1449500), the Open Funding Project of the State Key Laboratory of Bioreactor Engineering, and the Special Program for Applied Research on Super Computation of the NSFC−Guangdong Joint Fund (the second phase). 1992

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling (25) Raza, S.; Fransson, L.; Hult, K. Enantioselectivity in Candida Antarctica Lipase B: A Molecular Dynamics Study. Protein Sci. 2001, 10, 329−338. (26) Léonard, V.; Fransson, L.; Lamare, S.; Hult, K.; Graber, M. A Water Molecule in the Stereospecificity Pocket of Candida Antarctica Lipase B Enhances Enantioselectivity Towards Pentan-2-Ol. ChemBioChem 2007, 8, 662−667. (27) Conde, S.; López-Serrano, P.; Martínez, A. Candida Antarctica Lipase B Catalysed Amidation of Pyroglutamic Acid Derivatives. A Reaction Survey. J. Mol. Catal. B: Enzym. 1999, 7, 299−306. (28) Hansen, T. V.; Waagen, V.; Partali, V.; Anthonsen, H. W.; Anthonsen, T. Co-Solvent Enhancement of Enantioselectivity in LipaseCatalysed Hydrolysis of Racemic Esters. A Process for Production of Homochiral C-3 Building Blocks Using Lipase B from Candida Antarctica. Tetrahedron: Asymmetry 1995, 6, 499−504. (29) Lou, W.-Y.; Zong, M.-H.; Liu, Y.-Y.; Wang, J.-F. Efficient Enantioselective Hydrolysis of D,L-Phenylglycine Methyl Ester Catalyzed by Immobilized Candida Antarctica Lipase B in Ionic Liquid Containing Systems. J. Biotechnol. 2006, 125, 64−74. (30) Garcia-Urdiales, E.; Rebolledo, F.; Gotor, V. Kinetic Resolution of (±)-1-Phenylbutan-1-Ol by Means of Calb-Catalyzed Aminolyses: A Study on the Role of the Amine in the Alcohol Resolution. Adv. Synth. Catal. 2001, 343, 646−654. (31) Cuiper, A. D.; Kouwijzer, M. L.; Grootenhuis, P. D.; Kellogg, R. M.; Feringa, B. L. Kinetic Resolutions and Enantioselective Transformations of 5-(Acyloxy) Pyrrolinones Using Candida Antarctica Lipase B: Synthetic and Structural Aspects. J. Org. Chem. 1999, 64, 9529−9537. (32) García-Urdiales, E.; Rebolledo, F.; Gotor, V. Enzymatic One-Pot Resolution of Two Nucleophiles: Alcohol and Amine. Tetrahedron: Asymmetry 2000, 11, 1459−1463. (33) Jacobsen, E. E.; Hoff, B. H.; Anthonsen, T. Enantiopure Derivatives of 1,2-Alkanediols: Substrate Requirements of Lipase B from Candida Antarctica. Chirality 2000, 12, 654−662. (34) Ottosson, J.; Hult, K. Influence of Acyl Chain Length on the Enantioselectivity of Candida Antarctica Lipase B and Its Thermodynamic Components in Kinetic Resolution of Sec-Alcohols. J. Mol. Catal. B: Enzym. 2001, 11, 1025−1028. (35) Ö hrner, N.; Orrenius, C.; Mattson, A.; Norin, T.; Hult, K. Kinetic Resolutions of Amine and Thiol Analogues of Secondary Alcohols Catalyzed by the Candida Antarctica Lipase B. Enzyme Microb. Technol. 1996, 19, 328−331. (36) Palomo, J. M.; Mateo, C.; Fernández-Lorente, G.; Solares, L. F.; Diaz, M.; Sánchez, V. M.; Bayod, M.; Gotor, V.; Guisan, M.; FernandezLafuente, R. Resolution of (±)-5-Substituted-6-(5-Chloropyridin-2-Yl)7-Oxo-5,6-Dihydropyrrolo[3,4b]Pyrazine Derivatives-Precursors of (S)-(+)-Zopiclone, Catalyzed by Immobilized Candida Antarctica B Lipase in Aqueous Media. Tetrahedron: Asymmetry 2003, 14, 429−438. (37) Lee, Y. S.; Hong, J. H.; Jeon, N. Y.; Won, K.; Kim, B. T. Highly Enantioselective Acylation of Rac-Alkyl Lactates Using Candida Antarctica Lipase B. Org. Process Res. Dev. 2004, 8, 948−951. (38) Thodi, K.; Barbayianni, E.; Fotakopoulou, I.; Bornscheuer, U. T.; Constantinou-Kokotou, V.; Moutevelis-Minakakis, P.; Kokotos, G. Study of the Removal of Allyl Esters by Candida Antarctica Lipase B (Cal-B) and Pig Liver Esterase (Ple). J. Mol. Catal. B: Enzym. 2009, 61, 241−246. (39) Chen, B.; Hu, J.; Miller, E. M.; Xie, W.; Cai, M.; Gross, R. a. Candida Antarctica Lipase B Chemically Immobilized on EpoxyActivated Micro- and Nanobeads: Catalysts for Polyester Synthesis. Biomacromolecules 2008, 9, 463−533. (40) Hedstrom, L. Serine Protease Mechanism and Specificity. Chem. Rev. 2002, 102, 4501−4524. (41) Ishida, T.; Kato, S. Theoretical Perspectives on the Reaction Mechanism of Serine Proteases: The Reaction Free Energy Profiles of the Acylation Process. J. Am. Chem. Soc. 2003, 125, 12035−12048. (42) Fischer, E. Synthesen in Der Zuckergruppe Ii. Ber. Dtsch. Chem. Ges. 1894, 27, 3189−3232. (43) Armstrong, E. F. Enzymes. J. Soc. Chem. Ind., London 1930, 49, 919−920.

(44) Díaz, N.; Suárez, D.; Suárez, E. Kinetic and Binding Effects in Peptide Substrate Selectivity of Matrix Metalloproteinase-2: Molecular Dynamics and Qm/Mm Calculations. Proteins: Struct., Funct., Genet. 2010, 78, 1−11. (45) Balaji, B.; Ramanathan, M. Prediction of Estrogen Receptor Beta Ligands Potency and Selectivity by Docking and Mm-Gbsa Scoring Methods Using Three Different Scaffolds. J. Enzyme Inhib. Med. Chem. 2012, 27, 832−875. (46) Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J. Docking and Scoring in Virtual Screening for Drug Discovery: Methods and Applications. Nat. Rev. Drug Discovery 2004, 3, 935−949. (47) Scholz, C.; Knorr, S.; Hamacher, K.; Schmidt, B. Docktitea Highly Versatile Step-by-Step Workflow for Covalent Docking and Virtual Screening in the Molecular Operating Environment. J. Chem. Inf. Model. 2015, 55, 398−406. (48) Dong, G. Q.; Calhoun, S.; Fan, H.; Kalyanaraman, C.; Branch, M. C.; Mashiyama, S. T.; London, N.; Jacobson, M. P.; Babbitt, P. C.; Shoichet, B. K.; Armstrong, R. N.; Sali, A. Prediction of Substrates for Glutathione Transferases by Covalent Docking. J. Chem. Inf. Model. 2014, 54, 1687−1785. (49) London, N.; Farelli, J. D.; Brown, S. D.; Liu, C.; Huang, H.; Korczynska, M.; Al-Obaidi, N. F.; Babbitt, P. C.; Almo, S. C.; Allen, K. N.; Shoichet, B. K. Covalent Docking Predicts Substrates for Haloalkanoate Dehalogenase Superfamily Phosphatases. Biochemistry 2015, 54, 528−537. (50) Olsen, L.; Oostenbrink, C.; Jorgensen, F. S. Prediction of Cytochrome P450 Mediated Metabolism. Adv. Drug Delivery Rev. 2015, 86, 61−71. (51) Klepsch, F.; Chiba, P.; Ecker, G. F. Exhaustive Sampling of Docking Poses Reveals Binding Hypotheses for Propafenone Type Inhibitors of P-Glycoprotein. PLoS Comput. Biol. 2011, 7, e1002036. (52) Kumalo, H. M.; Bhakat, S.; Soliman, M. E. Theory and Applications of Covalent Docking in Drug Discovery: Merits and Pitfalls. Molecules 2015, 20, 1984−2000. (53) Tomić, S.; Bertoša, B.; Kojić-Prodić, B.; Kolosvary, I. Stereoselectivity of Burkholderia Cepacia Lipase Towards Secondary Alcohols: Molecular Modelling and 3d Qsar Approach. Tetrahedron: Asymmetry 2004, 15, 1163−1172. (54) Schulz, T.; Pleiss, J.; Schmid, R. D. Stereoselectivity of Pseudomonas Cepacia Lipase toward Secondary Alcohols: A Quantitative Model. Protein Sci. 2000, 9, 1053−1062. (55) Kahlow, U. H. M.; Schmid, R. D.; Pleiss, J. A Model of the Pressure Dependence of the Enantioselectivity of Candida Rugosa Lipase Towards (±)-Menthol. Protein Sci. 2001, 10, 1942−1952. (56) Kazlauskas, R. J.; Weissfloch, A. N.; Rappaport, A. T.; Cuccia, L. A. Enantiomer of a Secondary Alcohol Reacts Faster in Reactions Catalyzed by Cholesterol Esterase, Lipase from Pseudomonas Cepacia, and Lipase from Candida Rugosa. J. Org. Chem. 1991, 56, 2656−2665. (57) Liu, J.; Tang, X.; Yu, H. Prediction of the Enantioselectivity of Lipases and Esterases by Molecular Docking Method with Modified Force Field Parameters. Biotechnol. Bioeng. 2010, 105, 687−696. (58) Mantina, M.; Chamberlin, A. C.; Valero, R.; Cramer, C. J.; Truhlar, D. G. Consistent Van Der Waals Radii for the Whole Main Group. J. Phys. Chem. A 2009, 113, 5806−5817. (59) Verdonk, M. L.; Cole, J. C.; Hartshorn, M. J.; Murray, C. W.; Taylor, R. D. Improved Protein-Ligand Docking Using Gold. Proteins: Struct., Funct., Genet. 2003, 52, 609−631. (60) Bunch, A. W. Nitriles. In Biotechnology Set; Wiley-VCH: Weinheim, Germany, 2008; pp 277−325. (61) Zhang, L.; Yin, B.; Wang, C.; Jiang, S.; Wang, H.; Yuan, Y. A.; Wei, D. Structural Insights into Enzymatic Activity and Substrate Specificity Determination by a Single Amino Acid in Nitrilase from Syechocystis Sp. Pcc6803. J. Struct. Biol. 2014, 188, 93−101. (62) Raczynska, J. E.; Vorgias, C. E.; Antranikian, G.; Rypniewski, W. Crystallographic Analysis of a Thermoactive Nitrilase. J. Struct. Biol. 2011, 173, 294−302. (63) Fernandes, B. C. M.; Mateo, C.; Kiziak, C.; Chmura, A.; Wacker, J.; van Rantwijk, F.; Stolz, A.; Sheldon, R. A. Nitrile Hydratase Activity of a Recombinant Nitrilase. Adv. Synth. Catal. 2006, 348, 2597−2603. 1993

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994

Article

Journal of Chemical Information and Modeling (64) Heinemann, U.; Engels, D.; Bürger, S.; Kiziak, C.; Mattes, R.; Stolz, A. Cloning of a Nitrilase Gene from the Cyanobacterium Synechocystis Sp. Strain Pcc6803 and Heterologous Expression and Characterization of the Encoded Protein. Appl. Environ. Microbiol. 2003, 69, 4359−4366. (65) Scott, D. W. Multivariate Density Estimation: Theory, Practice, and Visualization; John Wiley & Sons: Hoboken, NJ, 2015. (66) McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, Numpy, and Ipython. O’Reilly Media, Inc.: Sebastopol, CA, 2012. (67) Bashtannyk, D. M.; Hyndman, R. J. Bandwidth Selection for Kernel Conditional Density Estimation. Computational Statistics & Data Analysis 2001, 36, 279−298. (68) Kallberg, Y.; Oppermann, U.; Persson, B. Classification of the Short-Chain Dehydrogenase/Reductase Superfamily Using Hidden Markov Models. FEBS J. 2010, 277, 2375−2386. (69) Liu, X.; Yuan, Z.; Yuan, Y. A.; Lin, J.; Wei, D. Biochemical and Structural Analysis of Gox2181, a New Member of the Sdr Superfamily from Gluconobacter Oxydans. Biochem. Biophys. Res. Commun. 2011, 415, 410−415. (70) Carius, Y.; Christian, H.; Faust, A.; Zander, U.; Klink, B. U.; Kornberger, P.; Kohring, G.-W.; Giffhorn, F.; Scheidig, A. J. Structural Insight into Substrate Differentiation of the Sugar-Metabolizing Enzyme Galactitol Dehydrogenase from Rhodobacter Sphaeroides D. J. Biol. Chem. 2010, 285, 20006−20014.

1994

DOI: 10.1021/acs.jcim.5b00585 J. Chem. Inf. Model. 2016, 56, 1979−1994