A Semiautomated Structure-Based Method To Predict Substrates of

The discovery of unique substrates is important for developing potential applications of enzymes. However, the experimental procedures for substrate i...
0 downloads 12 Views 3MB Size
Subscriber access provided by Northern Illinois University

Article

A Semi-automated Structure-Based Method to Predict Substrate of Enzyme via Molecular Docking: A Case Study with Candida Antarctica Lipase B Zhiqiang Yao, Lujia Zhang, Bei Gao, Dongbing Cui, Fengqing Wang, Xiao He, John Zenghui Zhang, and Dong-Zhi Wei J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.5b00585 • Publication Date (Web): 16 Aug 2016 Downloaded from http://pubs.acs.org on August 23, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

4

A Semi-automated Structure-Based Method to Predict Substrate of Enzyme via Molecular Docking: A Case Study with Candida Antarctica Lipase B

5

Zhiqiang Yao1, Lujia Zhang1*, Bei Gao1, Dongbing cui1, Fengqing Wang1, Xiao He2,3,

6

John Z.H. Zhang2,3 and Dongzhi Wei1*

7

1State Key Laboratory of Bioreactor Engineering, New World Institute of

8

Biotechnology, East China University of Science and Technology, Shanghai 200237,

9

China

10

2State Key Laboratory of Precision Spectroscopy, Institute of Theoretical and

11

Computational Science, East China Normal University, Shanghai, 200062, China

12

3NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai,

13

200062, China

1 2 3

14

*E-mail: [email protected], [email protected]

15 16

Abstract

17

The discovery of unique substrate is important for developing potential applications of

18

enzymes. However, the experimental procedures for substrate identification are

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

laborious, time-consuming and expensive. Although in silico structure-based

2

approaches show great promise, recent extensive studies show that these approaches

3

remain a formidable challenge for current bio-computational methodologies. Here, we

4

present an open-source, extensible and flexible software platform for predicting

5

enzyme substrates called THEMIS, which performs in silico virtual screening for

6

potential catalytic targets of enzyme based on enzymes catalysis mechanism. Based on

7

a generalized transition state theory of enzyme catalysis, we introduced a modified

8

docking procedure called “Mechanism-Based Restricted Docking” (MBRD) for novel

9

substrate recognition from molecular docking. Comprising a series of utilities written

10

in C/Python, THEMIS automatically executes parallel computing MBRD tasks and

11

evaluates results with various MM (molecular mechanics) criteria such as energy,

12

distance, angle and dihedral angle to help identify desired substrates. Exhaustive

13

sampling and statistical measures were used to improve the robustness and

14

reproducibility of the method. We used Candida antarctica lipase B (CALB) as a test

15

system to demonstrate the effectiveness of our computational prediction of

16

(non-)substrates. A novel MM score function for CALB substrate identification derived

17

from near attack conformation (NAC) was used to evaluate the possibility of chemical

18

transformation. A highly positive rate of 93.4% was achieved from a CALB substrate

19

library with 61 known substrates and 35 non-substrates, and the screening rate has

20

reached 103 compounds/day (96 CPU cores, 100 samples/compound). The

21

performance shows that the present method is perhaps the first reported scheme to meet

ACS Paragon Plus Environment

Page 2 of 55

Page 3 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

the requirement for practical applicability to enzyme studies. An additional study was

2

performed to validate the universality of our method. In this verification we employed

3

two distinct enzymes, nitrilase Nit6803 and SDR Gox2181, where the correct rates of

4

both enzymes exceeded 90%. The source code used will be released under the GNU

5

General Public License (GPLv3) and be free to download. We believe that the present

6

method will provide new insights into enzyme research and accelerate the development

7

of novel enzyme applications.

8

Introduction

9

Enzymes catalyze a wide variety of reactions in aqueous solution under ambient

10

conditions with exquisite selectivity and stereospecificity1, 2 and are among the most

11

proficient catalysts known.3 The fascinating catalytic ability of enzymes has

12

tremendous potential for developing novel synthetic biochemical pathways.4,

13

5

14

new enzyme discovery. As of May 16, 2016, RefSeq (Version 76) contained

15

63,971,766 protein sequences and the list is rapidly increasing owing to

16

high-throughput sequencing technology.6 Because functional characterization of the

17

substrates of uncharacterized enzymes by experiment is laborious and time consuming,

18

the progress in utilizing more efficient enzymes has been slow.7 On the other hand,

19

computational methods are increasingly being applied in the design of biocatalysts to

20

provide rational guidelines, direct experimental planning, and avoid expensive and

21

time-consuming experiments.8

However, in vitro methodological progress is slow in comparison to the evolution of

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Nonetheless, the in silico quantitative prediction of enzyme activity and selectivity is

2

still a formidable task. 9 Although bioinformatics can suggest clues of activities from

3

sequence similarity, it provides no direct evidence because the enzymatic activities are

4

based on their 3-dimensional structure.10 When the target enzyme has little relationship

5

to orthologous enzymes of known activity, any inference could be unreliable,11 making

6

structure-based prediction inevitable. Structure-based computational methods are able

7

to describe the biocatalytic machinery in detail, and thus can provide more information

8

that is otherwise difficult to obtain.1

9

From the standpoint of applicability, a viable substrate screening method would have

10

to match or exceed a screening rate of 102 to 103 compounds per day in order to be

11

competitive in terms of cost with common experimental methods.12 In the present

12

study, we used a molecular mechanics (MM) based method to perform in silico

13

screening against substrates. This approach has the advantage of requiring modest

14

computational costs, and it has been widely used in enzyme engineering and

15

mechanistic investigations.13-16 Meeting speed requirements will introduce the second

16

and most critical performance criterion of computational methods, namely the ability to

17

reliably discriminate between active and inactive molecules in a virtual library.17

18

Generally, enzymes are large and heterogeneous systems containing thousands of

19

atoms, in which not only active residues but also other residues and the solvent

20

environment determine enzymes’ activities.18, 19 In generalized transition state theory,20

ACS Paragon Plus Environment

Page 4 of 55

Page 5 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

the rate constant for a reaction at temperature T can be conveniently expressed as

2

follows:

3

Eq. 1

4

Where T is the temperature, k(T) is the reaction rate, γ (T) is the transmission

5

coefficient, kTST is the transition state theory rate constant, kB is the Boltzmann’s

6

constant, h is the Planck’s constant, and G‡ (T) is the quasi thermodynamic free energy

7

of activation. Computational studies show that the dominant factor responsible for the

8

rate enhancement by enzymes is the lowering of the free energy of activation G‡ (T) as

9

compared to that of the uncatalyzed reaction in water. 1, 2 Nevertheless the transmission

10

coefficient γ (T) is also critical for understanding enzymatic reactions, which can either

11

accelerate or decelerate reactions. Thus, to accurately predict the substrates for

12

enzymes, the two key factors must be carefully examined for each candidate substrate.

13

Firstly, the reduction of the quasi thermodynamic free energy of activation for the

14

chemical transformation is a key step in enzyme catalysis from a physicochemical

15

perspective.21 However, such computation requires expensive, high-level quantum

16

mechanical (QM) simulation and is too slow to allow sufficient screenings of

17

wide-ranging substrate candidates even when combined with MM.22 Thus, we tried to

18

use simple MM molecule geometry elements to replace such heavy computation in the

19

present study. Because each enzyme follows a different catalytic mechanism, we used a

20

serine hydrolase, Candida antarctica lipase B (CALB/EC 3.1.1.3), which catalyze ester

21

hydrolysis, as a target for effective computational prediction of (non-)substrates. CALB

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

was chosen for a number of reasons: the availability of a high-resolution crystal

2

structure;23 a well-characterized catalytic mechanism;24 rich experimental data;25-38

3

wide use in academic and industrial laboratories; a great number of registered patents

4

and various applications; and it’s appropriateness a candidate in pharmaceutical,

5

chemical, and food industries.39 It is generally accepted that in serine protease-like

6

enzymes, the formation of the tetrahedral intermediate (TI) or near-attack

7

conformation (NAC), a ground-state conformation that can convert directly to the

8

transition state, is rate determining.40, 41 We therefore assumed in our sampling that a

9

higher probability of an NAC correlates to a lower barrier for this reaction, and thus

10

higher overall activity of the enzyme for the substrate. By defining key interactions

11

between substrates and CALB in the NAC, we designed a novel substrate prediction

12

score function (SP-CALB) to evaluate the possibility of chemical transformation of

13

CALB substrates by combining the contributions of each key interaction.

14

Secondly, the contribution of the transmission coefficient γ (T) to the reaction rate is

15

relatively small (typically no more than a factor of 103) in comparison to the effect of

16

reaction barrier (as much as a factor of 1019). 20 The transmission coefficient is also

17

important for predicting enzyme catalysis and is sensitive to substrate-enzyme and

18

substrate-water interactions: it can either accelerate or decelerate reactions. From a

19

biochemical perspective, the “lock and key” model42 is the most commonly accepted

20

hypothesis for enzyme catalysis. In this hypothesis, the binding of a substrate molecule

21

to the active site on the enzyme results in activation of the substrate (a reactive

ACS Paragon Plus Environment

Page 6 of 55

Page 7 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

conformation). Subsequently, a modified version in which the “key does not quite fit

2

the lock perfectly but exercises a certain strain on it” (ground-state destabilization) was

3

proposed.43 Along the reaction coordinate of enzyme catalysis, the energy barrier for

4

the reversible binding of substrate, forming an enzyme-substrate complex, which then

5

undergoes catalytic conversion, is critical for discriminating substrate from

6

non-substrate. This binding free energy can be calculated with various methodologies;

7

frequently used MM methods such as MM-PBSA and molecular docking can predict

8

enzyme selectivity between substrates with a certain degree of accuracy.44,

9

achieve a high screening rate, molecular docking was used for sampling in the present

10

study, which is a high-efficiency, binding free energy-based method widely used for

11

investigating enzyme-substrate interactions.46 However, MM-based methods alone are

12

not sufficient for sampling of the NAC because the energy function used cannot

13

accurately describe the electron structures that promote enzyme catalysis. Therefore,

14

we introduced three enhanced hydrogen bond terms, which mimic the stabilization of

15

substrate by an oxyanion hole in CALB to constrain the sample space within the

16

subspace around the NAC. A modified docking procedure called “Mechanism-Based

17

Restricted Docking” (MBRD) is introduced to achieve substrate recognition for

18

molecular docking. For robustness and reproducibility, exhaustive sampling was

19

applied along with MBRD in this work.

45

To

20

Here, we present an open source, extensible and flexible software platform called

21

THEMIS, which performs in silico screening of enzyme substrates to drastically reduce

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

experimental effort. THEMIS is composed of a series of utilities written in C/Python to

2

automatically execute parallel computing MBRD tasks and evaluate results with

3

various MM criteria such as energy, distance, angle and dihedral angle to identify

4

substrates. Against a library of 61 known substrates and 35 non-substrates of CALB, a

5

considerably high positive rate of 93.4% was achieved and the screening speed has

6

reached 103 compounds/day (96 CPU cores, 100 samples per compound). Additional

7

verification with a distinct enzyme, nitriles Nit6803, was performed to validate the

8

universality of our method. For 12 substrates and 12 non-substrates that were

9

experimentally confirmed, the positive rate reached 91.7%.

10 11

Results and Discussion

12

Workflow of the Substrate-screening Platform

13

The goal of our design was to establish a general screening platform for common use

14

in enzyme studies. The workflow of the in silico substrate-screening system is shown in

15

Figure 1. The entire blueprint was derived from the relationship between the substrate

16

specificity of the enzyme and the enzyme structure; in other words, the function is

17

determined by the structure. It starts with a suitable structure of an enzyme either by

18

downloading from the PDB database or homology modeling, followed by rigorous

19

analysis of the enzyme mechanism. Important information should be extracted and

20

formed into proper settings; CALB will be used to demonstrate details of this process

21

later in this paper. To identify substrates, the score function should be composed of

ACS Paragon Plus Environment

Page 8 of 55

Page 9 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

MM criteria derived from the NAC such as energy, distance, angle and dihedral angle.

2

From then on, THEMIS will take over all the miscellaneous tasks, automatically

3

conducting high-performance computational investigations of substrate screening.

4 5

Figure 1. Workflow of substrate prediction platform (THEMIS). The workflow

6

starts with either crystal structure or homo-modeling structure of target enzyme.

7

Catalytic mechanism of enzyme, such as active residue, binding pocket and catalysis

8

pathway, is necessary known before any screening. The ligand database is composed

9

of the compound structures to investigate. The pre-screening procedure analyzes the

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 55

1

reactive group in ligands and extract essential information for substrate prediction.

2

Docking procedure of ligands could be processed in parallel with message Passing

3

Interface (MPI) protocol. The results are automatically analyzed by utilities in

4

THEMIS.

5 6

Substrate Candidates and Pre-screening

7

Ninety-four ester compounds were investigated in this study, two substituents of

8

which differed over a wide range.25-29, 31-38, 47 All of the candidates gathered from the

9

literature are experimentally verified substrates or non-substrates of CALB. In order to

10

create a baseline test set, the investigated candidates varied in physicochemical

11

properties. As a result of the high enantioselectivity, the majority of data came from

12

investigations of CALB chiral resolution. For clarity, only substrates with notably high

13

reactivity rates were employed and they are marked as active. In contrast, candidates

14

with no detectable or extremely low hydrolytic rates were labeled as inactive. Among

15

them,

16

2-aminopentanedioate each has two active ester groups that can be hydrolyzed. All the

17

other candidates were monoesters.

(R)-5-allyl

1-ethyl

2-aminopentanedioate

and

(R)-1-allyl

5-ethyl

18

In typical applications of in silico screening, ligand databases generally consist of

19

thousands to millions of compound structures. However, not all of the structures in

20

these databases contain chemical groups such as esters that hydrolyzed by the enzyme

21

of interest. Furthermore, not only does MBRD require information about the substrate

ACS Paragon Plus Environment

Page 11 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

carbonyl oxygen but also assessment by the score function requires information about

2

the ester group atoms. Thus, a proper pre-screening procedure is essential. To

3

overcome the drawbacks of manually established configurations for each unique

4

candidate substrate, a utility program was developed to automatically analyze chemical

5

groups from candidate substrate structures, extract key atomic information and

6

generate docking configure files accordingly. A recursive algorithm was used to match

7

any predefined chemical group such as esters in the given compound structure (Tripos

8

mol2 format) according to atom type and topological connection. Additional

9

information for the chemical group will be written into docking configure files as

10

comments, which will be used in substrate assessment by score function. Details of

11

inputs for pre-screening can be found in the supporting information (S5-7).

12

MBRD

13

In principle, standard simulation approaches using a classical force field (MM) can

14

only simulate a stable structure corresponding to a local minimum on the potential

15

energy surface, whereas the reactant transition state is always associated with a

16

first-order saddle point on the potential energy surface during a reaction process.

17

Hence, molecular docking using a classical force field cannot directly simulate a

18

transition state. Furthermore, MM can-not simulate the process of covalent bond

19

breaking or formation, or the transition state, because it can-not accurately describe

20

electronic structures. Conversely, an NAC can be well described; it is an approximation

21

of the ground-state conformation of the reactant transition state, which can directly

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

convert to the transition state. In CALB, the NAC takes the form of a tetrahedral

2

intermediate (TI, Figure 2).

3 4

Figure 2. The key residues of CALB, which consist of the catalytic triad

5

(Asp187-His224-Ser105) and the oxyanion hole (Thr40 and Gln106). The red dashed

6

bonds represent three hydrogen bonds that stabilize the substrate during reaction,

7

between the substrate and the enzyme oxyanion hole. The proton transfers between

8

His224 and Ser105 is represented with a pink arrow. The green arrow shows the attack

9

to the carbonyl group of substrate initiated by Ser105 Oγ (nucleophile). The

10

nucleophilic attack prompts the breaking of the carbon-oxygen pi bond, resulting in the

11

formation of a tetrahedral intermediate, which promptly collapses and releases as

12

product.

13 14

Hence, effective sampling of the NAC in the complex of CALB and the candidate

15

substrate is the pivot point of our substrate-screening platform. Because molecular

ACS Paragon Plus Environment

Page 12 of 55

Page 13 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

docking programs are designed to efficiently sample the overall minima in the complex

2

between protein receptor and ligand, the docking score function prefers strong binding

3

for inhibitors and target receptors in drug development. However, the NAC of the

4

substrate is not at an overall minimum in most circumstances. Thus, we introduced the

5

MBRD procedure to constrain the sample space within the subspace around the NAC to

6

improve sampling efficiency. It should be pointed out that the only purpose of

7

performing such types of MBRD simulation is to rule out irrelevant binding poses and

8

focus on sampling of the local minima of docking for candidate substrates in the

9

subspace around the NAC of the enzymatic reaction. Subsequently, the SP-CALB is

10

designed to evaluate the potential activities of candidate substrates using the geometric

11

parameters of their corresponding conformations.

12

Besides the NAC model used in present study, some investigators have

13

implemented a number of explorations and attempts to the in sillico prediction of

14

substrates, in which covalent docking is a promising research direction. Since the

15

formation of a covalent bond with the target to achieve irreversible inhibition is

16

essential for a number of successful drugs, covalent docking has been used to

17

successfully describe covalent interactions between inhibitors and biological targets48

18

and this technique is also available for the prediction of substrates. Recently, some

19

resultful investigations of substrate prediction using covalent docking have been

20

reported.49-51 Different from these efficient investigations, our study adopted the NAC

21

model and tried to predict substrates according to the specific recognition between the

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

substrates and enzymes. We tried to explore the ability of molecular docking to predict

2

more broad and common substrates, who do not possess a clear reaction path with

3

covalent bonds. A comparative study with the current method has been made in the

4

following section.

5

In CALB, three hydrogen bonds between carbonyl oxygen of the substrate and an

6

oxyanion hole consisting of enzyme residues Gln106 and Thr40 were believed to form

7

before the substrate could be hydrolyzed (Figure 2). Thus, these hydrogen bonds can

8

form a nature constraint condition to sample the NAC. The hydrogen bonds were

9

simplified down to three group of pseudo harmonic potential energy, while the bond

10

angles have been ignored in the docking procedure. The potential energy takes the form

11

of E = kx2, where x is the difference between the distance and the closest constraint

12

bound and k is the spring constant. Each hydrogen bond was restricted to a lower bound

13

of 2.5 Å and an upper bound of 3.5 Å with k=1000 in the following simulation. Owing

14

to the triangular pyramid consisting of the oxyanion hole and carbonyl oxygen, the

15

bond length is able to restrict the substrate to the proper position. Therefore, the bond

16

angles of these hydrogen bonds were ignored in this study.

17

Another important improvement to enhance accuracy in this study is that we used

18

statistical data based on a sufficient amount (102) of parallel docking results.

19

Simulations of enzymes are commonly only used as a theoretical explanation of

20

experimental data where the reproducibility of the results is out of concern. However,

21

we consider that the reproducibility of the results is the most important feature of our

ACS Paragon Plus Environment

Page 14 of 55

Page 15 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

system. Thus, we enlarged the sampling number to control the contingency of MBRD,

2

just like Klepsch and his colleagues did in investigations of inhibitors for

3

P-glycoprotein.52 The other important thing is that the vast amount of data are

4

automatically operated on by Python scripts rather than manually, which not only

5

minimizes the artificial impact on raw data but also improves the screening speed.53

6

At the end of MBRD, an appropriate number of candidate substrate conformations

7

should be acquired. Obviously, if not a single pose could be found, the potential for

8

being a substrate is very limited. However, all of the candidate substrates successfully

9

generated 100 poses as expected. To distinguish active candidates, further evaluation

10

was needed.

11

Assessment Criteria

12

To accurately evaluate the possibility of chemical transformation for CALB

13

substrates, the score function has to integrate effective MM geometry assessment

14

criteria for the NAC.

15

Firstly, distances between key atoms in the substrate and enzyme catalytic residues

16

were reported to be reasonable criteria for predicting the enantioselectivity or

17

stereo-selectivity of enzymes.54 Kahlow et al. established a simple model in

18

Pseudomonas cepacia lipase, in which the difference in the distance D(HNɛ-Oα)

19

between the catalytic histamine side chain and the alcohol oxygen between the

20

preferred

and

the

non-preferred

enantiomer

was

ACS Paragon Plus Environment

correlated

with

the

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

enantioselectivity.55 Another simple quantitative model of Candida rugosa lipase

2

revealed the same phenomenon.56

3

Thus, the first assessment criterion adopted is distance (D1) between the imidazole

4

hydrogen atom (HNɛ) of His224 and the oxygen atom (Oα) of the alcohol part of the

5

candidate substrate ester (Figure 2). This distance represents the abstraction of the

6

eventful hydrogen bond between His224 and substrate Oα. Therefore, the smaller it is,

7

the greater the possibility that a hydrolytic reaction will take place.

8

In addition to D1, a second distance criterion (D2) between the Oγ atom of Ser105 and

9

the Cα atom of the carbonyl group of the candidate substrate was also employed

10

(Figure 2). This distance is a measure of the feasibility of nucleophilic attack on the

11

planar carbonyl group by the active Ser105 Oγ atom that is the trigger for hydrolytic

12

reaction. As with D1, the smaller D2 is the higher the possibility that a hydrolytic

13

reaction will take place.

14

The third criterion used in this study is the torsion angle (T) defined in Figure 3.

15

Based on Kazlauskas’s rule, which states that the size of substrate substituents at the

16

stereocenter affects the selectivity of lipase,57 we derived a quantifiable criterion, T,

17

which can be easily measured. The CALB pocket can be divided into two smaller

18

pockets, and two the substituents of substrate candidate interact with each of them

19

(Figure 4). To evaluate proper orientation of the ester group in the NAC, T is defined

20

by the dihedral angle between the ester group and a reference atom, Cβ of Gln106. In

ACS Paragon Plus Environment

Page 16 of 55

Page 17 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

this way, the orientation difference determined by two substituents is effectively

2

detectable by T.

3 4

Figure 3. The definition of torsion angle in projection mode. To distinguish

5

different orientations of ligands, torsion angle (T) was defined. The two planes share a

6

common axis (Green) composed of carboxylic oxygen and Cα. The angle can be

7

measured by the angle between the reference atom (Cβ of Gln106, Black) and hydroxyl

8

oxygen (Gray).

9

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Figure 4. The CALB catalytic pocket. The blue surface represents the pocket of

2

CALB, which is able to bind and orient the substrates. The substrate

3

((R)-3-methylbutan-2-yl isobutyrate) is showed in the binding pocket with stick model,

4

where atoms of the substrate are colored according to their atom types (hydrogen/white,

5

oxygen/red, and carbon-cyan/blue). The red mesh shows the atom surface of the

6

substrate. The commonly accepted explanation for the lipase catalytic selectivity is that

7

the match between the enzyme pocket and the substrate’s two substituents.

8 9

Performance of MBRD

10

To verify the performance improvement due to the application of MBRD, a

11

comparison was made between standard molecular docking,MBRD and covalent

12

docking by docking ten compounds of known activities into CALB (Figure 5). In

13

standard molecular docking and MBRD, for each candidate, 100 rounds of molecular

14

docking towards the receptor CALB was performed using identical parameters except

15

for three distance restrictions between a carbonyl oxygen of the substrate and the

16

oxyanion hole consisting of enzyme residues Gln106 and Thr40 presented in Figure 1.

17

The docking configuration was the same with the standard molecular docking except

18

that the covalent bond between the oxygen atom of Ser105 (OG) and substrate carbonyl

19

carbon atom was fixed during the docking by GOLD program. This covalent docking

20

protocol was following the work of liu et al58. The results show that the docking score

21

for MBRD is slightly smaller than that for standard docking (the bigger the better),

ACS Paragon Plus Environment

Page 18 of 55

Page 19 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

while the distances in MBRD are much smaller, especially in the substrates. The origin

2

of this difference is that the standard docking procedure performs global optimization

3

of the free binding energy between ligand and enzyme, which produces a

4

conformational energy minimum regardless of interactions with substrate required for

5

catalysis. Therefore, no consistent pattern could be detected between active and

6

non-active compounds in the standard docking results. On the other hand, MBRD

7

restricts the docking program to focus on sampling conformations around the subspace

8

of the NAC, where the geometric demands for enzyme catalysis could be satisfied. By

9

comparing the data of active candidates to the non-active ones in MBRD, clear patterns

10

could be easily found, where distances (D1 and D2) have a good correlation with

11

activities. This phenomenon demonstrates that MBRD is better at mimicking the NAC

12

than standard molecular docking. Moreover, MBRD produces equal degrees of

13

docking scores, according to the Boltzmann equation, which means that in equilibrium

14

substrate binding, these conformations have probabilities similar to those of standard

15

docking. However, all covalent docking results showed negative scores (the bigger the

16

better), which correspond to poor binding of the candidates. Moreover, the distance 1 in

17

covalent docking did not present a clear correlation with the activities either (distance2

18

is covalent bond length here which is fixed, thus it is not included in table 1). The

19

reason for the dissatisfied results may be that the constrain of covalent bond is too

20

strong for current system, which leads to poor binding and unsuitable conformations. In

21

summary, by measuring the key interactions between candidate compound and enzyme

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 55

1

in the complex produced by MBRD, the possibility of an enzyme-catalyzed reaction

2

could be successfully estimated. The torsion angle T is not included in Table 1, because

3

it is dependent upon the geometric location of the ligand, which means it is not

4

comparable without distance restrictions.

5

6 7

Figure 5. Ten compounds used to verify the performance of MBRD. Ten

8

compounds were picked to verify the improvement of MBRD comparing to standard

9

docking procedure. The data of these candidates were from reference.25.

10 11 No.

Table 1. Validation of MBRD Activity

Regular docking Goldscore

1.

Yes

31.61

MBRD

Distance

Distance

1

2

3.844

3.133

Goldscore

26.33

Covalent docking Distance

Distance

1

2

3.228

2.381

ACS Paragon Plus Environment

Goldscore

Distance1

-33.12

3.197

Page 21 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

2.

No

34.25

3.696

2.846

31.65

3.562

2.540

-30.19

2.979

3.

Yes

31.97

3.68

3.917

28.08

2.942

2.234

-20.14

2.434

4.

No

32.71

3.905

3.176

29.92

3.575

2.461

-16.53

2.536

5.

Yes

28.56

3.6

3.377

24.50

2.948

2.247

-26.27

3.240

6.

No

29.80

3.379

2.621

29.16

3.450

2.479

-18.92

2.717

7.

Yes

36.62

5.144

4.664

35.66

3.018

2.291

-7.94

2.746

8.

No

43.43

6.909

6.707

35.93

3.249

2.351

-5.67

2.778

9.

Yes

36.62

5.144

4.664

29.01

2.980

2.213

-35.56

2.541

10.

No

37.07

4.777

4.163

35.70

3.667

2.485

-21.79

2.730

1 2

Performance, Stability and Repeatability of Criteria

3

Before establishing the form of the score function, independent and cross testing had

4

been performed to confirm the effectiveness of these three criteria. To circumvent the

5

uncertainty of sampling by molecular docking, we exhaustively sampled the pose space

6

by generating 100 docking poses for each ligand. For the verification of the robustness,

7

five parallel trials were run to assess the stability and repeatability of our method.

8

Statistical data for the assessment criteria from all poses were analyzed upon

9

exhaustive sampling. The average values of parallel trials were used at the end of our

10

study, where each final prediction is based on the result of 500 samples.

11

The discriminatory ability of the three criteria along with docking scores (gold-score)

12

is presented as box charts and receiver operating characteristic (ROC) curves in Figure

13

6. From Figure 6A, it can be seen that D1 ranged from 2.7 Å to 4.3 Å which is slightly

14

greater than the summation of atomic van der Waals radii of H (1.10 Å) and N (1.55 Å)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

.59 The reactive and non-reactive candidates showed different distribution where

2

distances of non-reactive candidates are greater than substrates as expected. However,

3

the probability region between 50% and 95% of substrates overlaps with the main body

4

of non-substrate. As a result, only substrates with distance 1 lower than 3.0 Å can be

5

confirmed independently by D1. A similar situation could be found with D2 in Figure

6

6B, in that only a distance smaller than 2.3 Å could be independently confirmed as

7

substrate. Although the majorities of the two groups had less overlaps region, only a

8

very few regions were injective.

9 10

Figure 6. The discriminatory ability of criteria. From A to D in order represents the

11

Box diagram of D1 (HNɛ-Oα), D2 (Oγ-Cα), torsion angle (T) and docking score

12

(gold-score). The boxes are determined by the 25th, 50th and 75th percentiles and the

ACS Paragon Plus Environment

Page 22 of 55

Page 23 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

whiskers are determined by the 5th and 95th percentiles. Minimum and maximum

2

values are marked as crosses. The average value is presented as small squares. The

3

boxes are colored by activities of the candidate substrates: red for non-reactive

4

candidates and blue for active candidates.

5

Compared to the distance assessment criteria, torsion angle (T) provide a distribution

6

very close to that of one-one mapping, which provides a significant improvement to the

7

distinguishability. As shown in Figure 6C, substrates tend to have smaller angles, most

8

of which are below 90° and 75% of which are below 60°. In contrast, the majority of

9

non-substrates are distributed above 90° and 75% are above 130°. The overlapping

10

region of torsion angle distribution between substrate and non-substrate is less than 5%.

11

Furthermore, the docking score (gold-score) is not valid for distinguishing substrates

12

in this study where no direct difference between the distributions could be detected

13

(Figure 6D). Thus, this is ignored as an option of assessment criteria for the remainder

14

of the study. The ROC curve (Figure 7) was also used to evaluate the ability of these

15

criteria to independently predict substrates. As already indicated, the performances of

16

the criteria are in the following order: torsion angle T (AUC = 0.92), distance D2 (AUC

17

= 0.92), distance D1 (AUC = 0.80), and docking score (AUC = 0.50). In fact, prediction

18

efficiency of the docking score is virtually random.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2

Figure 7. ROC curves for the three criteria and docking score (sensitivity versus

3

1-specificity). The AUCs (Area under the curve) are 0.8016(D1), 0.9274(D2),

4

0.9522(T) and 0.5016 (gold-score).

5 6

The stability and repeatability were verified by histogram and probability plots of

7

standard deviation as shown in Figure 8. The majority of standard deviations (90%) for

8

D1 are below 0.04 Å, which is below of the average value of D1 by two orders of

9

magnitude (Figure 8A). Meanwhile, D2 shows an even smaller standard deviation,

10

90% of which is below 0.015 Å and is two orders of magnitude less than the average

11

value of D2 (Figure 8B). Comparing D1 and D2, the torsion angle T is less stable

12

(Figure 8C), the standard deviation of which is less than 5° (90%). Considering that the

13

difference in the torsion angle between substrates and non-substrates (Figure 6C) is

14

much larger (more than 10° for 95% of candidates and the high discriminatory

ACS Paragon Plus Environment

Page 24 of 55

Page 25 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

performance of T), it is sufficiently stable in the most circumstances. Furthermore,

2

exhaustive sampling the robustness of all criteria will be improved to meet the

3

requirements of practical use.

4

5

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Figure 8. The stability and repeatability of criteria. The standard deviations are

2

presented by histogram and probabilities plots: (A) the standard deviation of D1, (B) the

3

standard deviation of D2, (C) the standard deviation of torsion angle T.

4 5

In order to integrate the merits of the three criteria, their combinational use was

6

investigated (Figure 9). The x-y scatter plot of D1 and D2 shows a near-linear

7

distribution (Figure 9A). As expected, most of the substrates can be successfully

8

distinguished by D1 and D2, but some failed because of the indistinct border between

9

substrates and non-substrates. In contrast, the scatter plot of torsion angle vs either

10

distance shows better distinguishability (Figure 9B and C). Finally, the 3D scatter plot

11

shows that by combining the three criteria, a quite high efficiency of activity prediction

12

could be obtained (Figure 9D).

13

It should be pointed out that although the torsion angle (T) is the most efficient

14

criteria in this study, D1 and D2 were still employed in the final score function, because

15

the above performance comparison is data-dependent and D1 and D2 had more

16

intuitionistic relevancy to the catalytic mechanism than T. Moreover, torsion angle

17

itself does not require that atoms are close to each other. However, when the substrate is

18

not in the enzyme active site, on the face of it, the angle of T still seems reasonable.

19

However, it says nothing about the interaction between enzyme and substrate in that

20

situation. On the other hand, the distances are quite straightforward in their physical

21

meaning, which makes them much more reliable. When they diverge from their normal

ACS Paragon Plus Environment

Page 26 of 55

Page 27 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

value range, something has very likely gone wrong. Thus, both distances are used in the

2

following score function.

3

4 5

Figure 9. The discriminatory ability of combination of the three criteria. Each

6

candidate substrate is presented by a colored dot (red for non-active, blue for active):

7

(A) the combination use of D1 and D2 ; (B) the combination use of D1 and T; (C) the

8

combination use of D2 and T; (D) the 3D scatter plot of the three criteria.

9 10

Score Function and Prediction Result

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 55

1

The SP-CALB score is designed to integrate the merits of the three criteria and

2

provide a quantitative standard for the prediction of substrates and non-substrates. It is

3

derived from the Euclidean space norm (distance) and takes the form of

4

Eq. 2

5

Where D1 and D2 stands for distance 1 and 2 and T stands for torsion angle. The

6

torsion angle is mapped to a monotone increasing in the interval from 0 to 2 with a

7

triangle function. The ideal torsion angle for a substrate is considered to be below 90°,

8

corresponding to the interval from 0 to 1. Three constants (w1, w2, w3) are used to

9

normalize inputs. The default values of w1 = 16 and w2 = 9 are given by the round-off

10

square number of the sum of van der Waals radii between atom pairs in D1 and D2 (H

11

1.10 Å, O 1.52 Å and N 1.55 Å) .59 The default value of w3 = 4 is derived from the

12

maximum value of the cosine function. The constant w4 is designed to adjust the range

13

of final scores to be distributed asymmetrically, so that its meaning can be easily

14

understood by different signs and w5 is a scalar to amplify score differences for

15

perceptibility (w4 = 1.3, w5 = 100). With these default values, substrates can be

16

indicated by negative scores.

17

The performance of the SP-CALB score is presented in Figure 10. A majority (89

18

out of 96) of candidates were successfully predicted and only 7 failed. As shown in

19

Tables 2 and 3, three cases out of 96 were false-positive hits (3.1%) and four cases

20

were false-negative hits (4.1%). The high accuracy rate (92.7%) and high true-positive

ACS Paragon Plus Environment

Page 29 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

rate (93.4%) guarantees not only the feasibility but also the availability of the

2

SP-CALB score.

3 4

Figure 10. The prediction result of the SP-CALB score. Ninety-six candidate

5

substrates were predicted by the SP-CALB score. The experimental activities are

6

shown by different colors: red for non-active, green for active. The columns are

7

generated by the SP-CALB score values. The candidates, whose score are below zero,

8

are predicted as substrates and the opposite are non-substrates.

9

Table 2. Confusion matrices for the validation of prediction method. Substrate

Non-subst

Total

Correct

rate Substrate

TP

FN

TP+FN

Sensitivity=TP/(TP+FN)

Non-subst

FP

TN

FP+TN

Specificity=TN/(TN+FP)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 55

rate Total

TP+FP

FN+TN

TP+TN+FP+FN

Correct

Precision=

Accuracy=

TP/(TP+FP)

(TP+TN)/(TP+TN+FP+FN)

1

The performance of the proposed method of this study was fully validated using the

2

following measurements: precision, sensitivity, specificity and accuracy. The table

3

shows the confusion matrices used, where TP = true positive, FP = false positive, TN =

4

true negative, FN = false negative.

5

Table 3. Confusion matrices of prediction method. Substrate

Non-substrate

Total

Correct

Substrate

57

3

60

95.0%

Non-substrate

4

32

36

88.9%

Total

61

35

96

Correct

93.4%

91.4%

92.7%

6 7

Prediction Error

8

Before starting analysis of the prediction errors, it is worth noting that the flexibility

9

of CALB is ignored in this study. Although the flexibility of the enzyme can

10

significantly affect the accuracy of prediction, CALB has been confirmed to undergo

11

negligible structural changes upon the binding of substrate (PDB: 1LBS). In fact, we

12

tried to introduce molecular dynamics (MD) into our protocol to fully investigate the

13

flexibility of CALB. However, it weakened rather than improved every aspect of the

ACS Paragon Plus Environment

Page 31 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

prediction. Thus, we considered that although MD brought flexibility to the enzyme

2

active pocket, it also brought dislocation of important residues, which reduced the

3

prediction accuracy. It should be pointed out that for CALB, extensive measure to

4

model flexibility could be skipped, but this may not be the case with other enzymes.

5

Although several improvements have been made to the prediction protocol, there are

6

still seven failed cases (Figure 11). According to the molecular formula, four of them

7

contain analogs of pyrrole (3-pyrroline or pyrrolidine) directly connected to the ester

8

bond. It is likely that pyrrole ring systems are not handled properly in the current

9

protocol. The problem with pyrrole ring may come from the unfavorable interactions

10

between the pyrrole ring atoms and the side chain atoms of the enzyme pocket residues.

11

Ideally, a co-crystal structure of the pyrrole ring substrates and CALB could make a

12

good explanation of this problem. Otherwise, some MM methods such as flexible

13

docking or MD may improve the accuracy of prediction by providing abilities to

14

rearrange the side chain of the enzyme pocket residues to fit the pyrrole ring of

15

substrates. Vinyl propionate is the only non-ring-containing molecule out of seven.

16

After investigating the details of the docking conformations, we conclude that the

17

positive SP-CALB score is mainly caused by improper torsion angle. Further analysis

18

shows that vinyl propionate appears to be quite free to rotate because of the structural

19

symmetry and similarities of interactions both directions. In contrast, vinyl butanoate

20

performs well, which differs from vinyl propionate by only one carbon atom. The

21

difference might be due to the break in symmetry by replacing butyric acid with

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

propionic acid. Thus when the hydrolysis of vinyl propionate is catalyzed by CALB,

2

there might be a balance between the two major conformations that could lead to a

3

transformation, which could not be detected by our system. To improve the prediction

4

of similar molecules, one possible solutions is the use of clustering center value instead

5

of the average value, because the later will produce a meaningless value between the

6

two peaks in current circumstance.

7 8

Figure 11. The prediction errors. Seven compounds failed in the prediction.

9

According to experiment result, compound 1, 2 and 3 are substrates (false negative) and

10

compound 4, 5, 6 and 7 are non-substrates (false positive).

11 12

Verification with Nitrilase Nit6803

13

After successfully testing and verifying our protocol with a lipase, the program called

14

THEMIS has been developed to feature utilities that implement the automated

15

workflow of substrate screening (MBRD). Scripts written in Python implement

16

coarse-grained parallelism via message passing interface, which could accelerate the

ACS Paragon Plus Environment

Page 32 of 55

Page 33 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

speed of calculation by distributing computation-intensive docking tasks (Gold

2

v4.1260) within clusters of computers. In the test with CALB, a speed of 103

3

compounds/day was achieved with a cluster of 96 CPU cores (Xeon X5650).

4

To verify the utility of our method with different enzymes, another

5

substrate-screening experiment using a distinct enzyme, nitrilase, was undertaken using

6

THEMIS. Nitrilases catalyze the hydrolysis of organic nitriles to the corresponding

7

carboxylic acids, some of which are important intermediates in the chemical synthesis

8

of various products.61 A nitrilase from Cyanobacterium syechocystis sp. PCC6803

9

(Nit6803) was used in the current study. The crystal structure of Nit6803 that was

10

determined in our previous work62 is the world’s second crystal structure of nitrilase

11

and there are only two nitrilase structures in the pdb database until now. In contrast with

12

the first nitrilase that was crystalized63, Nit6803 has a very broad substrate spectrum

13

and much higher catalytical activities, which make it more feasible for industrial

14

applications and theoretical investigations. Following the same routine as for CALB, a

15

thorough analysis of the mechanism was initially performed.64 The form of the

16

transition state in the hydrolytic reaction with nitrile substrates catalyzed by Nit6803 is

17

presented in Figure 12. Based on key interactions between the enzyme and substrate,

18

constraint conditions and evaluation factors could be abstracted according to the NAC

19

theory. In the current study, the distance between the cyano nitrogen of the substrate

20

and the ε-amino nitrogen of Lys135 was used as a restriction that filters positive

21

NAC-like conformations during docking. This distance restriction mimics the essential

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

hydrogen bond that stabilizes the substrate at the beginning of the reaction. Following

2

the stabilization of substrate, a water-bridged hydrogen bond is formed to promote the

3

chemical transformation. Since it’s relatively difficult to mimic the water bridged three

4

body (substrate, water and enzyme) hydrogen bond network in the docking simulations.

5

The water is assumed to be able to freely participate in the hydrogen bond interaction

6

between substrate-enzyme complex as long as there is a suitable gap for it to implant.

7

This very gap is the criteria used in the study, which is represented by the distance

8

between the cyano nitrogen of the substrate and carboxylic acid carbon of Glu53. If the

9

gap is too big, then the water will not be able to connect the substrate and enzyme,

10

making the enzymatic reaction unable to take place. This gap’s maximum length could

11

be estimated by the sum of the bond length of O-H in water, the hydrogen bond’s length

12

between the water and the substrate and the hydrogen bond’s length between the water

13

and enzyme, which is about 6 Å (0.96 Å for O-H bond in water, 2.5 Å for hydrogen

14

bond). Consider that the four atoms of the hydrogen bond network are not perfectly

15

collinear in three-dimensional space, the practical distance of the gap will probably

16

reduce to some extent. In general similar to the CALB, the smaller the distance is,

17

meaning that the high possibility the hydrogen bond network will establish, therefore,

18

the better activity the compound might possess.

ACS Paragon Plus Environment

Page 34 of 55

Page 35 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1 2

Figure 12. Mechanism of Nit6803 catalyzed hydrolysis of nitrile substrates. The

3

catalytic triad of Nit6803 consists of Glu53, Lys135 and Cys169. The figure presents

4

the first step of hydrolytic reaction according to reference.64 The reaction takes place

5

via a thioimidate intermediate (Cys-substrate) and a water molecule plays an important

6

role in the reaction. During the reaction, Lys135 stabilizes the substrate by providing a

7

hydrogen bond and Glu53 stabilizes the transition state by interacting with the positive

8

charge on the cyano nitrogen of the substrate.

9

10

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Figure 13. Substrates of Nit6803 from literatures.65

2 3

Figure 14. Control group compounds of Nit6803 (Non-substrates).

4

Twelve known nitrile substrates for Nit6803 were obtained from the literature65

5

(Figure 13), and twelve non-hydrolyzable nitrile compounds that were experimentally

6

determined using identical methods as for the reference65 were used as controls (Figure

7

14). Each compound was docked 1000 times against enzyme Nit6803. This number of

8

docking is intentionally increased here to make the kernel density estimation (KDE)66

9

plot smoother and reduce pseudo-peaks in the plot. The KDE is better at analyzing a

10

small number of samples because it can represent the overall distribution better than the

11

box plot that was used in previous study of CALB. However, the KDE requires much

12

more computations, which will limit the speed of predictions, thus it is not as

13

convenient as classical statically methods in the circumstance of large sample size. The

14

distances between the compounds and the carboxylic acid carbon of Glu53 of the

ACS Paragon Plus Environment

Page 36 of 55

Page 37 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

enzyme was extracted from 24000 conformations and plotted with KDE in Figure 15.

2

The KDE plots reveal that a notable difference between the distribution of substrates

3

and non-substrates exists. All substrates have a centralized distribution within the

4

region containing distances less than 4.5 Å; however, the control group shows evident

5

distribution only in the zone above 4.5 Å. Thus, an effective model to separate

6

substrates from a chemical database could be deduced based on this difference. For

7

example, a binary classification by threshold (4.5 Å) could accomplish a correct rate of

8

91.7% (22/24) in the current data set. Since the threshold of 4.5 Å is only an optima

9

estimated from a relatively small group of candidates, further investigations upon other

10

substrates of Nit6803 may need to adjust this threshold depending on the state.

11

12

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 55

1

Figure 15. The plots of kernel density estimation for substrates and

2

non-substrates of Nit6803. The chart represents the KDE plots of substrates (A) and

3

non-substrates (B) in nitrilase Nit6803. The figures were generated by pandas v17.067

4

using Gaussian kernel density estimation algorithm.68 KDE is a data analysis technique

5

in statistics. Just like histograms, KDE reveals the distribution pattern of a finite data

6

sample, while possessing properties such as smoothness or continuity that histograms

7

lacks.66 In above figures, the x-axis represents the distances between the cyano nitrogen

8

of the candidate substrates and the carboxylic acid carbon of Glu53 in the 1000

9

conformations in each compound’s docking results, while the y-axis reflects the

10

corresponding probability of the distances by their kernel density. A notable difference

11

between the distribution of substrates and non-substrates exists in the zone below 4.5 Å

12

(marked

13

(S)-2-(2-chlorophenyl)-2-hydroxyacetonitrile

14

(S)-2-(4-chlorophenyl)-2-hydroxyacetonitrile have minor distributions here.

by

the

red

vertical

line).

Only

two

non-substrates and

15 16

Verification with SDR Gox2181

17

Short-chain dehydrogenase/reductases (SDRs) consist of a large superfamily of

18

NAD(P)H-dependent oxidoreductases, with over 47,000 members available in the

19

sequence database. However, most of which are distantly related, with typically

20

20-30% residue identity in pairwise comparisons. 69 Thus, it’s difficult for the sequence

21

based methods to predict the detailed activities of this superfamily. Meanwhile, there

ACS Paragon Plus Environment

Page 39 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

are more than 300 crystal structures of SDRs deposited in the Protein Data Bank, which

2

makes the SDR family enzymes more suitable for structure-based activity predictions.

3

Here, a SDR family enzyme,Gox2181 (PDB ID:3AWD), whose structure had been

4

determined by our colleagues 70, is used to further validate the generality of our method.

5

Gox2181 was demonstrated to be active towards a wide range of substrates, including

6

sugar alcohols, secondary alcohols, ketones, and ketoses. Therefore, it has a good

7

potential in industrial applications.70 As with Nit6803, the mechanism of Gox2181

8

catalyzed oxidation and reduction reactions were determined in the first place (Figure

9

16)71 and then constraint conditions and evaluation factors are abstracted based on key

10

interactions between the enzyme and substrate according to the NAC theory. Since the

11

oxidation and reduction catalyzed by the same enzyme of Gox2181 are reverse

12

reactions, where the core difference is the different protonation state of co-enzyme, the

13

oxidation and reduction substrates are treated identically in the docking setting up step,

14

except the special treatment of the protonation for co-enzyme NAD+ and NADH. In

15

present study, the distance between the hydroxy/keto/aldehyde oxygen atom of the

16

substrate and the phenolic oxygen atom of Tyr162, which mimics the essential

17

hydrogen bond that stabilizes the substrate at the beginning of the reaction and initiates

18

the proton transference between the substrate and Tyr162, was used as a distance

19

restriction that filters positive NAC-like conformations during docking. The evaluation

20

of the potential activities is taken by the distance between the nicotinamide ring C4

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

atom of co-enzyme and the hydroxy/keto/aldehyde oxygen atom of the substrate. The

2

smaller the distance is; the better activity the compound might possess.

3 4

Figure 16. Mechanism of Gox2181 catalyzed reduction of ketone substrates.

5

Gox2181 has the typical conserved catalytic tetrad of Asn119-Ser147-Tyr162-Lys166,

6

and the NAD-binding pocket of SDR superfamily. The figure only presents the

7

essential elements required by this study. Catalysis is initiated by the hydrogen bond

8

and a proton transfers between the phenolic oxygen atom of Tyr162 and the keto group

9

of the substrate followed by a hydride transfer from the nicotinamide ring C4 atom of

10

NADH to the substrate.

11

ACS Paragon Plus Environment

Page 40 of 55

Page 41 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1 2

Figure 17. Substrates of Gox2181.

3 4

Figure 18. Control group compounds of Gox2181 (Non-substrates).

5 6

Thirteen substrates and seven non-substrates for Gox2181 were obtained from

7

literature (Figure 17 and 18).70 As with Nit6803, each compound was docked 1000

8

times against enzyme Gox2181. The distances between the nicotinamide ring C4 atom

9

of co-enzyme and the hydroxy/keto/aldehyde oxygen atom of the substrates were

10

extracted from 20000 conformations and plotted with KDE in Figure 19. The KDE

11

plots revealed a notable difference between the distribution of substrates and

12

non-substrates, where the substrates tend to have a centralized distribution within the

13

range of about 3Å to 5 Å, meanwhile the control group (non-substrates) mainly

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

dominate the space of longer distances (about 4.5 Å to 6.5 Å). Based on this difference,

2

an effective model of binary classification to separate these substrates could be

3

deduced. For instance, a threshold of 3.2 Å, which is the sum of the vdw radius of

4

oxygen (1.7 Å) and carbon (1.5 Å) atoms, could accomplish a correct rate of 95%

5

(19/20) in the current data set. As same with Nit6803, the threshold is only an optima

6

estimated from a relatively small group of candidates, further investigations upon other

7

substrates of Gox2181 will also need adjustment as the case may be.

8

9 10

Figure 19. The plots of kernel density estimation for substrates and

11

non-substrates of Gox2181. The chart represents the KDE plots of substrates (A) and

12

non-substrates (B) in Gox2181. The figures were generated by pandas v17.067 using

ACS Paragon Plus Environment

Page 42 of 55

Page 43 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

Gaussian kernel density estimation algorithm.68 In above figures, the x-axis represents

2

the distances between the nicotinamide ring C4 atom of co-enzyme and the

3

hydroxy/keto/aldehyde oxygen atom of the substrate in the 1000 conformations of

4

each compound’s docking results, while the y-axis reflects the corresponding

5

probability of the distances by their kernel density. The red vertical line marks the 3.2 Å

6

threshold. In substrate 4 and 5, only the secondary alcohols were investigated and

7

plotted in the figure. Analogously, in substrate 10 and 11 and non-substrate 4 only the

8

keto group were investigated. Only one plot was kept in the figure for the compounds

9

who possess symmetrical alcohol/keto/aldehyde groups.

10

Conclusions

11

In summary, we have presented an in silico method that could predict enzyme

12

substrates. Based on this method, we developed an open-source software platform

13

called THEMIS, which could meet the requirements for applicability in enzyme

14

research. We tested the ability of this method to differentiate between substrates and

15

non-substrates of a lipase, CALB, using consistently measured experimental data and

16

carefully discussed all aspects of the substrate prediction. Furthermore, its extensive

17

applicability was successfully demonstrated by two other enzymes, Nit6803 and

18

Gox2181. We expect that our method will help accelerate enzyme development by

19

reducing the heavy costs of experimental screening.

20

In this study, molecular docking is the core of the overall method and occupies the

21

majority of the computational time. Although the GOLD docking suite was used in this

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

study, our protocol is quite generalizable. The protocol could easily be reprogramed to

2

operate with any other docking program, as long as they provide constraint functions.

3

The main advantage of molecular docking is the high efficiency of sampling for

4

different ligand and enzyme complex conformations. With modern computer

5

capability, a standard docking only takes several minutes or less. Among the existing

6

computational methods, it is currently the only available solution for high-throughput

7

screening of substrates. Compared to other methods such as MD simulation, the

8

accuracy of molecular docking may be lower due to the simplified score function and

9

the sampling algorithm. However, the heavy demand for computational resources

10

makes MD, as well as QM simulation, relatively difficult to use in practical

11

experimental design. The complexity of MD/QM simulation makes them unsuitable to

12

be fully accelerated by high performance computing platforms because of restrictions

13

in network communication. In contrast, molecular docking is capable of performing

14

large-scale parallel computation against millions of compounds on thousands of CPU

15

cores. In this study, we consider everyday experimental usability for as the first

16

priority. Thus, MBRD combined with exhaustive sampling and an MM score function

17

was used. To further improve the usability of our method, the whole workflow has been

18

coded into utilities and scripts written in Python/C. Except for the mechanistic analysis,

19

all the predictions with massive candidate compounds were able to be accomplished

20

automatically.

ACS Paragon Plus Environment

Page 44 of 55

Page 45 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

Overall, our results suggest that predicting enzyme substrates is promising in

2

computational biology and opens a new door to practical in silico screening of enzyme

3

substrates. Based on our method, further tests can be applied with a larger range of

4

enzyme targets that have known structures and mechanisms. For example, other lipases

5

can be studied with similar score functions and very few changes in settings. For other

6

enzymes, the definitions of restrictions and assessment criteria from the abstraction of a

7

totally different enzyme mechanism is an inconvenience process.

8

Although our results demonstrate that in silico screening of substrates could be

9

accomplished by a carefully designed workflow and score function, there is still a

10

significant gap between the activities of our prediction and experimental data.

11

Narrowing this gap presents an exciting prospect for future work.

12 13

Materials and methods

14

Preparation of the Free Enzyme

15

The crystal structure of CALB was obtained from PDB database. PDB code 1TCA

16

was used as the starting structure in this study, which has no significant rearrangement

17

but higher resolution than other structures. Hydrogen atoms were added and two

18

N-acetyl-D-glucosamine (NAG) moieties along with all water molecules were

19

removed. The structure of Nit6803 (PDB ID: 3WUY) and Gox2181 (PDB ID: 3AWD)

20

were prepared with identical operations. PDB code 1LBS, which is a complex structure

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

of CALB and a substrate analogue, was used to confirm catalytic residues and binding

2

pocket. The catalytic amino acid of CALB, His224, was defined as protonated.

3

Preparation of the Potential Substrates

4

A total number of 94 candidate substrates for CALB and 24 candidate substrates for

5

Nit6803 were gathered from literatures, which contain both substrates and non-reactive

6

compounds that were experimentally proved. Three-dimension structures were built

7

with ChemOffice suite v12. The structures were energy minimized and stored

8

separately (Mol2 file format). The detailed information and structures of all candidate

9

substrates are supplied in the supporting information (Table S1 and File S3). The

10

substrates of Nit6803 and Gox2181 were prepared using the same method.

11

MBRD

12

CCDC GOLD v4.12 was used for molecular docking. The enzyme structures were

13

prepared with GOLD HERMES. Residues within 12 Å range of the hydroxyl oxygen

14

atom of Ser105 of CALB were defined as the binding site and residues within the same

15

range of ε-amino nitrogen of Lys135 were used in nitrilase Nit6803. The configuration

16

template for docking are supplied in the supporting information (Text S7). Early

17

termination was turned off, thereby all conformations were saved at the end of docking.

18

Gold-score was used as the score function of docking. GA parameter was set to very

19

flexible. The script file prepare.py (File S4) was applied in the pre-screening, where

20

docking configuration files were automatically generated based on the template. The

ACS Paragon Plus Environment

Page 46 of 55

Page 47 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

script file do_docking_mpi.py (File S4) was used to execute docking computations

2

over cluster nodes.

3

Data Collection and Processing

4

With the help of utilities of THEMIS, geometric information were automatically

5

extracted from docking results. All data mining tasks were achieved by the script file

6

score.py (File S4), which does traversal visit in the docking results. In this study, Origin

7

v 9.0 and Matplotlib were used to analyzed and plot data.

8

Supporting Information.

9

All compounds used for substrate screening of CALB in both 2D and 3D format and

10

their original references (Table S1 and File S3), complete table for the validation of

11

MBRD (Table S2). THEMIS software with demo (File S4). Input file for pre-process

12

of ester, cyano group, hydroxy and keto groups (Text S5). Template for Gold docking

13

configuration (Text S6). This material is available free of charge via the Internet at

14

http://pubs.acs.org.

15 16 17 18 19 20 21 22 23

Acknowledgments Authors Zhiqiang Yao, Lujia Zhang, Bei Gao, Dongbing cui, Fengqing Wang and Dongzhi Wei received funding from National Natural Science Foundation of China (No. 31571786), National Basic Research Program of China (973) (No. 2012CB721003), National High Technology Research and Development Program of China (No. 2012AA020403), Natural Science Foundation of Shanghai( No.16ZR1449500) , Open Funding Project of the State Key Laboratory of Bioreactor Engineering and Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase).

24 25

References:

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

1. Benkovic, S. J.; Hammes-Schiffer, S., A Perspective on Enzyme Catalysis.

2

Science 2003, 301, 1196-1202.

3

2. Garcia-Viloca, M.; Gao, J.; Karplus, M.; Truhlar, D. G., How Enzymes Work:

4 5

Analysis by Modern Rate Theory and Computer Simulations. Science 2004, 303, 186-195.

6

3. Wolfenden, R.; Snider, M. J., The Depth of Chemical Time and the Power of

7

Enzymes as Catalysts. Acc. Chem. Res. 2001, 34, 938-945.

8

4. Bolon, D. N.; Voigt, C. A.; Mayo, S. L., De Novo Design of Biocatalysts. Curr.

9

Opin. Chem. Biol. 2002, 6, 125-129.

10

5. Hilvert, D., Critical Analysis of Antibody Catalysis. Annu. Rev. Biochem. 2000,

11

69, 751-793.

12

6. Release Note of Refseq V76. http://www.ncbi.nlm.nih.gov/refseq/ (accessed May

13

16, 2016).

14

7. Braiuca, P.; Ebert, C.; Basso, A.; Linda, P.; Gardossi, L., Computational Methods

15 16

to Rationalize Experimental Strategies in Biocatalysis. Trends Biotechnol. 2006, 24, 419-425.

17

8. Dwyer, M. A.; Looger, L. L.; Hellinga, H. W., Computational Design of a

18

Biologically Active Enzyme. Science 2004, 304, 1967-1971.

19

9. Kazlauskas, R. J., Molecular Modeling and Biocatalysis: Explanations,

20

Predictions, Limitations, and Opportunities. Curr. Opin. Chem. Biol. 2000, 4, 81-88.

21

10. Shoichet, B. K., Virtual Screening of Chemical Libraries. Nature 2004, 432,

22

862-866.

23

11. Hermann, J. C.; Marti-Arbona, R.; Fedorov, A. a.; Fedorov, E.; Almo, S. C.;

24 25

Shoichet, B. K.; Raushel, F. M., Structure-Based Activity Prediction for an Enzyme of Unknown Function. Nature 2007, 448, 775-779.

26

12. Zhu, H.; Snyder, M., Protein Chip Technology. Curr. Opin. Chem. Biol. 2003, 7,

27

55-63.

28

13. Korkegian, A.; Black, M. E.; Baker, D.; Stoddard, B. L., Computational

29

Thermostabilization of an Enzyme. Science 2005, 308, 857-860.

ACS Paragon Plus Environment

Page 48 of 55

Page 49 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

14. Jiang, L.; Althoff, E. A.; Clemente, F. R.; Doyle, L.; Röthlisberger, D.;

2 3

Zanghellini, A.; Gallaher, J. L.; Betker, J. L.; Tanaka, F.; Barbas, C. F., De Novo Computational Design of Retro-Aldol Enzymes. Science 2008, 319, 1387-1391.

4

15. Lutz, S., Reengineering Enzymes. Science 2010, 329, 285-287.

5

16. Tyagi, S.; Pleiss, J., Biochemical Profiling in Silico Predicting Substrate

6

Specificities of Large Enzyme Families. J. Biotechnol. 2006, 124, 108-116.

7

17. Clark, M.; Meshkat, S.; Talbot, G. T.; Carnevali, P.; Wiseman, J. S.,

8 9

Fragment-Based Computation of Binding Free Energies by Systematic Sampling. J. Chem. Inf. Model. 2009, 49, 1901-1913.

10

18. Bornscheuer, U. T., Methods to Increase Enantioselectivity of Lipases and

11

Esterases. Curr. Opin. Biotechnol. 2002, 13, 543-547.

12

19. Berglund, P.; Christiernin, M.; Hedenström, E. Enantiorecognition of Chiral

13

Acids by Candida Rugosa Lipase: Two Substrate Binding Modes Evidenced in

14

an Organic Medium. In ACS Symp. Ser., 2001; 2001; Vol. 776; pp 263-273.

15

20. Gao, J.; Ma, S.; Major, D. T.; Nam, K.; Pu, J.; Truhlar, D. G., Mechanisms and

16

Free Energies of Enzymatic Reactions. Chem. Rev. 2006, 106, 3188-3209.

17

21. Kraut, J., How Do Enzymes Work? Science 1988, 242, 533-540.

18

22. König, G.; Hudson, P. S.; Boresch, S.; Woodcock, H. L., Multiscale Free Energy

19 20 21

Simulations: An Efficient Method for Connecting Classical Md Simulations to Qm or Qm/Mm Free Energies Using Non-Boltzmann Bennett Reweighting Schemes. J. Chem. Theory Comput. 2014, 10, 1406-1419.

22

23. Uppenberg, J.; Hansen, M. T.; Patkar, S.; Jones, T. A., The Sequence, Crystal

23 24

Structure Determination and Refinement of Two Crystal Forms of Lipase B from Candida Antarctica. Structure 1994, 2, 293-308.

25

24. Uppenberg, J.; Ohrner, N.; Norin, M.; Hult, K.; Kleywegt, G. J.; Patkar, S.;

26 27 28

Waagen, V.; Anthonsen, T.; Jones, T. a., Crystallographic and Molecular-Modeling Studies of Lipase B from Candida Antarctica Reveal a Stereospecificity Pocket for Secondary Alcohols. Biochemistry 1995, 34, 16838-16888.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

25. Raza, S.; Fransson, L.; Hult, K., Enantioselectivity in Candida Antarctica Lipase

2

B: A Molecular Dynamics Study. Protein Sci. 2001, 10, 329-366.

3

26. Léonard, V.; Fransson, L.; Lamare, S.; Hult, K.; Graber, M., A Water Molecule

4 5

in the Stereospecificity Pocket of Candida Antarctica Lipase B Enhances Enantioselectivity Towards Pentan-2-Ol. ChemBioChem 2007, 8, 662-668.

6

27. Conde, S.; López-Serrano, P.; Martı́nez, A., Candida Antarctica Lipase B

7 8

Catalysed Amidation of Pyroglutamic Acid Derivatives. A Reaction Survey. J. Mol. Catal. B: Enzym. 1999, 7, 299-306.

9

28. Hansen, T. V.; Waagen, V.; Partali, V.; Anthonsen, H. W.; Anthonsen, T.,

10 11 12

Co-Solvent Enhancement of Enantioselectivity in Lipase-Catalysed Hydrolysis of Racemic Esters. A Process for Production of Homochiral C-3 Building Blocks Using Lipase B from Candida Antarctica. Tetrahedron: Asymmetry 1995, 6, 499-504.

13

29. Lou, W.-Y.; Zong, M.-H.; Liu, Y.-Y.; Wang, J.-F., Efficient Enantioselective

14 15 16

Hydrolysis of D,L-Phenylglycine Methyl Ester Catalyzed by Immobilized Candida Antarctica Lipase B in Ionic Liquid Containing Systems. J. Biotechnol. 2006, 125, 64-74.

17

30. Garcia-Urdiales, E.; Rebolledo, F.; Gotor, V., Kinetic Resolution of (±)‐1‐

18 19

Phenylbutan‐1‐Ol by Means of Calb‐Catalyzed Aminolyses: A Study on the Role of the Amine in the Alcohol Resolution. Adv. Synth. Catal. 2001, 343, 646-654.

20

31. Cuiper, A. D.; Kouwijzer, M. L.; Grootenhuis, P. D.; Kellogg, R. M.; Feringa, B.

21 22 23

L., Kinetic Resolutions and Enantioselective Transformations of 5-(Acyloxy) Pyrrolinones Using Candida Antarctica Lipase B: Synthetic and Structural Aspects. J. Org. Chem. 1999, 9529-9537.

24

32. Garcı́a-Urdiales, E.; Rebolledo, F.; Gotor, V., Enzymatic One-Pot Resolution of

25 26

Two Nucleophiles: Alcohol and Amine. Tetrahedron: Asymmetry 2000, 11, 1459-1463.

27

33. Jacobsen, E. E.; Hoff, B. H.; Anthonsen, T., Enantiopure Derivatives of

28 29

1,2-Alkanediols: Substrate Requirements of Lipase B from Candida Antarctica. Chirality 2000, 12, 654-662.

30

34. Ottosson, J.; Hult, K., Influence of Acyl Chain Length on the Enantioselectivity

31 32

of Candida Antarctica Lipase B and Its Thermodynamic Components in Kinetic Resolution of Sec-Alcohols. J. Mol. Catal. B: Enzym. 2001, 11, 1025-1028.

ACS Paragon Plus Environment

Page 50 of 55

Page 51 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

35. Öhrner, N.; Orrenius, C.; Mattson, A.; Norin, T.; Hult, K., Kinetic Resolutions of

2 3

Amine and Thiol Analogues of Secondary Alcohols Catalyzed by the Candida Antarctica Lipase B. Enzyme Microb. Technol. 1996, 19, 328-331.

4

36. Palomo, J. M.; Mateo, C.; Ferna, G.; Solares, L. F.; Diaz, M.; Bayod, M.; Gotor,

5 6 7 8

V.; Guisan, M., Resolution of (±)-5-Substituted-6-(5-Chloropyridin-2-Yl)-7-Oxo-5,6-Dihydropyrrolo[3,4b]Pyrazine Derivatives-Precursors of (S)-(+)-Zopiclone, Catalyzed by Immobilized Candida Antarctica B Lipase in Aqueous Media. Tetrahedron: Asymmetry 2003, 14, 429-438.

9

37. Lee, Y.; Hong, J.; Jeon, N., Highly Enantioselective Acylation of Rac-Alkyl

10 11

Lactates Using Candida Antarctica Lipase B. Org. Process Res. Dev. 2004, 8, 948-951.

12

38. Thodi,

13 14 15

Constantinou-Kokotou, V.; Moutevelis-Minakakis, P.; Kokotos, G., Study of the Removal of Allyl Esters by Candida Antarctica Lipase B (Cal-B) and Pig Liver Esterase (Ple). J. Mol. Catal. B: Enzym. 2009, 61, 241-246.

16

39. Chen, B.; Hu, J.; Miller, E. M.; Xie, W.; Cai, M.; Gross, R. a., Candida

17 18

Antarctica Lipase B Chemically Immobilized on Epoxy-Activated Micro- and Nanobeads: Catalysts for Polyester Synthesis. Biomacromolecules 2008, 9, 463-533.

19

40. Hedstrom, L., Serine Protease Mechanism and Specificity. Biochem. Rev. 2002,

20

102, 4501-4524.

21

41. Ishida, T.; Kato, S., Theoretical Perspectives on the Reaction Mechanism of

22 23

Serine Proteases:  The Reaction Free Energy Profiles of the Acylation Process. J. Am. Chem. Soc. 2003, 125, 12035-12048.

24

42. Fischer, E., Synthesen in Der Zuckergruppe Ii. Ber. Dtsch. Chem. Ges. 1894, 27,

25

3189–3232.

26

43. Armstrong, E. F., Enzymes. J. Chem. Technol. Biotechnol. 1930, 49, 919–920.

27

44. Díaz, N.; Suárez, D.; Suárez, E., Kinetic and Binding Effects in Peptide Substrate

28 29

Selectivity of Matrix Metalloproteinase-2: Molecular Dynamics and Qm/Mm Calculations. Proteins: Struct., Funct., Bioinf. 2010, 78, 1-11.

K.;

Barbayianni,

E.;

Fotakopoulou,

I.;

ACS Paragon Plus Environment

Bornscheuer,

U.

T.;

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

45. Balaji, B.; Ramanathan, M., Prediction of Estrogen Receptor Beta Ligands

2 3

Potency and Selectivity by Docking and Mm-Gbsa Scoring Methods Using Three Different Scaffolds. J. Enzyme Inhib. Med. Chem. 2012, 27, 832-875.

4

46. Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J., Docking and Scoring in

5 6

Virtual Screening for Drug Discovery: Methods and Applications. Nat. Rev. Drug Discovery 2004, 3, 935-949.

7

47. García‐Urdiales, E., Kinetic Resolution of (±)‐1‐Phenylbutan‐1‐Ol by Means of

8 9

Calb‐Catalyzed Aminolyses: A Study on the Role of the Amine in the Alcohol Resolution. Adv. Synth. Catal. 2001, 343, 646-654.

10

48. Scholz, C.; Knorr, S.; Hamacher, K.; Schmidt, B., Docktite—a Highly Versatile

11 12

Step-by-Step Workflow for Covalent Docking and Virtual Screening in the Molecular Operating Environment. J. Chem. Inf. Model. 2015, 55, 398-406.

13

49. Dong, G. Q.; Calhoun, S.; Fan, H.; Kalyanaraman, C.; Branch, M. C.;

14 15 16

Mashiyama, S. T.; London, N.; Jacobson, M. P.; Babbitt, P. C.; Shoichet, B. K.; Armstrong, R. N.; Sali, A., Prediction of Substrates for Glutathione Transferases by Covalent Docking. J. Chem. Inf. Model. 2014, 54, 1687-1785.

17

50. London, N.; Farelli, J. D.; Brown, S. D.; Liu, C.; Huang, H.; Korczynska, M.;

18 19 20

Al-Obaidi, N. F.; Babbitt, P. C.; Almo, S. C.; Allen, K. N.; Shoichet, B. K., Covalent Docking Predicts Substrates for Haloalkanoate Dehalogenase Superfamily Phosphatases. Biochemistry 2015, 54, 528-537.

21

51. Olsen, L.; Oostenbrink, C.; Jorgensen, F. S., Prediction of Cytochrome P450

22

Mediated Metabolism. Adv. Drug Delivery Rev. 2015, 86, 61-71.

23

52. Klepsch, F.; Chiba, P.; Ecker, G. F., Exhaustive Sampling of Docking Poses

24 25

Reveals Binding Hypotheses for Propafenone Type Inhibitors of P-Glycoprotein. PLoS Comput. Biol. 2011, 7, 1-13.

26

53. Kumalo, H. M.; Bhakat, S.; Soliman, M. E., Theory and Applications of Covalent

27

Docking in Drug Discovery: Merits and Pitfalls. Molecules 2015, 20, 1984-2000.

28

54. Tomić, S.; Bertoša, B.; Kojić-Prodić, B.; Kolosvary, I., Stereoselectivity of

29 30

Burkholderia Cepacia Lipase Towards Secondary Alcohols: Molecular Modelling and 3d Qsar Approach. Tetrahedron: Asymmetry 2004, 15, 1163-1172.

ACS Paragon Plus Environment

Page 52 of 55

Page 53 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

55. Schulz, T.; Pleiss, J.; Schmid, R. D., Stereoselectivity of Pseudomonas Cepacia

2 3

Lipase toward Secondary Alcohols: A Quantitative Model. Protein Sci. 2000, 9, 1053-1062.

4

56. Kahlow, U. H. M.; Schmid, R. D.; Pleiss, J., A Model of the Pressure

5 6

Dependence of the Enantioselectivity of Candida Rugosa Lipase Towards (±)-Menthol. Protein Sci. 2001, 10, 1942-1952.

7

57. Kazlauskas, R. J.; Weissfloch, A. N.; Rappaport, A. T.; Cuccia, L. A.,

8 9 10

Enantiomer of a Secondary Alcohol Reacts Faster in Reactions Catalyzed by Cholesterol Esterase, Lipase from Pseudomonas Cepacia, and Lipase from Candida Rugosa. J. Org. Chem. 1991, 2656-2665.

11

58. Ji, L.; Xiaoling, T.; Hongwei, Y., Prediction of the Enantioselectivity of Lipases

12 13

and Esterases by Molecular Docking Method with Modified Force Field Parameters. Biotechnol Bioeng 2010, 105, 687-696.

14

59. Mantina, M.; Chamberlin, A. C.; Valero, R.; Cramer, C. J.; Truhlar, D. G.,

15 16

Consistent Van Der Waals Radii for the Whole Main Group. J. Phys. Chem. A 2009, 113, 5806-5817.

17

60. Verdonk, M. L.; Cole, J. C.; Hartshorn, M. J.; Murray, C. W.; Taylor, R. D.,

18

Improved Protein-Ligand Docking Using Gold. Proteins 2003, 52, 609-631.

19

61. Bunch, A. W. Nitriles. In Biotechnology Set; Wiley-VCH Verlag GmbH:

20

2008, pp 277-325.

21

62. Zhang, L.; Yin, B.; Wang, C.; Jiang, S.; Wang, H.; Yuan, Y. A.; Wei, D.,

22 23 24

Structural Insights into Enzymatic Activity and Substrate Specificity Determination by a Single Amino Acid in Nitrilase from Syechocystis Sp. Pcc6803. J. Struct. Biol. 2014, 188, 93-101.

25

63. Raczynska, J. E.; Vorgias, C. E.; Antranikian, G.; Rypniewski, W.,

26 27

Crystallographic Analysis of a Thermoactive Nitrilase. J Struct Biol 2011, 173, 294-302.

28

64. Fernandes, B. C. M.; Mateo, C.; Kiziak, C.; Chmura, A.; Wacker, J.; van

29 30

Rantwijk, F.; Stolz, A.; Sheldon, R. A., Nitrile Hydratase Activity of a Recombinant Nitrilase. Adv. Synth. Catal. 2006, 348, 2597-2603.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

65. Heinemann, U.; Engels, D.; Bürger, S.; Kiziak, C.; Mattes, R.; Stolz, A., Cloning

2 3 4

of a Nitrilase Gene from the Cyanobacterium Synechocystis Sp. Strain Pcc6803 and Heterologous Expression and Characterization of the Encoded Protein. Appl. Environ. Microbiol. 2003, 69, 4359-4366.

5

66. Scott, D. W., Multivariate Density Estimation: Theory, Practice, and

6

Visualization. John Wiley & Sons: 2015.

7

67. McKinney, W., Python for Data Analysis: Data Wrangling with Pandas, Numpy,

8

and Ipython. O'Reilly Media, Inc.: 2012.

9

68. Bashtannyk, D. M.; Hyndman, R. J., Bandwidth Selection for Kernel Conditional

10

Density Estimation. Computational Statistics & Data Analysis 2001, 36, 279-298.

11

69. Kallberg, Y.; Oppermann, U.; Persson, B., Classification of the Short‐Chain

12 13

Dehydrogenase/Reductase Superfamily Using Hidden Markov Models. FEBS J. 2010, 277, 2375-2386.

14

70. Liu, X.; Yuan, Z.; Yuan, Y. A.; Lin, J.; Wei, D., Biochemical and Structural

15 16

Analysis of Gox2181, a New Member of the Sdr Superfamily from Gluconobacter Oxydans. Biochem. Biophys. Res. Commun. 2011, 415, 410-415.

17

71. Carius, Y.; Christian, H.; Faust, A.; Zander, U.; Klink, B. U.; Kornberger, P.;

18 19 20

Kohring, G.-W.; Giffhorn, F.; Scheidig, A. J., Structural Insight into Substrate Differentiation of the Sugar-Metabolizing Enzyme Galactitol Dehydrogenase from Rhodobacter Sphaeroides D. J. Biol. Chem. 2010, 285, 20006-20014.

21

ACS Paragon Plus Environment

Page 54 of 55

Page 55 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

2

ACS Paragon Plus Environment