ADMET Evaluation in Drug Discovery. 16 ... - ACS Publications

Jul 5, 2016 - prediction of hERG liability in the early stages of drug design is quite ... reliable theoretical models for the prediction of potential...
2 downloads 0 Views 2MB Size
Subscriber access provided by La Trobe University Library

Article

Predicting hERG blockers by combining multiple pharmacophores and machine learning approaches Shuangquan Wang, Huiyong Sun, Hui Liu, Dan Li, Youyong Li, and Tingjun Hou Mol. Pharmaceutics, Just Accepted Manuscript • DOI: 10.1021/acs.molpharmaceut.6b00471 • Publication Date (Web): 05 Jul 2016 Downloaded from http://pubs.acs.org on July 7, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Molecular Pharmaceutics is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

1

Predicting hERG blockers by combining multiple

2

pharmacophores and machine learning approaches

3

Shuangquan Wang1, Huiyong Sun1, Hui Liu1, Dan Li1, Youyong Li3, Tingjun Hou1,2* 1

4

College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China

5 6

2

State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China

7 8

3

Institute of Functional Nano & Soft Materials (FUNSOM), Soochow University, Suzhou, Jiangsu 215123, China

9 10 11 12

Corresponding author:

13

Tingjun Hou

14

E-mail: [email protected]

15

Tel: +86-571-88208412

16 17

Keywords: hERG; ADMET; Pharmacophore; Machine learning; Support vector

18

machine; Recursive partitioning; SVM

19

1

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

20 21

Page 2 of 30

For Table of Contents Use Only

SVM

NBC 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 2

ACS Paragon Plus Environment

Page 3 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

46

Abstract

47

Blockade of human ether-à-go-go-related gene (hERG) channel by compounds may

48

lead to drug-induced QT prolongation, arrhythmia and Torsades de Pointes (TdP), and

49

therefore reliable prediction of hERG liability in the early stages of drug design is

50

quite important to reduce the risk of cardiotoxicity-related attritions in the later

51

development stages. In this study, pharmacophore modeling and machine learning

52

approaches were combined to construct classification models to distinguish hERG

53

active from inactive compounds based on a diverse dataset. First, an optimal ensemble

54

of pharmacophore hypotheses that had good capability to differentiate hERG active

55

from inactive compounds was identified by the recursive partitioning (RP) approach.

56

Then, the naïve Bayesian classification (NBC) and support vector machine (SVM)

57

approaches were employed to construct classification models by integrating multiple

58

important pharmacophore hypotheses. The integrated classification models showed

59

improved predictive capability than any single pharmacophore hypothesis, suggesting

60

that the broad binding poly-specificity of hERG can only be well characterized by

61

multiple pharmacophores. The best SVM model achieved the prediction accuracies of

62

84.7% for the training set and 82.1% for the external test set. Notably, the accuracies

63

for the hERG blockers and nonblockers in the test set reached 83.6% and 78.2%,

64

respectively. Analysis of significant pharmacophores helps to understand the

65

multi-mechanisms of actions of hERG blockers. We believe that the combination of

66

pharmacophore modeling and SVM is a powerful strategy to develop reliable

67

theoretical models for the prediction of potential hERG liability.

68 69 70 71 72 73 74 75 76

3

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 30

77 78

Introduction

79

During cardiac depolarization and repolarization, a voltage-gated potassium channel

80

encoded by the human ether-à-go-go related gene (hERG or Kv11.1) plays a major

81

role in the regulation of the exchange of cardiac action potential and resting potential.1,

82

2

83

transmembrane helices (S1-S6). Some critical residues for the binding of hERG

84

blockers, such as Tyr652, Phe656 and V659 located in S6, have been confirmed by a

85

series of mutagenesis studies,3-5 but unfortunately, the whole crystal structure of the

86

hERG channel has not been solved currently.

The structure of the hERG channel is a homo-tetramer, and each subunit contains six

87

The hERG blockade may cause long QT syndrome (LQTS), arrhythmia and

88

Torsade de Pointes (TdP), which lead to palpitations, fainting or even sudden death.6-8

89

However, up to date, several important drugs, such as terfenadine, astemizole,

90

cisapride, vardenafil and ziprasidone, have been withdrawn from the market or

91

severely

restricted 5, 9, 10

in

availability

due

to

their

92

cardiotoxicity.

93

a important step in drug design/discovery pipeline.6, 11

undesirable

hERG-related

Therefore, assessment of hERG-related cardiotoxicity has become

94

As the fact that hERG assays and QT animal studies are time-consuming and

95

expensive, development of reliable and robust in silico models to predict potential

96

hERG liability become quite important. In the past decade, a wide range of

97

quantitative structure-activity relationship (QSAR) models for hERG liability have

98

been reported using various machine learning approaches, such as k-nearest neighbor

99

algorithm (kNN), artificial neural networks (ANN), random forest (RF), support

100

vector machine (SVM), self-organizing mapping (SOM), recursive partitioning (RP),

101

genetic algorithm (GA), naïve Bayesian classification (NBC), etc.5, 12-17 A majority of

102

the QSAR models are classifiers while only a few of them are quantitative regression

103

models. Although most QSAR models showed satisfactory predictions for the training

104

sets, these models were usually developed based on relatively small datasets, and thus

105

their chemical coverage and extrapolability are limited. Meanwhile, because hERG

106

blockade can be measured by various experimental techniques, it is possible that some

107

available datasets are not reliable when the data from different experimental sources

108

or protocols are mixed together. Moreover, most available QSAR models cannot be

109

interpreted easily due to the complexity of machine learning approaches. Alternatively,

110

structure-based approaches, including homology modeling, molecular docking, 4

ACS Paragon Plus Environment

Page 5 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

111

molecular dynamics (MD) simulation and free energy calculation, have been

112

employed to study hERG-drug interactions.18 All structure-based studies were

113

performed on homology model because the hERG channel has not yet been

114

crystallized. Unfortunately, the sequence similarity between hERG and the templates

115

(KscA, KvaP, MthK, Kir2.2 and Kv1.2) is quite low,5 and therefore the homology

116

model of hERG may not be reliable and thus the receptor-based modeling for hERG

117

blockade is interpretative rather than predictive. To date, pharmacophore modeling has been employed to predict hERG blockade.5,

118 119

16

120

blockers, which contained four hydrophobic features and one positive ionizable

121

moiety.19 Aronov proposed two five-point pharmacophore hypotheses based on a

122

dataset of 194 uncharged molecules, and the two hypotheses could correctly classify

123

78% and 69% of the hERG blockers in the dataset.20 Leong constructed three

124

pharmacophore hypotheses based on 26 molecules, and then developed a SVM

125

regression model by integrating the three hypotheses of the model to predict hERG

126

liability, which exhibited an r2 value of 0.94 for the 13 tested molecules.21 Garg and

127

coworkers developed a number of pharmacophore hypotheses based on a training set

128

of 44 molecules, but the best hypothesis with three features did not show acceptable

129

predictive capability to the 12 tested molecules.22 Durdagi et al. created 44

130

pharmacophore hypotheses based on 31 hERG blockers with diverse binding affinities,

131

and the best model showed an average deviation of 0.29 (pIC50) for the 14 hERG

132

blockers in the test set.23 Tan et al. proposed 47 pharmacophore models based on a

133

training set of 53 molecules, whereas the five most representative models with high

134

accuracy to the training set (r2=0.914~0.941) did not show acceptable performance to

135

the test set (q2=0.280~0.370).24 Kratz et al. developed seven complementary

136

pharmacophore models by using Catalyst and LigandScout. The best Catalyst model

137

could identify 14 out of 18 blockers in the training set and 8 out of 19 blockers in the

138

test set, and six LigandScout models could identify 84.9% of the true blockers

139

(ranging from 8 to 30% for individual models) with few false positives.25 In summary,

140

the pharmacophore hypotheses reported in the previous studies were usually

141

generated by a limited number of molecules. Moreover, compared with traditional

142

QSAR models, the pharmacophore models for hERG blockade usually exhibit worse

143

predictive power. It appears to be hard to establish a common pharmacophore

144

hypothesis to distinguish hERG blockers from nonblockers successfully.

For example, Ekins et al. established a pharmacophore hypothesis for 15 hERG

5

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

145

It is well known that, many proteins relevant to ADMET (absorption, distribution,

146

metabolism, excretion and toxicity), such as hERG, usually have large and flexible

147

binding pockets, and they can recognize and bind a variety of structural diverse

148

compounds.26, 27 Such complicated binding patterns between ADMET-related proteins

149

and their ligands cannot be well characterized by a single pharmacophore hypothesis,

150

and then the combination of multiple pharmacophore hypotheses, also referred to as

151

pharmacophore ensemble, may be an effective strategy to give reliable prediction for

152

some ADMET properties.28-31

153

In this study, in order to characterize the broad binding poly-specificity of hERG, a

154

large number of pharmacophore hypotheses were established, and the important

155

hypotheses were identified by Information Gain in the RP approach. Subsequently, the

156

NBC and SVM approaches were employed to develop the classifiers by integrating

157

multiple important pharmacophore hypotheses (Figure 1). Furthermore, the

158

complicated interaction mechanisms of hERG blockage were discussed by the

159

analysis of the important pharmacophores. By combining multiple pharmacophore

160

hypotheses, we tried to develop a reliable and fast platform for the prediction of

161

hERG liability and gain profound insights into the multi-mechanisms of action of

162

hERG blockade.

163 164

Materials and Methods

165

Dataset preparation

166

The total dataset for model construction and validation contains 587 molecules,

167

including 527 molecules with experimental hERG blocking bioactivities (IC50) and 60

168

hERG nonblockers without IC50 derived from our previous study.32 The IC50 activities

169

were mainly determined based on mammalian cell lines, such as HEK, CHO, and

170

COS. IC50 data derived from a non-mammalian cell line, XO (Xenopus laevis

171

oocytes), was included into the dataset when no mammalian cell line data was

172

available. If a molecule has multiple different experimental values, it would be

173

compared with the entry found in the PubChem’s BioAssay database and the

174

consistent experimental value was adopted.33 Moreover, a threshold of 40 µM was

175

used to define hERG blockers and nonblockers, where molecules with IC50 < 40 µM

176

were regarded as blockers and others were regarded as nonblockers.34, 35 The purpose

177

of this study is to develop qualitative classifiers rather than quantitative regression 6

ACS Paragon Plus Environment

Page 6 of 30

Page 7 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

178

models, and therefore some noise of the experimental data may be tolerated.

179

The whole dataset was divided into training set (392) and test set (195) with the

180

ratio of 2:1. The 352 molecules with IC50 values and 40 nonblockers without IC50 in

181

the training set were extracted from the whole dataset by using the Find Diverse

182

Molecules protocol in Discovery Studio 2.5 (DS 2.5) respectively,36 which guarantees

183

that the selected molecules in the training set have the largest diversity evaluated by

184

the Tanimoto coefficients based on the LCFP_8 fingerprints and logP.32,

185

pipeline of the dataset partition is shown in Figure S1 in the Supporting Information.

186

To give a comparison, 352 bioactive molecules and 40 nonblockers were randomly

187

selected from the whole dataset to construct the training set, and the remaining

188

molecules were used as the test set. The two training sets constructed by structural

189

diversity and random selection were referred to as training sets I and II, and the

190

corresponding test sets were named test sets I and II. Additionally, to validate the

191

prediction performance of models, 68 molecules reported by Li et al. were employed

192

as the external validation set.35

37

The

193 194

Pharmacophore generation

195

Pharmacophore hypotheses were derived by using the 3D QSAR Pharmacophore

196

Generation protocol in DS2.5.36, 38 In the generation of pharmacophore hypotheses,

197

IC50 was converted to pIC50. In order to generate diverse pharmacophore hypotheses

198

to characterize the complicated binding patterns of the hERG blockers, the following

199

strategy was employed. First, 40 molecules with experimental hERG activities were

200

randomly chosen from the training set, and they were used to generate pharmacophore

201

hypotheses. The above pharmacophore building process was repeated 200 times, and

202

then diverse pharmacophore hypotheses could be generated.

203

The low-energy conformations of each molecule were generated by the FAST

204

algorithm implemented in the Conformation Generation protocol in DS2.5.36 The

205

maximum number of conformations per molecule was limited to 255, and the

206

maximum energy threshold was set to 20 kcal/mol. The maximum number of the

207

features in each hypothesis was set to 5, and five types of pharmacophoric features

208

were used, including hydrogen-bond acceptor (HBA), hydrogen-bond donor (HBD),

209

hydrophobic center (HP), positive ionizable center (PI), and ring aromatic center (RA).

210

In addition, the excluded volume (EV) feature was used to describe the shape of the

211

binding pocket of hERG, and the maximum number of EV was set to 10. The other 7

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

212

parameters for pharmacophore generation were set to the default values.

213 214

Recursive partitioning analysis

215

For the training set, 200 randomly selected subsets were generated. It should be noted

216

that the pharmacophore hypotheses could not be successfully generated for some

217

certain subsets, and finally 190 and 189 pharmacophore hypotheses were generated

218

for the training sets I and II, respectively. Each hypothesis was labeled by a serial

219

number from 1 to 200 according to the number of the corresponding subset for

220

pharmacophore generation. Each molecule in the training set was mapped onto each

221

pharmacophore by using the Ligand Pharmacophore Mapping protocol in DS2.536

222

and the match of the mapping was measured by a fit score, which can determine how

223

well a molecule is mapped onto a pharmacophore. The higher the fit score, the better

224

the match. If a molecule could not be mapped onto a pharmacophore successfully, a

225

penalty of -1 was assigned. In total, 190 and 189 fit scores were generated for each

226

molecule in the training set I and II, and they were used as the independent variables

227

(X). The response variable Y was set to 1 for blockers and 0 for nonblockers.

228

Then, the RP approach was employed to develop classifiers to distinguish the

229

blockers from nonblockers.32 RP creates a decision tree that strives to continuously

230

classify molecules by splitting them into subsets based on independent properties and

231

terminates when the stopping criteria are satisfied. The decision trees were created by

232

the Create Recursive Partitioning Model protocol in DS2.5.36 The minimum samples

233

per node was set to 7. The Gini index was adopted by the split method in RP models.

234

The 5-fold cross-validation was used to evaluate the statistical significance of each

235

model and the other parameters were set to the default values. In order to evaluate the

236

impact of the maximum tree depth, different RP models were built by changing the

237

tree depth from 2 to 7 (tree depth higher than 7 was not considered because higher

238

level did not change the decision tree anymore).

239 240

Integration of multiple Pharmacophores by SVM and NBC

241

The decision tree splits the compounds into different subsets by grouping them that

242

can be mapped onto specific pharmacophore hypotheses. The pharmacophores chosen

243

by a decision tree are most significant than the others. Moreover, the important

244

pharmacophores chosen by a decision tree are independent of each other, and thus a

245

single tree may account for the multi-mechanisms of action of hERG blockers. 8

ACS Paragon Plus Environment

Page 8 of 30

Page 9 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

246

Then, the fit values given by these important pharmacophores were used as the

247

independent variables for the SVM and NBC calculations. The effectiveness of SVM

248

and NBC for binary classification has been extensively validated by our previous

249

studies.32, 39-45 SVM is a popular statistical method for classification and regression.

250

The theory of SVM is based on Structural Risk Minimization (SRM) principle that

251

constructs a hyperplane with the largest distances to the nearest training data points of

252

any class. The detailed description of SVM can be found in previous studies.44, 46 Here,

253

the radial basis function (RBF) was used as the kernel function for the classification.

254

Grid searching was used to optimize two parameters, cost (c) and gamma (γ), which

255

were designed to exponentially grow against 2, namely 2n, which goes from -15 to 15

256

and -15 to 5, respectively, with a step of 0.5. Because the numbers of blockers and

257

nonblockers are not balanced, a higher weight (k-) of 2 was applied to the nonblocker

258

class. The statistical significance of each model was tested by 5-fold cross-validation.

259

The SVM models were developed by using the e1071 package in R.47

260

NBC is a probabilistic classification method based on Bayes' theorem. Bayesian

261

statistics not only considers the likelihood of a model, but also takes into

262

consideration the complexity of the model. Compared with other machine learning

263

approaches, NBC can deal with large-scale data, trains fast and is tolerant of random

264

noise. The mathematical details of NBC and the procedure to create a naive Bayesian

265

classifier were described previously.16,

266

continuous variables was set to 10, and the other parameters were set to the default

267

values. 5-fold cross validation was used to assess the statistical significance of each

268

model. The NBC classifiers were created by the Create Bayesian Model protocol in

269

DS2.5.36

39, 40, 48

The number of bins to divide

270 271

Validation of classifiers

272

Each classifier was evaluated by the following parameters: true positives (TP), true

273

negatives (TN), false positives (FP), false negatives (FN), sensitivity: SE =

274

TP/(TP+FN), specificity: SP = TN/(TN + FP), prediction accuracy for blockers: PRE1

275

= TP/(TP + FP), prediction accuracy for nonblockers: PRE2 = TN/(TN + FN), global

276

accuracy (GA, Eq. 1) and Matthews correlation coefficient (MCC, Eq. 2):

277

(1)

278

(2) 9

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The above parameters were calculated for both training and test sets.

279 280 281

Results and Discussions

282

The distribution of the experimental hERG blocking activities (pIC50) is shown in

283

Figure S2. The chemical spaces of the training and test sets characterized by the

284

scattered distributions of the first and second principal components derived from the

285

principal component analysis (PCA) based on eight molecular descriptors is shown in

286

Figure 2. These descriptors are extensively used in ADMET predictions,49, 50 including

287

octanol-water partition coefficient (logP), number of rotatable bonds, number of rings,

288

topological polar surface area (TPSA), number of H-bond acceptors, number of

289

H-bond donors, number of heavy atoms, and molecular weight. The cumulative

290

proportion of the first two components can explain >99% of the variance. As shown in

291

Figure 2, the chemical spaces of the test sets are roughly distributed within the scope

292

of the training sets, and therefore it is reliable to predict the hERG liability for the test

293

sets by using the classifiers built from the training sets.

294 295

Important pharmacophores identified by RP

296

For the training set, the fit values of mapping provided by the 190 or 189

297

pharmacophore hypotheses were used as the input matrix for RP to establish decision

298

trees. Compared with the “blind operations” of many other machine learning

299

approaches, the results of RP can be easily converted to a series of simple hierarchical

300

rules. The pharmacophores as the splitting descriptors for a decision tree should be

301

more significant than the others. Moreover, during the RP modeling, the tree depth

302

was increased from 2 to 7, and the prediction accuracies of the established models at

303

different tree depths were compared according to their prediction accuracies to the test

304

sets.

305

The prediction results of the RP models are summarized in Table 1. For the

306

training set I, the RP model at the tree depth of 2 (RP1 model) does not achieve the

307

highest prediction accuracy for the training set I (82.7%) but it achieve the highest

308

accuracy for the test set I (78.5%). The decision tree established based on the training

309

set I at the tree depth of 2 is depicted in Figure 3a. For the training set II, the RP

310

model at the tree depth of 3 (RP8 model) reaches the highest prediction accuracy for

311

the test set II (79%). The decision tree established based on the training set II at the 10

ACS Paragon Plus Environment

Page 10 of 30

Page 11 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

312

tree depth of 3 is shown in Figure 3b. It can be observed that the prediction accuracies

313

for the test sets decrease slightly at the higher three depths (5~7).

314

Based on the theory of RP, the pharmacophores chosen by a decision tree have

315

the higher weights to split the blockers and nonblockers. The important

316

pharmacophores chosen by the decision tree at the tree depth of 2 for the training set I

317

include Hypo44, Hypo103, and Hypo178 (Figure 4a), and those chosen by the

318

decision tree at the tree depth of 3 for the training set II include Hypo70, Hypo105,

319

Hypo106, Hypo121, Hypo125, and Hypo138 (Figure 4b). Apparently, two chemical

320

features, including HP and RA, can be found in all of the nine important

321

pharmacophores, highlighting the importance of the hydrophobic interactions upon

322

the binding of hERG blockers. This observation is consistent with the experimental

323

mutagenesis data, which suggests that Tyr652 and Phe656 in the S6 helix can form

324

favorable hydrophobic interactions with high affinity blockers.19, 21, 23, 51

325

As shown in Figure 3, the most important discriminant pharmacophores are

326

Hypo178 and Hypo125. Both of these two pharmacophores have four chemical

327

features, including two aromatic rings (RA), a hydrophobic center (HP) and a positive

328

ionizable center (PI). These chemical features were also observed in most

329

pharmacophore models for hERG blockers reported in previous studies.19, 21, 25 So far,

330

the published pharmacophore hypotheses typically have the following common

331

features: (1) hydrophobic (HP) and/or aromatic centers (RA), which can form

332

hydrophobic and/or π-π stacking interactions with Tyr652 and Phe656 of hERG; (2) a

333

protonated nitrogen (PI), which is important for achieving high hERG potency but not

334

necessary to all blockers.16,

335

successfully mapped onto Hypo178, while 241 molecules could be mapped onto

336

Hypo125 for the training set II. However, the blocking activities of a number of

337

molecules could not be well predicted by the fit values given by pharmacophore

338

mapping. For example, the predicted activity of amsacrine given by Hypo178 has a

339

high deviation (∆pIC50=6.54) from the experimental data. Apparently, a single

340

pharmacophore hypothesis cannot effectively distinguish blockers from nonblockers,

341

and multiple pharmacophore hypotheses are necessary for the development of reliable

342

hERG prediction models.

20, 24

For the training set I, 296 molecules could be

343 344

NBC and SVM classifiers based on multiple pharmacophores 11

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

345

In order to develop more reliable classification models for hERG liability, two

346

powerful machine learning approaches (NBC and SVM) were used to develop

347

classifiers based on multiple pharmacophores. The fit values given by the important

348

pharmacophores chosen by RP were used to generate the data matrix for model

349

construction. First, NBC was employed to integrate the predictions given by the

350

multiple pharmacophores chosen by RP at five different tree depths (2~6). The

351

performances of the Bayesian classifiers are summarized in Table 2. For the training

352

set II, based on the predictions from the 12 pharmacophores chosen by the decision

353

tree at the tree depth of 5 (Figure S3), the Bayesian classifier (NBC9 model) achieves

354

the best prediction accuracy with the global accuracy of 0.81, sensitivity of 0.814,

355

specificity of 0.80, MCC of 0.576 and AUC of 0.829 for the test set II. The Bayesian

356

scores and receiver operating characteristic (ROC) curves given by NBC9 for the

357

training set II and test set II are shown in Figure 5. The histograms show that the

358

blockers tend to have more positive values, while the nonblockers tend to have more

359

negative values. For the training set I, based on the three pharmacophores chosen by

360

the decision tree at the tree depth of 2 (Figure 3a), the Bayesian classifier (NBC1

361

model) has the global accuracy of 0.8, sensitivity of 0.822, specificity of 0.758, MCC

362

of 0.566 and AUC of 0.847 for the test set I. Based on the same training and test sets,

363

the performances of the best Bayesian classifiers are only slightly better than those of

364

the RP classifiers. The theory of naïve Bayesian supposes that probability distribution

365

obeys conditional independence assumption, that is to say, the input variables are

366

conditionally independent, but the predictions given by some similar pharmacophores

367

generated based on a set of similar molecules are possibly relevant to each other.

368

Therefore, some relevant input variables may reduce the global accuracy of NBC.

369

Moreover, data unbalance may have unfavorable impact on the reliability and

370

robustness of the NBC models. For all the Bayesian classifiers, the prediction

371

accuracies for the nonblockers are lower than those for the blockers, which may be

372

attributed to the low number of the nonblockers in the training set.

373

Similarly, SVM was employed to integrate the predictions given by the important

374

pharmacophores chosen by RP. Compared with NBC, SVM has intrinsic advantage to

375

avoid the conditional independence problem involved in NBC. The performance of

376

the SVM classifiers are summarized in Table 3. For the training set I, the best SVM

377

classifier (SVM5 model; cost = 1 and gamma = 0.0174) based on the decision tree at

378

the tree depth of 6 has the sensitivity of 0.943, specificity of 0.596, MCC of 0.597 and 12

ACS Paragon Plus Environment

Page 12 of 30

Page 13 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

379

global accuracy of 0.847 for the training set I, and the sensitivity of 0.907, specificity

380

of 0.652, MCC of 0.587 and global accuracy of 0.821 for the test set I, and the global

381

accuracy of 0.809 for the external validation set. For the training set II, the best SVM

382

classifier based on the decision tree at the tree depth of 5 (SVM9 model) has the

383

sensitivity of 0.85, specificity of 0.745, MCC of 0.575 and global accuracy of 0.821

384

for the test set II. The prediction accuracies of the best SVM classifiers are obviously

385

better than those of the best Bayesian classifiers.

386

By analyzing the average performance of the three different machine learning

387

approaches for the test set I and the test set II, SVM exhibits the best predictive

388

capability than the other two approaches (Figure S4a). The performance of the SVM

389

models based on the pharmacophore hypotheses chosen by the decision trees with

390

different tree depths was then discussed. With the increase of the decision tree depth

391

from 2 to 6, the overall prediction power improves significantly (Figure S4b, c and d).

392

To all the models shown in Table 2 and 3, the MCC and SP values are relatively low,

393

which may be explained by the following reasons. First, the unbalanced nature of

394

blockers versus nonblockers has great impact on MCC, where the MCC values would

395

decrease dramatically with the increase of false positives or false negatives.52 Second,

396

the method of pharmacophore mapping may be more suitable in predicting hERG

397

blockers than nonblockers based on the unbalanced data. Consequently, the prediction

398

accuracy for the blockers is obviously higher than that of the nonblockers.

399 400

Clustering Analysis of Important Pharmacophores

401

The SVM classifier (SVM5) based the decision tree at the depth of 6 established from

402

the training set I has the best predictive capability. As shown in Figure S5, the

403

pharmacophore hypotheses used by SVM5 include Hypo44, Hypo48, Hypo75,

404

Hypo77, Hypo84, Hypo85, Hypo87, Hypo99, Hypo102, Hypo103, Hypo154,

405

Hypo178, Hypo190, and Hypo198 (Figure S6). Different pharmacophores may imply

406

the multiple binding mechanisms of action of hERG blockers. In order to explore the

407

structural difference of these 14 pharmacophores, the distance between each pair of

408

pharmacophores was calculated, and the distance matrix was hierarchically clustered

409

and presented as a dendrogram (Figure 6). The distance is defined as a function of the

410

number of common pharmacophoric features and the RMSD (root-mean-squared

411

displacement) between the matching features. As shown in Figure 6, these

412

pharmacophores can be roughly divided into four classes. The detailed descriptions of 13

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 30

413

the four representative pharmacophore hypotheses (Hypo44, Hypo75, Hypo103 and

414

Hypo178) chosen from the four clusters are summarized in Tables S2~S5. It can be

415

observed that different pharmacophores are strongly inclined to certain subsets in the

416

training set. For example, as shown in Figure 7, four representative molecules

417

(clofilium phosphate, neferine, lidoflazine, GBR12909) were mapped perfectly onto

418

Hypo44,

419

pharmacophore can characterize the binding features of certain molecules and

420

represent one binding mode. That is to say, the multi-mechanisms of action of hERG

421

blockers may be effectively characterized by the four representative pharmacophore

422

hypotheses, and therefore the integration of multiple pharmacophores is necessary to

423

improve the prediction accuracies for the whole dataset.

Hypo75,

Hypo103

and

Hypo178,

respectively.

Apparently,

each

424 425

Analysis of misclassified molecules

426

By analyzing the predictions results, we observed that 16 molecules in the test set I

427

could not be correctly predicted by any NBC or SVM classifier. The 9 misclassified

428

hERG blockers and 7 misclassified nonblockers are shown in Figure 8. We believe

429

that the following reasons may explain the misclassification.

430

First, the misclassifications may be caused by the intrinsic drawback of the

431

approach used in this study. The pharmacophore hypotheses can describe the spatial

432

arrangement of the important structural features favorable for hERG binding, but they

433

cannot precisely characterize the shape and chemical environment of the binding sites

434

of hERG. As shown in Figure S7, all of the 7 misclassified nonblockers can be

435

mapped onto at least one pharmacophore hypothesis. In these 9 misclassified blockers,

436

only one (atropine) has the charged nitrogen. The blockers may not be well

437

distinguished from the nonblockers because most pharmacophores have positive

438

ionizable centers (Figure S6). Apparently, the successful mapping of a molecule onto

439

a pharmacophore does not mean that this molecule can form favorable interaction

440

with hERG because this molecule may form unfavorable atom bump with the nearby

441

residues that are not characterized by the hypotheses. All of the 7 misclassified

442

nonblockers have aromatic rings and charged nitrogens, which are therefore more

443

likely to be mapped onto the hydrophobic centers, ring aromatic centers and positive

444

ionizable centers in the pharmacophores. Second, the noise of the experimental data is

445

another source of misclassification. Here, an arbitrary cutoff of 40 µM was used to

446

define blockers and nonblockers. In the 9 misclassified hERG blockers, 14

ACS Paragon Plus Environment

Page 15 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

447

dihydropyrrole has an IC50 of 25 µM, which is quite close to the cutoff. It is

448

understandable that dihydropyrrole was easily misclassified as a nonblocker. Actually,

449

in some previous studies, dihydropyrrole was also regarded as a nonblocker.35,

450

Similarly, two of the 7 misclassified nonblockers, Nip-142 and CHEMBL216680,

451

have relatively low IC50 (44 µM and 47.7 µM), and these two compounds were easily

452

predicted as blockers.

53

453 454

Conclusions

455

As a necessary step for safety evaluation of drugs, the assessment of hERG blockage

456

is extremely important. In this study, based on a relatively large dataset of 587 hERG

457

blockers and nonblockers, pharmacophore modeling and machine learning approaches

458

were combined to establish classification models to distinguish hERG active from

459

inactive compounds. First, a large number of pharmacophore hypotheses were

460

generated and an ensemble of pharmacophore hypotheses defined by a decision tree

461

were identified by RP. Then, the NBC and SVM approaches were employed to

462

construct classification models by integrating multiple representative pharmacophore

463

hypotheses identified by RP. The best SVM model achieved the prediction accuracies

464

of 84.7% for the training set and 82.1% for the test set. The improved prediction

465

capability of the integrated classification models suggests that the broad binding

466

poly-specificity of hERG can only be well characterized by multiple pharmacophores.

467

Moreover, the analysis of the important pharmacophores provide a deeper insight into

468

the

469

multi-mechanisms of hERG blockage can be characterized by several important

470

pharmacophores, and the combination of pharmacophores and machine learning

471

methods can provide a powerful way to evaluate hERG blockage.

multi-mechanisms

of

action

of

hERG

blockage.

In

summary,

the

472 473

Supporting Information

474

Table S1. The validation parameters (cost and gamma) of the SVM models. Table S2.

475

Weights, tolerances and three-dimensional coordinates of the chemical features in

476

pharmacophore model Hypo4; Table S3. Weights, tolerances and three-dimensional

477

coordinates of the chemical features in pharmacophore model Hypo75; Table S4.

478

Weights, tolerances and three-dimensional coordinates of the chemical features in

479

pharmacophore model Hypo103; Table S5. Weights, tolerances and three-dimensional 15

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

480

coordinates of the chemical features in pharmacophore model Hypo178. Figure S1.

481

The pipline of the dataset partition. Figure S2. Frequency distribution of pIC50 for the

482

data set; Figure S3. Decision tree with the tree depth of 5 (RP5) established based on

483

the training set II. Predicted blockers and nonblockers are colored in red and blue,

484

respectively. Figure S4. The performance of the classifiers based on three different

485

machine learning approaches. (a) The average performance (GA, MCC) of the

486

modelsfor the test set I and the test set II, respectively. (b) The global accuracy (GA),

487

(c) Matthews correlation coefficient (MCC), and (d) area under curve (AUC) of

488

SVM1-5 for the training set I and test set I. Figure S5. Decision tree with the tree

489

depth of 6 (RP6) established based on the training set I. Predicted blockers and

490

nonblockers are colored in red and blue, respectively; Figure S6. Pharmacophores

491

chosen by the decision tree with the tree depth of 6 based on the training set I,

492

including Hypo44, Hypo48, Hypo75, Hypo77, Hypo84, Hypo85, Hypo87, Hypo99,

493

Hypo102, Hypo103, Hypo112, Hypo154, Hypo178, Hypo190, Hypo198. HBA, HBD,

494

HP, PI, RA and EV chemical features are colored in green, magenta, cyan, red, orange

495

and gray, respectively. Figure S7. Mapping of 7 misclassified molecules onto the best

496

fitted pharmacophores (Hypo77, Hypo99, Hypo198, Hypo48, Hypo103, Hypo85 and

497

Hypo44). DataSet.xlsx. A list of the molecular structures and activities used in this

498

study.

499 500

Acknowledgment

501

This study was supported by the National Science Foundation of China (21575128).

502 503

References

504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519

1. 2. 3. 4.

5. 6. 7.

Smith, P. L.; Baukrowitz, T.; Yellen, G. The inward rectification mechanism of the HERG cardiac potassium channel. Nature 1996, 379, (6568), 833-836. Sanguinetti, M. C.; Tristani-Firouzi, M. hERG potassium channels and cardiac arrhythmia. Nature 2006, 440, (7083), 463-469. Vandenberg, J. I.; Perry, M. D.; Perrin, M. J.; Mann, S. A.; Ke, Y.; Hill, A. P. hERG K+ channels: structure, function, and clinical significance. Physiological Reviews 2012, 92, (3), 1393-1478. Recanatini, M.; Poluzzi, E.; Masetti, M.; Cavalli, A.; De Ponti, F. QT prolongation through hERG K+ channel blockade: current knowledge and strategies for the early prediction during drug development. Medicinal research reviews 2005, 25, (2), 133-166. Villoutreix, B. O.; Taboureau, O. Computational investigations of hERG channel blockers: New insights and current predictive models. Adv. Drug Delivery Rev. 2015, 86, 72-82. Witchel, H. J. The hERG potassium channel as a therapeutic target. Expert Opin. Ther. Targets 2007, 11, (3), 321-36. Brugada, R.; Hong, K.; Dumaine, R.; Cordeiro, J.; Gaita, F.; Borggrefe, M.; Menendez, T. M.; Brugada, J.; Pollevick, G. D.; Wolpert, C. Sudden death associated with short-QT syndrome linked to mutations in HERG. Circulation 2004, 109, (1), 30-35. 16

ACS Paragon Plus Environment

Page 16 of 30

Page 17 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579

8.

9. 10. 11. 12.

13.

14.

15.

16. 17. 18.

19.

20. 21.

22.

23.

24.

25.

26. 27. 28.

29. 30.

Curran, M. E.; Splawski, I.; Timothy, K. W.; Vincen, G. M.; Green, E. D.; Keating, M. T. A molecular basis for cardiac arrhythmia: HERG mutations cause long QT syndrome. Cell 1995, 80, (5), 795-803. Brown, A. Drugs, hERG and sudden death. Cell calcium 2004, 35, (6), 543-547. Aronov, A. M. Predictive in silico modeling for hERG channel blockers. Drug Discovery Today 2005, 10, (2), 149-155. Raschi, E.; Vasina, V.; Poluzzi, E.; De Ponti, F. The hERG K+ channel: target and antitarget strategies in drug development. Pharmacological Research 2008, 57, (3), 181-195. Bains, W.; Basman, A.; White, C. HERG binding specificity and binding site structure: evidence from a fragment-based evolutionary computing SAR study. Progress in biophysics and molecular biology 2004, 86, (2), 205-233. Roche, O.; Trube, G.; Zuegge, J.; Pflimlin, P.; Alanine, A.; Schneider, G. A virtual screening method for prediction of the HERG potassium channel liability of compound libraries. ChemBioChem 2002, 3, (5), 455-459. Broccatelli, F.; Mannhold, R.; Moriconi, A.; Giuli, S.; Carosati, E. QSAR modeling and data mining link torsades de pointes risk to the interplay of extent of metabolism, active transport, and hERG liability. Mol. Pharm. 2012, 9, (8), 2290-2301. Sinha, N.; Sen, S. Predicting hERG activities of compounds from their 3D structures: Development and evaluation of a global descriptors based QSAR model. European journal of medicinal chemistry 2011, 46, (2), 618-630. Wang, S.; Li, Y.; Xu, L.; Li, D.; Hou, T. Recent developments in computational prediction of HERG blockage. Current topics in medicinal chemistry 2013, 13, (11), 1317-1326. Jing, Y.; Easter, A.; Peters, D.; Kim, N.; Enyedy, I. J. In silico prediction of hERG inhibition. Future Med. Chem. 2015, 7, (5), 571-586. Dempsey, C. E.; Wright, D.; Colenso, C. K.; Sessions, R. B.; Hancox, J. C. Assessing hERG pore models as templates for drug docking using published experimental constraints: the inactivated state in the context of drug block. J. Chem. Inf. Model. 2014, 54, (2), 601-612. Ekins, S.; Crumb, W. J.; Sarazan, R. D.; Wikel, J. H.; Wrighton, S. A. Three-dimensional quantitative structure-activity relationship for inhibition of human ether-a-go-go-related gene potassium channel. Journal of Pharmacology and Experimental Therapeutics 2002, 301, (2), 427-434. Aronov, A. M. Common pharmacophores for uncharged human ether-a-go-go-related gene (hERG) blockers. Journal of medicinal chemistry 2006, 49, (23), 6917-6921. Leong, M. K. A novel approach using pharmacophore ensemble/support vector machine (PhE/SVM) for prediction of hERG liability. Chemical research in toxicology 2007, 20, (2), 217-226. Garg, D.; Gandhi, T.; Mohan, C. G. Exploring QSTR and toxicophore of hERG K+ channel blockers using GFA and HypoGen techniques. Journal of Molecular Graphics and Modelling 2008, 26, (6), 966-976. Durdagi, S.; Duff, H. J.; Noskov, S. Y. Combined receptor and ligand-based approach to the universal pharmacophore model development for studies of drug blockade to the hERG1 pore domain. Journal of chemical information and modeling 2011, 51, (2), 463-474. Tan, Y.; Chen, Y.; You, Q.; Sun, H.; Li, M. Predicting the potency of hERG K+ channel inhibition by combining 3D-QSAR pharmacophore and 2D-QSAR models. Journal of molecular modeling 2012, 18, (3), 1023-1036. Kratz, J. M.; Schuster, D.; Edtbauer, M.; Saxena, P.; Mair, C. E.; Kirchebner, J.; Matuszczak, B.; Baburin, I.; Hering, S.; Rollinger, J. M. Experimentally validated HERG pharmacophore models as cardiotoxicity prediction tools. J. Chem. Inf. Model. 2014, 54, (10), 2887-901. Stoll, F.; Goeller, A. H.; Hillisch, A. Utility of protein structures in overcoming ADMET-related issues of drug-like compounds. Drug Discov. Today 2011, 16, (11-12), 530-538. Chen, L.; Li, Y. Y.; Zhang, H. D.; Zhang, L. L.; Hou, T. J. Computational models for predicting substrates or inhibitors of P-glycoprotein. Drug Discov. Today 2012, 17, (7-8), 343-351. Xu, Y.; Liu, X.; Li, S.; Zhou, N.; Gong, L.; Luo, C.; Luo, X.; Zheng, M.; Jiang, H.; Chen, K. Combinatorial pharmacophore modeling of organic cation transporter 2 (OCT2) inhibitors: insights into multiple inhibitory mechanisms. Molecular pharmaceutics 2013, 10, (12), 4611-4619. Ding, Y.-L.; Shih, Y.-H.; Tsai, F.-Y.; Leong, M. K. In silico prediction of inhibition of promiscuous breast cancer resistance protein (BCRP/ABCG2). PloS one 2014, 9, (3). Xu, Y.; Liu, X.; Wang, Y.; Zhou, N.; Peng, J.; Gong, L.; Ren, J.; Luo, C.; Luo, X.; Jiang, H. Combinatorial Pharmacophore Modeling of Multidrug and Toxin Extrusion Transporter 1 17

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638

31. 32.

33.

34. 35.

36. 37. 38. 39.

40.

41.

42.

43.

44.

45.

46. 47. 48.

49. 50. 51. 52. 53.

Inhibitors: a Theoretical Perspective for Understanding Multiple Inhibitory Mechanisms. Sci. Rep. 2015, 5, 13684. Leong, M. K.; Chen, H.-B.; Shih, Y.-H. Prediction of promiscuous p-glycoprotein inhibition using a novel machine learning scheme. PloS one 2012, 7, (3), e33829. Wang, S.; Li, Y.; Wang, J.; Chen, L.; Zhang, L.; Yu, H.; Hou, T. ADMET evaluation in drug discovery. 12. Development of binary classification models for prediction of hERG potassium channel blockage. Molecular pharmaceutics 2012, 9, (4), 996-1010. Wang, Y.; Xiao, J.; Suzek, T. O.; Zhang, J.; Wang, J.; Zhou, Z.; Han, L.; Karapetyan, K.; Dracheva, S.; Shoemaker, B. A. PubChem's BioAssay database. Nucleic acids research 2012, 40, (D1), D400-D412. Aronov, A. M.; Goldman, B. B. A model for identifying HERG K+ channel blockers. Bioorganic & medicinal chemistry 2004, 12, (9), 2307-2315. Li, Q.; Jørgensen, F. S.; Oprea, T.; Brunak, S.; Taboureau, O. hERG classification model based on a combination of support vector machine method and GRIND descriptors. Molecular pharmaceutics 2008, 5, (1), 117-127. Discovery Studio, version 2.5, Accelrys Inc.: San Diego, CA, USA, 2009. Waring, M. J.; Johnstone, C. A quantitative assessment of hERG liability as a function of lipophilicity. Bioorganic & medicinal chemistry letters 2007, 17, (6), 1759-1764. Mannhold, R.; Kubinyi, H.; Folkers, G.; Langer, T.; Hoffmann, R. D., Pharmacophores and pharmacophore searches. John Wiley & Sons: 2006; Vol. 32. Shi, H.; Tian, S.; Li, Y.; Li, D.; Yu, H.; Zhen, X.; Hou, T. Absorption, Distribution, Metabolism, Excretion, and Toxicity Evaluation in Drug Discovery. 14. Prediction of Human Pregnane X Receptor Activators by Using Naive Bayesian Classification Technique. Chem. Res. Toxicol. 2014, 28, (1), 116-125. Chen, L.; Li, Y.; Zhao, Q.; Peng, H.; Hou, T. ADME evaluation in drug discovery. 10. Predictions of P-glycoprotein inhibitors using recursive partitioning and naive Bayesian classification techniques. Molecular pharmaceutics 2011, 8, (3), 889-900. Tian, S.; Wang, J.; Li, Y.; Xu, X.; Hou, T. Drug-likeness analysis of traditional Chinese medicines: prediction of drug-likeness using machine learning approaches. Molecular pharmaceutics 2012, 9, (10), 2875-2886. Hou, T.; Zhang, W.; Wang, J.; Wang, W. Predicting drug resistance of the HIV‐1 protease using molecular interaction energy components. Proteins: Structure, Function, and Bioinformatics 2009, 74, (4), 837-846. Hou, T.; Zhang, W.; Case, D. A.; Wang, W. Characterization of domain–peptide interaction interface: a case study on the amphiphysin-1 SH3 domain. Journal of molecular biology 2008, 376, (4), 1201-1214. Hou, T.; Wang, J.; Li, Y. ADME evaluation in drug discovery. 8. The prediction of human intestinal absorption by a support vector machine. Journal of chemical information and modeling 2007, 47, (6), 2408-2415. Hou, T.; Li, N.; Li, Y.; Wang, W. Characterization of domain–peptide interaction interface: prediction of SH3 domain-mediated protein–protein interaction network in yeast by generic structure-based models. Journal of proteome research 2012, 11, (5), 2982-2995. Chang, C.-C.; Lin, C.-J. LIBSVM: A library for support vector machines. ACM T. Intel. Syst. Tec. 2011, 2, (3), 27. Meyer, D. Support vector machines: The interface to libsvm in package e1071. 2004. Liu, L.-l.; Lu, J.; Lu, Y.; Zheng, M.-y.; Luo, X.-m.; Zhu, W.-l.; Jiang, H.-l.; Chen, K.-x. Novel Bayesian classification models for predicting compounds blocking hERG potassium channels. Acta Pharmacologica Sinica 2014, 35, (8), 1093-1102. Hou, T.; Wang, J.; Zhang, W.; Wang, W.; Xu, X. Recent advances in computational prediction of drug absorption and permeability in drug discovery. Curr. Med. Chem. 2006, 13, (22), 2653-2667. Hou, T.; Wang, J. Structure - ADME relationship: still a long way to go? Expert Opinion on Drug Metabolism & Toxicology 2008, 4, (6), 759-770. Du, L.; Li, M.; You, Q. The interactions between hERG potassium channel and blockers. Current topics in medicinal chemistry 2009, 9, (4), 330-338. Wei, Q.; Dunbrack Jr, R. L. The role of balanced training and testing data sets for binary classifiers in bioinformatics. PloS one 2013, 8, (7), e67863. Sun, H. An accurate and interpretable Bayesian classification model for prediction of hERG liability. ChemMedChem 2006, 1, (3), 315-322.

18

ACS Paragon Plus Environment

Page 18 of 30

Page 19 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

639

Legend of Figures

640

Figure 1. Pipeline of the procedure to build the NBC and SVM classifiers by

641

integrating multiple pharmacophore hypotheses identified by PR. (a) A number of

642

pharmacophore hypotheses were generated; (b) the molecules in the training set were

643

mapped on the pharmacophore hypotheses, and decision trees were established by PR

644

based on the fit values of mapping; (c) the data matrix was generated based on the fit

645

values given by the important pharmacophores chosen by RP; (d) the classifiers of

646

hERG blockade were generated by NBC and SVM.

647

Figure 2. The chemical space distributions based on PCA for (a) the training set I and

648

test set I, and (b) the training set II and test set II.

649

Figure 3. (a) The decision tree at the tree depth of 2 established based on the training

650

set I, and (b) the decision three at the tree depth of 3 established based on the training

651

set II. The predicted blockers are labeled in red and nonblockers in blue.

652

Figure 4. The pharmacophore hypotheses in (a) the decision tree shown in Figure 3a

653

and (b) the decision tree shown in Figure 3b. The HBA, HBD, HP, PI, AR and EV

654

pharmacophoric features are colored in green, magenta, cyan, red, orange, and gray,

655

respectively.

656

Figure 5. The distributions of the Bayesian scores predicted by the Bayesian classifier

657

based on the pharmacophores chosen by the decision three at the tree depth of 5 for (a)

658

the training set II and (b) test set II. The ROC curves for (d) the training set II and (d)

659

test set II. The color of the right stripe displays the different cutoff to split the blockers

660

and nonblockers.

661

Figure 6. The hierarchical clustering of the fourteen pharmacophores in the SVM6

662

model established based on the training set I.

663

Figure 7. The mapping of four molecules (clofilium phosphate, neferine, lidoflazine,

664

and GBR 12909) onto Hypo44, Hypo75, Hypo103 and Hypo178.

665

Figure 8. The structures of (a) the misclassified 9 blockers and (b) 7 nonblockers.

666 667 668 669 670 671 672 19

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 30

673

Table 1. The performance of the RP models for the training and test sets (RP2, RP3,

674

RP4, RP5, RP6 and RP7 represent the RP models with the tree depths of 2, 3, 4, 5, 6

675

and 7, respectively). Label

Training I

Training II

Test I

Test II

External validation

External validation

Depth

Model

TP

FN

FP

TN

SE

SP

PRE1

PRE2

GA

MCC

2

RP1

242

41

27

82

0.855

0.752

0.900

0.667

0.827

0.586

3

RP2

249

34

27

82

0.880

0.752

0.902

0.707

0.844

0.621

4

RP3

233

50

15

94

0.823

0.862

0.940

0.653

0.834

0.637

5

RP4

224

59

7

102

0.792

0.936

0.970

0.634

0.832

0.662

6

RP5

235

48

6

103

0.830

0.945

0.975

0.682

0.862

0.714

7

RP6

241

42

7

102

0.852

0.936

0.972

0.708

0.875

0.732

2

RP7

233

39

33

87

0.857

0.725

0.876

0.690

0.816

0.574

3

RP8

254

18

37

83

0.934

0.692

0.873

0.822

0.860

0.659

4

RP9

248

24

34

86

0.912

0.717

0.879

0.782

0.852

0.645

5

RP10

219

53

14

106

0.805

0.883

0.940

0.667

0.829

0.646

6

RP11

232

40

11

109

0.853

0.908

0.955

0.732

0.870

0.723

7

RP12

240

32

13

107

0.882

0.892

0.949

0.770

0.885

0.746

2

RP1

103

26

16

50

0.798

0.758

0.866

0.658

0.785

0.539

3

RP2

104

25

17

49

0.806

0.742

0.860

0.662

0.785

0.535

4

RP3

96

33

14

52

0.744

0.788

0.873

0.612

0.759

0.508

5

RP4

91

38

14

52

0.705

0.788

0.867

0.578

0.733

0.468

6

RP5

93

36

14

52

0.721

0.788

0.869

0.591

0.744

0.484

7

RP6

93

36

14

52

0.721

0.788

0.869

0.591

0.744

0.484

2

RP7

112

28

14

41

0.800

0.745

0.889

0.594

0.785

0.513

3

RP8

116

24

17

38

0.829

0.691

0.872

0.613

0.790

0.502

4

RP9

114

26

16

39

0.814

0.709

0.877

0.600

0.785

0.500

5

RP10

90

50

10

45

0.643

0.818

0.900

0.474

0.692

0.415

6

RP11

96

44

12

43

0.686

0.782

0.889

0.494

0.713

0.423

7

RP12

101

39

12

43

0.721

0.782

0.894

0.524

0.738

0.459

2

RP1

32

7

7

22

0.821

0.759

0.821

0.759

0.794

0.579

3

RP2

33

6

7

22

0.846

0.759

0.825

0.786

0.809

0.608

4

RP3

30

9

7

22

0.769

0.759

0.811

0.710

0.765

0.524

5

RP4

29

10

5

24

0.744

0.828

0.853

0.706

0.779

0.565

6

RP5

31

8

4

25

0.795

0.862

0.886

0.758

0.824

0.650

7

RP6

30

9

6

23

0.769

0.793

0.833

0.719

0.779

0.557

2

RP7

29

10

6

23

0.744

0.793

0.829

0.697

0.765

0.531

3

RP8

32

7

6

23

0.821

0.793

0.842

0.767

0.809

0.611

4

RP9

32

7

7

22

0.821

0.759

0.821

0.759

0.794

0.579

5

RP10

26

13

5

24

0.667

0.828

0.839

0.649

0.735

0.491

6

RP11

29

10

4

25

0.744

0.862

0.879

0.714

0.794

0.599

7

RP12

31

8

6

23

0.795

0.793

0.838

0.742

0.794

0.584

676 677 678 679 680 681 20

ACS Paragon Plus Environment

Page 21 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

682

Table 2. The performance of naïve Bayesian classifiers for the training and test sets

683

(NBC2, NBC3, NBC4, NBC5 and NBC6 represent the models established based on

684

the pharmacophore hypotheses chosen by the decision trees at the depths of 2, 3, 4, 5

685

and 6, respectively. Label

Training I

Training II

Test I

Test II

External validation

External validation

Level

Model

SE

SP

PRE1

PRE2

GA

MCC

2

NBC1

0.859

0.706

0.884

0.658

0.816

0.553

0.783

3

NBC2

0.777

0.771

0.898

0.571

0.776

0.507

0.778

4

NBC3

0.774

0.780

0.901

0.570

0.776

0.511

0.769

5

NBC4

0.887

0.651

0.869

0.689

0.821

0.548

0.761

6

NBC5

0.866

0.670

0.872

0.658

0.811

0.532

0.772

2

NBC6

0.838

0.733

0.877

0.667

0.806

0.557

0.799

3

NBC7

0.904

0.617

0.842

0.740

0.816

0.551

0.779

4

NBC8

0.912

0.608

0.841

0.753

0.819

0.555

0.790

5

NBC9

0.835

0.742

0.880

0.664

0.806

0.560

0.791

6

NBC10

0.857

0.733

0.879

0.693

0.819

0.581

0.793

2

NBC1

0.822

0.758

0.869

0.685

0.800

0.566

0.847

3

NBC2

0.775

0.788

0.877

0.642

0.779

0.541

0.821

4

NBC3

0.814

0.742

0.861

0.671

0.790

0.544

0.825

5

NBC4

0.822

0.727

0.855

0.676

0.790

0.540

0.819

6

NBC5

0.829

0.697

0.843

0.676

0.785

0.523

0.820

2

NBC6

0.743

0.764

0.889

0.538

0.749

0.465

0.820

3

NBC7

0.807

0.764

0.897

0.609

0.795

0.537

0.811

4

NBC8

0.779

0.764

0.893

0.575

0.774

0.504

0.813

5

NBC9

0.814

0.800

0.912

0.629

0.810

0.576

0.829

6

NBC10

0.800

0.800

0.911

0.611

0.800

0.559

0.819

2

NBC1

0.846

0.724

0.805

0.778

0.794

0.576

0.815

3

NBC2

0.846

0.759

0.825

0.786

0.809

0.608

0.850

4

NBC3

0.846

0.759

0.825

0.786

0.809

0.608

0.855

5

NBC4

0.821

0.759

0.821

0.759

0.794

0.579

0.865

6

NBC5

0.795

0.724

0.795

0.724

0.765

0.519

0.865

2

NBC6

0.769

0.828

0.857

0.727

0.794

0.591

0.832

3

NBC7

0.795

0.793

0.838

0.742

0.794

0.584

0.839

4

NBC8

0.795

0.862

0.886

0.758

0.824

0.650

0.848

5

NBC9

0.795

0.724

0.795

0.724

0.765

0.519

0.834

6

NBC10

0.795

0.759

0.816

0.733

0.779

0.551

0.826

686

21

ACS Paragon Plus Environment

AUC

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 30

687

Table 3. The performance of SVM classifiers for the training and test sets (SVM2,

688

SVM3, SVM4, SVM5 and SVM6 represent the models established based on the

689

pharmacophore hypotheses chosen by the decision trees at the depths of 2, 3, 4, 5 and

690

6, respectively). Label

Training I

Training II

Test I

Test II

External validation

External validation

Level

Model

SE

SP

PRE1

PRE2

GA

MCC

AUC

2

SVM1

0.883

0.670

0.874

0.689

0.824

0.558

0.871

3

SVM2

0.922

0.587

0.853

0.744

0.829

0.552

0.867

4

SVM3

0.940

0.578

0.853

0.788

0.839

0.576

0.879

5

SVM4

0.936

0.578

0.852

0.778

0.837

0.569

0.885

6

SVM5

0.943

0.596

0.859

0.802

0.847

0.597

0.899

2

SVM6

0.882

0.675

0.86

0.717

0.819

0.567

0.799

3

SVM7

0.897

0.708

0.875

0.752

0.839

0.616

0.872

4

SVM8

0.897

0.708

0.875

0.752

0.839

0.616

0.876

5

SVM9

0.901

0.700

0.872

0.757

0.839

0.615

0.898

6

SVM10

0.897

0.717

0.878

0.754

0.842

0.623

0.898

2

SVM1

0.860

0.697

0.847

0.719

0.805

0.562

0.832

3

SVM2

0.891

0.652

0.833

0.754

0.810

0.565

0.819

4

SVM3

0.891

0.621

0.821

0.745

0.800

0.539

0.813

5

SVM4

0.915

0.621

0.825

0.788

0.815

0.573

0.838

6

SVM5

0.907

0.652

0.836

0.782

0.821

0.587

0.842

2

SVM6

0.843

0.709

0.881

0.639

0.805

0.536

0.759

3

SVM7

0.836

0.727

0.886

0.635

0.805

0.542

0.785

4

SVM8

0.829

0.745

0.892

0.631

0.805

0.548

0.82

5

SVM9

0.850

0.745

0.895

0.661

0.821

0.575

0.839

6

SVM10

0.843

0.745

0.894

0.651

0.815

0.566

0.845

2

SVM1

0.872

0.724

0.810

0.808

0.809

0.606

0.769

3

SVM2

0.872

0.690

0.791

0.800

0.794

0.576

0.821

4

SVM3

0.897

0.690

0.795

0.833

0.809

0.608

0.836

5

SVM4

0.872

0.690

0.791

0.800

0.794

0.576

0.805

6

SVM5

0.897

0.690

0.795

0.833

0.809

0.608

0.821

2

SVM6

0.795

0.759

0.816

0.733

0.779

0.551

0.773

3

SVM7

0.846

0.724

0.805

0.778

0.794

0.576

0.813

4

SVM8

0.846

0.690

0.786

0.769

0.779

0.545

0.814

5

SVM9

0.846

0.724

0.805

0.778

0.794

0.576

0.824

6

SVM10

0.846

0.724

0.805

0.778

0.794

0.576

0.828

691 692 693

22

ACS Paragon Plus Environment

Page 23 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

HBA

AR

HP

HBD

a

EV

PI

(a)

(b)

RP (c) Class

Fit1

blocker1

1

blocker2

1



1

blockeri

1

nonblocker1

0

nonblocker2

0



0

nonblockerj

0

Fit2

Fit3

Fit4



(d) Y

X1



X2

Xn

NBC 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709

SVM Figure 1

23

ACS Paragon Plus Environment

Fitm

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

710 711 712 713

Figure 2

24

ACS Paragon Plus Environment

Page 24 of 30

Page 25 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

714

715 716 717 718 719

Figure 3

25

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736

Figure 4

26

ACS Paragon Plus Environment

Page 26 of 30

Page 27 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

737 738 739 740 741 742 743 744 745 746

Figure 5

27

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

747 748 749 750

Figure 6

28

ACS Paragon Plus Environment

Page 28 of 30

Page 29 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

751

752 753 754 755 756 757 758

Figure 7

29

ACS Paragon Plus Environment

Molecular Pharmaceutics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

759 760

Figure 8

30

ACS Paragon Plus Environment

Page 30 of 30