NP-StructurePredictor: Prediction of Unknown Natural Products in

Nov 13, 2017 - Identification of the individual chemical constituents of a mixture, especially solutions extracted from medicinal plants, is a time-co...
2 downloads 6 Views 2MB Size
Subscriber access provided by READING UNIV

Article

NP-StructurePredictor: prediction of unknown natural products in plant mixtures Yeu-Chern Harn, Bo-Han Su, Yuan-Ling Ku, Olivia A. Lin, Cheng-Fu Chou, and Yufeng Jane Tseng J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00565 • Publication Date (Web): 13 Nov 2017 Downloaded from http://pubs.acs.org on November 18, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

NP-StructurePredictor: prediction of unknown

2

natural products in plant mixtures

3 4

Yeu-Chern Harn1,2#, Bo-Han Su3#, Yuan-Ling Ku4, Olivia A. Lin5, Cheng-Fu

5

Chou3,and Y. Jane Tseng2,3,5,6*

6

1

7

1 Roosevelt Rd. Sec. 4, Taipei 10617, Taiwan

8

2

9

Syujhou Road, Taipei 10055, Taiwan

Graduate Institute of Networking and Multimedia, National Taiwan University, No.

The Metabolomics Core Laboratory, NTU Center of Genomic Medicine, 7F, No. 2,

10

3

11

University, No. 1 Roosevelt Rd. Sec. 4, Taipei 10617, Taiwan

12

4

13

9, Wuquan Rd., Wugu Dist., New Taipei City 24886, Taiwan

14

5

15

University, No. 1 Roosevelt Rd. Sec. 4, Taipei 10617, Taiwan

16

6

17

Ai Rd. Sec. 1, Taipei 10051, Taiwan

18

#

19

*Corresponding author (Voice: +886.2.3366.4888#529, Fax: +886.2.23628167,

20

[email protected])

Department of Computer Science and Information Engineering, National Taiwan

Medical and Pharmaceutical Industry Technology and Development Center, 7F, No.

Graduate Institute of Biomedical Electronic and Bioinformatics, National Taiwan

Drug Research Center, National Taiwan University College of Medicine, No.1 Jen

Equal contribution

21

-1ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

22

Abstract

23

Identification of the individual chemical constituents of a mixture, especially

24

solutions extracted from medicinal plants, is a time-consuming task. The

25

identification results are often limited by challenges such as the development of

26

separation methods and the availability of known reference standards. A novel

27

structure elucidation system, NP-StructurePredictor, is presented and used to

28

accelerate the process of identifying chemical structures in a mixture based on a

29

branch and bound algorithm combined with a large collection of natural product

30

databases. NP-StructurePredictor requires only targeted molecular weights calculated

31

from a list of m/z values from LC-MS experiments as input information to predict the

32

chemical structures of individual components matching the weights in a mixture. NP-

33

StructurePredictor also provides the predicted structures with statistically calculated

34

probabilities so that the most likely chemical structures of the natural products and

35

their analogs can be proposed accordingly. Four datasets consisting of different

36

Chinese herbs with mixtures containing known compounds were selected for

37

validation studies, and all their components were correctly identified and highly

38

predicted using NP-StructurePredictor. NP-StructurePredictor demonstrated its

39

applicability for predicting the chemical structures of novel compounds by returning

40

highly accurate results from four different validation case studies.

41

Keywords

42

liquid chromatography-mass spectrometry (LC-MS), computer-aided structure

43

elucidation (CASE), cheminformatics, chemometrics, natural product determination,

44

branch and bound algorithm

45

-2ACS Paragon Plus Environment

Page 2 of 39

Page 3 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

46

Introduction

47

One of the central themes in medicinal chemistry and chemical biology research

48

involves the efficient identification of the small molecules that regulate protein

49

functions.1 Moreover, identification of small molecules from plants (natural products)

50

is important because those molecules are among the major sources of inspiration for

51

drug discovery.2 However, determination of chemical constituents in plants is a time-

52

consuming task requiring complex and lengthy procedures. For example, the

53

identification of the components in Ligusticum chuanxiong involves, first and

54

foremost, sample extraction via steam distillation then gas chromatography and mass

55

spectrometry analyses.3 The process may be lengthier when the components to be

56

identified are novel chemical compounds. Although powerful chromatographic and

57

spectroscopic analytical methods may help with the elucidation of novel structures,

58

there is currently no automated method with high-throughput capabilities. 4, 5

59

Generally, systems that aim to automatically propose a list of possible chemical

60

structures for unknown compounds in a mixture based on chromatographic and

61

spectroscopic data are commonly known as computer-assisted methods for structure

62

elucidation (or computer-aided structure elucidation, CASE in short). CASE was

63

developed thirty years ago6-9 to elucidate the chemical structures of small organic

64

molecules. Different algorithms and chemical knowledge, including heuristics rules,10

65

stochastic optimization11 and graph algorithms12 have been used in this field. In the

66

last decade, many advanced algorithms supporting the CASE expert systems were

67

developed to realize the dream of many spectroscopists: fully automated structure

68

elucidation.13-17 However, such methods still cannot replace human intelligence18 and

69

still have many limitations.19 CASE systems largely rely on two-dimensional nuclear

70

magnetic resonance spectroscopy (2D NMR) data as inputs, because they provide -3ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

71

abundant structural information.20 However, 2D NMR experiments can be quite time-

72

consuming; typically, data acquisition take hours. Furthermore, if co-eluting

73

compounds in the mixture of interest cannot be completely separated, the CASE

74

system will need other experimental data to assist with the structure elucidation

75

process.21

76

Compared to NMR methods, mass spectrometry (MS) is more sensitive and is

77

therefore a good starting point for identification procedures.22 Furthermore, impurities

78

in input mixtures do not usually impact the results of MS experiments. In the last

79

decade, many MS-based computational methods for automatic identification of small

80

molecules have been developed and were recently reviewed by Scheubert et al..23

81

However, there are no successful structure elucidation methods using MS data alone.

82

Most of these MS-based methods are referred to as “in silico fragmentation spectrum

83

prediction” techniques. Current in silico fragmentation strategies can only be

84

successfully applied to a limited number of classes of molecules, such as lipids,

85

glycans, and alkenes, due to their structural simplicity and homogeneity. Yetukuri et

86

al.24 proposed an approach to predict structures of a specific group of lipids to assist

87

with lipid identification in lipidomics research. Yetukuri and colleagues took

88

advantage of the highly conserved patterns in lipid structures to deduce structures of

89

other lipids using known lipid scaffolds. Once the lipid scaffold is determined,

90

fragments attached to that scaffold can be added to construct more diverse lipids to

91

match unknown lipid signals in the ultra-performance liquid chromatography-mass

92

spectrometry (UPLC-MS) spectra. Although the methods proposed by Yetukuri et al.

93

can identify novel chemical structures in lipids from MS data, its application toward

94

lipid scaffolds is limited. Construction of an accurate and fully automated CASE

95

system to predict unknown structures in mixtures using mass spectra remains a

-4ACS Paragon Plus Environment

Page 4 of 39

Page 5 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

96

challenge. A part of the inspiration for this study comes from the study of chemical

97

structure scaffold classification.25,

98

construct chemical structure trees based on their scaffolds; in this way, structures can

99

be classified by the trees. Our study combines the advantages of two ideas, namely, to

100

classify related structures by their scaffolds and to predict novel structures by adding

101

different branches to the scaffolds.

26

Schuffenhauer et al.26 proposed a method to

102

In this study, we have developed a computational method, which we named NP-

103

StructurePredictor, that can accelerate the process of identifying chemical structures

104

using

105

StructurePredictor identifies compounds in a mixture by matching the compounds of

106

interest to the most likely chemical structures of natural products and their analogs

107

through a series of analyses. This method can also predict structures that do not exist

108

in the current libraries by combining different scaffolds and side chains and inferring

109

structures from similar scaffolds. For each target molecular weight from the input

110

mass spectra, NP-StructurePredictor returns a list of possible structures and their

111

relative probabilities. The proposed chemical structures with higher rankings are the

112

most likely structural candidates for the unknown compounds in the mixtures. Four

113

complex herbal mixtures with known constituents were used as a validation set in this

114

study, and all their components were successfully predicted using this method.

liquid

chromatography-mass

spectrometry

(LC-MS)

spectra.

NP-

115

There are several major differences between the previous CASE systems and our

116

current NP-StructurePredictor system. First, NP-StructurePredictor requires only

117

experimental data, essentially only the m/z list from the LC-MS spectra, as inputs. It

118

does not require additional NMR or tandem mass spectrometry (MS/MS) spectra for

119

further structural information. Second, the prediction model of NP-StructurePredictor

120

was built using a large collection of natural products; therefore, it is tailored for the

-5ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

121

structural prediction of natural products. Finally, NP-StructurePredictor predicts

122

unknown structures with rankings based on the possible combinations of scaffolds

123

and side chains from our large databases of natural products. This ranking ensures

124

NP-StructurePredictor proposes a list of the closest structural matches to the currently

125

known plant-derived natural products.

126

-6ACS Paragon Plus Environment

Page 6 of 39

Page 7 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

127

Materials and Methods

128

System overview

129

A novel CASE system, NP-StructurePredictor, is presented and used to

130

accelerate the process of identifying chemical structures in a mixture. The system

131

architecture is shown in Figure 1. We needed to first extract a peak table containing

132

processed and aligned mass peaks with molecular weights from LC-MS experiments.

133

In our experiments, we used MAVEN27 and XCMS28 to extract the peak tables. The

134

list of targeted molecular weights (targeted MWs) in the peak table is necessary

135

information for our system. Moreover, if users have knowledge about what potential

136

scaffold structures (seed scaffolds) are likely to be present, NP-StructurePredictor can

137

use this information to predict the exact compounds. When the seed scaffolds for the

138

test material are not provided, NP-StructurePredictor is able to perform a full search

139

on all scaffolds to select suitable seeds in our database to generate suitable candidates.

140

A large natural products database was collected in NP-StructurePredictor, and a side

141

chains database was then constructed from the natural products database. In the

142

procedure of structure elucidation, a branch and bound algorithm was designed to

143

systematically search correct structures with the targeted MW values by linking the

144

appropriate fragments from the side chains database to the input seed scaffolds.

145

Furthermore, a scaffolds database was also constructed from our collected natural

146

products database, and a hierarchical scaffolds tree was constructed to correlate the

147

relationship between all the scaffolds. According to the hierarchical scaffolds tree, the

148

scaffolds which have strong relationship with the input seed scaffold are also used as

149

the input candidates of scaffolds in searching procedure of NP-StructurePredictor. For

150

each peak with a specific molecular weight, NP-StructurePredictor identifies it by

151

returning a list of possible compounds matching that targeted MW, and the resulting -7ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 39

152

compounds are ranked according to their calculated probability of occurrence in

153

nature. Each module in NP-StructurePredictor was described in the later sections.

154

Collection of natural products

155

A

large

natural

products

database

(NPDB)

was

collected

in

NP-

156

StructurePredictor for further construction of scaffolds and side chains databases. The

157

main concept underlying the function of NP-StructurePredictor is that it utilizes the

158

structural information gleaned from three natural products databases to predict

159

possible structures in each experiment. The three natural products databases used here

160

are the Dictionary of Natural Products (DNP),29 “ZINC natural products” subset of

161

ZINC,30 and Traditional Chinese Medicine Database (TCMD, updated at 2010-07-

162

14).31 DNP listed 203615 records, “ZINC natural products” subset of ZINC listed

163

89425 records, and TCMD listed 3897 records. The structure data collected from

164

these three databases were standardized first by ChemAxon Application Programming

165

Interfaces (ChemAxon Kft, Máramaros köz 3/a, Budapest, 1037 Hungary). The

166

standardization included neutralization, removal of valence errors, and retaining the

167

largest fragment. After standardization, all records from the three databases were

168

pooled together and the redundant records were removed. Moreover, since the

169

majority of compounds used in medicinal chemistry and chemical biology research

170

contain rings,32 we only considered the structures with rings in this study (a total of

171

243130 records, of which 226949 records contained rings). The reason for compiling

172

a large NPDB is two-fold; 1) to increase the probability of matching structures from

173

the NPDB in our initial searches, and 2) to learn the diverse structural patterns of the

174

NPDB. The next section will cover our pattern analysis of NPDB.

-8ACS Paragon Plus Environment

Page 9 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

175

Construction of scaffold trees and generation of side chains

176

Since similar structures could possess similar biological functions, we

177

constructed hierarchical scaffold trees from the collected NPDB to search for similar

178

core structures of the given seed scaffolds to aid in elucidating possible compounds in

179

mixtures. A scaffold relationship database was first constructed by breaking and

180

classifying each structure within the NPDB into separate substructure categories —

181

one major chemical scaffold and several side chains. Then, the classified scaffolds

182

were used to construct the scaffold trees. The definition of scaffolds according to

183

Bemis et al.25 is the remaining core structures after all terminal side chains have been

184

eliminated. However, if the terminal chains are linked by double bonds, the chains are

185

retained. The rule for double bonds ensures that the planar sp2 carbon atoms in the

186

scaffolds are distinguishable from the tetrahedral sp3 carbon atoms. The hierarchical

187

scaffold trees were then constructed using the Scaffold Tree Generator26 to illustrate

188

the structural relationship between all the scaffolds. Each node in the tree denotes a

189

scaffold. The parent-child relationships in the trees were defined such that a parent

190

scaffold is a substructure of the child scaffold. To decide which substructures were the

191

child scaffolds and to preserve substructures with more chemical characteristics,

192

thirteen prioritization rules26 were used to remove side chains. The scaffolds having

193

the same parent are defined as sibling scaffolds, and thus, all the sibling scaffolds

194

have the same number of rings. For natural products in a mixture, we utilize the

195

constructed hierarchical scaffold trees to retrieve scaffolds that are similar to the given

196

seed scaffolds. The parent, sibling, and child scaffolds of the given seed scaffolds

197

were all retrieved for input in the next prediction procedure; these candidates

198

combined with the seed scaffolds are referred to as “targeted scaffolds” in this study.

199

By selecting the right scaffolds through surveying the sibling, parent, and child

-9ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 39

200

relationships in the scaffold trees, the accuracy of unknown structure elucidation can

201

be enhanced considerably.

202

To elucidate chemical structures in mixtures that were not already included in the

203

NPDB and to enhance the structural diversity of predicted compounds in our system, a

204

side chain database is generated and used to combine the seed scaffolds for

205

construction of compound candidates in a mixture. The side chains are defined as the

206

parts of the structure other than the scaffold. We only considered the side chains that

207

are not hydrogen. All possible side chains that can be linked on each position of the

208

scaffold were collected from NPDB. Moreover, the probabilities of occurrence for

209

side chains in particular positions were also calculated. For a scaffold with atom-

210

positions {1, 2, … …, S}, the probability of occurrence of side chain x at atom-

211

position y is defined as follows:   =

( )

, for an atom-position y ∈ 1, 2, … … , 

(1)

212

Where ( ) is the frequency at which side chain x occurred at atom-position y of

213

the scaffold in the NPDB, and  is the total number of possible side chains that

214

occurred at atom-position y of the scaffold in the NPDB.

215

After analyzing which side chains can be linked to each scaffold in the NPDB

216

and calculating the probabilities of occurrences of side chains at each position of the

217

scaffold, we also must determine which possible sets of positions on the scaffold can

218

be linked to the side chains. These possible sets of positions on the scaffold are called

219

atom-position configurations in this study. We used the atom-position configurations

220

to elucidate unknown chemical structures by extending appropriate side chains.

- 10 ACS Paragon Plus Environment

Page 11 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

221

Prediction procedure of NP-StructurePredictor

222

The main prediction procedure in NP-StructurePredictor is to identify possible

223

structures that match the targeted MWs. The targeted scaffolds selected by NP-

224

StructurePredictor and user are used as the starting information for structural

225

elucidation. Users can choose to provide the seed scaffolds if the user has prior

226

knowledge of the potential structural features that are likely to be found in the

227

compounds. When the seed scaffolds for the test material are not provided NP-

228

StructurePredictor can regard all scaffolds in our database as the seed scaffolds for

229

generation of suitable candidates. If the seed scaffolds are not provided, the system

230

will require longer execution time; however, NP-StructurePredictor is efficient

231

enough to complete the task in a reasonable timeframe. To generate structures having

232

targeted MW of W0 from the input peak table, NP-StructurePredictor first takes each

233

scaffold listed in the set of targeted scaffolds as the starting seed. The prediction

234

procedure provided two ways for generating possible chemical structures having the

235

target scaffold. The first approach directly searches existing structures that contained

236

the relevant targeted scaffolds and matched the MW criteria in the NPDB. The second

237

approach applied a branch and bound algorithm to computationally formulate possible

238

chemical structures by linking all possible side chains on the targeted scaffolds to

239

match the target MWs based on the atom-position configurations. We defined a

240

combination of possible side chains on a considered scaffold as C=(X1, X2, …, XS),

241

where Xn is a side chain at atom-position n, and S is the number of atom positions of a

242

specific scaffold. For the considered scaffold, if we want to find the R most likely

243

structures with respect to a targeted MW of W0, the computational problem to

244

generate possible structures can be formulated into the following equations:

245

- 11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Find R combinations ( ,  , … … ,  )

(2)

such that ( ) is maximum, ( ) > ( ) … … > ( )

(3)

Page 12 of 39

+

and ∀c ∈ ( ,  , … … ,  ), % &'(() ) = *

(4)

),

246

where w is the molecular weight excluded the MW of the scaffold from W0, and

247

&'(() ) represents the molecular weight of the side chain, () . The probability of a

248

combination C is defined as follows: ( ) = ∏+), (() ) ,  = (( , ( , … … , (+ )

(5)

249

The probability of a side chain Xi occurring at position i was defined in equation

250

(1). Formula (3) ensures that the combinations of side chains are the best R candidates

251

with the highest probability of occurrence in nature, and formula (4) ensures that the

252

total molecular weight of the selected side chains matches w. The value of w is the

253

targeted MW excluding the scaffold since we only consider the MWs of the side

254

chains in the prediction process. The probability of the selected combination of side

255

chains is defined in equation (5). Because the probability of each side chain occurring

256

on a scaffold is considered an independent event, the probability of the whole group

257

of selected side chains is given by the product of the probabilities of each side chain at

258

their corresponding atom positions. To identify the best R side chain lists, a brute-

259

force strategy is used to search all the combinations of side chains and compute the

260

probability for each combination. Thus, the brute-force strategy must be executed in

261

exponential time. Since we know that some combinations are impossible for the

262

targeted MWs, NP-StructurePredictor adopted a branch and bound algorithm to

263

enhance search performance. The algorithm iteratively searches the best side chain

264

candidates starting from atom-position 1 to position S of the targeted scaffolds, and

265

omits impossible combinations in each iteration. To illustrate the idea of this

- 12 ACS Paragon Plus Environment

Page 13 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

266

algorithm, consider the following case. Suppose we are searching all combinations of

267

side chains on a scaffold with S atom-positions, and the algorithm searched a possible

268

side chain combination Ccurrent=(X1, X2, …, Xy) once positions 1 to y (y < S) have been

269

processed. The combination of Ccurrent can be skipped once the MW of current

270

combination (MW(Ccurrent)) is greater than the targeted MW, w; that is, there is no

271

combination (X1, X2, …, Xy+1) with an MW smaller or equal to w. Thus, the algorithm

272

can save computation time by only branching to appropriate combinations.

273

NP-StructurePredictor finds possible structures for each targeted MW in user’s

274

peak table. The whole process continues until all targeted MWs in the peak table are

275

processed. The source codes of NP-StructurePredictor can be downloaded from

276

http://npstructurepredictor.cmdm.tw/NPSP.rar. The algorithm was implemented in

277

Java (JDK 7) and tested on a Linux PC with an Intel Xeon(R) CPU 2.40 GHz with 32

278

GB of memory. Users can build the program of NP-StructurePredictor for structural

279

elucidation steps by steps according to our provided manual file.

280 281

Validation datasets

282

Four herbal datasets (Cuscuta chinensis, Ophiopogon japonicus, Polygonum

283

multiflorum, and angelica) were selected to evaluate our system’s performance. All

284

herbs data sets were taken from the Natural Product Laboratory of Taiwan Medical

285

and Pharmaceutical Industry Technology and Development Center (PITDC).

286

Moreover, the Natural Product Laboratory of PITDC identified a list of structures of

287

each herb using their own identification procedure. The list of authentic structures

288

was treated as validated results in our evaluation process. The raw MS data in the

289

mzML

format

can

be

downloaded

from

- 13 ACS Paragon Plus Environment

the

following

link

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

290

(http://npstructurepredictor.cmdm.tw/Spectra.rar). The detailed experimental methods

291

are described in the next section.

292

Experimental methods

293

Sample powder (200 mg) was transferred to a 2-mL centrifuge tube followed by

294

1.5 mL methanol/water (7/3), and the tube was then placed in an ultrasonic bath

295

(Branson 5510/5210) at maximum ultrasonication for 15 min at 40°C. The sample

296

tube was then centrifuged at 10,000 rpm for 5 min (Hermle Z 323K). The extraction

297

was repeated three more times, and the upper extracts were combined. Then, 70%

298

methanol was added to the filtrate to bring the sample solutions up to a total volume

299

of 5 mL. The solutions were filtered through 0.45-µm filters before high-performance

300

liquid chromatography (HPLC) and high-performance liquid chromatography-

301

electrospray ionization-mass spectrometry (HPLC-ESI-MS) analyses. HPLC analyses

302

were carried using an Agilent 1100 HPLC series system (Santa Clara, CA, USA). The

303

column used was a Zorbax SB-C18 column (4.6 mm × 250 mm i.d., 5 µm; Agilent

304

Company, USA), and it was protected by a guard column (3.9 mm × 20 mm i.d., 5

305

µm). The extracts of the four herbs were analyzed under the same HPLC conditions.

306

The mobile phase consisted of solvent A, water/0.1% formic acid, and solvent B,

307

acetonitrile/0.1 formic acid, with a gradient program at a flow rate of 1 mL/min. The

308

gradient elution program was as follows: 0-40 min, linear gradient from 10 to 35% B;

309

40-50 min, linear gradient from 35 to 50% B; 50-60 min, linear gradient from 50 to

310

100% B; and hold at 100% B for 5 min. The effluent was monitored at 254 nm, 280

311

nm, and 312 nm. The MS system used was a Bruker Daltonics Esquire 2000 ion trap

312

mass spectrometer (Bremen, Germany) equipped with an orthogonal ESI interface.

313

The ionization parameters were as follows: positive and negative ion mode; capillary

314

voltage, 4000 V; nebulizing gas was nitrogen at 25-30 psi; drying gas flow 10.0 L/min - 14 ACS Paragon Plus Environment

Page 14 of 39

Page 15 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

315

at 250-300°C. The mass analyzer scanned from 50 to 1000 amu. The MS/MS spectra

316

were recorded in auto MS/MS mode. Other instrument parameters were set according

317

to the properties of each compound. The obtained data, including parent and daughter

318

ions pattern, were compared with the spectra of compounds of similar medicinal herbs

319

in earlier publications or databases. This step led to the preliminary identification of

320

the top five high-intensity peaks. These sample peaks were further compared with the

321

authentic compounds analyzed under the same LC conditions to compare their

322

retention times and MS/MS spectra.

323

- 15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

324

Results and discussion

325

The scaffold tree database

326

Hierarchical scaffold trees database was generated for selection of appropriate

327

targeted scaffolds in NP-StructurePredictor. Classification of chemical structures, as

328

well as construction of a scaffold tree database, were achieved using hierarchical

329

scaffold classification.26 There were 83242 different scaffolds that formed 4001 trees

330

in our scaffold tree database. The total number of natural products in our NPDB is

331

243130, while the number of generated scaffolds is 83242. There are patterns that

332

frequently reoccur in the compounds of the NPDBs since the number of unique

333

scaffolds generated only accounts for approximately one third of the natural product

334

structures in the NPDB. We can utilize these patterns to generate novel structures for

335

elucidation of unknown chemical structures. Since every scaffold exists in an average

336

of three structures from the NPDB, we have adequate number of side chains to

337

generate novel structures for elucidation of unknown compounds. One of the

338

representative trees is shown and discussed in the supplementary Additional File 1

339

online.

340

The scaffold database

341

We designed several strategies in searching protocol to enhance the efficiency of

342

NP-StructurePredictor. During the process of scaffold database construction, we used

343

the symmetrical structures principles to ensure all possible atom positions can be

344

linked by any given side chain. In this way, NP-StructurePredictor can generate

345

structures that are not already available in the current NPDB. However, since a total

346

of 243130structures were included in the NPDB, a direct database searching and

347

structure matching cannot be achieved within a reasonable period. A threshold and an

- 16 ACS Paragon Plus Environment

Page 16 of 39

Page 17 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

348

indexing strategy were applied to increase the execution speed. First, we directly

349

applied the largest molecular weight (LMW) and the smallest molecular weight

350

(SMW) thresholds to filter out unsuitable scaffolds; if the targeted molecular weight

351

(MW) of the scaffold is smaller than the SMW threshold or larger than the LMW

352

threshold, then we can directly bypass this scaffold and terminate any further

353

processing. Although this strategy may increase the risk of losing structures that

354

should be identified, our validation experiments revealed that our system can still

355

effectively identify all the structures. Second, we assigned an index to each scaffold

356

that can directly map onto the structures in the NPDB. The final searching protocol in

357

the worst-case scenario reduced the number of structures from 243130 (total in the

358

NPDB) to 10214 (the largest number of structures with the same scaffolds).

359

Evaluation of time performance

360

The comparison of time performance between NP-StructurePredictor and the

361

traditional algorithm was analyzed in this section. NP-StructurePredictor adopted the

362

branch and bound algorithm to significantly improve the performance speed of

363

structure elucidation. Two different modes of the branch and bound algorithm were

364

implemented in NP-StructurePredictor to identify unknown structures; 1) by using the

365

information of our constructed atom-position configurations (“learned R-group”), and

366

2) by letting users specify atom-position configurations (“added R-group”) to restrict

367

the number of possible atom positions that can be linked by side chains. A traditional

368

brute-force algorithm which generated all combinations of possible structures was

369

compared with our two modes of branch and bound methods. We used the total

370

number (Nc) of possible structures that could be generated in the system to evaluate

371

the computational time of these methods, because 1) determining Nc is the most time-

- 17 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

372

intensive step in the whole process, and 2) computation time is approximately

373

proportional to Nc when all other conditions are kept the same.

374

In Figure 2, four user-identified scaffolds from the test case of Polygonum

375

multiflorum, were evaluated and compared using the three algorithms mentioned

376

above. The result show branch and bound methods using either learned R-groups or

377

added R-groups can significantly improve execution times compared to the brute-

378

force method. Taking scaffold 2-2 for instance, Nc in learned R-group mode is 2.38 ×

379

107, and Nc in added R-group mode is 7.68 × 105. However, Nc using the brute-force

380

approach is 7.36 × 1016. The total number of possible structures generated using the

381

brute-force approach is significantly larger than the two branch and bound methods by

382

a factor of approximately 1010. Moreover, the results of the branch and bound

383

approaches were more precise. These results are further discussed in the subsection

384

titled “Structure elucidation using a combinatorial side chains approach.” While the

385

rankings for some of the known structures identified by the brute-force approach fell

386

below the top one hundred most likely structures, this can be rectified by utilizing the

387

branch and bound approach which can improve the rankings such that they all fall in

388

the top ten most likely structures. This discrepancy is because the brute-force

389

algorithm for structure generation considers all possible atom positions in scaffolds,

390

and therefore, false positive chemical structures were included in the results. Either

391

learned R-group or added R-group approaches can address this challenge and

392

empirically generate better structures. It should be noted that the difference between

393

the combination numbers (Nc) of added R-group mode and learned R-group mode is

394

not significant; however, learned R-group mode is totally automatic without any user

395

intervention. In contrast, since added R-group mode allows users to specify R-group,

396

the results could be biased toward the users’ preexisting knowledge.

- 18 ACS Paragon Plus Environment

Page 18 of 39

Page 19 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

397

Case studies

398

The performance of our system was evaluated and validated using four

399

experimental herbs, namely, Cuscuta chinensis, Ophiopogon japonicus, Polygonum

400

multiflorum, and genus Angelica. We compared the predicted results from NP-

401

StructurePredictor with the known components of the four herbal mixtures to evaluate

402

the accuracy of our system.

403

Two proposed prediction methods were applied in each experimental case. The

404

first approach directly searched structures in our database containing the scaffolds of

405

interest and matched those to the MW criteria in the NPDB. The second approach

406

generated new structures by linking all possible side chains onto the scaffold to match

407

the target molecular weight. The detailed algorithms are described in the Methods

408

section. In the four case studies below, we will directly reference these methods as the

409

“first approach” and the “second approach.” It is worth noting that we only applied

410

the learned R-group mode to the second approach.

411

In the next four sections, four testing herbs were validated and analyzed by NP-

412

StructurePredictor. The “first” and “second” approaches were applied to the first two

413

testing herbs respectively. The third case demonstrated our capability and efficiency

414

of structure elucidation without inputting seed scaffolds. The last case illustrated the

415

predictive power of our system for structure elucidation even for a very complex herb.

416

Our evaluation demonstrated the following:

417

1)

The ranking strategy of NP-StructurePredictor is practical. We have shown its

418

practicality in four test cases, in which the compounds that were highly ranked using

419

NP-StructurePredictor matched the known compounds in the tested herbs.

420

2)

421

included in the current NPDB. Meaning, NP-StructurePredictor improves the

NP-StructurePredictor can generate novel structures that were not already

- 19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

422

identification outcome by suggesting novel and correct (physically and chemically

423

feasible) structures.

424

Structure elucidation using database searching methods

425

The herbal mixture Cuscuta chinensis is known for its anti-cancer,33

426

immunostimulatory,34 and antiosteoporotic activities.35 The structures of the

427

components were confirmed using liquid chromatography-tandem mass spectrometry

428

(LS-MS/MS), and their respective spectra are shown in supplementary Additional File

429

2 which can be found online. The detected MW values extracted from the mass

430

spectra were 286.24, 302.24, 354.31, and 478.41 (Figure 3B). The validated

431

structures matching the targeted MW value of 286.24 include luteolin and kaempferol,

432

while the validated structures matching the targeted MW values of 302.24, 354.31,

433

and 478.41 are quercetin, 3-[3-(3,4-dihydroxy-phenyl)-acryloyloxy]-1,4,5-trihydroxy-

434

cyclohexanecarboxylic acid, and 2-(3-hydroxy-4-methoxyphenyl)-3,5-dihydroxy-7-

435

O-β-D-glucopyranoside-4H-1-benzopyrane-4-one, respectively. These structures were

436

verified by the Natural Product Laboratory of Pharmaceutical Industry Technology

437

and Development Center (PITDC) and were the correct structures for our testing set.

438

In this case study, NP-StructurePredictor took the four targeted MW values as

439

input. Two possible known scaffolds, shown in Figure 3A, were used as seed

440

scaffolds in the program of NP-StructurePredictor. One of the scaffolds (flavone, 1-2)

441

is a common backbone structure in Cuscuta chinensis.35-37 In this case, since most of

442

the constituents in Cuscuta chinensis were included in our collected NPDB, we used

443

the database searching approach to directly search existing compounds in the NPDB.

444

The average number of predicted structures identified by NP-StructurePredictor

445

across the four targeted MW values are 3 for scaffold 1-1 and 32 for scaffold 1-2. All

- 20 ACS Paragon Plus Environment

Page 20 of 39

Page 21 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

446

five of the validated structures were successfully identified in this approach. The

447

rakings of the five validated structures among all the predicted structures are listed in

448

Figure 3B. All five structures in the Cuscuta chinensis mixture, which were

449

definitively identified by experimental methods, can be correctly predicted by our

450

system; more importantly, all were ranked in the top five possible structures.

451

In this scenario, since all the validated chemical structures were already

452

available in the NPDB, NP-StructurePredictor simply had to retrieve known structures

453

from the database and rank them. The average ranking for these identified structures

454

was approximately 2, indicating that NP-StructurePredictor consistently recommends

455

chemical structures that closely resemble the structures of known compounds in our

456

databases. This case study demonstrated that the searching functionality of our system

457

is reliable and that our ranking method is reasonable.

458

Structure elucidation using a combinatorial side chains approach

459

Polygonum multiflorum, also called he shou wu, is one of the most important

460

traditional Chinese medicines and is frequently used as a strong laxative and blood

461

tonic. We used this herbal mixture to demonstrate the second approach of our NP-

462

StructurePredictor, by linking all possible side chains on a scaffold to match the

463

targeted MW. The validated structures and their respective targeted MW obtained

464

from experiments, are shown in the supplementary Additional File 3 online. The

465

corresponding targeted MW values from the mass spectra were 270.24, 284.27,

466

290.27, 406.39, 406.39, 432.38 and 578.53. Four scaffolds derived from known

467

chemical constituents of Polygonum multiflorum that have previously been published

468

in the literature38-40 were used as seed scaffolds in this case (Figure 4A). We first

469

applied the database searching approach, and NP-StructurePredictor returned a list of

- 21 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 39

470

more than one hundred structures matching each targeted MW. However, most of the

471

known structures could not be correctly predicted, nor were they highly ranked using

472

this method. This was because the compounds in the Polygonum multiflorum herbal

473

mixture are complex (e.g., procyanidin B2) and very diverse (a total of seven

474

structures with four scaffolds). To better predict the chemical structures in this

475

mixture, we applied the second prediction approach involving appending appropriate

476

side chains on the targeted scaffolds based on the atom-position configurations. NP-

477

StructurePredictor then utilizes the top five most common atom-position

478

configurations to generate novel structures. Seven confirmed constituents in

479

Polygonum multiflorum were all correctly identified. Incorporation of this approach

480

into NP-StructurePredictor improved the predictions. Although the second approach

481

iteratively searched all possible combinations of side chains on the targeted scaffolds

482

and generated a huge number of possible structures matching the targeted MW, the

483

validated structures were still ranked highly (Figure 4B). The average ranking of the

484

corrected structures was approximately 4. A total of 2858 compounds were generated

485

containing the 2-2 scaffolds matching the targeted MW value of 406.39. Then known

486

structures

487

tetrahydrostilbene 2-O-β-D-glucopyranoside, both containing the 2-2 scaffold, were

488

ranked 1 and 4, respectively. This case study demonstrated that the approach of side

489

chains extension on targeted scaffolds was useful in improving structure rankings

490

(reducing false positive identifications).

491

Structure elucidation without inputting seed scaffolds

492

NP-StructurePredictor contains an option to elucidate “unknown” chemical structures

493

by directly searching through all possible 83242 scaffolds, without inputting any seed

494

scaffold. We took another popular traditional Chinese herb, Ophiopogon japonicus

3,5,3',4'-tetrahydrostilbene-4'-O-β-D-glucopyranoside

- 22 ACS Paragon Plus Environment

and

2,3,5,4'-

Page 23 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

495

(also known as Maidong), as an example to compare the prediction results with or

496

without inputting seed scaffolds. Ophiopogon japonicus has been used clinically as a

497

treatment for chronic inflammation and coronary heart disease.41, 42 The seven known

498

structures from the Ophiopogon japonicus mixture and their experimentally obtained

499

mass spectra can be found in the supplementary Additional File 4 online. Their

500

corresponding targeted MW values extracted from the mass spectra are 328.32,

501

342.35, 356.33, and 370.36. Several chemical constituents of the roots of Ophiopogon

502

japonicus were elucidated by spectroscopic and chemical analyses,43,

503

derived three possible scaffolds (Figure 5A) from those chemical constituents. Using

504

the atom-position configurations approach to combine possible scaffolds with a

505

weighted list of side chains, NP-StructurePredictor correctly identified all seven

506

compounds from the Ophiopogon japonicus herbal mixture and assigned them

507

relatively high rankings (Figure 5B). Although methylophiopogonanone B has the

508

lowest estimated rank among all the prediction results, this compound still ranked 5

509

out of the 638 generated structures for its targeted MW. Furthermore, the other six

510

compounds from Ophiopogon japonicus all ranked in the top 3. However, when we

511

applied the direct searching of NPDB approach, two of the seven experimentally

512

confirmed

513

methylenedioxybenzyl)chromone

514

methylenedioxybenzyl)chromone), could not be identified or matched because these

515

two natural products do not currently exist in our NPDB. Although we have included

516

a large number of natural products (226949) from three well-known natural products

517

databases, we recognize our NPDB are not all-inclusive. This demonstrated that our

518

prediction system’s ability to generate novel chemical structures is crucial for the

519

structural elucidation process. For structures that were not already included in the

structures,

44

and we

(5,7,2'-trihydroxy-8-methyl-3-(3',4'and

5,7,2'-trihydroxy-6-methyl-3-(3',4'-

- 23 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

520

NPDB, NP-StructurePredictor is well-equipped to generate new structures to

521

compensate for unknown natural products.

522

When the seed scaffolds for the test species cannot be provided by users, NP-

523

StructurePredictor can still elucidate “unknown” chemical structures. Although this

524

process takes longer, NP-StructurePredictor can efficiently complete this task in a

525

reasonable timeframe. We took a targeted MW value of 342.35 from Ophiopogon

526

japonicus as a test case to identify unknown chemical structures by searching all

527

scaffolds in NP-StructurePredictor. Only two structural candidates were generated by

528

NP-StructurePredictor when the known scaffold was given. However, after

529

performing the second prediction approach on all 83242 scaffolds, the number of

530

generated compound candidates increased to 17332. The total execution time was

531

approximately 7 days. All structures of Ophiopogon japonicus with a targeted MW

532

value of 342.35 can be correctly identified, and the ranking of the three known

533

structures,

534

methylenedioxybenzyl)chromone,

535

(3',4'methylenedioxybenzyl)chromone, were 23, 133, and 159, respectively. This

536

example demonstrated that the validated structures still can be successfully ranked in

537

the top one percent of compounds even without the inputting seed scaffolds. We

538

recommend users choose the top 200 generated compounds as the likely candidates in

539

the testing mixture, and further utilize known mass spectra or structural information to

540

verify these structures.

541

Structure elucidation for a complex herbal mixture

methylophiopogonanone

A,

5,7,2'-trihydroxy-8-methyl-3-(3',4'-

and

5,7,2'-trihydroxy-6-methyl-3-

542

We chose a complex herbal mixture for our last case study. In this case, the

543

mixture contained Chinese angelica, Hualien angelica, and Japanese angelica to

- 24 ACS Paragon Plus Environment

Page 24 of 39

Page 25 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

544

illustrate the effectiveness of structure elucidation by NP-StructurePredictor. The root

545

of Angelica (Danggui) has been widely used for the treatment of many diseases

546

because of its anti-oxidation, anti-tumor, and anti-inflammatory activities.45 The

547

structures of the bioactive constituents isolated from angelica are very complex.45 The

548

chemical components are mainly composed of different types of coumarins,

549

acetylenic compounds, chalcones, sesquiterpenes and polysaccharides

550

45 validated compounds in our genus Angelica. The complex herbal mixture used in

551

this study contains 46 validated compounds, as shown in supplementary Additional

552

File 5 which can be found online. Thirty-seven different targeted MW values from

553

mass spectra ranging from 162.03 to 574.29 were reported. In Additional File 5, 46

554

validated structures are listed according to their MWs.

46

There were

555

A total of six known scaffolds derived from literature reports46 are shown in

556

Figure 6. We directly used the second prediction approach of NP-StructurePredictor

557

to elucidate the structures based on the six given scaffolds. The prediction results are

558

reported in supplementary Additional File 5. In this case, a total of 7079 compounds

559

were generated by NP-StructurePredictor based on the six seed scaffolds and 35

560

targeted MWs. The average number of generated structures for each targeted MW was

561

37. As shown in the table in supplementary Additional File 5, the average ranking for

562

the true structures in this herbal mixture was approximately 4, indicating that NP-

563

StructurePredictor can make good predictions even for a complex chemical mixture.

564

For example, in the mixture of angelica, the chemical constituent byakangelicin was

565

ranked first out of the thirty-six generated compounds that contain the 4-2 scaffold.

566

The overall prediction rate for this complex mixture was approximately 82%, since

567

out of a total of 45 compounds, only eight structures could not be correctly predicted

568

by our system. These compounds included oxypeucedanin, byakangelicol,

- 25 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

569

japoangelone, angelol G, isoepoxypteryxin, edulisin V, japoangelol B, and

570

japoangelol A. NP-StructurePredictor failed to predict these eight compounds because

571

our system can only utilize side chains learned from the collected NPDB to construct

572

possible structures on the known scaffolds; if our system lacked the specific side

573

chains required to generate these unique structures, then NP-StructurePredictor would

574

not be able to predict them. For example, since the side chain 3-(methoxymethyl)-2,2-

575

dimethyloxirane was not included in our side chain database, NP-StructurePredictor

576

could not link this side chain on the 4-2 scaffold to generate the correct chemical

577

constituent, oxypeucedanin, in the angelica mixture. A solution to this limitation is to

578

manually input extra side chains into our prediction system. To do this, commonly

579

occurring or structurally related side chains need to be added, and the criteria used to

580

select these side chains should be provided as well.

581

This case study demonstrated the merit of our NP-StructurePredictor system;

582

structures that do not already exist in the NPDB can still be generated by our system

583

for the identification of complex unknown natural products. The unavailable

584

structures in the NPDB include 4-hydroxyderricin and xanthoangelol E, and they were

585

ranked quite highly (4-hydroxyderricin: ranked 1, xanthoangelol E: ranked 5). The

586

ranking strategy is reliable because the predicted structures and their rankings

587

correlate well with the experimentally validated structures. Moreover, the outcomes of

588

case studies 3 and 4 showed that the atom-position configurations approach is an

589

effective strategy for generating new and viable structures to enhance the predictive

590

power of our system for structure elucidation.

591

Conclusions

592

In this study, NP-StructurePredictor was developed to efficiently and accurately

593

predict chemical structures of individual constituents of plant mixtures from LC-MS - 26 ACS Paragon Plus Environment

Page 26 of 39

Page 27 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

594

experiments. The only input users need to provide NP-StructurePredictor are a list of

595

molecular weight (MW) values from LC-MS spectra of the sample and seed scaffold

596

information from prior knowledge of the potential structural categories. When the

597

seed scaffolds are not provided, NP-StructurePredictor can directly search all its

598

83242 scaffolds for suitable candidates. The system computationally generates

599

possible chemical structures based on the user inputted target MWs and by combining

600

the most likely scaffolds and a list of side chains from our curated NPDB. NP-

601

StructurePredictor ranks the predicted structures allowing the most likely natural

602

product structures and their analogs to be proposed accordingly. Moreover, NP-

603

StructurePredictor can predict novel structures that were not already available in our

604

NPDB. NP-StructurePredictor is superior to previously developed methods that use

605

heuristics rules or chemical structural searches to generate structures because it can

606

automatically elucidate structures based on known side chains and correctly propose

607

the most plausible structures with respect to current experimental results. According

608

to our four validation case studies, our system can be used to predict natural products

609

in any herbal mixture. NP-StructurePredictor can also be utilized as a preliminary

610

structure elucidation screening system to reduce large numbers of possible chemical

611

structures, accelerating further identification procedures. The source code can be

612

downloaded from http://npstructurepredictor.cmdm.tw/NPSP.rar.

613

- 27 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

614

Acknowledgments

615

This work was funded by the Ministry of Science and Technology, Taiwan, grant

616

numbers 105-3011-F-002-010 -, 105-2812-8-002-001-MY2, and 106-2622-B-002-

617

008 -, and National Taiwan University, grant number NTU-ERP-106R880803.

618

Resources of the Laboratory of Computational Molecular Design and Metabolomics

619

and the Department of Computer Science and Information Engineering of National

620

Taiwan University were used to perform these studies.

621

Abbreviations

622

LC-MS: liquid chromatography-mass spectrometry

623

CASE: computer aided structures elucidation

624

2D: two-dimensional

625

NMR: nuclear magnetic resonance

626

MS: mass spectrometry

627

UPLC-MS: ultra performance liquid chromatography-mass spectrometry

628

MS/MS: tandem mass spectrometry

629

NPDB: natural products database

630

LMW: the largest molecular weight

631

SMW: the smallest molecular weight

632

MW: molecular weight

633

LC-MS/MS: liquid chromatography-tandem mass spectrometry

634

PITDC: Pharmaceutical Industry Technology and Development Center

635

DNP: Dictionary of Natural Products

636

TCMD: Traditional Chinese Medicine Database

637

HPLC: high performance liquid chromatography

- 28 ACS Paragon Plus Environment

Page 28 of 39

Page 29 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

638

HPLC-ESI-MS: high performance liquid chromatography-electrospray ionisation-

639

mass spectrometry

640

Supporting Information

641

Additional File 1. Detailed Results and Discussion.

642

Additional File 2. The spectral data as well as the structures identified from the

643

spectra from the Cuscuta chinensis case study.

644

Additional File 3. The spectral data as well as the structures identified from the

645

spectra from the Ophiopogon japonicus case study.

646

Additional File 4. The spectral data as well as the structures identified from the

647

spectra from the Polygonum multiflorum case study.

648

Additional File 5. A list of the verified structures and prediction results from the

649

genus Angelica case study using NP-StructurePredictor.

650 651

References

652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670

1. Koch, M. A.; Schuffenhauer, A.; Scheck, M.; Wetzel, S.; Casaulta, M.; Odermatt, A.; Ertl, P.; Waldmann, H., Charting biologically relevant chemical space: a structural classification of natural products (SCONP). Proc. Natl. Acad. Sci. U S A 2005, 102, 17272-17277. 2. Newman, D. J.; Cragg, G. M., Natural products as sources of new drugs over the last 25 years. J. Nat. Prod. 2007, 70, 461-477. 3. Zhang, C.; Qi, M.; Shao, Q.; Zhou, S.; Fu, R., Analysis of the volatile compounds in Ligusticum chuanxiong Hort. using HS-SPME-GC-MS. J. Pharm. Biomed. Anal. 2007, 44, 464-470. 4. Steinbeck, C., Recent developments in automated structure elucidation of natural products. Nat. Prod. Rep. 2004, 21, 512-518. 5. Steinbeck, C., The automation of natural product structure elucidation. Curr. Opin. Drug. Discov. Devel. 2001, 4, 338-342. 6. Elyashberg, M. E.; Gribov, L. A., Formal-logical method for interpreting infrared spectra from characteristic frequencies. J. Appl. Spectrosc. 1968, 8, 189-191. 7. Lederberg, J.; Sutherland, G. L.; Buchanan, B. G.; Feigenbaum, E. A.; Robertson, A. V.; Duffield, A. M.; Djerassi, C., Applications of artificial intelligence for chemical inference. I. The number of possible organic compounds. Acyclic structures containing C, H, O, and N. J. Am. Chem. Soc. 1969, 91.

- 29 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720

8. Nelson, D. B.; Munk, M. E.; Gash, K. B.; Herald, D. L., Alanylactinobicyclone. An Application of Computer Techniques to Structure Elucidation. J Org Chem 1969, 34, 3800-3805. 9. Sasaki, S.; Abe, H.; Ouki, T.; Sakamoto, M.; Ochiai, S., Automated structure elucidation of several kinds of aliphatic and alicyclic compounds. Anal. Chem. 2002, 40, 2220-2223. 10. Buchanan, B. G.; Smith, D. H.; White, W. C.; Gritter, R. J.; Feigenbaum, E. A.; Lederberg, J.; Djerassi, C., Applications of artificial intelligence for chemical inference. 22. Automatic rule formation in mass spectrometry by means of the metaDENDRAL program. J Org Chem 1976, 98, 6168-6178. 11. Steinbeck, C., SENECA: A platform-independent, distributed, and parallel system for computer-assisted structure elucidation in organic chemistry. J. Chem. Inf. Comput. Sci. 2001, 41, 1500-1507. 12. Peironcely, J. E.; Rojas-Cherto, M.; Fichera, D.; Reijmers, T.; Coulier, L.; Faulon, J. L.; Hankemeier, T., OMG: Open Molecule Generator. J. Cheminform. 2012, 4, 21. 13. Christie, B. D.; Munk, M. E., The role of 2-dimensional nuclear-magneticresonance spectroscopy in computer-enhanced structure elucidation. J Org Chem 1991, 113, 3750-3757. 14. Peng, C.; Yuan, S. G.; Zheng, C. Z.; Hui, Y. Z., Efficient Application of 2d Nmr Correlation Information in Computer-Assisted Structure Elucidation of Complex Natural-Products. J. Chem. Inf. Comput. Sci. 1994, 34, 805-813. 15. Lindel, T.; Junker, J.; Kock, M., 2D-NMR-guided constitutional analysis of organic compounds employing the computer program COCON. Eur. J. Org. Chem. 1999, 573-577. 16. Blinov, K. A.; Carlson, D.; Elyashberg, M. E.; Martin, G. E.; Martirosian, E. R.; Molodtsov, S.; Williams, A. J., Computer-assisted structure elucidation of natural products with limited 2D NMR data: application of the StrucEluc system. Magn. Reson. Chem. 2003, 41, 359-372. 17. Elyashberg, M. E.; Blinov, K. A.; Williams, A. J.; Molodtsov, S. G.; Martin, G. E.; Martirosian, E. R., Structure Elucidator: a versatile expert system for molecular structure elucidation from 1D and 2D NMR data and molecular fragments. J. Chem. Inf. Comput. Sci. 2004, 44, 771-792. 18. Elyashberg, M.; Blinov, K.; Molodtsov, S.; Williams, A., Elucidating 'undecipherable' chemical structures using computer-assisted structure elucidation approaches. Magn. Reson. Chem. 2012, 50, 22-27. 19. Elyashberg, M. E.; Williams, A.; Martin, G. E., Computer-assisted structure verification and elucidation tools in NMR-based structure elucidation. Prog. Nucl. Magn. Reson. Spectrosc. 2008, 53, 1-104. 20. Elyashberg, M.; Blinov, K.; Molodtsov, S.; Smurnyy, Y.; Williams, A. J.; Churanova, T., Computer-assisted methods for molecular structure elucidation: realizing a spectroscopist's dream. J Cheminform. 2009, 1, 3. 21. Kind, T.; Fiehn, O., Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. Bmc Bioinformatics 2007, 8, 105. 22. von Bargen, C.; Hubner, F.; Cramer, B.; Rzeppa, S.; Humpf, H. U., Systematic approach for structure elucidation of polyphenolic compounds using a bottom-up approach combining ion trap experiments and accurate mass measurements. J. Agric. Food Chem. 2012, 60, 11274-11282. 23. Scheubert, K.; Hufsky, F.; Bocker, S., Computational mass spectrometry for small molecules. J Cheminform 2013, 5, 12. - 30 ACS Paragon Plus Environment

Page 30 of 39

Page 31 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769

24. Yetukuri, L.; Katajamaa, M.; Medina-Gomez, G.; Seppanen-Laakso, T.; Vidal-Puig, A.; Oresic, M., Bioinformatics strategies for lipidomics analysis: characterization of obesity related hepatic steatosis. BMC Syst Biol 2007, 1, 12. 25. Bemis, G. W.; Murcko, M. A., The properties of known drugs. 1. Molecular frameworks. J Med Chem 1996, 39, 2887-2893. 26. Schuffenhauer, A.; Ertl, P.; Roggo, S.; Wetzel, S.; Koch, M. A.; Waldmann, H., The scaffold tree--visualization of the scaffold universe by hierarchical scaffold classification. J Chem Inf Model 2007, 47, 47-58. 27. Clasquin, M. F.; Melamud, E.; Rabinowitz, J. D., LC-MS data processing with MAVEN: a metabolomic analysis and visualization engine. Curr Protoc Bioinformatics 2012,, 14.11,1-23. 28. Smith, C. A.; Want, E. J.; O'Maille, G.; Abagyan, R.; Siuzdak, G., XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 2006, 78, 779-787. 29. The Dictionary of Natural Products database is available from Chapman & Hall/CRC at URL http://dnp.chemnetbase.com/. (July 10, 2010) 30. Irwin, J. J.; Shoichet, B. K., ZINC--a free database of commercially available compounds for virtual screening. J Chem Inf Model 2005, 45, 177-182. 31. Chen, C. Y.-C., TCM database@Taiwan: the world's largest traditional chinese medicine database for drug screening in silico. PLoS ONE 2011, 6, e15939. 32. Fejzo, J.; Lepre, C. A.; Peng, J. W.; Bemis, G. W.; Ajay; Murcko, M. A.; Moore, J. M., The SHAPES strategy: an NMR-based approach for lead generation in drug discovery. Chem Biol 1999, 6, 755-769. 33. Nisa, M.; Akbar, S.; Tariq, M.; Hussain, Z., Effect of Cuscuta chinensis water extract on 7,12-dimethylbenz[a]anthracene-induced skin papillomas and carcinomas in mice. J Ethnopharmacol 1986, 18, 21-31. 34. Bao, X.; Wang, Z.; Fang, J.; Li, X., Structural features of an immunostimulating and antioxidant acidic polysaccharide from the seeds of Cuscuta chinensis. Planta Med 2002, 68, 237-243. 35. Yang, L.; Chen, Q.; Wang, F.; Zhang, G., Antiosteoporotic compounds from seeds of Cuscuta chinensis. J Ethnopharmacol 2011, 135, 553-560. 36. Umehara, K.; Nemoto, K.; Ohkubo, T.; Miyase, T.; Degawa, M.; Noguchi, H., Isolation of a new 15-membered macrocyclic glycolipid lactone, Cuscutic Resinoside a from the seeds of Cuscuta chinensis: a stimulator of breast cancer cell proliferation. Planta Med 2004, 70, 299-304. 37. Hajimehdipoor, H.; Kondori, B. M.; Amin, G. R.; Adib, N.; Rastegar, H.; Shekarchi, M., Development of a validated HPLC method for the simultaneous determination of flavonoids in Cuscuta chinensis Lam. by ultra-violet detection. Daru 2012, 20, 57. 38. Yao, S.; Li, Y.; Kong, L., Preparative isolation and purification of chemical constituents from the root of Polygonum multiflorum by high-speed counter-current chromatography. J Chromatogr A 2006, 1115, 64-71. 39. Qiu, X.; Zhang, J.; Huang, Z.; Zhu, D.; Xu, W., Profiling of phenolic constituents in Polygonum multiflorum Thunb. by combination of ultra-high-pressure liquid chromatography with linear ion trap-Orbitrap mass spectrometry. J Chromatogr A 2013, 1292, 121-131. 40. Choi, S. G.; Kim, J.; Sung, N. D.; Son, K. H.; Cheon, H. G.; Kim, K. R.; Kwon, B. M., Anthraquinones, Cdc25B phosphatase inhibitors, isolated from the roots of Polygonum multiflorum Thunb. Nat Prod Res 2007, 21, 487-493.

- 31 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788

41. Ge, L. L.; Kan, L. D.; Zhuge, Z. B.; Ma, K. E.; Chen, S. Q., Ophiopogon japonicus strains from different cultivation regions exhibit markedly different properties on cytotoxicity, pregnane X receptor activation and cytochrome P450 3A4 induction. Biomed. Rep. 2015, 3, 430-434. 42. Chen, M. H.; Chen, X. J.; Wang, M.; Lin, L. G.; Wang, Y. T., Ophiopogon japonicus--A phytochemical, ethnomedicinal and pharmacological review. J. Ethnopharmacol 2016, 181, 193-213. 43. Hung, T. M.; Thu, C. V.; Dat, N. T.; Ryoo, S. W.; Lee, J. H.; Kim, J. C.; Na, M.; Jung, H. J.; Bae, K.; Min, B. S., Homoisoflavonoid derivatives from the roots of Ophiopogon japonicus and their in vitro anti-inflammation activity. Bioorg Med Chem Lett 2010, 20, 2412-2416. 44. Li, N.; Zhang, J. Y.; Zeng, K. W.; Zhang, L.; Che, Y. Y.; Tu, P. F., Antiinflammatory homoisoflavonoids from the tuberous roots of Ophiopogon japonicus. Fitoterapia 2012, 83, 1042-1045. 45. Jin, M.; Zhao, K.; Huang, Q.; Xu, C.; Shang, P., Isolation, structure and bioactivities of the polysaccharides from Angelica sinensis (Oliv.) Diels: a review. Carbohydr Polym 2012, 89, 713-722. 46. Sarker, S. D.; Nahar, L., Natural medicine: the genus Angelica. Curr Med Chem 2004, 11, 1479-1500.

789 790

- 32 ACS Paragon Plus Environment

Page 32 of 39

Page 33 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

791

Figure 1. NP-StructurePredictor system overview. The NP-StructurePredictor system

792

overview is illustrated using color-coded modules. The red modules represent

793

inputted raw data into the NP-StructurePredictor system; the blue modules represent

794

the computational functions executed by the system; and the green modules represent

795

processed data. To emphasize the roles these modules play within the overall system,

796

boxes are drawn around sub-groups with dashed lines and labelled with white texts

797

over black background. The detailed system overview is described in the Methods

798

section.

799 800

- 33 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

801

Figure 2. Comparison of combination numbers (Nc) using three approaches. We used

802

the combination number as an index for evaluating the computation times for the

803

three approaches. Four scaffolds (2-1, 2-2, 2-3, and 2-4) were assessed in this

804

evaluation. The y-axis values are the base 10 logarithms of the combination numbers,

805

Nc.

806 807

- 34 ACS Paragon Plus Environment

Page 34 of 39

Page 35 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

808

Figure 3. The prediction result of Cuscuta chinensis. Two known possible scaffolds

809

for this herbal mixture were used as input scaffolds and are shown in (A). The

810

confirmed compounds are shown in (B). All these structures were correctly identified

811

by NP-StructurePredictor. The predicted rankings for these compounds are listed

812

below each structure.

813 814

- 35 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

815

Figure 4. The prediction result of Ophiopogon japonicus. Three input scaffolds are

816

shown in (A). The confirmed compounds are shown in (B). All these structures were

817

correctly identified by NP-StructurePredictor. The predicted rankings for these

818

compounds are listed below each structure.

819 820

- 36 ACS Paragon Plus Environment

Page 36 of 39

Page 37 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

821

Figure 5. The prediction result of Polygonum multiflorum. Four input scaffolds are

822

shown in (A). The confirmed compounds are shown in (B). All these structures were

823

correctly identified by NP-StructurePredictor, and the predicted rankings for these

824

compounds are listed below each structure.

825 826

- 37 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

827

Figure 6. The six input scaffolds for genus Angelica are shown. The input scaffolds,

828

the known possible scaffolds for this herbal mixture, were gleaned from published

829

data.

830 831

- 38 ACS Paragon Plus Environment

Page 38 of 39

Page 39 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

832

TOC GRAPH

833

For Table of Contents Use Only NP-StructurePredictor: prediction of unknown natural products in plant mixtures Yeu-Chern Harn, Bo-Han Su, Yuan-Ling Ku, Olivia A. Lin, Cheng-Fu Chou,and Y. Jane Tseng*

834

- 39 ACS Paragon Plus Environment