Exploring Chemical and Biological Space of Terpenoids | Journal of

6 days ago - However, comprehensive understanding on the structure-function features for terpenoid NPs is limited. In this work, we have systematicall...
2 downloads 0 Views 2MB Size
Subscriber access provided by Nottingham Trent University

Chemical Information

Exploring Chemical and Biological Space of Terpenoids Tao Zeng, Zhihong Liu, Huawei Liu, Wengan He, Xiaowen Tang, Liwei xie, and Ruibo Wu J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.9b00443 • Publication Date (Web): 12 Aug 2019 Downloaded from pubs.acs.org on August 17, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Exploring Chemical and Biological Space of Terpenoids Tao Zeng1,#, Zhihong Liu2,#, Huawei Liu1, Wengan He1, Xiaowen Tang1, Liwei Xie2, Ruibo Wu1,* 1 School 2 State

of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou 510006, P.R. China

Key Laboratory of Applied Microbiology Southern China, Guangdong Provincial Key

Laboratory of Microbial Culture Collection and Application, Guangdong Open Laboratory of Applied Microbiology, Guangdong Institute of Microbiology, Guangdong Academy of Sciences, Guangzhou 510070, China # T.

Zeng and Z. Liu contributed equally to this work.

* E-mail: [email protected]

Abstract Terpenoids represent the largest family of natural products (NPs) with dramatically chemical and structural diversity, which makes terpenoids the important compound resources of drug discovery. However, comprehensive understanding on the structure-function features for terpenoid NPs is limited. In this work, we have systematically explored the chemical and biological space of terpenoid NPs, including their distribution, physicochemical properties, scaffold features and functional applications, by utilizing various cheminformatics and bioinformatics approaches. We have not only confirmed that terpenoid NPs have good drug-likeness and great potential for drug discovery, but more importantly, illuminated the uniqueness of cyclic scaffold diversity in different species (plant, fungi, bacteria and animal.) and the specificity of biological function for the dominant fused-ring scaffolds of terpenoids. The present work supplies a valuable reference for identifying the new structure and unknown function of terpenoid NPs.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Key Words: Natural Products; Cheminformatics; Chemical Diversity; Terpenoids; Bioinformatics

TOC:

ACS Paragon Plus Environment

Page 2 of 26

Page 3 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1. Introduction Terpenoids (also known as isoprenoids or terpenes) are the largest class of naturally occurring hydrocarbon compounds, and the terpenome family (includes terpenoids, steroids and carotenoids) accounts for nearly one-third of natural products in the Dictionary of Natural Products (DNP).1 By utilizing the tenet of enzyme catalysis in all living organisms, the chemically and structurally diverse terpenoids are naturally derived from the simple isoprene units (also called as five carbon C5 building blocks)2-4, dimethylallyl diphosphate (DMAPP) and isopentenyl diphosphate (IPP) (Figure 1). For example, the geranyl diphosphate (GPP) is assembled by DMAPP and IPP in a “headto-tail” style under the catalysis of GPP synthase, and then yield farnesyl diphosphate (FPP), geranylgeranyl diphosphate (GGPP) and geranylfarnesyl diphosphate (GFPP) by further enzyme catalysis to add IPP building block. Subsequently, starting from these acyclic precursors as substrates, combined with various cyclization cascade reaction catalyzed by terpenoid cyclases, will lead to enormous cyclic terpenoids with multiple scaffold diversity and complicated chiral stereochemistry. Thus, it is a miracle of nature that the complex chemical space of terpenoids is actually generated by the sophisticated enzyme catalysis strategy using simple and achiral C5 units. The total number of C5 units in a terpenoid compound is used for classification, that is, hemiterpenoids (C5), monoterpenoids (C10), sesquiterpenoids (C15), diterpenoids (C20), sesterterpenoids (C25), triterpenoids (C30), tetraterpenoids (C40) and polyterpenoids (>C40).

Figure 1. Overview of the biogenesis and applications of terpenoids. One of the bottlenecks in modern drug discovery is the limitation of chemical diversity for drug screening.5-7 Natural products (NPs) are known as the major source of drug discovery due to their enormous structural and physicochemical diversity.8, 9 To date, more than 60% of the FDA-approved drugs are NPs or their derivatives.10 As one of the largest parts of NPs, terpenoids are implicated in defense or in the attraction of beneficial organisms for the producers11 while they also play an important role in the

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

pharmaceutical fields. For example, the plant-derived sesquiterpenoid artemisinin (Figure 1) and its semisynthetic derivatives are well-known as anti-malarial drugs.12 Another plant-derived diterpenoid taxol (its precursor taxadiene in Figure 1) is famous for its anti-tumor activities13. Extracts of Ginkgo biloba leaves, including a series of diterpenoids (ginkgolides A, B, C and J) and sesquiterpenoids, afford protection against some kinds of neural and vascular damage.14 Besides, terpenoids are also used as solvents and ingredients in many food, cosmetics and industrial products.15 Along with the exponentially increase of natural and synthetic molecules, the cheminformatics and bioinformatics approaches play a powerful role in drug discovery.16 Very recently, several studies focused on the cheminformatics analysis of NPs have been reported. Kirchmair et al.17 analyzed the NPs from virtual database and physical database, results showed that the readily obtainable NPs are highly diverse and cover similar regions of chemical space that are highly relevant to drug discovery. NCATS teams18 assembled the Canvass library with 346 NPs, and subsequently screened against 50 cell-based or biochemical assays, to provide a scientific community with a valuable resource for biological evaluation of NPs. Sheng et al.19 contributed a comprehensive review of successful cases on structural simplification of NPs for drug discovery, which is a powerful drug design strategy since the complexity of NPs makes them difficult to be synthetized and often leads to unfavorable pharmacokinetic profiles. Hou et al.20 explored the chemical space difference between terrestrial and marine originated NPs. They found marine originated compounds have longer chains and larger rings than terrestrial originated NPs. Medina-Franco et al.21 did a cheminformatics analysis for the most recent version of NuBBEDB (a database of compounds from Brazilian biodiversity) and other collections of NPs. They concluded that the diversity and complexity of NPs varies according to the compounds origin. Ertl et al.22 preformed a systematic cheminformatics analysis for the functional groups of NPs. The revealed function group distribution in NPs have potential application of medical synthesis and helpful in the identification of biosynthetic gene cluster. Regarding to cheminformatics analysis of terpenoids, Kikuchi et al.23 developed a terpenoid alkaloid-like compound library based on the humulene skeleton and evaluated the chemical diversity by using PCA analysis with 20 structural and physicochemical descriptors. Nuutinen et al.24 reviewed the medicinal properties of terpenoids found in Cannabis sativa and Humulus lupulus which are rich in mono- and sesquiterpenes. Gao et al.25 explored the potential anti-inflammatory and immunomodulatory activity of 102 cassane diterpenoids by target prediction, molecular docking, molecular dynamics (MD) simulation, and signaling pathways analysis, which provides a reference for the structure−activity relationship study of cassanediterpenoids-like compounds. Xiao et al.26 collected 665 anti-inflammatory NPs in which flavonoids and triterpenoids were the major structural types, and elucidated the

ACS Paragon Plus Environment

Page 4 of 26

Page 5 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

biological targets which could be guidable for target-based and fragment-based antiinflammatory drug development. The structural diversity of terpenoids also caught lots of attentions. Such as, Jacobson et al.27 developed the first computational approach for the enumeration of terpenoid carbocations from monoterpenoid synthases. Hamberger et al.28 summarized the chemotaxonomic and enzyme activity data for diterpene synthases in the Lamiaceae and explore their sequence and function space by leveraging the new transcriptomes, which revealed the relationship between diverse skeletons and synthases in Lamiaceae. Nevertheless, all these cheminformatics analyses for terpenoids are mostly focusing on probing the physicochemical properties and structure landscape of specific terpenoids. To our best knowledge, a comprehensive cheminformatics analysis for the complete set of all terpenoids has not been reported yet. Herein over 60,000 unique terpenoid compounds are collected, and then their drug-like properties, structural diversity and the functional profiling are explored by combining a variety of cheminformatics and bioinformatics approaches. Finally, a comprehensive picture of the chemical and biological space of terpenoids is provided, which is guidable for further chemical diversity navigation and biological activity validation of the terpenoids.

2 Methods 2.1 Data collection and preparation DNP is the well approbatory encyclopedic database of NPs, we were authorized to use the commercial DNP database29 to acquire the 77,317 records annotated with type of terpenoids. And then the two-dimensional structures of terpenoids with available SMILES (simplified molecular-input line-entry system) were generated with Open Babel30. All the molecules were standardized by adding hydrogen atoms, stripping salts, keeping the largest fragments and removing duplicated molecules using canonical SMILES by Pipeline Pilot 8.031. Finally, we obtained a well-established terpenoid NPs dataset with 62,611 unique molecules with reliable structural information, after reassessment of all the terpenoids records in DNP. For comparison analysis, 138,694 unique non-terpenoid structures derived from DNP and 9532 unique drug structures derived from DrugBank32, 33 constituted the non-terpenoid NPs dataset and drugs dataset respectively. 2.2 Physicochemical property analysis We estimated 11 kinds of physicochemical properties including: The number of aromatic atoms (Aro_Atoms); The oil-water partition coefficient (AlogP); The fraction of Csp3 atoms (Fsp3); The fraction of rotatable bonds (FRBonds); The number of

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

hydrogen-bond acceptors (H_acc); The number of hydrogen-bond donors (H_don); The number of heavy atoms (Heavy_Atoms); The number of rings (nRings); The molecular weight (MW); The topological polar surface area (TPSA) and the number of chiral centers (Chiral). Except for the chiral centers (Chiral) evaluated by MOE software34, all the rest features were calculated by means of PaDEL-Descriptor35, an open source software to calculate molecular descriptors and fingerprints. And all those physicochemical properties were employed for further principal component analysis and self-organizing maps. 2.3 Principal component analysis (PCA) and self-organizing maps (SOM) PCA is a powerful method of dimensionality reduction and most commonly used to produce visualizations for large datasets. The multi-dimensional variables were linearly combined to obtain the representatives of the original features as much as possible, reflecting by the low dimensional data of the principal component (PC). In spite of PCA is widely used for molecular descriptors in the cheminformatics community36, the main problem of the PCA is that most common PC values will locate in a limited region where the samples are overloaded.37 To avoid this unreliability, another method SOM was also used to visualize the chemical space of terpenoid NPs with an open-source program Data Warrior38. First a two-dimensional network (map) with a certain number of units was initialized randomly, when a sample was presented to network, the most similar unit to the present sample is activated. Then the activated unit and its neighbor units were optimized in the direction of the present sample. And this iterative procedure will be applied for all samples in the network. 2.4 Scaffold analysis Scaffold analysis, which represents the core structural skeletons and captures the common features from enormous molecules, is guidable and widely used in medicinal chemistry research community.39 The Bemis-Murcko framework40 (Figure S1), in which the linkers and ring systems of the molecule are kept at the atomic level by including the detailed information of element type, atomic hybridization, and atomic charge, is employed to analyze the scaffolds characteristic of terpenoid NPs by RDKit41 packages, especially for the cyclic scaffolds statistical analysis. While the graph level of Bemis-Murcko framework is not adopted since it only keeps the connectivity properties of the atoms by considering each atom as a vertex and each bond as an edge, thus too much detailed scaffold features of terpenoid NPs are lost. 2.5 Target mapping and gene enrichment analysis To exploring the biological space of the collected terpenoid NPs, firstly they were matched with drug molecules in DrugBank. Then, the ChEMBL database42, 43 ,

ACS Paragon Plus Environment

Page 6 of 26

Page 7 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

containing a large number of compounds with drug-like bioactivity, was used for target mapping using InChIKeys of terpenoid NPs. MySQL version of ChEMBL was downloaded to the local and an in-house Golang script was used to gather the bioactivity information of the terpenoids. Targets with activities against terpenoids with an IC50, EC50, Ki or Kd of 5) of tetraterpenoids (C40) and polyterpenoids as shown in Figure S3. Nevertheless, the AlogP values of most terpenoid NP satisfy the “rule of five”. Therefore, terpenoid NPs show a favorable drug-likeness, indicating that terpenoid NPs can be one of the most productive sources of drug leads in natural products. And it also indicates that further structural modification to enhance its hydrophilicity (reduce the ALogP) is likely important in drug discovery of many-carbon (>40) terpenoids-like inhibitors.

Figure 4. Drug-like property distributions.

Table 1. The scaffolds and ring-systems of terpenoid NPs and non-terpenoid NPsa Data set

Fring

Fring_3

Fring_4

Fbridge

ACS Paragon Plus Environment

Fspiro

Ns

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 26

Non-terpenoid NPs

89.987%

0.779%

9.720%

8.828%

32,850

Terpenoid NPs

98.281% 15.015% 2.033%

19.826%

22.033%

20,953

4.222%

aF ring:

The fraction of compounds with rings, Fring_3: The fraction of compounds with three-membered rings, Fring_4 :The fraction of compounds with four-membered rings, Fbridge: The fraction of cyclic compounds with bridgehead atoms, Fspiro: The fraction of cyclic compounds with spiro atoms, Ns: The number of scaffolds. The Cyclic System Retrieval (CSR) graphs was also depicted by using Bemis-Murcko framework, as shown in Figure S4.

3.3 Structural features of terpenoid NPs In natural products, ring systems occurred frequently and are important to the bioactivities48. The occurrence rate of ring scaffold in terpenoid NPs (~98%) is much higher than that in non-terpenoid (~90%) NPs, as shown in Table 1. Interestingly, the ring-containing compounds are even more prevailing in terpenoid NPs compared to non-terpenoid NPs. Specifically, terpenoid NPs have more aliphatic rings and less aromatic rings, while the opposite tendency for non-terpenoid NPs (Figure 5 a,b). In detail, as shown in Table 1, the three- and four-membered rings occur more frequently, and the occurrence rate of bridge- or spiro- ring is also much higher among the cyclic terpenoid NPs, in comparison to the non-terpenoid NPs, indicating the more complex scaffold of cyclic terpenoid NPs. Besides, the fraction of Csp3 atoms (Fsp3) and the number of chiral centers (Chiral) were used as metrics to quantify the molecular complexity which is thought of an important element in drug design.49 High molecular complexity means the high selectivity when a molecule acts on a target,50 suggesting the minor adverse reactions in clinical treatment. Figure 5 c,d shows that terpenoid NPs also perform higher molecular complexity than non-terpenoid NPs. And it further reflects the greater stereochemical complexity of terpenoid NPs.

ACS Paragon Plus Environment

Page 11 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5. Distributions of structure properties of terpenoid NPs and non-terpenoid NPs: (a) number of aromatic rings, (b) number of aliphatic rings, (c) fraction of Csp3 atoms, (d) number of chiral centers. The distribution of terpenoids NPs could be further confirmed from Figure S3.

Since ring-containing compounds are dominant in both terpenoid and nonterpenoid NPs, the Bemis-Murcko scaffolds were analyzed and the acycilc compounds were not considered here. As shown in Table 1, there are total 20,953 distinctive cyclic scaffolds in our terpenoid NPs dataset (62,611), and the number of cyclic scaffolds is 32,850 from 138,694 compounds in our non-terpenoid NPs dataset, to some extent, the diversity of cyclic scaffolds is much higher for terpenoid NPs (~33%, 20,953/62,611), in comparison to the non-terpenoids (~24%). The top 50 frequently-occurred cyclic scaffolds for each category of terpenoid NPs are shown in Figure 6, in which the structure landscape of terpenoid NPs is clearly visualized. The most common scaffolds are six-membered rings, which are the essential building blocks for biosynthesis of hormone or other substances important for life activities. It is expectable that the simple monocyclic scaffolds are ubiquitous in all categories of terpenoid NPs. Moreover, each category of terpenoid has its dominant ring scaffold. For example, triterpenoids are mostly represented by penta- and tetra-cyclic skeletons, diterpenoids are mainly represented by tetra-and tri-cyclic skeletons. In addition, there are seven bridge-ring and four three-member-ring detected while no spiro-ring and four-member-ring found

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

in the top 50 scaffold list, recall of the ring-scaffolds population shown in Table 1, it indicates that the spiro-ring is not enriched but deconcentrated into various scaffolds.

Figure 6. The top 50 scaffolds (order by occurrence frequency) of all terpenoid NPs. mono-(yellow), sesqui-(green), di-(blue), tri-(orange) and other terpenoids (purple). Bar height is relative to the category in which this scaffold is most frequently occurred.

Figure 7. The top 5 dominant scaffolds in fungi, bacteria and animal. They are ordered by occurrence frequency. The unique scaffold in that species are highlighted in red number while black not. Interestingly, there are several specifically enriched condensed rings for tri-, diand sesqui-terpenoids respectively, presenting the uniqueness of fused-rings for each kind of terpenoids. And the common feature of the fused-rings is its non-aromaticity, while aromatic ring is rare found in terpenoids, as confirmed in Figure 5ab and Figure S3. One of the representative condensed ring structure is pentacyclic triterpenoid, such

ACS Paragon Plus Environment

Page 12 of 26

Page 13 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

as lupane, oleanane and ursane skeletons. They can be divided into 6-6-6-6-6 pentacycles and 6-6-6-6-5 pentacycles (the first and fourth in Figure 6). Hundreds of publications have highlighted the broad spectrum of biological activities of these pentacyclic triterpenoids, including antiangiogenic, anti-inflammatory, antioxidant effects and so on.51, 52 The structure diversity of these compounds also comes from the different conformations (boat or chair) of unsaturation six-membered rings.53 Beside the prevalent successively fused-ring or isolated single-ring scaffolds, there are four scaffolds (14th/21th/39th/50th in Figure 6) containing a linker between the two parts of ring, which are also widely existed in nature. Considering that more than 85% terpenoid NPs are derived from plants, the forementioned top 50 scaffolds could not represent the distribution for minor terpenoid NPs that derived from bacteria, fungi and animals. In order to explore the scaffolds difference produced by different organisms, we further identify the top 5 highfrequency scaffolds of terpenoid NPs in non-plant (Figure 7). It indicates that the dominant scaffold styles in fungi-derived terpenoid NPs are similar to plant-derived terpenoid NPs with only one unique scaffold (4th in Figure 7) with a four-member ring. Differently, the dominant scaffolds derived from bacteria and animals are significantly different from those in plants. Besides the first dominant scaffold for fungi (Figure 7), the rest 14 dominant scaffolds are not detected in the top 50 list for all terpenoid NPs (Figure 6), it indicates that the plant-derived terpenoids represents the most scaffold diversity of all terpenoid NPs. This might be owing to the fact that promiscuous terpenoid-related enzymes are widely existed in plant while high fidelity is detected in animal and bacterial1, 54-57. Moreover, the more unique scaffolds in animal and the less in fungi are consistent with the evolutionary relationship, in which fungi is nearest to plant and animal is furthest from plant. Interestingly, the epoxide ring is prevalent in the top 5 high-frequency scaffolds of animal-derived terpenoid NPs, such as, the oxidized γ-lactone-bearing bicyclo[8.4.0] ring (11th/13th/14th in Figure 7) are exclusive existed in animals. These briarane-type diterpenoids derived from marine coelenterate corals can be the chemotaxonomic markers for the orders of Gorgonacea and Alcyonacea,58 and some of them also have been proved to exhibit antiinflammatory activities.59 It is noteworthy that no spiro and four-membered ring was found in the top 50 dominant scaffold in plant, while the spiro scaffold is occurred frequently in fungi and animal. Regarding to the unique N-contain dominant scaffolds in bacterial, it is likely originated from the participation of many bacteria in the important biological nitrogen fixation. 3.4 Drugs and targets mapping of terpenoid NPs The InChIKeys of terpenoid NPs were matched with DrugBank and 65 identical compounds were found (highlighted in bold in Table S1), they are widely used in the

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 26

treatment of various diseases, such as neurologic disorders, cardiovascular disease, cancer and so on. Indeed, there are at least 212 terpenoid-like drugs in DrugBank, as summarized in Table S1. On the one hand, this can be attributed to the different definition of “terpenoid” in DrugBank and DNP. Terpenoids show remarkable structure diversity and mainly exist in the form of glycosides and alkaloid in nature, and they are classified as terpenoids in some database/literatures while not in others. On the other hand, structure information is lack for some entries in DrugBank, such as paclitaxel poliglumex, AI-850 and some mixture like pyrethrum extract, thus structure comparative studies are blocked. Besides, DrugBank contains drug entries that belong to synthetic molecules and biotech (protein/peptide) drugs, which were not considered in this work. For example, E710760 is a semisynthetic diterpene lactone derivative of the natural product pladienolide B which was originally isolated from Streptomyces platensis. Therefore, it is no doubt that there are quite a few of terpenoid-derived scaffolds in DrugBank. In order to confirm the activities of terpenoid NPs, we also compared with another database, ChEMBL. A total of 2937 terpenoid-like active compounds, which also deposited in our terpenoid NPs dataset, were targeted with 803 proteins in ChEMBL. Specifically, the compounds undergoing clinical trials were picked up, 9 of them are approved drugs (Table 2). And the compounds targeting human proteins with an IC50, EC50, Ki, or Kd of