MetaQSAR: An Integrated Database Engine to Manage and Analyze

5 days ago - The study describes the MetaQSAR tool, a new database engine specifically tailored to collect and analyze metabolic data. This is a plug-...
0 downloads 10 Views 2MB Size
Subscriber access provided by UNIV LAVAL

Article

MetaQSAR: An Integrated Database Engine to Manage and Analyze Metabolic Data Alessandro Pedretti, Angelica Mazzolari, Giulio Vistoli, and Bernard Testa J. Med. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.jmedchem.7b01473 • Publication Date (Web): 15 Dec 2017 Downloaded from http://pubs.acs.org on December 16, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Medicinal Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

MetaQSAR: An Integrated Database Engine to Manage and Analyze Metabolic Data Alessandro Pedretti1*, Angelica Mazzolari1, Giulio Vistoli1 and Bernard Testa2

1) Dipartimento di Scienze Farmaceutiche “Pietro Pratesi”, Facoltà di Farmacia, Università degli Studi di Milano, Via Luigi Mangiagalli, 25, I-20133 Milano, Italy

2) Emeritus Professor, University of Lausanne, Switzerland

Keywords: Metabolism, QSAR, Database, Data analysis.

1 ACS Paragon Plus Environment

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 36

Abstract The study describes the MetaQSAR tool, a new database engine specifically tailored to collect and analyze metabolic data. This is a plug-in embedded in the VEGA suite of programs (freely downloadable at www.vegazz.net) and takes advantage from all cheminformatics features implemented in the software with additional tools aimed to perform statistical analyses, similarity searches and physicochemical profiling of the stored molecules. MetaQSAR also implements a novel metabolism classification, which groups the metabolic reactions in 101 classes and can find numerous applications in metabolic analyses. The potentials of MetaQSAR are here assessed by using it to store and analyze an extended database focused on metabolism of xenobiotics, which was collected by manually curated meta-analysis of the recent literature. The database includes 1890 substrates taken from about 1500 original papers in the years 2004-2015. The database was utilized in both physicochemical analyses and similarity searches, so evidencing the notable potentialities of MetaQSAR, which can find particularly fruitful applications in developing targeted predictive approaches.

2 ACS Paragon Plus Environment

Page 3 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

1. Introduction The recent and steady progress in omics disciplines – and in related analytical techniques – has strongly increased the amount of scientific data routinely available to researchers1. This growth involved both biological data (coming for example from HTS campaigns, genomic or proteomic studies) and chemical data (coming from combinatorial synthesis, compounds' collection and physicochemical profiling)2. This growth fostered the development of computational methods able to store and to analyze a tremendous amount of scientific data, thus laying the grounds for the emerging bioinformatics and cheminformatics fields3.

While involving different objects and

pursuing different objectives, these two fields similarly imply the development of in silico tools able to collect and to analyze scientific databases often combined with efficient procedures to automatically retrieve scientific data from on line resources or scientific publications. Moreover, cheminformatics and bioinformatics approaches can be synergistically related since they investigate the same processes from molecular or biological standpoints, respectively4. The study of drug metabolism offers a particularly fertile arena where bioinformatics and cheminformatics approaches can feed each other since metabolism combines the cheminformatics profiling of substrates and metabolites with the bioinformatics analyses of the involved biochemical pathways and proteins5. Furthermore, databases for metabolic data can also take advantage of the approaches adopted to manage the databases of synthetic reactions since all these databases primarily involve an efficient collection of chemical reactions6. The aforementioned progress has resulted in a very rich arsenal of databases and web-based resources focused on metabolic data and biochemical pathways, as recently reviewed7, 8. However, the robotized acquisition of scientific data from various available resources may limit the critical curation and the accuracy of the collected data, resulting in databases, which are suitable for qualitative analyses and comparisons but may prove ineffective in quantitative predictions where the inaccuracy of only a few data can impair the overall predictive power of the resulting models9. 3 ACS Paragon Plus Environment

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 36

This concern is particularly felt in metabolism prediction, which while not necessarily requiring huge databases is strongly influenced by the accuracy of the experimental metabolic data on which it is based. Finally, the existing resources focused on human metabolism are primarily bioinformatics-oriented and not tailored to drug metabolism since they are aimed mainly at predicting biochemical pathways, a task usually performed in metabolomics analyses. In contrast, a limited number of cheminformatics-oriented applications have been proposed even though they could find fruitful applications in metabolism prediction10. On these grounds, the present study describes a novel application (MetaQSAR) embedded in the VEGA suite of programs11 and specifically tailored to generate and manage metabolic databases. MetaQSAR has been developed as a plug-in of VEGA ZZ and can be used to collect, classify and analyze metabolic reactions in an effective way while representing a versatile engine to develop inhouse targeted applications for metabolism analysis and prediction. In detail, MetaQSAR is a relational database that comprises a management system which connects the database to the graphic user interface (GUI) for input, data analysis and delivery. Notably, MetaQSAR includes a novel classification for the metabolic reactions which subdivides them into 28 classes and 101 subclasses thus allowing an efficient clustering of the stored data. To assess its potential, MetaQSAR was utilized to store and to analyze an extended version of the metabolic database already collected for our previous study12. Besides an updated evaluation of the relative importance of biotransformation reactions in the metabolism of xenobiotics, the here reported analysis will exploit the cheminformatics features implemented in VEGA ZZ to profile the physicochemical properties of the included molecules as well as to perform similarity analyses. The results emphasize that cheminformatics methods applied to metabolism databases can provide meaningful results with noteworthy potential in developing similarity-based predictive approaches.

4 ACS Paragon Plus Environment

Page 5 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

2. Implementation 2.1 MetaQSAR Database

Figure 1. Structure of the implemented metabolism database with the logical links between the different data types which are organized in Tables here represented as boxes colored according to the data type.

2.1.1 Database Structure The database structure is shown in Figure 1 in which the different data types are organized in tables (here shown as boxes) colored in different ways according to the type of information they contain. The relations between the tables, depicted by red arrows, indicate the logical links between rows of different tables. The pivotal table is Reactions (sandy colored) in which each metabolic reaction is characterized by the following data: the substrate, the substrate atom(s) involved in the reaction, the metabolite formed, the reaction class, the enzyme(s) catalyzing the reaction, the 5 ACS Paragon Plus Environment

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 36

generation of the metabolite along with its toxicity and/or reactivity (these last two features encoded by simple Boolean descriptors). It should be remembered that a metabolite can be classified of first generation if directly formed by the xenobiotic, of second generation when coming from a first generation metabolite and so on. The substrates and products of the reactions are saved in the Molecules table (light blue box) which includes the 1D and 2D structures as well as several molecular descriptors and fingerprints that can be used for similarity searches and physicochemical analyses. The corresponding 3D structures are stored into a different table (Structures) to enhance their analysis. Notice that the Molecules and Structures tables do not discriminate between substrates and metabolites since a given molecule can be the product of an enzymatic reaction and the substrate of subsequent reaction and thus a unique classification would be impossible. Each object in the Molecules table is linked to its bibliographic data (gray boxes). Moreover, a compact and human-readable paper code was implemented to simplify the management of the reviewed papers and the user can define its own paper code provided it does not exceed the 16 characters. Sixteen enzyme classes (Table 1) have been defined in the Enzymes table (yellow box) to be linked to the Reactions table to specify the class of the major and possibly alternative enzymes involved in the reaction. The reaction class is probably the most informative property, which can be included when a new metabolic reaction is entered into the database. Indeed, MetaQSAR includes an enhanced classification, the scientific rationale of which is explained in the Results and which subdivides the metabolic reactions into 101 classes. As compiled in Table 2, these 101 reaction classes are organized in a framework composed of three levels corresponding to (1) main classes (compiled in the ReaMain table), (2) classes (in the ReaClasses table), and (3) sub-classes (in the ReaSubClasses). This classification method allows searches at different levels of detail. For example, one may retrieve all redox reactions (main class level), only the Csp3 oxidations (class level), or even only the hydroxylation reactions of an unactivated Csp3 atoms (sub-class level). 6 ACS Paragon Plus Environment

Page 7 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

Figure 2. Graphical User Interface of MetaQSAR (top right window) which is organized in Tabs by which the user can perform the main tasks. When a substrate is selected, its 3D structure is displayed in the main VEGA ZZ window (top left window) allowing an easy selection of the atoms involved in a given reaction. Notice that these reactive atoms remain highlighted by a blue sphere. Examples of molecular searches which can be carried out using the MetaQSAR interface by structure (bottom left window) or by properties (bottom right window).

7 ACS Paragon Plus Environment

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 36

2.1.2 MetaQSAR Graphical User Interface The graphic user interface (GUI) of MetaQSAR was developed in C++ language as a plug-in of the VEGA ZZ program and is completely integrated in its graphic environment (Figure 2). The interface is organized in five tabs in which the user can edit the metabolic reactions (Reactions tab), manage the bibliographic data (Papers and Journals tabs), search for reactions by similarity or molecular properties (Search tab) and analyze the data (Statistics tab). For the interested readers, a detailed user’s guide can be found at http://nova.disfarm.unimi.it/manual/plugins/metaqsar.htm. In the Reactions tab, data entry is substrate-oriented since, when a new reaction is added, the substrate selection is the first step and thus a new substrate has to be added before into the Molecules table. Additionally, more advanced search modes are also available to find substrates with missing or incomplete metabolic data. Next, the metabolic reaction has to be defined using a graphical tree in which the reactions are grouped into different classes and which permits an easy selection. When a substrate is selected, the corresponding 3D structure is displayed in the VEGA ZZ main window and for each reaction, the user can indicate the target atom(s) by clicking them in the 3D view. When the reactive atoms are not well-defined or uncertain, the reactive atoms can be better characterized by adding the confidence level of the data. For each added reaction, the user can finally define the generation to which a given metabolite belongs; enzymes involved and toxicity or reactivity can also be specified. In the Papers tab, the user can manage the bibliographic data by adding new papers, editing them and linking substrates and products to their source in which they are cited. As exemplified in Figure 2, MetaQSAR GUI includes some tools to retrieve and analyze the metabolic data stored in the database. In particular, in the Search tab substrates and products (i.e., Molecules) can be searched by structural similarity with a query molecule or by molecular properties. For the searches by structure, the SMILES string of the query molecule is required. The Bingo similarity fingerprints of the substrates, as implemented by the Indigo toolkit 8 ACS Paragon Plus Environment

Page 9 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

(http://lifescience.opensource.epam.com/indigo/), are compared to that of the query molecule and evaluated by a Tanimoto similarity index13. If the index so obtained is greater than a cut-off threshold specified by the user, the substrate and its metabolic reactions are shown as a tree report in the bottom box of the window and a preview of the corresponding 2D structure is shown by selecting a row of the tree. The search by molecular properties can be performed by specifying a set of criteria which are related to each other by logical operators (AND, OR). In detail, each criterion consists of a molecular descriptor, a mathematic operator and a value (e.g., all substrates with logP > 5; all substrates with logP > 5 AND PSA > 80 Å2 and so on). In such case also, the search results are shown in the window as a tree without the similarity values. Several comprehensive statistics can be shown in the Statistics tab, which includes four sections: main data, metabolic reactions, enzymes involved and reaction counts. The main data section includes generic statistics such as the number of substrates, metabolic reactions or enzyme classes. The metabolic reactions section shows the counts of the reactions according to metabolic generation (1, 2 and 3+) and of the reactions which yield toxic and/or reactive products. The enzyme and reaction sections report the statistics for each class (or subclass) of enzymes or metabolic reactions, reporting the number and the percentage of metabolic reactions belonging to each group (or subgroup). Similarly, bibliographic sections count the reviewed papers as classified by journal, publisher or country.

2.2. Database collection As mentioned in the Introduction, the MetaQSAR engine was used to store an extended version of the database already described in a previous study. The main objective was the collection of a rather large number (> 103) of substrates and their metabolic reactions to be used in statistical analysis and as QSAR learning sets. The metabolic data were collected in a systematic paper-bypaper search of metabolic studies carried out in the primary literature (i.e., Chem. Res. Tox., 9 ACS Paragon Plus Environment

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 36

Xenobiotica and Drug Metab. Dispos.) during the years 2004-2012 for the first two journals, and the years 2004-2015 for Drug Metab. Dispos. Great care was used in collecting, selecting and screening literature metabolic data for entry into the database. Thus, the metabolites reported in each paper were examined critically for quality and adequacy according to a number of criteria which resemble those already exploited in the previous study and can be briefly summarized as follows: 1. The focus was on drugs and other xenobiotics, excluding endogenous compounds except when these are used as drugs (e.g., estradiol). 2. The database was restricted to metabolic studies in humans and other mammals, carried out either in vivo, in cellular systems, or at a subcellular or enzymatic level. 3. All papers were screened for biochemical and analytical quality. 4. All reported metabolites were taken into account irrespective of their quantitative importance, provided their structural characterization was unambiguous. 5. The structure of a number of metabolites was not fully elucidated in the sense that the regioselectivity of the reaction was left undetermined. Such metabolites were deemed unfit for inclusion in relevant learning sets. 6. The duplicate count of substrates and metabolites was avoided. For example, a given metabolite of a given substrate was counted only once when reported in several papers. 7. Regio- and stereoisomers were considered as distinct substrates (substrate selectivity) or metabolites (product selectivity). 8. When shown in the paper, or supported by existing knowledge and compatible with the biological conditions described in the paper, the formation of some metabolites was assigned to two enzymatic (super)families or groups of enzymes. This was the case, e.g., for the CYPand FMO-catalyzed oxygenation of some types of N- and S-containing functional groups, or the CYP- and peroxidases-catalyzed formation of quinones from polyphenols. 10 ACS Paragon Plus Environment

Page 11 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

9. Two-electron reactions such as [-CHOH → >C=O], [>C=O → -CHOH] and [-CHO → -COOH] were each counted as a single step. The same applied, e.g., to the reactions [>NH → >NOH], {-NHOH →-N=O], [-NO2 → -N=O], [N=O → -NHOH] and [>NOH → >NH].

3. Results and discussion 3.1. Classification into reaction types and enzymes The first and major issue when setting up this database engine was to define a list of metabolic reactions that was as comprehensive and realistic as possible without being exaggeratedly long. Even though several classifications have been proposed for the enzymatic reactions mainly based on the involved enzymes (e.g., the E.C. system, http://www.sbcs.qmul.ac.uk/iubmb/enzyme/), the here proposed list is instead based on the catalyzed reactions to reach a higher degree of classification in the field of metabolism of xenobiotics. This list appears in Table 2 and is based on extensive works (books and chapters) authored or coauthored by one of the present authors14,

15, 16, 17

. As is customary, metabolic reactions were

classified into redox reactions catalyzed by oxidoreductases, hydration/dehydration reactions catalyzed by hydrolases or occurring spontaneously, and conjugations catalyzed by transferases or in a few cases by ligases. These three major categories were divided in turn into second-level categories to take into account the diversity of target sites in substrate molecules. Carbon oxidations, being the most frequent reactions, were divided into Csp3- and Csp2/Csp-oxidations. These reactions are catalyzed mainly by cytochromes P450 [CYPs], while the molybdoflavoenzymes (xanthine oxidoreductase [XOR] and aldehyde oxidoreductase [AOR]) play a comparatively modest yet non-negligible role. Redox reactions involving carbonyl compounds as products or substrates (i.e., oxidation of alcohols and aldehydes, reduction of aldehydes and ketones) are catalyzed by dehydrogenases such as alcohol dehydrogenases [ADHs], aldehyde 11 ACS Paragon Plus Environment

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 36

dehydrogenases [ALDHs] and aldo-keto reductases [AKRs]. A variety of less frequently encountered redox reactions, such as reductions at C=C moieties and reductive dehalogenations, are catalyzed by CYPs or various reductases. The oxidation of N-containing functional groups such as tertiary, secondary and primary amines are oxidized by CYPs and/or by flavin-containing monooxygenases [FMOs]. In contrast, the reduction of these groups involves CYPs or other reductases. Like for N-containing functional groups, oxidation reactions at sulfur-containing groups are catalyzed by CYPs, and to a lesser extent also by FMOs. The role of peroxidases such as myeloperoxidase [MPO] in catalyzing the formation of reactive quinones and quinoneimines is of particular toxicological significance, given the role played by NADPH quinone reductases [NQOs] and glutathione (GSH, see below) in their detoxication. The classification of hydrolytic reactions needs little explanation and involves, e.g., carboxylesterases and amidases. Note however that spontaneous reactions of hydration or dehydration were considered as a metabolic step. The classification of transferases was as follows: Methyltransferases such as catechol Omethyltransferase [COMTs], N-methyltransferases [NMTs] and the thiol methyltransferases [TMTs]; sulfotransferases [SULTs]; UDP-glucuronosyltransferases [UGTs]; glutathione Stransferases [GSTs] plus all enzymes involved in the sequence of reactions leading to the corresponding mercapturic acids: peptidases, N-acetyltransferases [NATs], and C-S cleavage (βlyase). Other conjugating enzymes are the fatty acyl-coenzyme A ligases [ACSs] and the enzymes catalyzing subsequent reactions (glycine N-acyltransferases [GLYAT] and glutamine Nacyltransferases) which form the corresponding amino acid conjugates, and the various enzymes involved in β-oxidation or 2C-elongation or chiral inversion, and finally other transferases such as phosphotransferases. Some conjugation reactions deserve an additional comment. Thus, the sequence from GSH conjugation to the formation of mercapturic acids and even to thiols (due to β-lyase catalyzed C-S 12 ACS Paragon Plus Environment

Page 13 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

cleavage) was considered as a single pathway since it proved futile to do otherwise. Some reactive metabolites or intermediates may react non-enzymatically with glutathione, a well-known nucleophile. Coupling with endogenous carbonyls to form hydrazones, and CO2 addition to primary and secondary amines to form carbamic acids, are also known metabolic reactions occurring nonenzymatically. In the vast majority of reactions, there was no ambiguity in assigning the formation of a given metabolite to a given enzyme (super)family or category (e.g., to CYP, FMO or dehydrogenases/reductases), because either the enzyme (super)family was determined or no realistic alternative existed. However, a few cases (200 metabolites, i.e. < 3% of total metabolites) led to double enzymatic assignment based on experimental evidence.

3.2. Content of the MetaQSAR-based database 3.2.1 Overall metabolic statistics As a preamble, it should be emphasized that the previous study12 considered all the metabolic reactions regardless of their generation, while the here reported preliminary data are focused only on reactions forming first-generation metabolites. As mentioned above, the screening process involved three primary journals as published in the 2004-2015 years. In this way, 1455 papers were analyzed; they featured a total of 1890 substrates, yielding a total of 4776 metabolic reactions, namely 2.53 first-generation metabolites per substrate. The generation of the 2D/3D structure for the corresponding metabolites is still a work-in-progress aimed at allowing the second-generation metabolites to be compiled. Currently, the database includes the structure of about 50% of the firstgeneration metabolites. When this work will be completed, the analysis of the second-generation metabolites could begin. When considering the distribution of 1st-Gen metabolites among the three main classes, the analyzed database has been found to include 3261 (68.3%) products of redox reactions, 507 13 ACS Paragon Plus Environment

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 36

(10.6%) products of hydrolyses, and 994 (20.8%) conjugates. These relative abundances are in line with those previously reported for the first generation thus confirming that functionalization reactions play a largely predominant role, which progressively decreases when considering metabolic reactions of the second and third generations as revealed in the previous study. Figure 3 schematizes the relative abundance of the 21 reaction classes, revealing that Coxidations represent more than 50% of all monitored reactions, while redox reactions involving heteroatoms show a markedly minor role. Among the conjugation reactions, glucoronidations represent about the 10 % of all collected reactions and almost the 40% when considering the conjugations only thus emphasizing their very important role, which further increases when considering second- and third-generation metabolites. Finally, the hydrolyses of esters are conceivably more abundant than those of amides, while spontaneous hydrolytic processes represent a modest minority. Table 2 compiles the relative abundance as computed for the 101 reaction subclasses and reveals that the most abundant subclass is the Oxidation of aryl compounds to epoxides, phenols or other metabolites followed to a very short distance by hydroxylation (or other oxidations) of Csp3 carrying an heteroatom. Hydrolyses and conjugations show a more homogeneous distribution among the monitored subclasses, the O-Glucuronidation of phenols being the most populated group. The relative abundance of the reaction classes (and subclasses) is reflected in that of the involved enzymes as reported in Table 1, confirming the markedly prevalent role of cytochromes P450 followed by UDP-glucuronosyltransferases.

14 ACS Paragon Plus Environment

Page 15 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

Figure 3. Distribution of metabolites according to reaction classes. The percentages shown in the Figure refer to 4776 first generation metabolic reactions = 100%. The color code is as follows: Redox reactions blue; hydrolyses yellow; conjugations red. Alternating dark and light fields are used simply for graphical clarity. The reaction classes are reported in numerical order starting from the upper-right quadrant. For clarity, the following very poorly populated classes are not shown: 9) Redox reactions of other atoms (0.21%); 13. Epoxide hydration (0.04%); 28 Other conjugations (1.5%).

15 ACS Paragon Plus Environment

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 36

3.2.2 Physicochemical profiling of the collected substrates and metabolites: an overview As mentioned under Methods, a very relevant feature of the MetaQSAR tool is the possibility of combining the database engine with all cheminformatics features implemented in the VEGA ZZ program. With a view to offering a meaningful example of this possibility, all collected substrates and the up-to-now inserted metabolites were analyzed by calculating an extended set of conformerdependent and conformer-independent structural and physicochemical descriptors such as number of rotors, number of H-bonding groups, PSA, SAS, dipole moment and logP as computed by the MLP approach18. Figure 4 compares the average values of four representative descriptors as computed for substrates and metabolites and clustered into the three main reaction classes. The analysis was focused on two structural descriptors, namely molecular weight and number of rotors, which code for molecular size and flexibility, as well as two physicochemical descriptors, namely logP and PSA (polar surface area), which are related to polarity and H-bonding capacity. The results allow for some major considerations. First, the reported averages show clear differences between the three main reaction classes which, while considering the extent of standard deviations, suggest that, for example, (a) hydrolytic enzymes prefer larger and more flexible substrates, while (b) redox reactions preferentially involve more lipophilic molecules, and (c) conjugation reactions tend to metabolize less lipophilic and more rigid compounds. Second, the differences between substrates and metabolites show contrasting trends, the sole common trend being the lower lipophilicity of metabolites compared to substrates. Third, the standard deviations can be interpreted in two ways: (a) when compared between main reaction classes, they are suggestive of enzyme promiscuity, (b) while when compared between substrates and metabolites they encode the structural changes induced by a given reaction class. Hence, the reported standard deviations reveal that hydrolytic enzymes accommodate substrates with the widest structural heterogeneity while redox and conjugations reactions show a comparably narrower promiscuity. Again, (a) redox reactions do not show clear differences between substrates and metabolites, (b) hydrolytic reactions 16 ACS Paragon Plus Environment

Page 17 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

reduce the structural variability of their substrates, (c) while conjugations enlarge it. This last observation can be explained by considering that redox reactions usually induce small modifications on their substrates, hydrolytic reactions break the substrates releasing two fragments (such as a drug and a promoiety in prodrugs), and conjugations add a novel portion (the endocon) to the substrate structures. While being clearly preliminary, these results reveal meaningful differences between reaction (and consequently enzyme) classes, which invite further investigations and might even have a predictive role.

Figure 4: Physicochemical profiling of substrates and metabolites as subdivided into the three main reaction classes and assessed by four representative descriptors such as Molecular Weight (MW), number of rotors, Polar surface area (PSA) and virtual logP as computed by the MLP approach.

17 ACS Paragon Plus Environment

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3.2.3 Similarity searches: an exploratory study

Figure 5. Structure of the xenobiotics not included in the database and utilized in the exploratory predictive study with the involved atoms and the corresponding classes of the reported metabolic reactions (the blue numbers indicate the correctly predicted metabolic reactions, while the red numbers indicate the unpredicted reactions).

As explained under Methods, the cheminformatics tools implemented in VEGA ZZ allow targeted searches by structure (and substructure) as well as by molecular properties. To offer a relevant example of this, a set of very recent papers dealing with the metabolism of molecules which are not included into the database (see Figure 5 and Table 4) were used in similarity analyses to investigate whether the results of such analyses can be used to indirectly predict the metabolism 18 ACS Paragon Plus Environment

Page 18 of 36

Page 19 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

of these novel input molecules, in fact an external test set. Such an analysis can reveal which role the simple molecular similarity can have in metabolism prediction at least as a starting point for further analyses. Usually, similarity-based predictions have the known problem of providing a nonnegligible proportion of false positive results, and must be combined with other predictive approaches to filter and prioritize the retrieved results19. While such filtering procedures may well remain, one may hope that searches based on highly curated databases might limit the number of false positives thus reducing the relevance of post-search filtering approaches. Clearly the analysis is based on the assumption that the training set of metabolic data is rich enough to covers a fair proportion of the chemical space of the substrates. While considering its tentative nature, this analysis was performed by following a set of welldefined rules which can be summarized as follows: (a) the analysis was limited to the firstgeneration metabolites; (b) only compounds having a Tanimoto distance based on the Bingo fingerprints greater than 0.5 were considered; (c) the analysis was extended to the reactions reported for the three most similar compounds. Figure 5 and Table 4 summarizes the results and allows for some meaningful considerations. Of the ten considered molecules, only for one very complex molecule (Dasabuvir), the search was unable to find a similar molecule within the database since the maximum Tanimoto distance obtained (0.47) was below the assumed threshold. This underlines that the database conveniently covers the chemical space of xenobiotics and suggests that the search criteria should be tuned according to the chemical complexity of the input molecule. Nevertheless, such a calibration analysis goes beyond the scope of this exploratory study and thus Dasabuvir was here discarded. Remarkably, the retrieved reactions are in agreement with 44 out of the 51 first generations metabolites as reported for the 9 input molecules. Only 7 metabolites were not predictable; the false positives are very few and only in two cases they are more than two. This encouraging result emphasizes the crucial role of the number of retrieved compounds on which the analysis is focused which should be carefully chosen to limit the number of false positives. While 19 ACS Paragon Plus Environment

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 36

considering the exploratory nature of this analysis, its results reveal that metabolism prediction based solely on molecular similarity and based on the present database possesses an encouraging predictive power showing a sensitivity equal to 0.86, a precision of 0.76 and a F1 score equal to 0.80. When considering the classes of the unpredicted reactions, one may note that the higher number of wrong predictions involves metabolic red-ox reactions and, in detail, the aryl hydroxylations appear to be the most problematic reactions with three unpredicted metabolites. If the relative abundance of missed red-ox metabolites might simply reflect their greater frequency in metabolism, the high number of unpredicted aryl hydroxylations can be due to the greater complexity of the aromatic systems which are captured with difficulty by similarity searches. It should be emphasized that such a similarity analysis predicts the reaction classes an input substrate can undergo, rather than structurally fully determined metabolites. Since several structurally compatible metabolites (e.g., regioisomers) may usually be generated by a predicted reaction class, such a similarity search should be combined with other computational approaches able to predict target sites of metabolism (soft spots) so prioritizing the putative metabolites derived from similarity analyses.

20 ACS Paragon Plus Environment

Page 21 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

4. Conclusions This study describes a database engine specifically designed to collect and manage metabolism data, which can be freely downloaded at www.vegazz.net. The tool is embedded in the VEGA ZZ suite of programs and thus can take advantages from all cheminformatics features implemented in the software. Along with an updated evaluation of the statistical relevance of the monitored reaction classes, the study includes two promising examples of the cheminformatics potentials of the database engine described here. In the first example, the physicochemical profile of the collected molecules was investigated, revealing significant differences between the substrates and metabolites as grouped into the three main classes. The second example emphasizes the encouraging role of similarity-based searches which prove successful in predicting the metabolic reactions of a set of recently published molecules not yet included in the database. Finally, an exhaustive classification of the metabolic reactions is presented which may find countless applications in metabolism analysis.

21 ACS Paragon Plus Environment

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 36

Table 1. Enzyme classes stored in the Enzyme table used for the reaction entries with the corresponding relative abundance of class as monitored in the here collected database.

ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Enzyme class Cytochromes P450 Dehydrogenases FMO XO, AO Peroxidases Other reductases Other oxidoreductases or autooxidations Hydrolases UDP-Glucuronosyltransferases Sulfotransferases Glutathione S-transferases & subsequent enzymes/reactions Acetyltransferases Acyl-CoA ligases & subsequent enzymes Methyltransferases Other transferases or non-enzymatic conjugations Non-enzymatic hydrolyses or (de)hydrations

% 55.38 2.29 4.19 2.17 2.39 2.78 0.59 9.16 10.14 2.13 2.37 0.94 1.23 1.41 1.82 1.02

22 ACS Paragon Plus Environment

Page 23 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

Table 2. Classification of the metabolic reactions according to main class/class/sub-class scheme with the corresponding relative abundance of each subclass as monitored in the here collected database.

Count

Main classes Description

Classes Code

Description

Subclasses Code

1

01

2

02

01

3

Oxidation of Csp3

03

4

04

5

05

6

01

7 Redox

02

Oxidation of Csp2 & Csp

02

8

03

9

04

10

01

11 03

-CHOH ↔ >C=O → -COOH

02

12

03

13

04

14 15

01 02

16

04

Various redox reactions of carbon atoms

17 18

03 04

05

Redox reactions of

01

Description Hydroxylation (or other oxidations) of isolated Csp3 Hydroxylation (or other oxidations) of C alpha to an unsaturated system (>C=C, >C=O, -C≡N, aryl) Hydroxylation (or other oxidations) of Csp3 carrying an heteroatom (N, O, S, halo) (including subsequent dealkylation, deamination or dehalogenation) Dehydrogenation of >CHCH< to >C=CCH-N< to >C=N- (incl. >C=N+C=C< bonds to epoxides or other metabolites Oxygenation of -C-C≡C-H and -C-C≡C-C- bonds Dehydrogenation of -CH2OH groups to -CHO and of >CHOH to >C=O Hydrogenation of -CHO to -CH2OH and of >C=O to >CHOH Oxidation of -CHO to -COOH Dehydrogenation of dihydrodiols to orthodiphenols Oxidative decarboxylation Reductive dehalogenations Reduction of arene and alkene epoxides Other C reductions, e.g. of >C=C< to -CH2-CH2Oxidation of tertiary

% 7.04

6.89

16.35

1.97

0.21

16.54 2.37 1.24 0.17 0.94

1.86 0.15 0 0.04 0.13 0 0.75 1.4

23 ACS Paragon Plus Environment

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

R3 N

19

02

20

03

21

01

22

02

23

03

24 06 25

Oxidation of >NH, >NOH and -N=O // Reduction of -NO2, -N=O, >NOH, etc.

04

05

26

06

27

07

28

08

29

01

30

31

02 07

Oxidation to quinones or analogs // Reduction of quinones and analogs

03

32

04

33

05

34

01

35

02 08

Oxidation and reduction of S atoms

36

03

37

04

alkylamines and heterocyclic amines to N-oxides or other metabolites Oxidation of tertiary arylamines, azarenes and azo compounds to N-oxides or other metabolites Reduction of N-oxides Hydroxylation of amines to hydroxylamines or intermediates Hydroxylation of amides to hydroxylamides Oxidation of primary hydroxylamines to nitroso compounds or oximes (incl. spontaneous dismutation), then to nitro compounds Reduction of hydroxylamines and hydroxylamides (incl. spontaneous dismutation) Reduction of nitroso compounds and oximes to hydroxylamines Reduction of nitro compounds to nitroso compounds Other N-oxidations (1,4dihydropyridines, etc) Other N-reductions (e.g. azo compounds to hydrazines, hydrazines to amines, reductive N-rings opening) Oxidation of diphenols to quinones Oxidation of amino- and amido-phenols to quinoneimines or quinoneimides, resp. Oxidation of cresols and analogs to quinonemethides Other oxidations of phenols and amines (dimerization, quinone-like metabolites, etc) Reduction of quinones and analogs Oxidation of thiols to sulfenic acids or disulfides Oxygenation of sulfenic acids to sulfinic acids, and of sulfinic acids to sulfonic acids Oxygenation of sulfides to sulfoxides, and of sulfoxides to sulfones Oxygenation of thiones (>C=S) or thioamides to sulfines, and of sulfines to

Page 24 of 36

1.05

0.21 1.42 0.21

0

0.27

0.29

0.8 0.19

0.27

0.31

0.23

0.46 1.19 0.25 0.19

0

1.97

0.34

24 ACS Paragon Plus Environment

Page 25 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

38

05

39

06

40

07

41

08

42

09

43 09

Redox reactions of other atoms

01

44

02

45 46

01 02

47

03

48

04

49 11 50

Hydrolysis of esters, lactones and inorganic esters

51

52

06 07

08

Hydrolysis & other

53

54

05

01

12

Hydrolysis of amides, lactams and peptides

02

55

03

56

04

57

13

Epoxide hydration

14

Other hydrolyses/hydration reactions // Nonenzymatic eliminations and rearrangements

58

59

01

01

02

sulfenes Oxidative desulfurations of >C=S to ketones, and of -P=S to -P=O groups S-Oxygenations of disulfides, thiosulfinates (-SO-S-), alpha-disulfoxides (-SO-SO-) and thiosulfonates (-SO2-S-) Reduction of disulfides to thiols Reduction of sulfoxides to sulfides Other S-reductions Oxidation of silicon, phosphorus, arsenic and other elements Reduction of Se, P, Hg, As and other elements Hydrolysis of alkyl esters Hydrolysis of aryl esters Hydrolysis of anionic and cationic esters Hydrolysis of linear and cyclic carbamates (>N-CO-OR') and carbonates (RO-CO-OR') Hydrolysis of acyl ß-glucuronides or other acylglycosides Reversible hydrolytic opening of lactone rings Hydrolysis of thioesters (RCO-SR' and RCS-SR') and thiolactones Hydrolysis of esters of inorganic acids (nitrates, nitrites, sulfates, sulfamates, phosphates, phosphonates, etc) Hydrolysis of alkyl and aryl amides [alkyl-CO-N< and aryl-CO-N