Article pubs.acs.org/jmc
Cite This: J. Med. Chem. 2018, 61, 1019−1030
MetaQSAR: An Integrated Database Engine to Manage and Analyze Metabolic Data Alessandro Pedretti,*,† Angelica Mazzolari,† Giulio Vistoli,† and Bernard Testa‡,§ †
Dipartimento di Scienze Farmaceutiche “Pietro Pratesi”, Facoltà di Farmacia, Università degli Studi di Milano, Via Luigi Mangiagalli 25, I-20133 Milano, Italy ‡ University of Lausanne, 1015 Lausanne, Switzerland ABSTRACT: The study describes the MetaQSAR tool, a new database engine specifically tailored to collect and analyze metabolic data. This is a plug-in embedded in the VEGA suite of programs (freely downloadable at www.vegazz.net) and takes advantage from all cheminformatics features implemented in the software with additional tools aimed to perform statistical analyses, similarity searches, and physicochemical profiling of the stored molecules. MetaQSAR also implements a novel metabolism classification, which groups the metabolic reactions in 101 classes and can find numerous applications in metabolic analyses. The potentials of MetaQSAR are here assessed by using it to store and analyze an extended database focused on metabolism of xenobiotics, which was collected by manually curated meta-analysis of the recent literature. The database includes 1890 substrates taken from about 1500 original papers in the years 2004−2015. The database was utilized in both physicochemical analyses and similarity searches, so evidencing the notable potentialities of MetaQSAR, which can find particularly fruitful applications in developing targeted predictive approaches.
1. INTRODUCTION The recent and steady progress in omics disciplines, and in related analytical techniques, has strongly increased the amount of scientific data routinely available to researchers.1 This growth involved both biological data (coming for example from HTS campaigns, genomic, or proteomic studies) and chemical data (coming from combinatorial synthesis, compounds’ collection and physicochemical profiling).2 This growth fostered the development of computational methods able to store and to analyze a tremendous amount of scientific data, thus laying the ground for the emerging bioinformatics and cheminformatics fields.3 While involving different objects and pursuing different objectives, these two fields similarly imply the development of in silico tools able to collect and to analyze scientific databases often combined with efficient procedures to automatically retrieve scientific data from on line resources or scientific publications. Moreover, cheminformatics and bioinformatics approaches can be synergistically related because they investigate the same processes from chemical or biological standpoints, respectively.4 The study of drug metabolism offers a particularly fertile arena where bioinformatics and cheminformatics approaches can feed each other because metabolism combines the cheminformatics profiling of substrates and metabolites with the bioinformatics analyses of the involved biochemical pathways and proteins.5 Furthermore, databases for metabolic data can also take advantage of the approaches adopted to © 2017 American Chemical Society
manage the databases of synthetic reactions because all these databases primarily involve an efficient collection of chemical reactions.6 The aforementioned progress has resulted in a very rich arsenal of databases and web-based resources focused on metabolic data and biochemical pathways, as recently reviewed.7,8 However, the robotized acquisition of scientific data from various available resources may limit the critical curation and the accuracy of the collected data, resulting in databases which are suitable for qualitative analyses and comparisons but may prove ineffective in quantitative predictions where the inaccuracy of only a few data can impair the overall predictive power of the resulting models.9 This concern is particularly felt in metabolism prediction, which while not necessarily requiring huge databases is strongly influenced by the accuracy of the experimental metabolic data on which it is based. Finally, the existing resources focused on human metabolism are primarily bioinformatics-oriented and not tailored to drug metabolism because they are aimed mainly at predicting biochemical pathways, a task usually performed in metabolomics analyses. In contrast, a limited number of cheminformatics-oriented applications have been proposed even though they could find fruitful applications in metabolism prediction.10 Received: October 4, 2017 Published: December 15, 2017 1019
DOI: 10.1021/acs.jmedchem.7b01473 J. Med. Chem. 2018, 61, 1019−1030
Journal of Medicinal Chemistry
Article
Figure 1. Structure of the implemented metabolism database with the logical links between the different data types which are organized in tables here represented as boxes colored according to the data type.
class, the enzyme(s) catalyzing the reaction, and the generation of the metabolite along with its toxicity and/or reactivity (these last two features encoded by simple Boolean descriptors). It should be remembered that a metabolite can be classified of first generation if directly formed by the xenobiotic, of second generation when coming from a first generation metabolite, and so on. The substrates and products of the reactions are saved in the molecules table (light-blue box), which includes the 1D and 2D structures as well as several molecular descriptors and fingerprints that can be used for similarity searches and physicochemical analyses. The corresponding 3D structures are stored into a different table (structures) to enhance their analysis. Notice that the molecules and structures tables do not discriminate between substrates and metabolites because a given molecule can be the product of an enzymatic reaction and the substrate of subsequent reaction and thus a unique classification would be impossible. Each object in the molecules table is linked to its bibliographic data (gray boxes). Moreover, a compact and human-readable paper code was implemented to simplify the management of the reviewed papers, and the user can define its own paper code provided it does not exceed the 16 characters. Sixteen enzyme classes (Table 1) have been defined in the enzymes table (yellow box) to be linked to the reactions table to specify the class of the major and possibly alternative enzymes involved in the reaction. The reaction class is probably the most informative property, which can be included when a new metabolic reaction is entered into the database. Indeed, MetaQSAR includes an enhanced classification, the scientific rationale of which is explained in the results and which subdivides the metabolic reactions into 101 classes. As compiled in Table 2, these 101 reaction classes are organized in a framework composed of three levels corresponding to (1) main classes (compiled in the ReaMain table), (2) classes (in the ReaClasses table), and (3) subclasses (in the ReaSubClasses table). This classification method allows searches at different levels of detail. For example, one may retrieve all
On these grounds, the present study describes a novel application (MetaQSAR) embedded in the VEGA suite of programs11 and specifically tailored to generate and manage metabolic databases. MetaQSAR has been developed as a plugin of VEGA ZZ and can be used to collect, classify, and analyze metabolic reactions in an effective way while representing a versatile engine to develop in-house targeted applications for metabolism analysis and prediction. In detail, MetaQSAR is a relational database that comprises a management system which connects the database to the graphic user interface (GUI) for input, data analysis, and delivery. Notably, MetaQSAR includes a novel classification for the metabolic reactions which subdivides them into 28 classes and 101 subclasses, thus allowing an efficient clustering of the stored data. To assess its potential, MetaQSAR was utilized to store and to analyze an extended version of the metabolic database already collected for our previous study.12 Besides an updated evaluation of the relative importance of biotransformation reactions in the metabolism of xenobiotics, the analysis reported here will exploit the cheminformatics features implemented in VEGA ZZ to profile the physicochemical properties of the included molecules as well as to perform similarity analyses. The results emphasize that cheminformatics methods applied to metabolism databases can provide meaningful results with noteworthy potential in developing similarity-based predictive approaches.
2. IMPLEMENTATION 2.1. MetaQSAR Database. 2.1.1. Database Structure. The database structure is shown in Figure 1 in which the different data types are organized in tables (here shown as boxes) colored in different ways according to the type of information they contain. The relations between the tables, depicted by red arrows, indicate the logical links between rows of different tables. The pivotal table is reactions (sandy colored), in which each metabolic reaction is characterized by the following data: the substrate, the substrate atom(s) involved in the reaction, the metabolite formed, the reaction 1020
DOI: 10.1021/acs.jmedchem.7b01473 J. Med. Chem. 2018, 61, 1019−1030
Journal of Medicinal Chemistry
Article
molecule is required. The Bingo similarity fingerprints of the substrates, as implemented by the Indigo toolkit (http:// lifescience.opensource.epam.com/indigo/), are compared to that of the query molecule and evaluated by a Tanimoto similarity index.13 If the index so obtained is greater than a cutoff threshold specified by the user, the substrate and its metabolic reactions are shown as a tree report in the bottom box of the window and a preview of the corresponding 2D structure is shown by selecting a row of the tree. The search by molecular properties can be performed by specifying a set of criteria which are related to each other by logical operators (AND, OR). In detail, each criterion consists of a molecular descriptor, a mathematic operator, and a value (e.g., all substrates with log P > 5; all substrates with log P > 5 AND PSA > 80 Å2 and so on). In such case also, the search results are shown in the window as a tree without the similarity values. Several comprehensive statistics can be shown in the statistics tab, which includes four sections: main data, metabolic reactions, enzymes involved, and reaction counts. The main data section includes generic statistics such as the number of substrates, metabolic reactions, or enzyme classes. The metabolic reactions section shows the counts of the reactions according to metabolic generation (1, 2, and 3+) and of the reactions which yield toxic and/or reactive products. The enzyme and reaction sections report the statistics for each class (or subclass) of enzymes or metabolic reactions, reporting the number and the percentage of metabolic reactions belonging to each group (or subgroup). Similarly, bibliographic sections count the reviewed papers as classified by journal, publisher, or country. 2.2. Database Collection. As mentioned in the Introduction, the MetaQSAR engine was used to store an extended version of the database already described in a previous study. The main objective was the collection of a rather large number (>103) of substrates and their metabolic reactions to be used in statistical analysis and as QSAR learning sets. The metabolic data were collected in a systematic paper-by-paper search of metabolic studies carried out in the primary literature (i.e., Chemical Research in Toxicology, Xenobiotica, and Drug Metabolism and Disposition) during the years 2004−2012 for the first two journals, and the years 2004−2015 for Drug Metababolism and Disposition. Great care was used in collecting, selecting, and screening literature metabolic data for entry into the database. Thus, the metabolites reported in each paper were examined critically for quality and adequacy according to a number of criteria which resemble those already exploited in the previous study and can be briefly summarized as follows: 1. The focus was on drugs and other xenobiotics, excluding endogenous compounds except when these are used as drugs (e.g., estradiol). 2. The database was restricted to metabolic studies in humans and other mammals, carried out either in vivo, in cellular systems, or at a subcellular or enzymatic level. 3. All papers were screened for biochemical and analytical quality. 4. All structurally characterized metabolites were taken into account irrespective of their quantitative importance. 5. The structure of a number of metabolites was not fully elucidated in the sense that the regioselectivity of the reaction was left undetermined. Such metabolites were deemed unfit for inclusion in relevant learning sets.
Table 1. Enzyme Classes Stored in the Enzyme Table Used for the Reaction Entries with the Corresponding Relative Abundance of Class As Monitored in the Here Collected Database ID
enzyme class
%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
cytochromes P450 dehydrogenases FMO XO, AO peroxidases other reductases other oxidoreductases or autoxidations hydrolases UDP-glucuronosyltransferases sulfotransferases glutathione S-transferases and subsequent enzymes/reactions acetyltransferases acyl-CoA ligases and subsequent enzymes methyltransferases other transferases or nonenzymatic conjugations nonenzymatic hydrolyses or (de)hydrations
55.38 2.29 4.19 2.17 2.39 2.78 0.59 9.16 10.14 2.13 2.37 0.94 1.23 1.41 1.82 1.02
redox reactions (main class level), only the Csp3 oxidations (class level), or even only the hydroxylation reactions of an isolated Csp3 atom (subclass level). 2.1.2. MetaQSAR Graphical User Interface. The graphic user interface (GUI) of MetaQSAR was developed in C++ language as a plug-in of the VEGA ZZ program and is completely integrated in its graphic environment (Figure 2). The interface is organized in five tabs in which the user can edit the metabolic reactions (reactions tab), manage the bibliographic data (papers and journals tabs), search for reactions by similarity or molecular properties (search tab), and analyze the data (statistics tab). For the interested readers, a detailed user’s guide can be found at http://nova.disfarm. unimi.it/manual/plugins/metaqsar.htm. In the reactions tab, data entry is substrate-oriented because, when a new reaction is added, the substrate selection is the first step, and thus a new substrate has to be added before into the molecules table. Additionally, more advanced search modes are also available to find substrates with missing or incomplete metabolic data. Next, the metabolic reaction has to be defined using a graphical tree in which the reactions are grouped into different classes and which permits an easy selection. When a substrate is selected, the corresponding 3D structure is displayed in the VEGA ZZ main window, and for each reaction, the user can indicate the target atom(s) by clicking them in the 3D view. When the reactive atoms are not welldefined or uncertain, the reactive atoms can be better characterized by adding the confidence level of the data. For each added reaction, the user can finally define the generation to which a given metabolite belongs; enzymes involved and toxicity or reactivity can also be specified. In the papers tab, the user can manage the bibliographic data by adding new papers, editing them and linking substrates and products to their source in which they are cited. As exemplified in Figure 2, MetaQSAR GUI includes some tools to retrieve and analyze the metabolic data stored in the database. In particular, in the search tab, substrates and products (i.e., molecules) can be searched by structural similarity with a query molecule or by molecular properties. For the searches by structure, the SMILES string of the query 1021
DOI: 10.1021/acs.jmedchem.7b01473 J. Med. Chem. 2018, 61, 1019−1030
Journal of Medicinal Chemistry
Article
Table 2. Classification of the Metabolic Reactions According to Main Class/Class/Subclass Scheme with the Corresponding Relative Abundance of Each Subclass As Monitored in the Here Collected Database main classes count
description
classes code
description
description
%
01 02
7.04 6.89 16.35
16.54 2.37 1.24 0.17
3
03
4
04
5
05
hydroxylation (or other oxidations) of isolated Csp3 hydroxylation (or other oxidations) of C α to an unsaturated system (>CC, >CO, −C≡N, aryl) hydroxylation (or other oxidations) of Csp3 carrying an heteroatom (N, O, S, halo) (including subsequent dealkylation, deamination, or dehalogenation) dehydrogenation of >CH−CH< to >CCCH-N< to>CN− (incl >CN+CC< bonds to epoxides or other metabolites oxygenation of −C-CC−H and −C-CC−C− bonds
10 11 12 13
03
−CHOH ↔ >CO → −COOH
01 02 03 04
dehydrogenation of −CH2OH groups to −CHO and of >CHOH to >CO hydrogenation of −CHO to −CH2OH and of >CO to >CHOH oxidation of −CHO to −COOH dehydrogenation of dihydrodiols to ortho-diphenols
0.94 1.86 0.15 0
14 15 16 17
04
various redox reactions of carbon atoms
01 02 03 04
oxidative decarboxylation reductive dehalogenations reduction of arene and alkene epoxides other C reductions, e.g., of >CC< to −CH2−CH2−
0.04 0.13 0 0.75
18
05
redox reactions of R3N
01
1.4
19
02
20
03
oxidation of tertiary alkylamines and heterocyclic amines to N-oxides or other metabolites oxidation of tertiary arylamines, azarenes and azo compounds to N-oxides or other metabolites reduction of N-oxides
01
hydroxylation of amines to hydroxylamines or intermediates
1.42
22 23
02 03
0.21 0
24
04
25 26 27 28
05 06 07 08
hydroxylation of amides to hydroxylamides oxidation of primary hydroxylamines to nitroso compounds or oximes (incl spontaneous dismutation), then to nitro compounds reduction of hydroxylamines and hydroxylamides (incl spontaneous dismutation) reduction of nitroso compounds and oximes to hydroxylamines reduction of nitro compounds to nitroso compounds other N-oxidations (1,4-dihydropyridines, etc.) other N-reductions (e.g., azo compounds to hydrazines, hydrazines to amines, reductive N-rings opening)
21
29
06
07
oxidation of >NH, >NOH, and −NO /reduction of −NO2, −NO, >NOH, etc.
oxidation to quinones or analogues/reduction of quinones and analogues
1.05 0.21
0.27 0.29 0.8 0.19 0.27
01
oxidation of diphenols to quinones
0.31
30
02
0.23
31 32
03 04
33
05
oxidation of amino- and amido-phenols to quinoneimines or quinoneimides, respectively oxidation of cresols and analogues to quinonemethides other oxidations of phenols and amines (dimerization, quinone-like metabolites, etc.) reduction of quinones and analogues
34 35
08
oxidation and reduction of S atoms
01 02
36 37
03 04
38 39
05 06
40
07
oxidation of thiols to sulfenic acids or disulfides oxygenation of sulfenic acids to sulfinic acids, and of sulfinic acids to sulfonic acids oxygenation of sulfides to sulfoxides, and of sulfoxides to sulfones oxygenation of thiones (>CS) or thioamides to sulfines, and of sulfines to sulfenes oxidative desulfurations of >CS to ketones, and of −PS to −PO groups S-oxygenations of disulfides, thiosulfinates (−SO-S−), α-disulfoxides (−SO-SO−), and thiosulfonates (−SO2-S−) reduction of disulfides to thiols
1022
0.46 1.19 0.25 0.19 0 1.97 0.34 0.23 0.04 0.02
DOI: 10.1021/acs.jmedchem.7b01473 J. Med. Chem. 2018, 61, 1019−1030
Journal of Medicinal Chemistry
Article
Table 2. continued main classes count
description
classes code
description
41 42 43 44
subclasses code
description
%
08 09
reduction of sulfoxides to sulfides other S-reductions
0.08 0
09
redox reactions of other atoms
01 02
oxidation of silicon, phosphorus, arsenic and other elements reduction of Se, P, Hg, As, and other elements
0.15 0.06
11
hydrolysis of esters, lactones, and inorganic esters
01
hydrolysis of alkyl esters
2.05
46 47 48
02 03 04
0.25 2.01 0.54
49 50 51 52
05 06 07 08
hydrolysis of aryl esters hydrolysis of anionic and cationic esters hydrolysis of linear and cyclic carbamates (>N-CO-OR′) and carbonates (RO-CO-OR′) hydrolysis of acyl β-glucuronides or other acylglycosides reversible hydrolytic opening of lactone rings hydrolysis of thioesters (RCO-SR′ and RCS-SR′) and thiolactones hydrolysis of esters of inorganic acids (nitrates, nitrites, sulfates, sulfamates, phosphates, phosphonates, etc.)
45
hydrolysis and other
53
12
hydrolysis of amides, lactams and peptides
54 55 56
0 0.19 0.17 0.92
01
hydrolysis of alkyl and aryl amides [alkyl-CO-N< and aryl-CO-NCO → −CHOH] and [−CHO → −COOH] were each counted as a single step. The same applied, e.g., to the reactions [>NH → >NOH], {−NHOH → −N O], [−NO2 → −NO], [NO → −NHOH] and [>NOH → >NH].
3. RESULTS AND DISCUSSION 3.1. Classification into Reaction Types and Enzymes. The first and major issue when setting up this database engine was to define a list of metabolic reactions that was as comprehensive and realistic as possible without being exaggeratedly long. Even though several classifications have been proposed for the enzymatic reactions mainly based on the involved enzymes (e.g., the EC system, http://www.sbcs.qmul. ac.uk/iubmb/enzyme/), the list proposed here is instead based on the catalyzed reactions to reach a higher degree of classification in the field of metabolism of xenobiotics. 1024
DOI: 10.1021/acs.jmedchem.7b01473 J. Med. Chem. 2018, 61, 1019−1030
Journal of Medicinal Chemistry
Article
Figure 2. Graphical user interface of MetaQSAR (top right window) which is organized in tabs by which the user can perform the main tasks. When a substrate is selected, its 3D structure is displayed in the main VEGA ZZ window (top left window), allowing an easy selection of the atoms involved in a given reaction. Notice that these reactive atoms remain highlighted by a blue sphere. Examples of molecular searches which can be carried out using the MetaQSAR interface by structure (bottom left window) or by properties (bottom right window).
mercapturic acids and even to thiols (due to β-lyase catalyzed C−S cleavage) was considered as a single pathway because it proved futile to do otherwise. Some reactive metabolites or intermediates may react nonenzymatically with glutathione, a well-known nucleophile. Coupling with endogenous carbonyls to form hydrazones, and CO2 addition to primary and secondary amines to form carbamic acids, are also known metabolic reactions occurring nonenzymatically. In the vast majority of reactions, there was no ambiguity in assigning the formation of a given metabolite to a given enzyme (super)family or category (e.g., to CYP, FMO, or dehydrogenases/reductases) because either the enzyme (super)family was determined or no realistic alternative existed. However, a few cases (200 metabolites, i.e.,