Fishing the Target of Antitubercular Compounds: In ... - ACS Publications

Mar 20, 2009 - Novartis Institute for Tropical Diseases, 10 Biopolis Road, #05-01 Chromos, ... Discovery Informatics, Center for Proteomic Chemistry, ...
3 downloads 0 Views 6MB Size
Fishing the Target of Antitubercular Compounds: In Silico Target Deconvolution Model Development and Validation Philip Prathipati,*,†,§ Ngai Ling Ma,† Ujjini H. Manjunatha,† and Andreas Bender*,‡,# Novartis Institute for Tropical Diseases, 10 Biopolis Road, #05-01 Chromos, 138670, Singapore, and Lead Discovery Informatics, Center for Proteomic Chemistry, Novartis Institutes for BioMedical Research, Inc., 250 Massachusetts Avenue, Cambridge, Massachusetts 02139 Received December 18, 2008

An in silico target prediction protocol for antitubercular (antiTB) compounds has been proposed in this work. This protocol is the extension of a recently published ‘domain fishing model’ (DFM), validating its predicted targets on a set of 42 common antitubercular drugs. For the 23 antiTB compounds of the set which are directly linked to targets (see text for definition), the DFM exhibited a very good target prediction accuracy of 95%. For 19 compounds indirectly linked to targets also, a reasonable pathway/ embedded pathway prediction accuracy of 84% was achieved. Since mostly eukaryotic ligand binding data was used for the DFM generation, the high target prediction accuracy for prokaryotes (which is an extrapolation from the training data) was unexpected and provides an additional proof of concept of the DFM. To estimate the general applicability of the model, ligand-target coverage analysis was performed. Here, it was found that, although the DFM only modestly covers the entire TB proteome (32% of all proteins), it captures 70% of the proteome subset targeted by 42 common antiTB compounds, which is in agreement with the good predictive ability of the DFM for the targets of the compounds chosen here. In a prospective validation, the model successfully predicted the targets of new antiTB compounds, CBR-2092 and Amiclenomycin. Together, these findings suggest that in silico target prediction tools may be a useful supplement to existing, experimental target deconvolution strategies. Keywords: TB chemogenomics • antiTB drugs • Tb proteome • target prediction • chemoproteomics • protein domains • model domain extrapolation • in silico target deconvolution • protein-protein interactions • Tuberculosis • drugs • Molecular Target

Introduction Target deconvolution is an important component of ‘forward chemical genetics’.1 It plays a major role in anti-infective drug research for elucidating the mechanism of drug action, for understanding the biological processes taking place in the pathogen, and it also helps in other areas of rational drug design, apart from providing a target-specific host toxicity profile. There are several methods available for experimental target deconvolution such as affinity chromatography;2 identifying the mutation(s) responsible for resistance by both rational and whole-genome approaches;1 transcriptional profiling;3 protein microarrays;4 whole genome expression libraries;5 and genomic library complementation,1 just to name a few. Historically, the generation and characterization of resistanceconferring mutations using either rational or whole genome approaches have been widely used; however, in many in* To whom correspondence should be addressed. E-mail: (P.P.) philipp@ bii.a-star.edu.sg; (A.B.) [email protected]. † Novartis Institute for Tropical Diseases. ‡ Novartis Institutes for BioMedical Research, Inc. § Current Address: Bioinformatics Institute (BII), Agency for Science, Technology and Research, (A*STAR), Singapore 138671. # Current Address: Leiden/Amsterdam Center for Drug Research, Division of Medicinal Chemistry, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands.

2788 Journal of Proteome Research 2009, 8, 2788–2798 Published on Web 03/20/2009

stances, multiple approaches are required in parallel to identify the molecular target of a compound. In view of the experimental efforts required to identify drug targets, well-validated in silico target deconvolution protocols can be assumed to be useful tools for guiding experiments and/or for complementing experimentally obtained information.1 In silico target predictions are based on the principle that “similar compounds bind to similar targets”.6 Several reviews of this principle provide more details,7-11 and these techniques were shown to recall known targets of several classes of pharmaceuticals.12,13 Here, we define a drug ‘target’ as the protein(s) that a compound interact(s) with and, causally, influences biological processes. If one assumes that domains are the primary units of evolution for proteins, then it is reasonable to expect that a compound may interact with apparently unrelated proteins if they happen to share some of their domains, in particular if this is the case within the ligandbinding domain(s). This has been demonstrated in literature, such as the promiscuity of Captopril toward leukotriene A4 hydrolase and thermolysin.14 While these two targets of Captopril are from two different enzyme classes (EC 3.3.2.6 vs EC 3.4.24.27) which also possess low overall pairwise residue identity (∼20%), they share the same structural fold at the catalytic site,14 defined as a common Interpro domain in this 10.1021/pr8010843 CCC: $40.75

 2009 American Chemical Society

research articles

Fishing the Target of Antitubercular Compounds a

Table 1. Data Sources Used in This Work description

source

annotation in original source

Common TB compounds Pubchem Chemical identifier SWISS-PROT/EnsEMBL protein identifier SWISS-PROT protein identifier

b

AIDS# PUBCHEM_COMPOUND_ID Protein Alias SWISS-PROT_ID ID Accession, Genes GO GOS Functional category Pathways

Enzyme commission (EC) number SWISS-PROT protein identifier for M Tb Gene ontology annotations Gene ontology SLIMS annotations Functional classification for MTB genome KEGG and Biocyc pathways

NIAID Pubchemc STITCHd STRINGe Drugbankf IntEnzg SWISS-PROTh EBI’s GOAi EBI’s GOAj Tuberculistk Publicationl

a Shown are the description, source and field name in the source database of various biological profile annotations added to the antiTB compounds under investigation. b Compounds common to Janin17 and http://chemdb.niaid.nih.gov/struct_search/oi/OI_search.asp#. c http:// pubchem.ncbi.nlm.nih.gov. d http://stitch.embl.de/download/protein_chemical.links.v1.0.tsv.gz. e http://string.embl.de/newstring_download/protein. aliases.v7.1.txt.gz. f http://www.drugbank.ca/public/downloads/current/drugcards.zip. g ftp://ftp.ebi.ac.uk/pub/databases/intenz/enzyme. h http:// www.uniprot.org/uniprot/?query)MYCTU&force)yes&format)xls&columns)id,entry. i ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/30.M_ tuberculosis_ATCC_25618.goa. j ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/goslim/gene_association.goa_uniprot_slimmed.gz. k http://genolist.pasteur. fr/TubercuList/. l Supporting Information of Hasan et al.33

case (IPR006025). Given the phenomenon that ligands bind to apparently unrelated targets with shared protein fold, and with the idea in mind to add extrapolative abilities to targets outside the original training set, Bender, Jenkins, and colleagues recently published a domain fishing model (DFM) by extending the computational algorithm from predicting individual proteins as targets for compounds, to predicting a protein domain as the interaction partner of the ligand.15 As an additional fact, it is well-known that ligand selectivity is often an issue for related proteins (orthologs) from different species.16 Taking both factors into account, it suggests that ligands interacting with a set of given protein domains in one species would also likely interact with the same (or a similar) set of protein domains of another species. This fact is exploited in the protein domain prediction model developed in the current work. Given that most of the ligand-target interaction space is annotated for eukaryotes, the prediction of prokaryotic targets responsible for infectious diseases has not yet been explored. In this paper, we report our efforts in extending the work of Bender et al.15 to predict prokaryotic (in this case, Mycobacterium tuberculosis, M Tb) targets from knowledge largely gathered from eukaryotes. Validation of the model is carried out either by matching target proteins or target pathways, depending on the available data, for 42 well-studied antituberculosis (antiTB) agents. To assess the general applicability and limitation of the approach, the coverage of the model in terms of the TB proteome is investigated. Furthermore, in a prospective validation, the model is applied to two antiTB compounds currently under development, to gauge its target prediction capabilities also for novel compounds. For four compounds acting against TB with as of yet unknown experimental targets, we present a list of predicted targets, and hence hypothesize their modes of action in a truly prospective manner.

Methods Common AntiTB Compounds Domain Prediction Workflow. A set of 46 well-studied antiTB compounds was collected from literature,17 and their M Tb targets (derived from domains) were predicted with the domain fishing model (42 of those compounds could finally be assigned experimental targets, while for the remaining four compounds, prospective target predictions were provided, as outlined in the following).

Figure 1. Illustration of the workflow used for validating the domain fishing model (DFM). The most crucial parts are Supporting Information Table S1 (which contains the predicted targets for the 46 compounds under investigation), and Supporting Information Table S2 (which contains the experimental targets of 42 compounds which could be found in literature). From these data, Tables 2 and 3 were constructed, which compare predicted to experimental targets. (For further details see main text.)

The domain fishing model generation workflow has been described in detail previously,15 and hence, only a short description shall be given here. By merging the target informaJournal of Proteome Research • Vol. 8, No. 6, 2009 2789

research articles

Prathipati et al.

Figure 2. Network of predicted Interpro domains (ranked based on the Bayes score assigned by the DFM), corresponding M Tb proteins, and gene IDs, for the antiTB drug Cerulin. The protein predictions which are in agreement with experimental data23 are highlighted with asterisks.

tion in the WOMBAT database,18 with the target-domain information provided by Interpro,19 the Interpro domains of targets are obtained. These domains are linked back to the ligand information in WOMBAT, and multicategory Bayesian models were trained to capture the associations between 150 000 ligand chemical features with a total of 1403 Interpro target domains. The target information collected in the WOMBAT database (estimated to be over 1300) is largely human,18 hence, in order to extend the DFM to the prediction of M Tb targets, the predicted Interpro domains of the 46 compounds were linked to the cognate Swiss-Prot protein IDs in the TB genome using information from EBI Integr8 database.20,21 In other words, the predicted Interpro domains for each antiTB agent were mapped to the actual proteins present in the M Tb proteome, containing that Interpro domain. When the model is applied to a compound, multiple TB domains with different probability scores could be predicted to interact with the compound. Analysis by Bender et al.22 has suggested that the performance of the DFM based on the top 5 predicted domains is superior to other values. We undertook further analysis to assess the prediction accuracies when the top 5, 10, 15, and 20 predictions were considered. Our analysis of the 46 common TB compounds reveals that top 15 predictions is the minimal number required for predicting the targets of all compounds in this set (e.g., when using the top 10 domains, the targets of two compounds cannot be obtained). 2790

Journal of Proteome Research • Vol. 8, No. 6, 2009

As it is also optimal in terms of prediction accuracy (discussed in detail below), the target prediction results based on the top 15 domains are reported in this study. (Given that in the current work the extrapolation from ligand binding to eukaryotic proteins to ligand binding to prokaryotic proteins is performed, taking a larger list of model predictions into account seems to be justified also by the different nature of the data sets in each study.) The predicted biological profile of the 46 common antiTB compounds (in terms of their top 15 predicted domains, associated proteins and gene information) is deposited in Supporting Information Table S1. The same information can be displayed graphically as networks as illustrated in the following. Using Cerulin as an example (Figure 2), the DFM predicted that the compound may interact with 12 domains (depicted as red diamonds in Figure 2) which are present in 31 proteins (green hexagons), expressed by 31 genes (purple octagons). In this case, the 12 predicted target domains were predicted to have probability scores ranging from 17 to 42, with the ranking based on the Bayes score (highest score as rank 1) included in Figure 2. Further case studies of this type will be presented below. Common AntiTB Compounds Biological Profile Annotation Workflow. To gauge how well the model agrees with experiment, we collected the experimental biological activity profile of common antiTB compounds for validation purposes.

research articles

Fishing the Target of Antitubercular Compounds Table 2. In Silico Target Predictions Are Listed for 23 Well-Studied AntiTB Compounds for Which Experimental Protein Targets Are Available in the Literaturea name

TB protein (Swiss-Prot ID) common to predicted and experimental targets

Cerulenin Thiolactomycin FAS20013 Clotrimazole Econazole Ciprofloxacin Clinafloxacin Gatifloxacin Levofloxacin Moxifloxacin Ofloxacin Sparfloxacin Rifampin Rifabutin Rifapentin Isoniazidb Ethionamideb Prothionamideb Nicotinamideb Trimethoprim Ethambutol Clavulanic Acid TMC207

P63454, P63456, P0A5Y4, P63458 P63454, P63456 P63454 P0A512 P0A512 Q07702 Q07702 Q07702 Q07702 Q07702 Q07702 Q07702 P66701 P66701 P66701 P60176 P60176 P60176 P60176 P0A546, P67044 P72059, P72030, P0A560 P0C5C1 no match

a See Supporting Information for full list. b For these compounds, the targets were predicted by assuming that these compounds form adducts.55

As there is no one single source for this type of information available, we have merged information from various sources using the Pubchem CID identifiers (Table 1). When the two established ligand-target databases Drugbank23 and STITCH24 were consulted, only 35 from a total of 44 compounds could be found. Out of the 35 compounds, only three compounds (namely Isoniazid, Ethambutol and Rifampin) had target species identifiers labeled with MYCTU, explicitly indicating the M Tb target. For the remaining 32 compounds, extrapolation based on Enzyme Commission (EC) numbers was made. As an example, the target for Rifabutin based on Drugbank is RPOA (EC 2.7.7.6) in Escherichia coli, with four proteins (RPOA, RPOB, RPOC and PROZ) in M Tb sharing the same EC number. Hence, in the absence of other more definite information, all four were considered as potential targets also in M Tb. In summary, the annotation workflow consists of the following steps: the protein identifiers (Protein) in the STITCH database were converted to Swiss-Prot identifiers (SWISSPROT_ID)25 using the protein name (alias) in the STRING database.26 Then, the orthologous M Tb proteins Swiss-Prot identifier (Accession)25 were added via Enzyme Commission annotations (ID) from IntEnz,20 and the corresponding Gene identifiers (Genes) were obtained from Swiss-Prot, wherever an explicit M Tb target was not available. With the use of the above approach, experimental targets were obtained for 27 common antiTB agents. For the remaining 19 compounds, where the above approach was not successful, M Tb targets of only 15 compounds were assigned manually from literature.17 For these targets, gene ontology (GO)27,28 and gene ontology SLIMS term (GOS)27,28 were subsequently added from the European Bioinformatics Institute Gene Ontology Annotation (EBI GOA) database,29 with functional category annotation obtained from Tuberculist30 and pathways information (Kyoto

Table 3. In Silico Target Predictions Are Listed for 19 Well-Studied AntiTB Compounds for Which Experimental Protein Targets Are Not Available in the Literaturea

name

d-Cycloserinec p-Aminosalicylic acidc Pyrazinamided Pyrazinoic acidd 5-Chloropyrazinamided Streptomycind Amikacind Triclosand Clarithromycind CGI17341d Viomycind Capreomycind Clofazimined Mefloquined Thiacetazoned Kanamycind Thiocarlide PA-824 OPC-67683

predicted TB genes found in the pathways of experimental targetsb

Rv3423c Rv3608, Rv1207 Rv2718c Rv1253, Rv1329c, Rv2718c, Rv2973c, Rv1020 Rv0504c Rv1286, Rv0684, Rv0701, Rv0685, Rv0120c Rv0120c, Rv0684, Rv0685, Rv1286, Rv1713, Rv2364c Rv3370c Rv1638, Rv0384c, Rv3596c, Rv0058, Rv0001 Rv2869c Rv0651 Rv0651 Rv1595, Rv1552 Rv2964, Rv2155c Rv1484 Rv0120, Rv2839, Rv1286, Rv0701, Rv0684, Rv0685 no match no match no match

a For those compounds, predicted and functional target associations have been compared by means of protein-protein network interactions and KEGG and Biocyc’s pathway information. b Literature references used to associate ligands with their experimental target is provided in Supporting Information Table 2. c Match obtained by comparing with classical pathway.33 d Match obtained by comparing with embedded pathway, in which proteins within two degrees of separation (in the protein interaction network) from a particular protein is assumed to be on the same pathway of the protein.38,39,43

Encyclopedia of Genes and Genomes, KEGG31 and BioCyc pathway32) obtained from a recent metagenome analysis publication.33 Despite our efforts, the experimental target annotations of the four compounds (Amoxicillin, Linezolid, Ranbezolid and SQ109) out of the 46 common antiTB compounds could not be assigned, as no targets were associated with them with reasonable evidence. The resulting master data set of 42 common antiTB compounds annotated with the experimental target information is available as Supporting Information (Table S2).

Results and Discussion Prediction Accuracy for Common TB Compounds. Depending on the evidence of experimental target association, the 42 common antiTB compounds were split into two groups (illustrated in Figure 1). There is ample literature evidence suggesting that the target associations of antiTB agents with molecular weight less than 200 such as d-Cycloserine, pAminosalicylic acid, Pyrazinamide, Pyrazinoic acid, 5-Chloropyrazinamide, CGI17341, Thiocarlide and Triclosan are based on indirect interactions or that they might be interacting as adducts after being activated. In addition, for compounds such as Capreomycin,34 Amikacin,35 Clarithromycin,36 Kanamycin,35 Streptomycin,36 Thioacetazone34 and Viomycin,37 the target associations were based on genetic studies (such as resistance or cross-resistance). Hence, 19 of the compounds were clasJournal of Proteome Research • Vol. 8, No. 6, 2009 2791

research articles

Prathipati et al.

Figure 3. Network of predicted Interpro domains (ranked based on the Bayes score assigned by the DFM), associated M Tb proteins, gene IDs and pathways associated with the compound TMC207. This is shown since it was the only case found where target predictions were not successful, most likely due to insufficient coverage of the protein domain with ligand chemical information in the training set of the target prediction model.

sified as compounds with what we call here ‘functional’ targets. For the remaining 23 ligands (from 11 compound classes), which have solid experimental evidence of a physical interaction with its target, a direct comparison between the predicted and experimental protein sets could be performed which is presented in Table 2. For these compounds, very good target prediction accuracy (95%) was observed. Using Cerulin as an example again, out of the 23 proteins with domains predicted to interact with the compound (Figure 2), four proteins were found to match the experimental targets (namely P63454: FAB1_MYCTU; P63456: FAB2_MYCTU; P0A5Y4: FABG_MYCTU and P63458: FABD_MYCTU), as highlighted with asterisks. This prediction accuracy is remarkable since mostly ligand-binding data of human (eukaryotic) drug targets were used in the model generation. TMC207 was the only compound in this class whose target predictions did not the match the known experimental target, which is P63691 (atpE, Rv1305; Figure 3). For this compound, no match between predicted and experimental targets could be identified even if the entire list of predicted domains (instead of only the top 15 scoring domains) was used for this purpose. Also the analysis of protein interaction networks (see below for detail) was unable to reconcile both types of data. Several reasons could be attributed to the inaccurate target prediction for TMC207: (1) None of the Interpro domains of atpE (IPR002379; IPR000454 and IPR005953) were captured by the DFM, and hence, a direct match could not be expected; and (2) TMC207 does not contain any of the substructures associated with Interpro domains found in other ATP synthases; hence, no match either at the pathway or PPI network maps was found. Interestingly, the Bayes score (∼13) of the most probable predicted domains for TMC207 (IPR001236 2792

Journal of Proteome Research • Vol. 8, No. 6, 2009

and IPR001557) is also much lower than the average score (∼64) for the 22 correct predictions, suggesting that the prediction is associated with low statistical significance. For the remaining 19 ligands (from 12 compounds classes) which do not have any experimental evidence of a physical interaction with its target (the above set of compounds with ‘functional targets’), an indirect comparison between either their pathways or its analogous cluster or module in proteinprotein interaction (PPI) networks38,39 was performed in Cytoscape,40 shown in Table 3. For these compounds, the experimental targets were implicated based on evidence of indirect interactions, with mechanisms of action often described as ‘pathways’ or ‘cellular processes’. Given that the domain fishing model employed here predicts proteins (and protein folds) as compound targets, we chose to assess predictions of the model by investigating whether the predicted target and the experimental target share the same pathway. Biological pathways have traditionally been represented by genes (as nodes in a chain) which interact directly or indirectly (as edges) in a linear fashion, that is, in classical pathways. However, for the TB proteome, only a fraction of proteins and genes (about one-fifth, 19%) possesses classical pathway annotations.41 The more recent systems biology paradigm portraits pathways as networks, that is, embedded pathways, in which the interactions between genes are not necessarily linear and where nodes (genes) can have any number of interaction partners, connected via edges in the gene network.42 In our particular case, to assess whether predicted and ‘functional’ targets are part of the same area of the network, protein-protein interaction network maps38,39,43 were utilized, and proteins

Fishing the Target of Antitubercular Compounds

research articles

Figure 4. (a) Network of predicted Interpro domains, associated M Tb proteins, gene IDs and pathways, for the drug p-aminosalicylic acid. The pathways in agreement with experiment23,24 are highlighted with asterisks. This is an example of a direct match of predicted and experimental targets. (b) The network of Streptomycin’s predicted Interpro domains and their associated gene ID’s, overlaid onto the embedded pathway of the experimental target of Streptomycin, Rv0682.44 In this case, the direct target could not be predicted, but for neighboring targets (highlighted with asterisks) which are known to interact functionally with Rv0682, this actually was the case. The predicted Interpro domains in (a) and (b) are ranked based on the Bayes score assigned by the DFM. Journal of Proteome Research • Vol. 8, No. 6, 2009 2793

research articles

Prathipati et al.

within two degrees of separation are assumed to be functionally related, in agreement with earlier publications.38,39 For these 19 antiTB compounds without implicit protein target information, the prediction accuracy of the model is 84% (Table 3). Given our model predictions and the comparison to experimental data, a classical pathway match is exemplified by p-aminosalicylic acid. In this case, the matching pathways (Biocyc’s superpathway of chorismate, Biocyc’s tetrahydrofolate biosynthesis and KEGG’s folate biosynthesis) are highlighted using an asterisk in the figure (Figure 4a). In this case, all three pathways are associated with the most probable domains (Dihydropteroate synthase, IPR006390; Dihydropteroate synthase-like, IPR011005; Dihydropteroate synthase, DHPS, IPR000489), with a probability score of 11. On the other hand, an embedded pathway match is exemplified by Streptomycin (Figure 4b). Here, it is observed that the predicted targets (Rv0120c, Rv0701, Rv1286, Rv0684, Rv0685) are within one degree of separation from the experimental target (Rv0682), which was implicated via resistance mechanism studies.44 Unlike the list of compounds presented in Table 2 (direct match) where the incorrect prediction (TMC207) is associated with a significantly lower Bayes score than the correct predictions, there is no clear distinction in the scores between the correct (16 compounds) versus incorrect predictions (3 compounds: PA-824, OPC-67683, Thiocarlide) for the compounds indirectly connected to their targets (Table 3). The observed lack of agreement between predicted and experimental targets could arise from several sources. The two compounds PA-824 and OPC-67683 are nitroimidazole prodrugs and their experimental target’s domain is unique to actinomycetes,45 (IPR004378, M. tuberculosis paralogous family 11) which was not captured by the model, since this part of ligand-target interaction space was not covered by the training set derived from mainly eukaryotic binding data. Thiocarlide (Isoxyl) has been experimentally shown to target fatty acid desaturase (DesA3, Rv3229c).46 However, the DFM predicts IPR008979 (Galactosebinding like) and Rv1835c as the most likely domain and target gene, respectively. The inaccurate prediction could be due to the absence of IPR005804, the domain associated with the experimental target of Thiocarlide-Rv3229c or any of its related domains (IPR009160, IPR011388, IPR012171, IPR014607, IPR015876, and IPR001522) in the domain fishing model. Coverage of the Domain Fishing Model. To better understand the reasons for the good prediction accuracy and gain further confidence regarding the prediction of truly novel targets for future compounds, model coverage of the TB proteome (given the WOMBAT training set employed), as well as model coverage of the TB proteome subset targeted by common antiTB compounds, was assessed. Here, coverage is defined as the percentage of the TB proteome that the domain fishing model is capable of predicting in principle; in other words, it describes the fraction of Interpro domain folds present in M Tb which are ‘filled’ with ligand structures in the eukaryotic training set. Three factors affect this coverage. First, it depends on how much of the relevant biological space, in terms of druggability, is covered by the ligands in the training set. It has been estimated that current druggable genome comprises of approximately 130 privileged domains.47,48 As the DFM captured over 70% of these privileged domains, the coverage of the druggable biological space by the model seems reasonable. Second, coverage depends on how well the TB proteome is annotated with Interpro domains. For the 3947 TB proteins investigated,49 78% 2794

Journal of Proteome Research • Vol. 8, No. 6, 2009

Figure 5. Bar charts comparing (a) the distribution of the entire M Tb proteome within the 11 functional classes (in black), with the distribution of the proteome according to Interpro domain annotations (in gold) and with the proteome captured by the model (blue). Part (b) of the figure displays the distribution of the ‘M Tb drug target proteome’ among the 11 functional classes (in black), to the target proteins captured by the model (blue). The TubercuList30 Functional classification codes are described as 0, virulence, detoxification, adaptation; 1, lipid metabolism; 2, information pathways; 3, cell wall and cell processes; 4, stable RNAs; 5, insertion seqs and phages; 6, PE/PPE; 7, intermediary metabolism and respiration; 8, unknown; 9, regulatory proteins; 10, conserved hypotheticals; 16, conserved hypotheticals with an orthologue in M. bovis.

of them have at least one Interpro domain annotation, suggesting the TB proteome is equally well-annotated. Lastly, the larger the number of domains shared between the biological species in the training set (in this case, mainly prokaryotes) and the TB bacteria, the better the coverage of targets in this particular organism. Here, the model captures 1403 domains, out of which 445 are found in the TB proteome. This set of 445 domains are associated with 1250 out of the 3947 TB proteins, that is, the domain fishing model covers only 32% of the entire TB proteome. While this number seems rather low, one should keep in mind that only a fraction of the proteome will be relevant in the context of drug discovery, as described before. For comparison, analyses by Overington et al.47 and Hopkins and Groom48 have indicated that existing drugs which follow the rule-of-5 are associated with only 3051 out of a total of 25 635 proteins of the human proteome, that is, ∼12% of the overall protein interaction partners. So even though the coverage of DFM is small numerically, it may already be adequate when put into the context relevant for drug discovery in which it will be used. In a next step, we analyzed the proteome coverage in terms of the functional classification scheme by Cole et al., in which the TB genome is divided into 11 functional classes.50 Displayed in Figure 5a, it can be seen that the DFM coverage of a functional class is independent of how well the Interpro domain annotations cover that particular class. For example, ap-

Fishing the Target of Antitubercular Compounds

research articles

Figure 6. Network of predicted Interpro domains (ranked based on the Bayes score assigned by the DFM), associated M Tb proteins, and gene IDs for the drug Amiclenomycin. The dotted lines depict the relationships between predicted target domains (which are highlighted with asterisks), and experimental domains known to be contained in the target,53 BioA.

proximately one-quarter of the TB proteome belongs to the conserved hypothetical proteins (10), of which 72% possess Interpro domain annotation. However, this is not true for functional class 10, where only approximately 20% of the proteins of this class are captured by the model. We will now turn our attention to the portion of TB proteome that common antiTB drugs are known to act on. For ease of discussion, we termed this part of the proteome the “TB drug proteome” in the following. The analysis reveals that 93% of TB drug proteome is distributed within the four functional classes (Figure 5b), namely, intermediary metabolism and respiration (7), information pathways (2), regulatory proteins (8) and lipid metabolism (1)/cell wall biosynthesis (3). On the other hand it is interesting to note that the most significant classes in the entire TB proteome, namely, conserved hypothetical proteins (10), is poorly featured as drug targets (