In silico enhancing M. tuberculosis protein interaction networks in

Apr 3, 2018 - Bacterial protein-protein interaction (PPI) networks are significant to reveal the ... drug-target genes via analysis of network degree ...
0 downloads 4 Views 2MB Size
Subscriber access provided by UNIV OF DURHAM

In silico enhancing M. tuberculosis protein interaction networks in STRING to predict drug resistance pathways and pharmacological risks suyu mei J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00702 • Publication Date (Web): 03 Apr 2018 Downloaded from http://pubs.acs.org on April 3, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

In silico enhancing M. tuberculosis protein interaction networks in STRING to predict drug resistance pathways and pharmacological risks Suyu Mei*# #

Software College, Shenyang Normal University, Shenyang, 110034, China

Email addresses: [email protected] Phone: 86-024-86592390 Abstract Bacterial protein-protein interaction (PPI) networks are significant to reveal the machinery of signal transduction and drug resistance within bacterial cells. The database STRING has collected a large number of bacterial pathogen PPI networks, but most of the data are of low quality without being experimentally or computationally validated, thus restricting its further biomedical applications. In this work, we exploit the experimental data via four solutions to enhance the quality of M. tuberculosis H37Rv (MTB) PPI networks in STRING. Computational results show that the experimental data derived jointly by two-hybrid and co-purification approaches are the most reliable to train an l2-regularized logistic regression model for MTB PPI networks validation. Based on the validated MTB PPI networks, we further study the three problems via breadth-first graph search algorithm (1) discovery of MTB drug resistance pathways through searching for the paths between known drug-target genes and drug-resistance genes; (2) choosing potential co-target genes via searching for the critical genes located on multiple pathways; (3) choosing essential drug-target genes via analysis of network degree distribution. In addition, we further combine the validated MTBPPI networks with human PPI networks to analyze the potential pharmacological risks of known and candidate drug-target genes from the

-1-

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

point of view of system pharmacology. The evidences from protein structure alignment demonstrate that the drugs that act on MTB target genes could also adversely act on human signaling pathways. Keywords: machine learning, M. tuberculosis H37Rv, protein-protein interaction networks, signaling pathways, drug resistance, system pharmacology Introduction Mycobacterium tuberculosis is the causative agent of tuberculosis that causes millions of deaths each year1. As a well-studied model bacterium, M. tuberculosis H37Rv (MTB) has recently received much attention about its co-infection with HIV 2, drug resistance 3,4,5 and drug target discovery 3,6. The paths between drug-target genes and drug-resistant genes in MTB protein-protein interaction (PPI) networks are especially useful to study the machinery of MTB drug resistance3,4,5,6. For instance, Raman et al.3 directly exploit the MTB PPI networks from the database STRING 7 to study the pathways of MTB drug resistance. Although the database STRING7 provides much information about bacterial protein-protein interactions, its data have been estimated to be low quality8,9. For instance, among the 796,610 entries (398,305 non-redundant PPIs) for MTB PPI networks in STRING, only 32 PPIs are derived from experiment, while the rest are inferred from other sources without being experimentally or computationally validated. To our knowledge, the existing computational methods seldom exploit experimental data to reconstruct M. tuberculosis H37Rv PPI networks4,10. As an alternative, these methods use homology modeling to derive interlogs as training data or direct MTB PPIs. Cui et al.4 use homology modeling method to infer MTB interlogs from the PPI networks of other species in DIP 11, based on which to further -2-

ACS Paragon Plus Environment

Page 2 of 39

Page 3 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

search for essential MTB genes and potential drug-target genes. Liu et al.10 infer MTB interlogs from the PPI networks of E. coli and other 14 species, with the difference to the method4 that it uses the interlogs to train a SVM classifier to further predict unknown interactions. Comparatively, these two methods4,10 are much less reliable than the methods of gene co-expression, gene fusion, gene neighborhood that STRING adopts, because the cross-species knowledge transfer between the evolutionarily distant species E. coli and M. Tuberculosis is less credible than the gene events that occur in M. tuberculosis H37Rv itself. Furthermore, the model10 has not been validated against experimental data. Actually, Wang et al.11 have experimentally constructed a small-scale M. tuberculosis H37Rv PPI networks via bacterial two-hybrid approach that contain 8,044 PPIs, wherein 144 PPIs have been verified by co-purification technique. The data have not been used for computational modeling or biological inference. In this work, we first train a machine learning model based on the reliable experimental M. tuberculosis H37Rv PPI data in11 to validate the rest data in11 and the inferred MTB PPI networks in STRING7, based on which to further study the potential MTB drug-resistance pathways. Then we combine the validated MTB PPI networks with human PPI networks to analyze the potential pharmacological risks of the known and the candidate drug-target genes from the point of view of system pharmacology. Data and method Data STRING7 has collected 398,305 non-redundant M. tuberculosis H37Rv (MTB) PPIs between 3,929 proteins, where only 32 non-redundant PPIs are derived from

-3-

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

experiments and the rest PPIs are inferred from gene structure (gene neighborhood), gene events (gene fusion, gene co-expression, gene cooccurence), other data sources (database, text mining) or direct interlogs (homology, interlog).The MTB PPI networks in STRING7 have been estimated to be of low quality8,9, most PPIs have not been subjected to experimental or computational validation. We need train a predictive model to validate these PPIs that are indirectly inferred. Wang et al.11 have derived 8,044 experimental MTB PPIs via high-throughput bacterial two-hybrid approach, most of which have not been computationally exploited. Out of the 242 PPIs derived by bacterial two-hybrid approach, 146 MTB PPIs have been experimentally verified by co-expression/co-purification technique11. Before co-purification, the pair of potential interaction proteins was cloned into the co-expression vector (pHEX) and E. coli vector (pGEX), and the pair of recombinant vectors was co-transformed into competent E. coli BL21 (DE3)11. As reviewed in12, different experimental techniques of identifying protein-protein interactions, e.g. X-ray crystallography, Yeast two-hybrid (Y2H), mass spectrometry, affinity purification, gene co-expression, etc., are suitable to capture different types of interactions and thus complementary to each other in practical PPI detection. Furthermore, these experimental identification methods exhibit a high fraction of false positive interactions and often show low agreement when generated by different techniques12. For these reasons, the 146 MTB PPIs that are simultaneously identified by two-hybrid approach and co-purification technique are supposed to be of less risk than those PPIs that are identified by two-hybrid approach alone. Hence, the training data come from three sources of experimental data: (1) the first source is the 32 PPIs labeled “experiments” in STRING7 (called STRING); (2) the second source is the 8,044 PPIs derived by bacterial two-hybrid approach (called

-4-

ACS Paragon Plus Environment

Page 4 of 39

Page 5 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

two-hybrid)11; (3) and the third source is the 146 PPIs that are simultaneously verified by two-hybrid approach and co-purification (called co-purification)11. After removing those obsolete genes, the two-hybrid data contain 8,042 PPIs and the co-purification data contain 141 PPIs. To estimate the generalization ability of the proposed model, we randomly sample 20 PPIs as positive independent test data from the co-purification data, so the co-purification data finally contain 121 PPIs and the two-hybrid data contain 8,022 PPIs (the co-purification data are originally contained in the two-hybrid data). Furthermore, the sampled 20 PPIs are disjoint with the 32 PPIs in STRING. Independent test data should have been sampled from other sources independent of the four solutions, but no other reliable data are available at present. Considering the 141 PPIs are both verified by two-hybrid approach and co-purification technique, we sample gold-standard independent test data out of the co-purification data. The positive independent test data are small, so a positive class biased model could completely recognize the positive data. To check the model bias, we randomly sample 200 protein pairs as negative independent test data from the protein pair space exclusive of the protein pairs in STRING7 and the two-hybrid data11. Negative data sampling is one major concern in computational prediction of protein-protein interactions. Random sampling is the common practice to yield negative data. To obtain more reliable negative data, cellular compartments restricted sampling method has been proposed13, where negative data are sampled only from those protein pairs that are not located in the same cellular compartments. Reliable as they are, the sampled negative data are less representative, because many proteins in the same cellular compartments also do not interact. Random sampling could yield more representative negative data. The other major concern about negative data

-5-

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

sampling is how to determine the ratio of positive to negative. We do not attempt to simulate the true scenario of biological space, e.g. ratio of positive to negative 1:10, 1:100, etc., for two reasons: (1) highly skewed distributions between positive class and negative class are prone to yield a biased model; (2) it is hard to find a direct mapping from biological problem space to computational space. For the reason, we randomly sample the negative training data with 1:1 ratio to the positive training data. The sampling space consists of MTB protein pairs exclusive of all the experimental data and the inferred PPIs in STRING11. In view of different confidence levels of the three sources, we provide four solutions to computational modeling as described below. All the solutions are subjected to 5-fold cross validation and estimated on the same independent test set. We will choose the solution that achieves best performance as the final predictive model. STRING. In this solution, we use the 32 PPIs in STRING7 as positive training data, and the negative training data are randomly sampled with 1:1 ratio to the positive training data. Two-hybrid. In this solution, we use the two-hybrid data (8,022 PPIs) as positive training data, and the negative training data are randomly sampled with 1:1 ratio to the positive training data. Two-hybrid+co-purification. In this solution, we first use the co-purification data (121 PPIs) to train a predictive model, and then use the model to validate the PPIs in two-hybrid data. The predictive positive data are incrementally added to the co-purification data to augment the positive training data. Based on the new positive training data, we resample the negative training data with 1:1 ratio to the positive training data.

-6-

ACS Paragon Plus Environment

Page 6 of 39

Page 7 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Co-purification. In this solution, we use the co-purification data (121 PPIs) as positive training data, and the negative training data are randomly sampled with 1:1 ratio to the positive training data. As compared to the solution two-hybrid, the training data are much smaller but more reliable. The computational modeling follows the idea: (1) the largest experimental two-hybrid data are preferred to be used as training data; (2) if the solution two-hybrid fails, the more reliable co-purification data will be used as training data. Considering its small size, we attempt to validate the PPIs in two-hybrid data using the trained model, and the validated PPIs are then incrementally added to the co-purification data as new training data, i.e. the two-hybrid+co-purification; (3) if the two solutions both fail, we have to use the co-purification data alone as training data; (4) if the solution co-purification works, we need further check the risk of data overfiting because of its small size, so we use the solution STRING as baseline model, whose data are also small. The reason we adopt the method of trial and error to choose training data is that the true distribution of PPIs in the space of MTB protein pairs is unknown and no method available at present guides us to obtain the best training data. Nevertheless, higher quality of experimental data is supposed to yield a model of higher quality. The performance evaluation could guide us to choose the proper solution and model. Multi-instance feature construction Feature construction is a critical step that directly or indirectly determines success or failure of machine learning modeling. In this work, feature construction serves two purposes of (1) capturing significant biological information of genes/proteins; (2) augmenting the small training data to reduce the risk of data overfitting. Among the features, gene ontology (GO)14 has been claimed to the most discriminative feature to predict protein-protein interactions15. Unfortunately, the GO terms in GOA14 are -7-

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 39

sparsely distributed among those less-studied genes so that a gene/protein is potentially depicted by null feature vector. For the reason, the GO knowledge of the homologs is transferred to study the gene/protein concerned. Considering the small training data, we introduce an independent homolog instance to depict the homolog knowledge, so that each gene/protein is depicted with two instances, and the target instance depicts the GO knowledge of the gene/protein itself. The homologs are extracted from SwissProt16 using PSI-Blast17 with default parameters against all species. For each protein i in the training set U = U pos ∪ U neg , we obtain the i

i

homolog set of GO terms S H , the target set of GO terms S T , and the whole training set of GO terms S =

∪(S i∈U

i T

∪ Si ). H

For each protein pair (i1 , i2 ) , the target instance and homolog instance are formally defined as follows: 0, g ∉ STi1 ∧ g ∉ STi2  VT(i1 ,i2 ) [ g ] = 2, g ∈ STi1 ∧ g ∈ STi2 ; 1, otherwise 

0, g ∉ S Hi1 ∧ g ∉ S Hi2  VH(i1 ,i2 ) [ g ] = 2, g ∈ S Hi1 ∧ g ∈ S Hi2 1, otherwise 

(1)

For each GO term g ∈ S , VT(i1 ,i2 ) [ g ] denotes the component g of target instance VT(i1 ,i2 ) and VH(i1 ,i2 ) [ g ] denotes the component g of homolog instance VH(i1 ,i2 ) . Those GO terms g ∉ S are discarded. If a protein pair

(i1 , i2 ) shares the same GO term g, then the

corresponding component in feature vector VT(i1 ,i2 ) or VH(i1 ,i2 ) is set 2; if neither protein possesses the GO term g, then the value is set 0; otherwise the value is set 1. Those protein pairs that are degraded into null feature vectors are removed. It is noted that the assigned values {0, 1, 2} just simply record the three cases of occurrence of a GO term in a protein pair instead of the counts of occurrence: (1) the GO term occurs in both proteins; (2) the GO term occurs in one protein alone; (3) the GO term does not -8-

ACS Paragon Plus Environment

Page 9 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

occur in any protein. A protein pair is depicted by the combinatorial occurrences of all GO terms, thus the feature vector for a protein pair may not be sparse, though the GO terms for a protein are sparse. L2-regularized logistic regression L2-regularized logistic regression18 is adopted here to fast fit training data and counteract the noise introduced by homolog instances. Given a set of instance-label pairs ( xi , yi ), i = 1, 2,..., l ; xi ∈ R n ; yi ∈ {−1, +1} , l2-regularized logistic regression solves the following unconstrained optimization problem. l T 1 T min ω ω + C ∑ log(1 + e − yiω xi ) (2) ω 2 i =1

where ω denotes weight vector, C denotes penalty parameter/regularizer, and the second term penalizes noise/outlier fitting. The optimization problem (2) is solved via its dual form. l l 1 min α T Qα + ∑ α i logα i + ∑ (C − α i ) log(C − α i ) − ∑ C logC α 2 (3) i:α i > 0 i:α i | f (VH(i1 ,i2 ) )|  (4) Decision _ value(i1 , i2 ) =  f (VH(i1 ,i2 ) ), if | f (VT(i1 ,i2 ) ) | ζ L(i1 , i2 ) =  (5) 0, otherwise

The threshold ζ is used to filter out those weak positive predictions.

Experimental setting and model evaluation Since each protein pair is depicted with two instances, the model can be estimated by both target instance and homolog instance (called combined-instance case), homolog instance alone (called homolog-instance case) and target instance alone (called target-instance case), respectively. From the former two cases, we can check the effectiveness of homolog knowledge transfer. The model performance is measured with the following metrics: ROC-AUC (AUC of Receiver Operating Characteristic), SE (sensitivity), SP (specificity), MCC (Matthews correlation coefficient) and F1 score. ROC-AUC is calculated based on the decision values defined in formula (4), and all the other metrics are derived from a confusion matrix M, whose element M i , j records the counts that class i are classified to class j . From M, we define several intermediate variables as formula (6). Based on these intermediate variables, we further define SPl, SEl and MCCl for each label as formula (7) and the overall MCC as formula (8). L

L

∑ ∑

pl = M l ,l , ql =

L

M i , j , rl =

i =1,i ≠ l j =1, j ≠ l L

p=

L



M i ,l ,sl =

i =1,i ≠ l

L

L

∑M

j =1, j ≠ l

L

l, j

(6)

∑ p ,q = ∑q ,r = ∑r ,s = ∑s l

l

l =1

SPl =

pl

SEl =

pl

l

l =1

pl + rl

pl + sl



L

l =1

l =1

, l = 1, 2..., L , l = 1, 2..., L

MCCl = ( pl ql − rl sl )

Acc =

l

l =1

Ml ,l

MCC = ( pq − rs)

(7)

( pl + rl )( pl + sl )(ql + rl )(ql + sl ) , l = 1, 2..., L

∑ ∑ L

L

i =1

j =1

Mi , j

(8)

( p + r)( p + s)(q + r )(q + s)

- 10 -

ACS Paragon Plus Environment

Page 10 of 39

Page 11 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

where L denotes the number of labels. F1 score is defined as follows: F1 score =

2 × SPl × SEl

SPl + SEl

, l = 1 denotes the positive class

(9)

Results Performance of 5-fold cross validation The ROC curves and AUC scores for the four solutions are illustrated in Figure 1 (A~D). As illustrated in Figure 1 (B), the solution two-hybrid shows the poorest performance of 5-fold cross validation among the four solutions. As shown in Table 1, the solution two-hybrid achieves overall accuracy below 60% for the three cases. The other metrics (SE, SP, MCC) also implicate that the two-hybrid data are potentially of low quality. Out of 242 PPIs derived by two-hybrid approach, only 141 MTB PPIs are verified by co-purification technique, accounting for 58.26% overlap rate11, implying that the relatively large experimental PPI data derived by high-throughput bacterial two-hybrid approach potentially contain a fraction of false interactions. The co-purification data contain more reliable PPIs that are verified by two-hybrid and co-purification approaches. As illustrated in Figure 1 (D), the solution co-purification achieves very promising ROC-AUC scores for the three cases, and the

performance measured with the metrics (SE, SP, MCC) is far better than that of solution two-hybrid. Among the three cases, the homolog-instance case shows relatively poor performance and some bias, i.e. MCC 0.7735, 0.8135 on the positive and negative class, respectively. Nevertheless, the homolog instance could help the model to work when the gene/protein concerned is novel and carries little information. We can see that data quality plays a more critical role in model performance than data quantity.

- 11 -

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The solution co-purification+two-hybrid is designed to check whether incremental increase of the co-purification data could increase the model performance. The PPIs validated by the model trained on the co-purification data are added to the training data, and the co-purification+two-hybrid data finally contain 1,257 PPIs. As shown in Figure 1 (C) and Table 1, the solution co-purification+two-hybrid does not increase but steeply deteriorate the performance with 20%~30% drop of SP, SE and over 30% drop of MCC. The results show that more data does not necessarily imply higher performance. Small data are easily prone to overfitting. The co-purification data are small, so we need to check whether the excellent performance of 5-fold cross validation potentially results from data overfitting. The solution STRING is designed as a contrast model. As shown in Figure 1 (A) and Table 1, the STRING data that contain 32 experimental PPIs do not result in overfitting, and the performance measured with SP, SE and MCC is moderate. Nevertheless, the solution STRING is much better than the solution two-hybrid. The results also imply that the experimental PPI networks11 need further

validation. We can see that small data do not necessarily result in overfitting. In the next section, we will further check the potential overfitting via independent test. From the above analysis, we could safely conclude that the solution co-purification is the best solution to be chosen and the quality co-purification data yields a credible model. We will use the model to validate the inferred PPIs in STRING7 and the rest PPIs in11 (see the following section- computational validation of M. tuberculosis H37Rv PPIs in STRING7 and two-hybrid PPIs in11). Performance of independent test The independent test set contains 20 PPIs and 200 randomly sampled protein pairs. We use the same independent test set to estimate the generalization ability and check - 12 -

ACS Paragon Plus Environment

Page 12 of 39

Page 13 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

data overfitting of the four solutions, namely STRING, two-hybrid, co-purification+ two-hybrid and co-purification. As shown in Figure 1 (E) and Table 1, the solution co-purification achieves the best independent test performance with recognition rate

94.12%, 85.71% and the positive and negative class, respectively. The performance further demonstrates that the co-purification data do not yield overfitting. The solution co-purification+two-hybrid also recognizes the positive data with 94.12% accuracy, but it shows a relatively poor performance on the negative data with 73.14%. The results show that the solution co-purification could potentially achieve good generalization ability with less risk of data overfitting. The performance of independent test may be somewhat biased since the independent test data are sampled from the co-purification data. However, the sampled 20 PPIs actually also belong to the two-hybrid data11, but the performance of independent test on the model trained on two-hybrid data is comparatively much poor (see Figure 1). For the reason, we are inclined to choose the co-purification solution. Performance comparison with the related methods The existing computational methods that are related to M. tuberculosis H37Rv are classified into two groups (1) predicting M. tuberculosis H37Rv protein interactions10; (2) predicting protein interactions between M. tuberculosis H37Rv and host8,9. The method10 trains a SVM model on E. coli PPI data to predict M. tuberculosis H37Rv PPIs, achieving 95.95% accuracy, SE 0.8986 and SP 0.9801. This performance is achieved on E. coli data instead of the target species M. tuberculosis H37Rv. Nevertheless, this performance is still inferior to our proposed model that is trained on experimental M. tuberculosis H37Rv PPI data (see Table 1). Furthermore, the method10 is not estimated by independent test on experimental M. tuberculosis H37Rv PPIs, so we do not know its true performance of predicting M. tuberculosis H37Rv - 13 -

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

protein interactions. Actually, no biological basis supports direct knowledge transfer between E. coli and M. tuberculosis H37Rv. Similarly, the KMM-SVM method19 exploits the knowledge of source pathogen-host PPIs to predict target pathogen-host PPIs via cross-species knowledge transfer. Unfortunately, the performance is not so convincing to demonstrate the effectiveness of cross-species knowledge transfer. For instance, a KMM-SVM model is trained on the known Francisella-human PPI data to predict Salmonella-human PPIs, with performance SP 0.257, SE 0.161 and F1 score 0.199. The homology method8 uses the known interacting domain pairs of multiple species as templates to predict protein interactions between M. tuberculosis H37Rv and Homo sapiens. Due to no experimental MTB-human PPIs available, the authors validate the

model against human PPIs. Among 839 predicted human PPIs, only 82 PPIs are validated against experimental data, accounting for only 9.77% overlap rate. The results do not convincingly demonstrate the feasibility of homology modeling. Computational validation of M. tuberculosis H37Rv PPIs in STRING7 and two-hybrid PPIs in11 The 398,305 non-redundant MTB PPIs in STRING7 are separately extracted according to the evidences of neighborhood, neighborhood transferred, fusion, co-occurrence, homology, co-expression transferred, experiments, experiments transferred, database, database transferred, text mining, text mining transferred. The data labeled experiments are merged with the two-hybrid data in11 as labeled data merged experiments. All these data are predicted by the model trained on the co-purification data, and the predicted positive rates are shown in Table 2. It is noted

that those protein pairs that are undetermined by the proposed model (see Formula (4)) are discarded. We can see that only a small fraction of MTB PPIs are validated by the

- 14 -

ACS Paragon Plus Environment

Page 14 of 39

Page 15 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

proposed model for each kind of evidences. The evidences with relatively high predicted positive rates include text mining (21.79 %) and experiments (16.98 %), while some evidences are validated with low predicted positive rates, e.g. database (6.20%), fusion (7.16%). We merge the validated PPIs with the positive training data and the positive independent test data and finally obtain non-redundant 45,295 PPIs (see Supplementary file S1), much smaller than the 398,305 PPIs in STRING7. Exclusive of those PPIs that are undetermined, we obtain 338,477 non-interactions (see Supplementary file S2). The computationally validated 45,295 PPIs will be used to discover novel drug target genes and drug resistance pathways (see the next sectionanalysis of potential drug resistance pathways and drug co-target genes). It is noted that we only validate the known experimentally derived or inferred PPIs in the current databases7,11 and do not predict novel PPIs in order to ensure the quality of M. tuberculosis H37Rv PPI networks. In addition, the huge space of protein pairs makes the prediction task too complicated. Analysis of potential drug resistance pathways and drug co-target genes Discovery of potential drug target genes. Critical or hub genes play important roles in regulating protein-protein interactions and signal transduction, removal or suppression of critical bacterial genes is expected to disrupt the bacterial signaling pathways and prohibit the pathogen from performing vital functions20,21. Highly connected genes with large degree distribution are assumed to be essential genes and drug target genes, inhibition of which could disconnect all the essential pathways and break down the whole signal communication systems of M. tuberculosis H37Rv cell. The overall structure of the M. tuberculosis H37Rv PPI networks validated in this work is illustrated in Figure 2. At the core of PPI networks are highly connected essential - 15 -

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 39

genes, some of which could be selected as potential drug target genes, while at the peripheral of PPI networks are non-essential genes. In the networks, the known drug target genes taken from the SInCRe database22 are explicitly colored in green, and the known drug resistance genes taken from3 are colored in red. We can see that most of the known drug target genes and drug resistance genes are located near the core of PPI networks. Figure 3 (A)~(C) illustrate the overall degree distributions of M. tuberculosis H37Rv genes, the degree distributions of known drug target genes22 and

the degree distributions of known drug resistance genes3. The details of degree distributions are provided in Supplementary File S2, S3 and S4, respectively. As indicated in Figure 3 (A), the degrees of M. tuberculosis H37Rv genes are subjected to power-law distribution, which makes it convenient for us to choose drug target genes. For instance, such the proteins with large degrees as Rv2264c {degree: 866}, Rv1023 {degree: 668}, Rv2857c {degree: 457}, Rv2737c {degree: 429}, Rv1344 {degree: 407} are potentially druggable (see Supplementary file S3). As shown in Figure (B), most of the known drug target genes have large network degrees, e.g. Rv0120c {degree: 310}, Rv2244 {degree: 309}, Rv2986c {degree: 276}, Rv3710 {degree: 255}, etc (see Supplementary file S4). Figure 3 (C) also shows that many M. tuberculosis H37Rv genes that have acquired drug resistance also have large network

degrees, e.g. Rv2737c {degree: 429}, Rv3197A {degree: 204}, etc (see Supplementary file S5). Discovery of potential drug resistance pathways. To date, the existing approaches to address bacterial drug resistance have achieved very limited success, mainly due to lack

of

knowledge

about

the

drug

resistance

machinery

in

bacteria19.

Interactome-scale networks-based system pharmacology23 has been proposed as an effective approach to understand the machinery of drug action and drug resistance in

- 16 -

ACS Paragon Plus Environment

Page 17 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

bacteria3,21. At present, the known types of drug resistance in bacteria include efflux pumps to transport drugs out of the cell, cytochromes-like enzymes that chemically modify the drug and horizontal gene transfer that imports a detoxifying protein from the environment3,21,24,25. For each drug resistance gene, there potentially exist multiple physical paths starting from a gene that drug acts on. In this work, we first map the 391 drug target genes obtained from22 and the 51 drug resistance genes obtained from3 onto the M. tuberculosis H37Rv PPI networks validated by our proposed method, and then search for the potential drug resistance pathways that start from drug target genes and end at a drug resistance gene via breadth-first graph search algorithm. The potential pathways for each drug resistance gene are provided in Supplementary file S6 (the text editor EditPlus is a preferred tool to read the files), by which we can gain insights of the drug resistance machinery of M. tuberculosis H37Rv genes Since most of the known drug resistance genes have large network degrees (see Figure 2), there potentially exist multiple drug resistance pathways for each drug resistance gene. For instance, the drug resistance gene polA(Rv1629), also a known drug target gene, has network degree 190, indicating that there are 190 paths passing through it. Computational results show that all the known 391 drug target genes have paths passing through the drug resistance gene polA(Rv1629) (see Supplementary file S6). We can see that a drug resistance gene generally acquires mutations to resist multi-drug actions on a group of drug target genes. For the sake of legibility, we only randomly choose 4 known drug target genes {Rv0120c(Rv0120c), acpP(Rv2244), hup(Rv2986c), leuA(Rv3710)} and 5 paths for each drug target gene to illustrate in

Figure 4. We can see that there are one or two intermediate genes (nodes in white) along the paths from the drug target genes to the drug resistance gene polA(Rv1629).

- 17 -

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 39

Take the drug target gene Rv0120c (Rv0120c) for example, the gene with network degree 310 happens to connect with the drug resistance gene polA(Rv1629) through 190 paths. Take the path {Rv0120c- Rv2150c- Rv1295- Rv1629} for example, the drug target gene Rv0120c fulfils the molecular functions of GTPase activity (GO:0003924) and translation elongation factor activity (GO:0003746), and is involved

in

sulfur

compound

metabolic

process

(GO:0006790)

(http://www.uniprot.org/uniprot/Rv2150c). The drug resistance gene polA(Rv1629) yields the product DNA polymerase that exhibits 3'-5' and 5'-3' exonuclease activity and is involved in the biological processes of DNA-dependent DNA replication (GO:0006261),

DNA

repair

(GO:0006281)

and

growth

(GO:0040007)

(http://www.uniprot.org/uniprot/P9WNU5). Between these two genes, there are two intermediate genes {acpP(Rv2244), hup(Rv2986c)}. The gene cpP(Rv2244) gets involved in the biological processes of acyl-CoA metabolic process (GO:0006637), growth

(GO:0040007)

and

lipid

A

biosynthetic

process

(GO:0009245)

(http://www.uniprot.org/uniprot/P9WQF3). The other gene hup(Rv2986c) is involved in the biological processes of chromosome condensation (GO:0030261), DNA protection

(GO:0042262),

growth

(GO:0040007),

nucleoid

organization

(GO:0090143) and positive regulation of transcription, DNA-templated (GO:0045893) (http://www.uniprot.org/uniprot/P9WMK7). We can see that this pathway plays critical roles in M. tuberculosis H37Rv cell cycle and cell life. Actually, the two intermediate genes {acpP(Rv2244), hup(Rv2986c)} are also known drug target genes (see Supplementary file S5), whose drug actions potentially induce the drug resistance of gene polA(Rv1629). Discovery of co-target genes. Analysis of the physical paths from multiple drug target genes to a drug resistance gene (referred to N targets to1-resitance) is useful to

- 18 -

ACS Paragon Plus Environment

Page 19 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

reveal the machinery how the drug resistance gene resists multi-drug actions imposed on multiple drug target genes. Similarly, analysis of the physical paths from a drug target gene to multiple drug resistance genes (referred to as 1 target to N resistances) is useful to reveal the mechanisms that multiple drug resistance genes jointly resist the drug action imposed on the drug target gene. Especially, some critical genes that are located on multiple drug resistance pathways can be treated as co-targets of the drug target gene to jointly counteract the drug resistance. For instance, the gene htpG(Rv2299c) occurs on 38 pathways from the target gene Rv0120c(Rv0120c) to the

known drug resistance genes. The pathways from a drug target gene to multiple drug resistance genes are provided in Supplementary file S7, based on which to calculate the number of pathways that the candidate co-target genes are located at (see Supplementary file S8). Take the drug target gene Rv0120c(Rv0120c) for example, we randomly choose 30 pathways that all pass through the same intermediate gene htpG(Rv2299c) to illustrate in Figure 5. We can see that the gene htpG denoted in light blue occupies a critical position on the pathways. Co-actions of drugs on the two genes {Rv0120c, htpG} potentially counteract the drug resistance and disrupt the signal transduction in M. tuberculosis H37Rv cell. The gene htpG yields product of molecular chaperone with

ATPase activity that is involved in the biological processes of cellular response to superoxide (GO:0071451) and protein folding (GO:0006457), partly similar to the drug target gene Rv0120c. Pharmacological risk analysis of drug target genes. Critical genes generally play essential roles in the normal life processes of bacterial cell. Suppression of critical bacterial genes with right medication is expected to rapidly and effectively kill bacterial pathogens and cure diseases. However, the drugs on bacterial critical genes

- 19 -

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

could simultaneously cause serious side effects to the host ortholog genes. If the host ortholog genes are also essential genes, the drug side effects may be fatal. If the host ortholog genes belong to host essential signaling pathways, the drugs on bacterial critical genes could also affect the host signaling system. Hence we need analyze the pharmacological risks of candidate and known M. tuberculosis H37Rv drug target genes. The host ortholog genes are obtained via PSI-Blast17 against SwissProt database16. The degree distributions of human ortholog genes are calculated via the physical PPI networks from HPRD26 and BioGrid27. Human immune signaling pathways are taken from NetPath27, which curates about 35 signaling pathways. For simplicity, we merge the 11 sub-types of Interleukin (IL-1 ~ IL-11) into one single signaling pathway denoted as IL-2, thus we obtain 27 pathways in total. The information of human ortholog genes and corresponding human immune signaling pathways is merged into Supplementary file S3 ~ S5. For the known drug target genes in SInCRe22, we illustrate their human ortholog genes and the corresponding immune signaling pathways in Figure 6 (A). Those drug target genes whose human ortholog genes are not assigned to any known signaling pathway are neglected. Drug that act on the MTB drug target genes denoted as red nodes potentially act on their human ortholog genes denoted as white nodes, and accordingly interfere with relative human immune signaling pathways. Take the MTB gene pstP(Rv0018c) for example, its human ortholog gene PPM1A is a signaling component of TGFBeta signaling pathway that gets involved in important biological processes of negative regulation of transcription from RNA polymerase II promoter (GO:0000122), protein dephosphorylation (GO:0006470), negative regulation of transforming growth factor beta receptor signaling pathway (GO:0030512), cell cycle

- 20 -

ACS Paragon Plus Environment

Page 20 of 39

Page 21 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

arrest (GO:0007050), etc. From the point of view of molecular functions, the MTB gene pstP(Rv0018c) and its human ortholog gene PPM1A fulfil the common functions of magnesium ion binding (GO:0000287), catalytic activity (GO:0003824), phosphoprotein

phosphatase

activity

(GO:0004721),

protein

serine/threonine

phosphatase activity (GO:0004722). Besides sequence homology and common functions, the two gene products pstP and PPM1A are also structurally similar. We conduct pairwise protein structure alignments

between

pstP

and

PPM1A

via

the

workbench

(http://www.rcsb.org/pdb/workbench/workbench.do)29. The structure alignment via RCSB PDB id3FXK and 2CM1 is illustrated in Figure 7. As shown in Figure 7, the protein pstP and PPM1A are well structurally aligned (see the colored parts of the structure cartoon and the corresponding sequence alignment) with alignment score 528.38 and Z-score 5.86. This result drops a hint that drug screening for MTB gene pstP(Rv0018c) needs to cautiously consider the potential side effects on its human

ortholog gene PPM1A. For all potential drug target genes, we need to minimize the drug side effects on their human ortholog genes and signaling pathways. Additionally, the human ortholog genes and signaling pathways of the MTB genes with network degree larger than 150 are illustrated in Figure 6 (B). Those candidate drug target genes whose human ortholog genes are not assigned to any known signaling pathway are also neglected. Highly connected genes in M. tuberculosis H37RvPPI networks are potentially candidate drug targets or co-targets, but we need make sure that the drug side effect on their human ortholog genes and signaling pathways is minimized. For instance, the MTB gene Rv2264c(Rv2264c) with the largest network degree 866 has human ortholog gene HSPA8 located on human signaling pathways {TGFBeta; TNF; AR; RANKL; TSH}. Suppression of gene

- 21 -

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Rv2264c could cause disruption of M. tuberculosis H37Rv PPI networks with a higher

probability, but drug design or screening needs to consider minimizing the pharmacological risks to its human ortholog gene HSPA8 and related signaling pathways. The gene HSPA8 plays important roles in cellular processes, such as regulation of cellular response to heat (GO:1900034), chaperone-mediated autophagy (GO:0061684), regulation of mRNA stability (GO:0043488), etc. Protein structure alignment could help us design or choose a proper inhibitor. Discussion M. tuberculosis H37Rv (MTB) has received much attention in recent years partly due

to its increasingly serious drug resistance. M. tuberculosis H37Rv PPI networks are an important infrastructure to study the machinery of drug resistance and discover novel drug target genes. However, the M. tuberculosis H37Rv PPI networks in the STRING database have been estimated to be of low quality. The related computational methods are generally neither based on experimental data nor validated against independent experimental data, so that the performance may not be convincing in real application scenarios as reported. In this work, we exploit the experimental MTB PPI data collected in11 via four solutions to validate the MTB PPIs in STRING7. Computational results show that the experimental data derived jointly by two-hybrid and co-purification approaches are the most reliable to train an l2-regularized logistic regression model for MTB PPI networks validation. The co-purification validated data is only a small fraction of the two-hybrid data, but achieves far better performance of cross validation and independent test. These results show that pure data accumulation does not necessarily increase the model performance and data quality is the most critical factor that determines the success or failure of computational modeling. The independent test - 22 -

ACS Paragon Plus Environment

Page 22 of 39

Page 23 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

results show that the relative small training data train a well generalized model with less risk of overfitting. Some readers may cast doubt on the small training data from co-purification technique. According to statistical learning theory30, small examples still could train a learnable model that well generalizes to unseen examples or patterns. Because of limited representative ability, small training data cannot represent the whole class distributions and is prone to narrow the purview of predictive model. The proposed model validates 45,295 PPIs from the 398,305 PPIs in STRING7 and 8,022 PPIs in11, indicating that the model is learnable and does not miss a significant fraction of interactions to a certain confidence level. Here we aim to minimize the risk of false interactions in STRING11, so the compromise to achieve high true positive rate at the sacrifice of a little higher false negative rate is still acceptable. From the viewpoints of computational methodology, this work directly exploits the experimental M. tuberculosis H37Rv PPI as training data rather than infers across-species interlogs as M. tuberculosis H37Rv PPIs3 or treats Ecoli PPIs as training data to predict M. tuberculosis H37Rv PPIs10. We use GO terms as predictive features that have been proven the most discriminative features of PPI prediction. To address the concern of GO sparsity, we conduct homolog knowledge transfer via homolog instances that effectively augment the training data and substitute for the target instances when the genes/proteins concerned are less studied. L2-regularized logistic regression is adopted in this work to reduce computational complexity and counteract the noise resulting from homolog knowledge transfer. The trained l2-regularized logistic regression model is then used to validate the MTB PPIs in STRING7 and the two-hybrid derived PPIs in11, and finally non-redundant 45,295 PPIs in total are obtained, much smaller than the inferred 398,305 PPIs in STRING. If the probability threshold for predicted interactions is

- 23 -

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

increased (see Supplementary file S1), the scale of the validated PPI networks will be further reduced with lower risk of false positive predictions; otherwise, we could obtain more MTB PPIs if we decrease the probability threshold for predicted non-interactions (see Supplementary file S2). The choice depends on our demand on quality of data quality or abundance of information. Based on the validated M. tuberculosis H37Rv PPI networks, we conduct further studies on the problems: (1)

discovery of novel drug target genes via analysis of network degree distributions; (2) discovery of drug resistance pathways via searching for the physical paths from drug target genes to drug resistance genes; (3) discovery of co-target genes via searching for the critical genes that are located on multiple drug resistance pathways. All these results could provide insights into the drug resistance machinery of M. tuberculosis H37Rv genes and drug target screening. Essential genes with large network degrees play critical roles in signaling pathways and regulatory networks, and thus drug actions on these genes could disrupt normal biological processes of M. tuberculosis H37Rv cell. However, the drug actions could potentially cause side effects to their human host ortholog genes. To estimate the pharmacological risks of known or candidate drug target genes, we integrate the validated M. tuberculosis H37Rv PPI networks, human physical PPI networks and immune signaling pathways. For each candidate or known drug target gene, we find out its mapping to human ortholog genes and the related signaling pathways. This mapping facilitates us to simultaneously consider the drug actions on drug target genes as well as the potential side effects on the human ortholog and signaling pathways. Protein structure alignment is applicable to detection of potential drug side effects. Highly similar structure similarity indicates that the drug acting on the drug target gene also potentially acts on the human ortholog gene. For instance, the known

- 24 -

ACS Paragon Plus Environment

Page 24 of 39

Page 25 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

drug target gene product pstP is structurally similar to the human ortholog gene product PPM1A, while PPM1A plays an important role in negatively regulating TGF-beta signaling through dephosphorylating SMAD2 and SMAD3. Hence we need carefully select or design drugs to minimize the pharmacological risks on gene PPM1A.

SUPPORTING INFORMATION: The following files are available free of charge at ACS website http://pubs.acs.org: File S1 Text file contains the computationally and experimentally validated M. tuberculosis H37Rv PPIs. (TXT) File S2 Text file contains the inferred PPIs in STRING7 and two-hybrid PPIs in11 that are computationally validated to be non-interactions. (TXT) File S3 Text file contains the overall degree distributions of M. tuberculosis H37Rv genes, corresponding human ortholog genes and human signaling pathways. (TXT) File S4 Text file contains the degree distributions of known drug target M. tuberculosis H37Rv genes, corresponding human ortholog genes and human signaling pathways. (TXT) File S5 Text file contains the degree distributions of known drug resistance M. tuberculosis H37Rv genes, corresponding human ortholog genes and human signaling pathways. (TXT) File S6 Text file contains the potential drug resistance pathways for each known drug resistance gene. (TXT) File S7 Text file contains the pathways from a drug target gene to multiple drug resistance genes. (TXT) File S8 Text file contains the candidate co-target genes of the known drug target genes. (TXT) Funding sources No funding sources were given. Author Contributions MS conducted the study and wrote the paper. All the authors reviewed the paper.

- 25 -

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Competing interests The authors declare that they have no competing interests. References 1. Koul, A.; Herget, T.; Klebl, B.; Ullrich, A. Interplay between mycobacteria and host signalling pathways. Nat. Rev. Microbiol. 2004, 2, 189-202. 2. Nunn, P.; Williams, B.; Floyd, K.; Dye, C.; Elzinga, G.; Raviglione, M. Tuberculosis control in the era of HIV. Nat. Rev. Immunol. 2005, 5, 819-826. 3. Raman, K.; Chandra, N. Mycobacterium tuberculosis interactome analysis unravels potential pathways to drug resistance. BMC Microbiol. 2008, 8, 234. 4. Cui, T.; Zhang, L.; Wang, X.; He, Z.G. Uncovering new signaling proteins and potential drug targets through the interactome analysis of Mycobacterium tuberculosis. BMC Genomics. 2009, 10, 118. 5. Cui, Z.J.; Yang, Q.Y.; Zhang, H.Y.; Zhu, Q.; Zhang, Q.Y. Bioinformatics Identification of Drug Resistance-Associated Gene Pairs in Mycobacterium tuberculosis. Int. J. Mol. Sci. 2016, 17 pii: E1417. 6. Melak, T.; Gakkhar, S. Maximum flow approach to prioritize potential drug targets of Mycobacterium tuberculosis H37Rv from protein-protein interaction network. Clin. Transl. Med. 2015, 4, 61. 7. Szklarczyk, D.; Franceschini, A.; Wyder, S.; Forslund, K.; Heller, D.; Huerta-Cepas, J.; Simonovic, M.; Roth, A.; Santos, A.; Tsafou, K.P.; Kuhn, M.; Bork, P.; Jensen, L.J.; von Mering, C. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2015, 43 (Database issue), D447-52. 8. Zhou, H.; Rezaei, J.; Hugo, W.; Gao, S.; Jin, J.; Fan, M.; Yong, C.H.; Wozniak, M.; Wong, L. Stringent DDI-based prediction of H. sapiens-M. tuberculosis H37Rv protein-protein interactions. BMC Syst. Biol. 2013, 7, Suppl 6:S6. 9. Zhou, H.; Gao, S.; Nguyen, N.N.; Fan, M.; Jin, J.; Liu, B.; Zhao, L.; Xiong, G.; Tan, M.; Li, S.; Wong, L. Stringent homology-based prediction of H. sapiens-M. tuberculosis H37Rv protein-protein interactions. Biol. Direct. 2014, 9, 5. 10. Liu, Z.P.; Wang, J.; Qiu, Y.Q.; Leung, R.K.; Zhang, X.S.; Zhang, X.S.; Tsui, S.K.; Chen, L. Inferring a protein interaction map of Mycobacterium tuberculosis based on sequences and interologs. BMC Bioinformatics. 2012, 13, Suppl 7:S6. 11. Wang, Y.; Cui, T.; Zhang, C.; Yang, M.; Huang, Y.; Li, W.; Zhang, L.; Gao, C.; He, Y.; Li, Y.; Huang, F.; Zeng, J.; Huang, C.; Yang, Q.; Tian, Y.; Zhao, C.; Chen, H.; Zhang, H.; He, Z.G. Global protein-protein interaction network in the human pathogen Mycobacterium tuberculosis H37Rv. J. Proteome Res. 2010, 9, 6665-77. 12. Gonzalez, M.W.; Kann, M.G. Chapter 4: Protein interactions and disease. PLoS Comput. Biol. 2012, 8, e1002819. 13. Ben-Hur, A.; Noble, W. Choosing negative examples for the prediction of protein-protein interactions. BMC Bioinformatics. 2006, 7, S2. 14. Barrell, D.; Dimmer, E.; Huntley, R.P.; Binns, D.; O'Donovan, C.; Apweiler, R. The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Res. 2009, 37 (Database issue), D396-403. 15. Maetschke, S.; Simonsen, M.; Davis, M.; Ragan, M.A. Gene Ontology-driven - 26 -

ACS Paragon Plus Environment

Page 26 of 39

Page 27 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

inference of protein–protein interactions using inducers. Bioinformatics. 2012, 28, 69-75. 16. Boeckmann, B.; Bairoch, A.; Apweiler, R.; Blatter, M.C.; Estreicher, A.; Gasteiger, E.; Martin, M.J.; Michoud, K.; O'Donovan, C.; Phan, I.; Pilbout, S.; Schneider. M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31, 365-70. 17. Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389-402. 18. Fan, R.; Chang, K.; Hsieh, C.; Wang, X.; Lin, C. LIBLINEAR: A Library for Large Linear Classification. Mach. Learn Res. 2008, 9, 1871-1874. 19. Kshirsagar, M.; Schleker, S.; Carbonell, J.; Klein-Seetharaman, J. Techniques for transferring host-pathogen protein interactions knowledge to new tasks. Front. Microbiol. 2015, 6, 36. 20. Strong, M.; Eisenberg, D. The protein network as a tool for finding novel drug targets. Prog. Drug Res. 2007, 64, 191-215. 21. Wong, L.; Liu, G. Protein Interactome Analysis for Countering Pathogen Drug Resistance. J. COMPUT. SCI. TECH-CH. 2010, 25, 124–130. 22. Metri, R.; Hariharaputran, S.; Ramakrishnan, G.; Anand, P.; Raghavender, U.S.; Ochoa-Montaño, B.; Higueruelo, A.P.; Sowdhamini, R.; Chandra, N.R.; Blundell, T.L.; Srinivasan, N. SInCRe-structural interactome computational resource for Mycobacterium tuberculosis. Database (Oxford). 2015, bav060. 23. Berger, S.; Iyengar, R. Network analyses in systems pharmacology. Bioinformatics. 2009, 25, 2466-2472. 24. Nguyen, L.; Thompson, C.J. Foundations of antibiotic resistance in bacteria physiology: The mycobacterial paradigm. Trends Microbiol. 2006, 14, 304-312. 25. Smith, P.A.; Romesberg, F.E. Combating bacteria and drug resistance by inhibiting mechanisms of persistence and adaptation. Nat. Chem. Biol. 2007, 3, 549-556. 26. Keshava Prasad, T.S.; Goel, R.; Kandasamy, K.; Keerthikumar, S.; Kumar, S.; Kumar, S.; Mathivanan, S.; Telikicherla, D.; Raju, R.; Shafreen, B.; Venugopal, A.; Balakrishnan, L.; Marimuthu, A.; Banerjee, S.; Somanathan, D.S.; Sebastian, A.; Rani, S.; Ray, S.; Harrys Kishore, C.J.; Kanth, S.; Ahmed, M.; Kashyap, M.K.; Mohmood, R.; Ramachandra, Y.L.; Krishna, V.; Rahiman, B.A.; Mohan, S.; Ranganathan, P.; Ramabadran, S.; Chaerkady, R.; Pandey, A. Human Protein Reference Database--2009 update. Nucleic Acids Res. 2009, 37(Database issue), D767-72. 27. Chatr-Aryamontri, A.; Breitkreutz, B.J.; Oughtred, R.; Boucher, L.; Heinicke, S.; Chen, D.;, Stark, C.; Breitkreutz, A.;, Kolas, N.; O'Donnell, L.; Reguly, T.; Nixon, J.; Ramage, L.; Winter, A.; Sellam, A.; Chang, C.; Hirschman, J.; Theesfeld, C.; Rust, J.; Livstone, M.S.; Dolinski, K.; Tyers, M. The BioGRID interaction database: 2015 update. Nucleic Acids Res. 2015, 43 (Database issue):D470-8. 28. Kandasamy, K.; Mohan, S.S.; Raju, R.; Keerthikumar, S.; Kumar, G.S.; Venugopal, A.K.; Telikicherla, D.; Navarro, J.D.; Mathivanan, S.; Pecquet, C.; Gollapudi, S.K.; Tattikota, S.G.; Mohan, S.; Padhukasahasram, H.; Subbannayya, Y.; Goel, R.; Jacob, H.K.; Zhong, J.; Sekhar, R.; Nanjappa, V.; Balakrishnan, L.; Subbaiah, R.; Ramachandra, Y.L.; Rahiman, B.A.; Prasad, T.S.; Lin, J.X.; Houtman, J.C.; Desiderio, S.; Renauld, J.C.; Constantinescu, - 27 -

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

S.N.; Ohara, O.; Hirano, T.; Kubo, M.; Singh, S.; Khatri, P.; Draghici, S.; Bader, G.D.; Sander, C.; Leonard, W.J.; Pandey, A. NetPath: a public resource of curated signal transduction pathways. Genome Biol . 2010, 11, R3. 29. Prlic, A.; Bliven, S.; Rose, P.W.; Bluhm, W.F.; Bizon, C.; Godzik, A.; Bourne, P.E. Pre-calculated protein structure alignments at the RCSB PDB website. Bioinformatics. 2010, 26, 2983-5. 30. Vapnik, V. An Overview of Statistical Learning Theory. IEEE Trans. Neural Netw. 1999, 10, 988–999.

Figure 1 ROC curves and AUC scores for 5-fold cross validation and independent test. (A~D) illustrates the performance of cross validation for the four solutions: STRING, two-hybrid, two-hybrid+co-purification and co-purification. (E) illustrates the performance of independent test on disjoint co-purification data. Figure 2 Overall network degree distributions in the validated M. tuberculosis H37Rv PPI networks. Dots denote M. tuberculosis H37Rv genes and edges denote interactions. Green dots denote the known drug target genes and red dots denote the known drug resistance genes. Figure 3 Network degree distributions of M. tuberculosis H37Rv genes. (A) illustrates the degree distributions of all M. tuberculosis H37Rv genes; (B) illustrates the degree distributions of known M. tuberculosis H37Rv drug target genes; (C) illustrates the degree distributions of known M. tuberculosis H37Rv drug resistance genes. Figure 4 Potential drug resistance pathways of the drug resistance gene polA(Rv1629). The node in red denotes the drug resistance gene polA(Rv1629), the nodes in green denote the known M. tuberculosis H37Rv drug target genes, and the nodes in white denote the intermediate genes.

- 28 -

ACS Paragon Plus Environment

Page 28 of 39

Page 29 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 5 Potential drug co-target gene htpG(Rv2299c). The nodes in red denote the known drug resistance genes, the nodes in green denote the M. tuberculosis H37Rvdrug target genes Rv0120c(Rv0120c), the nodes in white denote intermediate genes, and the node in light blue denotes the potential co-target gene htpG(Rv2299c). Figure 6 Pharmacological risks analyses of known and candidate drug target genes from the point of view of system pharmacology.(A) illustrates the drug target genes from the SInCRe database22; (B) illustrates the candidate drug target genes with degree larger than 150 in the validated M. tuberculosis H37Rv PPI networks. The circle node in red denotes the M. tuberculosis H37Rv gene, the circle node in white denotes the human ortholog genes, and the round rectangle node in light green denotes the human immune signaling pathways.

Figure 7 Protein structure alignments between M. tuberculosis H37Rv protein pstP (Rv0018c) and human ortholog protein PPM1A.

Table 1 Results of 5-fold cross validation and independent test. combined-instance

homolog-instance

target-instance

STRING

Size SP

SE

MCC

SP

SE

MCC

SP

SE

Positive

32

0.8000

0.7500

0.6163

0.7600

0.7037

0.5694

0.8333

0.8000

0.6693

Negative

32

0.7419

0.7931

0.6123

0.7333

0.7857

0.5850

0.7727

0.8095

0.6570

MCC

[Acc; MCC; ROC-AUC]

[77.05%; 0.6132 ;.8125]

[74.55%; 0.5768;0.7950]

[80.43%; 0.6637;0.8124]

F1 Score

0.7142

0.7308

0.8163

Independent test

70.59% (+)

66.29% (-)

combined-instance

two-hybrid

homolog-instance

target-instance

Size SP

SE

MCC

SP

SE

MCC

SP

SE

MCC

Positive

8,022

0.5814

0.6020

0.3627

0.5780

0.6009

0.3605

0.5691

0.6375

0.3810

Negative

8,022

0.5879

0.5671

0.3528

0.5894

0.5663

0.3524

0.6338

0.5652

0.3783

[Acc; MCC; ROC-AUC]

[58.45%; 0.3576;0.6284]

[58.35%; 0.3563;0.6260]

[59.95%; 0.3771;0.6429]

F1 Score

0.6017

0.5892

0.6014

Independent test

70.59% (+)

61.71% (-)

- 29 -

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

co-purification+ two-hybrid

combined-instance

homolog-instance

Page 30 of 39

target-instance

Size SP

SE

MCC

SP

SE

MCC

SP

SE

MCC

Positive

1,257

0.7470

0.6963

0.5373

0.6985

0.6430

0.4841

0.7548

0.7725

0.5992

Negative

1,257

0.6819

0.7341

0.5292

0.6760

0.7285

0.5038

0.7639

0.7459

0.5944

[Acc; MCC; ROC-AUC]

[71.40%; 0.5321;0.7705]

[68.62%; 0.4934;0.7363]

[68.62%; 0.4934;0.8068]

F1 Score

0.7208

0.6646

0.7635

Independent test

94.12% (+)

73.14% (-)

combined-instance

homolog-instance

target-instance

co-purification Size SP

SE

MCC

SP

SE

MCC

SP

SE

Positive

121

0.9340

0.9706

0.9087

0.7667

0.8846

0.7735

0.9588

0.9894

MCC 0.9461

Negative

121

0.9706

0.9340

0.9087

0.9647

0.9213

0.8135

0.9878

0.9529

0.9457

[Acc; MCC; ROC-AUC]

[95.19%; 0.9081;0.9824]

[91.30%; 0.8390 ;0.9291]

[97.21%; 0.9456; 0.9912]

F1 Score

0.9519

0.8214

0.9739

Independent test

94.12% (+)

85.71% (-)

Table 2 Validation of M. tuberculosis H37Rv PPIs in STRING7 and two-hybrid PPIs in11.

Positive rate

Positive rate

neighborhood

neighborhood*

database

database*

text mining

text mining*

13.93 %

11.12 %

6.20 %

8.71 %

21.79 %

12.72 %

merged experiments

experiments*

fusion

co-occurrence homology

co-expression*

16.98 %

13.34 %

7.16 %

11.59 %

11.84 %

* denotes transferred

- 30 -

ACS Paragon Plus Environment

15.04 %

Page 31 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

for TOC only

- 31 -

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

- 32 -

ACS Paragon Plus Environment

Page 32 of 39

Page 33 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1 410x220mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2 69x64mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 34 of 39

Page 35 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 3 405x212mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4 67x21mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 36 of 39

Page 37 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 5 64x19mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6 93x58mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 38 of 39

Page 39 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 7 133x120mm (300 x 300 DPI)

ACS Paragon Plus Environment