Deconvolution of Targeted Protein–Protein Interaction Maps - Journal

Jun 25, 2012 - Current proteomic techniques allow researchers to analyze chosen biological pathways or an ensemble of related protein complexes at a ...
0 downloads 0 Views 3MB Size
Article pubs.acs.org/jpr

Deconvolution of Targeted Protein−Protein Interaction Maps Alexey Stukalov, Giulio Superti-Furga, and Jacques Colinge* CeMM − Center for Molecular Medicine of the Austrian Academy of Sciences, AKH-BT 25.3, Lazarettgasse 14, A-1090 Vienna, Austria S Supporting Information *

ABSTRACT: Current proteomic techniques allow researchers to analyze chosen biological pathways or an ensemble of related protein complexes at a global level via the measure of physical protein−protein interactions by affinity purification mass spectrometry (AP-MS). Such experiments yield information-rich but complex interaction maps whose unbiased interpretation is challenging. Guided by current knowledge on the modular structure of protein complexes, we propose a novel statistical approach, named BI-MAP, complemented by software tools and a visual grammar to present the inferred modules. We show that the BI-MAP tools can be applied from small and very detailed maps to large, sparse, and much noisier data sets. The BI-MAP tool implementation and test data are made freely available. KEYWORDS: bioinformatics, systems biology, protein complex, AP-MS, interaction



INTRODUCTION Protein complexes are molecular machines that carry out biological processes in living cells.1,2 Proteomics and molecular biology technologies3,4 have made possible large-scale mapping of physical interactions between proteins found in the same complexes, thereby depicting huge networks of protein− protein interactions (PPIs).5,6 The analysis of these networks has revealed that protein complexes are likely to be composed of smaller building blocks named protein complex modules, which are groups of proteins that are in strong physical association with each other.7,8 Such protein modules often correspond to subunits of larger protein complexes like the coat protein subcomplexes9 and the 19S proteasome.10 Hence, protein modules can be regarded as intermediate functional units that are combined to build the larger and more specialized protein complexes. In targeted projects, PPIs are mapped to investigate a complete biological pathway or a set of related complexes.11−14 The two main methodologies to identify PPIs are the yeast-2-hybrid (Y2H) system15 and affinity purification mass spectrometry3,4,16 (AP-MS). While the Y2H system detects direct binary interactions between proteins through expression in yeast conventionally, AP-MS measures protein association in complexes from a chosen cell type and organism, which better fits the concept of targeted medium-scale experiments that is our focus here. With the constant progress of proteomics techniques and the development of efficient sample preparation protocols,4,16−18 AP-MS-based mediumscale targeted interactome mapping has become a successful strategy. Nonetheless, the complex PPI maps that result from such experiments require unbiased and comprehensive analysis. In this study, as a significant step toward a mathematical © 2012 American Chemical Society

prediction of protein complexes, we developed a sensitive and rigorous Bayesian analysis method aimed at inferring protein modules, which we complemented with analysis and visualization tools. In AP-MS, interactions with a chosen (tagged) protein, the bait, are measured in pulldown experiments where the bait is copurified with its binding proteins, the prey (Figure 1A and 1B). Intuitively, one would assume that the analysis of largescale collections of PPIs to infer their modular structure and protein complexes is more difficult than from medium-scale PPI maps, but it turns out to be the opposite. In large-scale data sets, most of the prey proteins are baits in some other experiments, and a significant cross-coverage is thus achieved, which is exploited by adapted algorithms.7,19,20 In medium-scale data sets (10−500 baits), there is a large proportion of proteins found as prey only, and this complicates data analysis substantially. Other researchers have already investigated the AP-MS medium-scale data set analysis problem by taking spectral counts into consideration. For each bait protein of the data set, prey proteins are detected by mass spectrometry via the identification of a certain number of spectra. The spectral counts are collected in a so-called AP-MS matrix whose rows represent the prey and whose columns represent the baits (Figure 1C); such a matrix resembles a gene expression matrix (Figures 1C and 2A). Sardiu et al.13,21 have proposed to apply hierarchical clustering algorithms to the rows and the columns of the AP-MS matrix independently, thereby obtaining a better organized matrix that is manually annotated. Biclustering Received: February 10, 2012 Published: June 25, 2012 4102

dx.doi.org/10.1021/pr300137n | J. Proteome Res. 2012, 11, 4102−4109

Journal of Proteome Research

Article

Figure 1. Generation of AP-MS data. (A) A bait (protein c) is genetically modified to be coupled to magnetic beads (top). Its direct interactor b and indirect interactors a and d are enriched in the affinity purification (bottom). (B) Example situation with 4 protein complexes. (C) Corresponding AP-MS matrix for proteins a, b, c, d, and g selected as baits in 5 separate experiments. Bait c retrieves prey a, b, and d; i.e., all the interactors of all the complexes containing c are identified. In practice, some interactors will be missed because of mass spectrometry sensitivity and variability reasons. The numbers in the AP-MS matrix represent spectral counts, which are rough indicators of prey abundance. From the structure of the matrix, one can see that proteins c and a form a module as they are always observed or not observed simultaneously (same for a and d alone, and e,f and g,h as pairs). The spectral counts indicate the module abundances in a baitdependent manner, and they are normalized by the protein sequence lengths in the statistical analysis.

Figure 2. BI-MAP algorithm. (A) The original AP-MS matrix is unordered (bait identifications indicated by a small red rectangle). (B) The AP-MS matrix ordered according to BI-MAP maximum a posteriori partition features a block structure. We note in the expanded area (right) that the apparent abundance within a block is variable with some positions even lacking protein detection (white). (C) Example BI-MAP partition corresponding to Figure 1 with bait and prey clusters, and average block abundance as estimated by the statistical model. Prey clusters are equivalent to modules. (D) Maximum posterior partition corresponding to the AP-MS matrix in (A) and (B). We note that some modules not containing any bait could be identified. (E) Prior probability versus likelihood. Maximum and nearly maximum a posteriori solutions are colored in red.

methods, originally introduced for gene expression microarrays, were designed to detect sets of genes that coexpress under a certain set of conditions as opposed to coexpressing under all the conditions. Recently, Choi et al.22 have adapted a statistical model named nested clustering23 to AP-MS data in order to automate cluster predictions. Nested clustering first groups baits into bait clusters depending on prey profile (AP-MS matrix column) similarities. Subsequently, the prey of each bait cluster are grouped essentially independently from the other bait clusters. We reasoned that the biclustering model underlying the statistical analysis would benefit from integrating the notion of protein complex module to potentially better model the real physical entities and, in every case, as a convenient unit of data reduction to understand AP-MS data sets. Accordingly, we introduce a novel approach, which we name Bayesian inference of protein module a posteriori probabilities (BI-MAP), and that is based on global features found in the AP-MS matrix (Figure 2B). Module predictions are done in a fully automatic manner, guided by statistical considerations, and we illustrate BI-MAP application on two real11,13 and a collection of 100 different synthetic data sets.

Reporting performance in Figure 5, we used the 5 best posterior probability partitions for each algorithm to have more data points (no qualitative difference observed between these top 5 solutions). On the autophagy data set, we have tried to tune NestedCluster by changing the parameters of the nested clustering process alpha, beta, and gamma, whose defaults are 1, by considering all the combinations of 0.5, 1, and 1.5. The other parameters of NestedCluster define noninformative priors and hence should not influence the final solution significantly; we did not change them. We increased the number of iterations (5000 burn-in, 5000 sampling). The mapping of domain−domain physical interactions found in 3did24 (reported as pairs of Pfam domain ACs) to obtain putative protein−protein physical interactions was done using UniProt cross-references, which assign a protein to its Pfam domains.



EXPERIMENTAL SECTION Implementation details, MCMC sampler specificities, and BIMAP parameter values used on the diverse data sets are described in the Supporting Information (SI). TIP49a/b and autophagy data sets were taken from the supplementary materials of the corresponding publications.11,13 In comparing BI-MAP and NestedCluster algorithms on the synthetic data, we ran NestedCluster with its default parameters and an increased number of iterations to ensure convergence (20 000 burn-in, 10 000 sampling, instead of 1000 for both).



RESULTS AND DISCUSSION

Model Design

We take the operational definition of a protein complex module as a group of proteins associated with each other in every 4103

dx.doi.org/10.1021/pr300137n | J. Proteome Res. 2012, 11, 4102−4109

Journal of Proteome Research

Article

The likelihood function L(S|M) is designed as the product of three functions implementing three constraints we want to impose to the inferred modules:

experiment, which is compatible with real modules or could alternatively be regarded as natural way to decompose AP-MS data as explained in the Introduction. In a properly reordered AP-MS matrix (Figure 2B), a module appears as a group of prey (rows) whose detection is materialized by blocks of positive spectral counts for certain baits (columns). Therefore, to the limit of the data available, predicting modules is equivalent to partition the AP-MS matrix rows and columns to obtain blocks, with each block having an on/off state and an estimated abundance (Figure 2C). Such a partition naturally relates prey clusters Pi, representing the modules, with groups of experiments (bait clusters Ek) where they coappear. It is worth noting that although a prey is assigned to one module only, a module can be present in several complexes; i.e., a prey protein can be present in several complexes. As illustrated in Figure 2B, the spectral counts within one block come with variability,25 which motivates a statistical approach. Namely, the module inference problem is solved via a biclustering algorithm, the blocks being the biclusters, that determines a partition with maximum probability according to a statistical model. Since this model is rather complicated, although relying on a limited number of intuitive concepts, we limit the level of details reported here and provide full details in the SI. The space of all possible row and column partitions, block on/off states and abundance (Figure 2C), written partition space for short, is explored through Bayesian inference to identify maximum posterior probability solutions (Figure 2D). Bayesian inference classically combines a likelihood function, integrating constraints we want to impose to the solution, with an appropriate prior distribution. The role of the prior is to restrict the inferred solutions to reasonable values when the APMS data are limited (to avoid overfitting), whereas for data sets where the AP-MS data are stronger its influence on the inferred solutions becomes marginal. To define the model, we denote one possible solution, i.e., one point of the partition space, as a triple M = (C,G,A), with C the row and column partitions, G the block on/off states, and A the estimated abundance of the blocks (Figure 2C). By means of Bayes formula, given an APMS matrix S, the posterior probability of a solution M is P(M |S)=

L(S|M )P(M ) P(S)

L = LQ LT LD

(3)

The first function LQ implements a quantitative constraint that imposes coherent apparent abundance of prey within a module as measured by the spectral counts (SCs) normalized by the protein sequence lengths. Two cases must be distinguished, which are the on-blocks, where SCs reflect the presence of the proteins, and off-blocks, where the SCs reflect false positive protein identifications. Accordingly, we have LQ (S|M ) =



Ppresent (SB|AB)

B ∈ {on‐blocks}



Pfalse positive(SB) (4)

B ′∈ {off‐blocks}

where Ppresent(SB|AB) models the probability of observing the SCs in block B given the assumed abundance AB in this block (Lagrangian Poisson distribution26), and Pfalse positive(SB) models the probability of false positive identifications (geometric distribution); see the SI for full parametrization. The second function LT implements a topological constraint that is responsible for a coherent partitioning of the data by checking that rows outside each prey cluster are significantly different from rows within, and the same for columns and bait clusters. LT is related to the socio-affinity score7 or its variants and is expressed as L T (S | M ) =



Prowsim(r )

r ∈ {S rows}

∏ c ∈ {S columns}

Pcolsim(c) (5)

with Prowsim a similarity measure that compares row r with all the other rows not contained in the same block as r, the same for Pcolsim on the columns. Prowsim and Pcolsim are implemented as minima of Fisher’s exact test P-values (see the SI). The last function LD implements a constraint meant to enforce coherence with the experimental design. It measures the consistency between bait and prey clusters by enforcing that bait proteins predicted to be in the same module have similar prey and that bait proteins with similar prey are put in the same module: L D (S | M ) =

(1)



Pcoherence(E)

E ∈ {bait clusters}

where P(M) is the prior probability of M, and L(S|M) = P(S|M) the likelihood of the data given an assumed model M. P(S) cannot be practically estimated and most importantly plays no role in identifying the optimal P(M). Therefore, the maximum a posteriori (MAP) solution is given by MMAP = arg maxM L(S|M )P(M )

∏ F ∈ {prey clusters}

Pcoherence(F ) (6)

where the coherence of a bait or prey cluster is determined by considering the number of other clusters containing any of its protein via a geometric distribution (see the SI). In principle, eqs 3−6 with the prior definition are sufficient to obtain the MAP solution MMAP according to eq 2, but the actual computation of MMAP requires the exploration of the partition space that is gigantic. The standard numerical methods for inferring probability distributions like Gibbs sampling could not give satisfying results for the model of such complexity. We thus developed an optimized and problem-specific Markov Chain Monte Carlo (MCMC) sampler, which plays an important role in the performance results below and can run on a compute cluster (implementation details are provided in the SI). Finally, in case of highly complex and noisy data sets, the MAP solution is no longer

(2)

and we only need to define the likelihood and the prior. The prior probability P(M) relies on two distributions (Pitman− Yor) to assign probabilities to row and column partitions C, i.e., to block sizes, a third distribution (Bernoulli) for probabilities of the on/off states G, and a fourth distribution (log-normal) to define abundance A probabilities. By analyzing different 1- or 2step purification AP-MS data sets, we identified fixed values for most prior parameters, only leaving one free parameter that controls the rate of false protein identification in the original MS data (detailed formulation and parameters in the SI). 4104

dx.doi.org/10.1021/pr300137n | J. Proteome Res. 2012, 11, 4102−4109

Journal of Proteome Research

Article

Figure 3. TIP49a/b data set. (A) Analysis results including proteins originally discarded by Sardiu et al. in their secondary filter depicted in gray. BIMAP was able to predict two new modules enriched in members of the U5 snRNP and Chaperonin containing TCP1 complexes from these discarded proteins (arrows from [INO80D,INO80E,ACTR5,TFPT] colored in green). Nonspecific binders are nicely grouped into three modules containing discarded proteins only. The corresponding AP-MS matrix partition is provided as Table S1 (SI). (B) Visual grammar. Prey clusters, the modules BI-MAP infers, are denoted by blue ellipses. They constitute the elementary unit of decomposition for the BI-MAP analysis. Bait clusters, depicted by red ellipses, group baits with highly similar prey profiles. The detection of a module by a group of baits (a bait cluster) is materialized by an arrow (on-blocks in the partition). When a bait cluster is embedded in a prey cluster, it is displayed as a subset to limit the number of arrows. When a bait cluster is only partially embedded in a prey cluster, the included part is represented as a prey cluster subset and related to the rest of the bait cluster by an orange square. The square symbolizes the whole bait cluster.

(SI) appearing in less than half the pulldowns (see also Figure S4 in the SI). Baits within the hINO80 complex were split into two bait clusters, the first one [INO80D, INO80E, ACTR5, TFPT] interacting with the newly predicted modules and the second one [ZNHIT4, ACTR8, INO80C] not. We further noticed that INO80D, INO80E, and TFPT were reported to interact with INO80 NTD domain, whereas ZNHIT4, ACTR8, and INO80C were reported to interact with INO80 HSA and Snf2 domains.27 In Sardiu et al., ZNHIT6 (FLJ20729) and LIN9 were proposed as attachments of the hINO80 complex, and NUFIP1 and DPCD as attachments of the Prefoldin complex. What our analysis suggests is a more intricate architecture where DPCD, NUFIP1, and ZNHIT6 form a module that can be regarded as attachment to both complexes with LIN9 contributing as an additional, independent attachment to both as well. We also found that DNPK1 had a unique bait profile, and its involvement in nonhomologous end joining, telomeric stability, and transcription suggests it is potentially related to chromatin remodeling. Obvious nonspecific binders present in almost every pulldown were grouped into three modules according to their average abundance in the samples (Figure 3A, Table S1, and Figure S4 in the SI).

sufficient, and one additional procedure of stability analysis of the inferred modules must be performed as we explain and illustrate in the second application hereafter. Application to a Dense PPI Map

To illustrate BI-MAP performance, we first selected a data set by Sardiu et al.,13 who analyzed the chromatin remodeling complexes assembled around RUVBL1/2 (TIP49a/b) by performing 35 pulldowns with 27 distinct baits, thus obtaining a dense coverage of the complexes. In the original publication, TIP49a/b data were first filtered to remove nonspecific binders resulting in 127 prey proteins out of which 59 additional prey were discarded in a subsequent filtering step. We ignored the subsequent filtering to illustrate BI-MAP ability to work with more noisy data. Sardiu et al. reported their results as subcomplexes, i.e., units of protein assembly between modules and full complexes. BI-MAP maximal posterior probability solution (Figure 3A) is presented using a new visual grammar (Figure 3B) that intuitively summarizes modules, their relationship with baits, and bait relationships. Mapping BIMAP modules onto Sardiu et al. subcomplexes, we verified that we could cover all their predictions. Comparing the structures obtained, we naturally confirmed the core position of the [RUVBL1, RUVBL2] module to be shared by all the other protein complexes. BI-MAP was able to identify two new preyonly modules that are enriched for proteins from known complexes: the chaperonin containing TCP1 complex and the U5 snRNP complex, which have a specific profile in Table S1

Application to a Large and Sparse PPI Map, Identification of Stable Modules

We selected a second example of a much larger size and complexity published by Behrends et al.,11 who analyzed the human autophagy system. Their data set consisted of 105 4105

dx.doi.org/10.1021/pr300137n | J. Proteome Res. 2012, 11, 4102−4109

Journal of Proteome Research

Article

Figure 4. Performance with the autophagy data set. (A) GO BP analysis associated 70 modules with significant (P < 0.01) GO terms enrichment as represented in the heatmap. The overlap of GO BP terms in distinct clusters, indicating a desirable separation of functions among the modules, was significantly less than expected by chance. The histogram represents frequencies of overlap between 0, 1, 2, etc. modules (blue = randomized data, red = real data) goodness-of-fit P < 1.3 × 10−8 (Kolmogorov−Smirnov). (B) Average number of physical contacts per protein with another protein of the same module or subcomplex according to the 3did database.24 We note that BI-MAP stable modules perform better, independent of their size, compared to the maximum posterior probability distribution modules. The NestedCluster solutions obtained with default parameters and the best alternative parameter set were not significantly different. Random denotes sets of randomly selected proteins from the total data set with identical sizes as the inferred modules. In structures (modules or subcomplexes) smaller or equal to 12, BI-MAP stable modules perform clearly better than the alternative methods. In larger structures that become comparable in size with the number of interaction in a pulldown after nonspecific binder elimination all the methods are comparable in performance, although the relative improvement over random modules is then limited, thus indicating less reliable predictions.

module (see the SI). On average, the stability of predicted modules for the maximum a posteriori partition was 46%. By setting a stability index threshold at 80% (similar results with values 70−95%, data not shown), we identified the stable cores of the sampled modules, which we further filtered by requiring that prey proteins present in these stable modules were observed in at least 3 pulldowns on average to ensure enough supporting evidence was available. We obtained 150 predicted stable modules, containing 430 proteins, 110 of which only were reported as high-confident interacting proteins in Behrends et al. (see the SI). We found several well-known autophagy-related complexes that provided a second level of evidence of the inferred module accuracy, e.g., the AMPK complex, COPII (2 new/total 3) and the CCT complex (5/8). We also found potentially new relevant modules, e.g., [WBP2, TSR2, STBD1] that is detected in the pulldowns of Gammaaminobutyric acid receptor-associated-like proteins (GABARAP, GABARAPL1/L2) and MAP1LC3A. WBP2 is reported to bind to ubiquitin-protein ligase NEDD4,28 which is also copurified with this putative protein module and was shown to play an important role in autophagy.11 See the SI for additional known and potentially new modules. Finally, we wanted to obtain additional evidence of the correctness of our results by using an approach orthogonal to the function-based evidence collected so far. We checked for the enrichment for physical protein contacts in the inferred modules referring to 3did database24 of domain−domain interactions extracted from PDB29 structures. A strong enrichment is observed in stable modules compared to the maximum a posteriori solution and random selections (ANOVA P < 2 × 10−16, see Figure 4B).

pulldowns for 65 bait proteins and 2553 identified prey proteins. Using prey frequencies and spectral counts to stringently remove nonspecific binders, they reduced the number of prey down to 409. We wanted to further demonstrate the ability of BI-MAP to deal with large and noisy data sets and hence only excluded all the prey that appeared in a single pulldown with a spectral count less than 3, which left 2073 prey. Considering the maximum posterior probability partition (see the SI) that contained 353 modules, we obtained a first evidence of the quality of the predicted modules by performing a GO enrichment analysis. Setting the significance threshold at 1% (hypergeometric test), we found 70 modules, comprising 403 proteins, with one biological process (BP) hit at least. Remarkably, the GO annotations assigned to distinct modules overlapped very little (P < 1.3 × 10−8, Kolmogorov−Smirnov, Figure 4A, Table S2 (SI)), thereby showing that BI-MAP was able to decompose the data set well enough to yield nonredundant units of function. Behrends et al. were interested in describing a global landscape of autophagy-related protein interactions, and therefore their selection of baits covered a broad range of complexes. The PPI map was much sparser compared to TIP49a/b, and the modules of the maximum a posteriori solution should thus be interpreted as subcomplexes that delineate more accurate modules. This motivated the introduction of a module stability analysis to identify the core robust structures present in the data. We calculated the module stability index on the basis of all the partitions generated during MCMC sampling as the percentage of partitions in which any predicted module occurs as such or as a subset of another 4106

dx.doi.org/10.1021/pr300137n | J. Proteome Res. 2012, 11, 4102−4109

Journal of Proteome Research

Article

Robustness of the Method

ter in this special case we tested alternative parameter sets, but we found its default parameters to be as good as the best alternative, which is a nice indication of robustness for this algorithm (see the Experimental Section and Figure 4B). To quantitatively characterize the performance of BI-MAP over a large range of possible data sets in the absence of a gold standard catalog, we decided to build a realistic generative model of AP-MS data. This procedure starts by creating a network of protein interactions between N abstract proteins and mixing weak and strong binding strengths. Protein complexes are subsequently identified from this network and assigned random abundances, which determine individual protein abundances (a given protein can belong to multiple complexes). Finally, a random selection of baits is operated, and the abundance of each possible prey determined for each pulldown before being converted into spectral counts to obtain a synthetic AP-MS matrix. All the details of the procedure are provided in the SI. We generated 100 reference models, containing 100 ≤ N ≤ 200 proteins each, by varying the generation parameters to cover a wide range of situations (see Figure S10 in the SI). Obviously, NestedCluster and BI-MAP target different structures: NestedCluster aims at predicting larger parts of protein complexes (subcomplexes) and is less strict since the prey clusters obtained for each bait cluster are essentially independent from each other, whereas BI-MAP aims at identifying stable and usually smaller structures (modules) that are preserved over all the bait clusters. Given an underlying reference model, we designed a method that determines the correct protein groups that each algorithm should ideally predict (Figure 5A and SI). Performance was measured in terms of true positive rate (TPR)

Analysis of the autophagy motivated the complementation of the initial BI-MAP algorithm with a stability analysis, which should be regarded as an integral part of BI-MAP. In this section, we want to characterize the robustness of the overall procedure. As expected, the stability analysis performed on TIP49a/b data yielded a very different picture compared to the autophagy data. The maximum a posteriori partition modules featured an average stability of 88% instead of 46%; i.e., these modules would be preserved after stability analysis. We further investigated inference robustness with respect to repeated analysis (BI-MAP is a stochastic algorithm), missing experiments (removed pulldowns), and the presence of nonrelated pulldowns. In general, we observed that stable modules are marginally affected by such perturbations (adjusted Rand index >0.98,30 see the SI for complete details). Another factor potentially impacting inference was the signal intensity, i.e., the average spectral count for proteins in a module. TIP49a/b did not provide enough statistics, and we performed this analysis on autophagy showing a modest correlation only. Namely, coherent module fingerprints, such as obtained after stability analysis, were strong enough even when the module was rather low abundant (see the SI). Comparison with Existing Methods and Synthetic Tests

It is possible to predict modules by ignoring the values in the AP-MS matrix, i.e., only distinguishing zeros from nonzero spectral counts and thus only taking the topology of the data set into account. As observed by previous authors,13,22 while such approaches have produced satisfying results in large data sets, they do not work well on a medium-scale for the reasons indicated above. In particular, it is important to be able to distinguish strong and specific interactions from nonspecific interactions, and hence to integrate semiquantitative information represented by spectral counts is crucial. Choi et al.22 tested the application of existing gene microarray biclustering algorithms to AP-MS matrices and reported numerous missed protein complexes thus motivating the development of AP-MS specific algorithms. Sardiu et al.13,21 proposed a way to structure data, leaving the inference problem to the user. NestedCluster,22 to our best knowledge, is hence the only algorithm comparable to BI-MAP using spectral counts. Considering the TIP49a/b data set, which represents a typical data set aimed at characterizing a few complexes precisely, we note that NestedCluster did not assign TIP49a and TIP49b to the same bait cluster (see Figure 2B in ref 22), although they form the core of the complex and they were grouped together by BI-MAP and Sardiu et al.13 Furthermore, NUFIP, DPCD, and ZnF-HIT2 were not assigned to their own bait cluster, though they were identified with reasonable spectral counts and BI-MAP assigns them correctly. For the autophagy data, comparing the average of physical protein contacts in the inferred complexes, we can observe in Figure 4B that BI-MAP performed much better on structures of small to medium size compared to NestedCluster and Behrends et al. high confident prey. On larger structures, performance was similar, which fits the design of NestedCluster and high confident prey filtering that are oriented toward larger structure predictions. Nonetheless, it is important to observe that in this case performance was closer to random predictions, thus suggesting much less reliable inferences for larger structures. The autophagy results are only indicative for large sparse data sets obviously, and to avoid penalizing NestedClus-

TPR = P(pair of proteins grouped together|correct grouping)

and true negative rate (TNR) was TNR = P(pair of proteins put apart| in different reference groups)

In Figure 5B, we observe that over the 100 different configurations considered, BI-MAP predictions were much more selective (TNR), which is a clear consequence of targeting smaller and stronger structures, and more sensitive (TPR). Since the synthetic data sets were rather small (≤200 proteins) we wanted to assess how better each algorithm performed compared to random predictions, which could have TP and TN rates far from zero in this case. We adjusted the TP and TN rates with respect to random partitions according to the formula ⎛ Z − E(Z) ⎞ Zadjusted = max⎜0, ⎟ ⎝ 1 − E(Z) ⎠

where Z is either the TPR or the TNR and E(Z) is the expected corresponding value for a random prediction.30 We see in Figure 5C that BI-MAP clearly outperforms random partitions, although the difference is not as spectacular as suggested by Figure 5B. NestedCluster larger predicted structures are closer to random predictions. 4107

dx.doi.org/10.1021/pr300137n | J. Proteome Res. 2012, 11, 4102−4109

Journal of Proteome Research

Article

In addition to providing the ability to deal with noisy data sets, the stability analysis procedure integrated in BI-MAP confers robustness to the inferred protein complex decompositions. In particular, BI-MAP stable predictions are robust with respect to suppressed (missing) experiments or to the presence of unrelated pulldowns, which are difficult to avoid when new complexes are explored and wrong assumptions might lead to the inclusion of unrelated baits or the omission of a few relevant baits in a complex mapping study. The necessity of validating inference systems such as BIMAP despite the absence of multiple suitable reference data sets that would provide recent high quality AP-MS data for well-known complexes lead us to validation methods exploiting external and independent orthogonal resources and to the development of a realistic generative model of synthetic AP-MS data sets. Starting with orthogonal validation methods, we showed in the autophagy data analysis that BI-MAP generated complex modules having significantly little GO annotation overlap between them compared to randomized data, thus indicating meaningful functional decomposition of the data. Moreover, the mapping of protein domain physical contacts from the 3did database revealed a strong and significant increase of physical contacts within the predicted modules compared to random modules. The synthetic data finally allowed us to cover a wide range of complex structures (large/ small, strong/weaker affinities), and BI-MAP achieved for almost all the modules a TPR of 60% and a TNR of 90% (recall 86%); see Figure 5B. We complemented our validation efforts with a comparison with NestedCluster22 and could show on experimental and synthetic data the superiority of BI-MAP results. BI-MAP is a new tool that complements the arsenal of methods already proposed by others13,22 to analyze AP-MS data. The robustness of the obtained results combined with their representation through automatic visual exports as modular networks or as spreadsheet tables make BI-MAP a routine tool that can be applied to virtually any PPI map. In particular, BI-MAP analysis has a great potential to help in understanding structural changes in complexes under different conditions such as drug treatments or disease stages. BI-MAP analysis can reveal the module assembly dynamics, which combined with module functional annotation, e.g., provided by GO, can uncover the induced function activation changes. The BI-MAP software and synthetic data sets are freely available from the project website http://code.google.com/p/ bi-map.

Figure 5. Synthetic tests. (A) Schematic of the approach. A reference model is generated from which a module level reference model is derived to estimate BI-MAP performance and a subcomplex level reference model is derived to estimate NestedCluster performance. The reference model is additionally used to generate a synthetic APMS matrix (containing spectral counts) that is provided as input data to the two algorithms. (B) BI-MAP and NestedCluster performance is reported in terms of true positive (TPR) and true negative (TNR) rates. (C) Same results but adjusted with respect to average random prediction performance.



CONCLUSIONS We have introduced a novel model-driven statistical model (BIMAP) that integrates the concept of protein complex module and is designed for the analysis of small to medium size PPI maps.11−14 Beyond the inference of a most likely solution, BIMAP includes a stability analysis procedure that restricts the inferred modules to stable parts strongly supported by the data. The application of BI-MAP to a TIP49a/b-associated chromatin remodeling complexes data set,13 yielded convincing new protein complexes and recapitulated known results accurately. The application to a large, unfiltered autophagy data set11 revealed a multitude of autophagy-associated known and highly plausible new complexes despite the size of the PPI map and its high noise content. This analysis further unraveled interesting differences between the original analysis by Behrends et al. and BI-MAP results as well as their complementary nature. Behrends et al. were interested in pairwise interactions to map pathways, and we could not assign 299 of their high-confidence interactors to any nontrivial stable module, though we could assign 320 proteins they filtered out to stable modules, thereby enriching the original analysis by the detection of new protein complexes. This convincingly illustrates the interest of applying BI-MAP to PPI maps independent of the original analysis goal since there is a strong potential to realize complementary discoveries.



ASSOCIATED CONTENT

S Supporting Information *

Model details and implementation, TIP49a/b example, identification of stable modules, robustness of the method, synthetic tests, Figures S1−S10, and Tables S1 and S2. This material is available free of charge via the Internet at http:// pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Tel: +43 (0) 140160 70020. Fax: +43 (0) 140160 970030. Notes

The authors declare no competing financial interest. 4108

dx.doi.org/10.1021/pr300137n | J. Proteome Res. 2012, 11, 4102−4109

Journal of Proteome Research



Article

(20) Schelhorn, S. E.; Mestre, J; Albrecht, M; Zotenko, E Inferring physical protein contacts from large-scale purification data of protein complexes. Mol. Cell. Proteomics 2011, 10, M110 004929. (21) Sardiu, M. E.; Florens, L; Washburn, M. P. Evaluation of clustering algorithms for protein complex and protein interaction network assembly. J. Proteome Res. 2009, 8, 2944−2952. (22) Choi, H; Kim, S; Gingras, A. C.; Nesvizhskii, A. I. Analysis of protein complexes through model-based biclustering of label-free quantitative AP-MS data. Mol. Syst. Biol. 2010, 6, 385. (23) Rodriguez, A; Dunson, D. B.; Gelfand, A. E. The Nested Dirichlet Process. J. Am. Stat. Assoc. 2008, 103, 1131−1154. (24) Stein, A; Céol, A; Aloy, P 3did: identification and classification of domain-based interactions of known three-dimensional structure. Nucleic Acids Res. 2011, 39, D718−D723. (25) Lundgren, D. H.; Hwang, S. I.; Wu, L; Han, D. K. Role of spectral counting in quantitative proteomics. Expert Rev. Proteomics 2010, 7, 39−53. (26) Johnson, N. L.; Kemp, A. W.; Kotz, S. Univariate Discrete Distributions; Wiley: Hoboken, NJ, 2005; Vol. xix, p 646. (27) Chen, L; Cai, Y; Jin, J; Florens, L; Swanson, S. K.; et al. Subunit organization of the human INO80 chromatin remodeling complex: an evolutionarily conserved core complex catalyzes ATP-dependent nucleosome remodeling. J. Biol. Chem. 2011, 286, 11283−11289. (28) Chen, H. I.; Einbond, A; Kwak, S. J.; Linn, H; Koepf, E; et al. Characterization of the WW domain of human yes-associated protein and its polyproline-containing ligands. J. Biol. Chem. 1997, 272, 17070−17077. (29) Westbrook, J; Feng, Z; Jain, S; Bhat, T. N.; Thanki, N; et al. The Protein Data Bank: unifying the archive. Nucleic Acids Res. 2002, 30, 245−248. (30) Hubert, L; Arabie, P Comparing partitions. J. Classif. 1985, 2, 193−218.

ACKNOWLEDGMENTS The authors thank Dr. Christoph Baumann for valuable comments concerning the results of the autophagy data set stability analysis and Drs. Chris Soon Heng Tan and Kumaran Kandasamy for critical reading. A. Stukalov and J. Colinge are supported by a Bioinformatics Network (BIN III) grant of the GEN-AU program of the Austrian Ministry for Science and Research.



REFERENCES

(1) Alberts, B The cell as a collection of protein machines: preparing the next generation of molecular biologists. Cell 1998, 92, 291−294. (2) Vidal, M; Cusick, M. E.; Barabasi, A. L. Interactome networks and human disease. Cell 2011, 144, 986−998. (3) Puig, O; Caspary, F; Rigaut, G; Rutz, B; Bouveret, E; et al. The tandem affinity purification (TAP) method: a general procedure of protein complex purification. Methods 2001, 24, 218−229. (4) Glatter, T; Wepf, A; Aebersold, R; Gstaiger, M An integrated workflow for charting the human interaction proteome: insights into the PP2A system. Mol. Syst. Biol. 2009, 5, 237. (5) Rual, J. F.; Venkatesan, K; Hao, T; Hirozane-Kishikawa, T; Dricot, A; et al. Towards a proteome-scale map of the human proteinprotein interaction network. Nature 2005, 437, 1173−1178. (6) Li, S; Armstrong, C. M.; Bertin, N; Ge, H; Milstein, S; et al. A map of the interactome network of the metazoan C. elegans. Science 2004, 303, 540−543. (7) Gavin, A-C; Aloy, P; Grandi, P; Krause, R; Boesche, M; et al. Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440, 631−636. (8) Guruharsha, K. G.; Rual, J. F.; Zhai, B; Mintseris, J; Vaidya, P; et al. A protein complex network of Drosophila melanogaster. Cell 2011, 147, 690−703. (9) Hughes, H; Stephens, D. J. Assembly, organization, and function of the COPII coat. Histochem. Cell Biol. 2008, 129, 129−151. (10) Taverner, T; Hernandez, H; Sharon, M; Ruotolo, B. T.; MatakVinkovic, D; et al. Subunit architecture of intact protein complexes from mass spectrometry and homology modeling. Acc. Chem. Res. 2008, 41, 617−627. (11) Behrends, C; Sowa, M. E.; Gygi, S. P.; Harper, J. W. Network organization of the human autophagy system. Nature 2010, 466, 68− 76. (12) Sowa, M. E.; Bennett, E. J.; Gygi, S. P.; Harper, J. W. Defining the human deubiquitinating enzyme interaction landscape. Cell 2009, 138, 389−403. (13) Sardiu, M. E.; Cai, Y; Jin, J; Swanson, S. K.; Conaway, R. C.; et al. Probabilistic assembly of human protein interaction networks from label-free quantitative proteomics. Proc. Natl. Acad. Sci. U. S. A. 2008, 105, 1454−1459. (14) Bouwmeester, T; Bauch, A; Ruffner, H; Angrand, P-O; Bergamini, G; et al. A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat. Cell Biol. 2004, 6, 97−105. (15) Venkatesan, K; Rual, J. F.; Vazquez, A; Stelzl, U; Lemmens, I; et al. An empirical framework for binary interactome mapping. Nat. Methods 2009, 6, 83−90. (16) Burckstummer, T; Bennett, K. L.; Preradovic, A; Schutze, G; Hantschel, O; et al. An efficient tandem affinity purification procedure for interaction proteomics in mammalian cells. Nat. Methods 2006, 3, 1013−1019. (17) Breitkreutz, A; Choi, H; Sharom, J. R.; Boucher, L; Neduva, V; et al. A global protein kinase and phosphatase interaction network in yeast. Science 2010, 328, 1043−1046. (18) Rees, J. S.; Lowe, N; Armean, I. M.; Roote, J; Johnson, G; et al. In vivo analysis of proteomes and interactomes using Parallel Affinity Capture (iPAC) coupled to mass spectrometry. Mol. Cell. Proteomics 2011, 10, M110002386. (19) Geva, G; Sharan, R Identification of protein complexes from coimmunoprecipitation data. Bioinformatics 2011, 27, 111−117. 4109

dx.doi.org/10.1021/pr300137n | J. Proteome Res. 2012, 11, 4102−4109