Bait Compatibility Index: Computational Bait Selection for Interaction

Email: [email protected]. ... Here, we systematically study these biases and generate a novel score, the bait compatibility index, that can be used t...
5 downloads 3 Views 3MB Size
The Bait Compatibility Index: Computational Bait Selection for Interaction Proteomics Experiments Sudipto Saha, Parminder Kaur, and Rob M. Ewing* Center for Proteomics and Bioinformatics, School of Medicine, Case Western Reserve University, Cleveland, Ohio 44106 Received March 23, 2010

Protein interaction network maps have been generated for multiple species, making use of large-scale methods such as yeast two-hybrid (Y2H) and affinity purification mass spectrometry (AP-MS). These methods take fundamentally different approaches toward characterizing protein networks, and the resulting data sets provide complementary views of the protein interactome. The specific determinants of the outcome of Y2H and AP-MS experiments, in terms of detection of interacting proteins are, however, poorly understood. Here we show that a statistical model built using sequence- and annotationbased features of bait proteins is able to identify bait features that are significant determinants of the outcome of interaction proteomics experiments. We show that bait features are able to explain in part the disparities observed between Y2H and AP-MS constructed networks and can be used to derive the “bait compatibility index”, a numeric score that assesses the compatibility of bait proteins with each technology. Aside from understanding the bias and limitations of interaction proteomics, our approach provides a rational, data-driven method for prioritization of baits for interaction proteomics experiments, an essential requirement for future proteome-wide applications of these technologies. Keywords: Protein-protein interactions • yeast two-hybrid • affinity-purification mass spectrometry • Bayesian model • interactome

Introduction Protein interactome mapping is a key component of systemsbased approaches to understanding cellular function. Largescale attempts to map the protein interactomes of model organisms have largely used two different but complementary strategies: affinity-purification mass spectrometry (AP-MS) and yeast two-hybrid (Y2H). AP-MS combines the specificity of antibody-based protein purification with the sensitivity of mass spectrometry and enables the characterization of protein complexes under approximately physiological conditions.1 On the other hand, Y2H-based methods identify binary physical protein interactions by detecting reconstitution of a split transcription factor via activation of reporter gene expression.2 Development of automated platforms for both of these approaches has enabled genome-wide analysis of protein interactions. High-throughput protein-protein interaction maps have been generated using both Y2H and AP-MS techniques for multiple species including yeast,3-7 nematode,8,9 and human.10,11 In contrast to Y2H, in which direct protein-protein interactions are identified, AP-MS groups proteins according to comembership of protein complexes. Networks derived from Y2H and AP-MS data have different topologies and, when combined in a local modeling framework, have been shown to provide improved models of protein complexes.12 Furthermore, it has been shown that Y2H and AP-MS are independently enriched * To whom correspondence should be addressed. Tel: 1-216-368-4380. Fax: 1-216-368-6846. Email: [email protected].

4972 Journal of Proteome Research 2010, 9, 4972–4981 Published on Web 08/23/2010

with different “types” of interactions; for example, the Y2H binary interaction map is enriched for transient interactions between signaling molecules, whereas AP-MS data sets are relatively poor sources of direct binary protein interactions but represent comembership in protein complexes.13 Despite the importance of these observations, there has been little attempt to systematically classify interactions according to their detectability with each technology. The underlying assumption in this study is that protein features play an important role in specifying the compatibility of proteins used as baits with Y2H or AP-MS and are therefore important determinants of the outcome of those experiments, in terms of detection of prey proteins. Sequence and annotation features have been widely used in computational models to predict protein-protein interactions. For example, a Bayesian framework was used to integrate multiple yeast interaction data sets with several types of genomic evidence and shown to be capable of prediction of known and novel membership in protein complexes.14 The power of such approaches is that by combining individually weak predictors such as protein colocalization or gene coexpression, a more powerful predictor can be assembled. In this study, we make use of a similar approach, not to predict individual protein-protein interactions but rather to predict the outcome of Y2H and AP-MS experiments, in terms of detection of prey proteins, based upon a set of features of the bait. The method described provides a bait-by-bait measure of compatibility with Y2H and AP-MS, which may be used to 10.1021/pr100267t

 2010 American Chemical Society

research articles

Bait Compatibility Index select baits for interaction proteomics experiments. This is envisaged as an important component of designing strategies to map complete protein interactomes. Several studies have addressed the challenge of designing cost- and time-effective strategies for mapping complete interactomes. A “pay-as-you-go” strategy whereby network hubs are predicted from each successive interaction proteomics experiment and then used as baits themselves was proposed to maximize the efficiency in terms of the number of baits required to cover the interaction network.15 Alternatively, an approach that combines prioritization of binary interactions according to their probability of occurrence with pooling strategies was proposed to reduce the cost of covering the complete interactome.16 These authors stressed the need for multiple pass interaction screening to provide sufficient confidence and coverage and emphasized the importance of experimental design in future global studies of the interactome. A motivation behind our own study is to provide a rational, data-driven means of selection and prioritization of genes for use as baits in interaction proteomics experiments. With respect to yeast, for which multiple overlapping data sets have been generated, meta-analyses of the original data sets has enabled the delineation of higher quality subsets within the original data sets.13 It is clear from these meta-analyses that some classes of proteins make “good” baits, in that they enable detection of biologically relevant interactions, whereas others do not. It is not clear, however, what the underlying biases of AP-MS and Y2H really are and similarly unclear what impact those biases have on the outcome of interaction proteomics experiments in combination with the chosen bait(s). We explored these problems within a simple conceptualization of interaction proteomics experiments as either yielding prey proteins (“successful” experiments) or not yielding preys (“unsuccessful” experiments). Each combination of bait protein, data set (species, laboratory, etc.), and technology (AP-MS, Y2H) can then be classed in this way as successful or unsuccessful. In order to predict these experimental outcomes, we compute the Bait Compatibility Index (BCI) based upon Bayesian models built using sequence and annotation-based features of each bait. We identify specific biases of Y2H and AP-MS in terms of the correlation between experimental outcomes and bait protein features. We show that there is a significant dependence of experimental outcome and bait features, in particular for AP-MS experiments and that the calculated BCI score is able to predict experimental outcome at over 70% accuracy. Further, although coverage in human interaction proteomics data sets is much less than for yeast, we demonstrate how the same models may be applied in human, suggesting that combined bait-technology features are important determinants of the outcome of interaction proteomics experiments across studies.

Methods Interaction Proteomics Data Sets. The data sets used in the study are listed in Table 1. A data set suitable for training the Bayesian models was constructed by integrating AP-MS7 and Y2H13 large-scale yeast data sets (data sets D1 and D2, Table 1). This superdata set consists of the intersecting set of genes from the two studies (i.e., those genes that were used as baits in both D1 and D2) (Supplementary Table S1). The outcome for each tested bait was recorded as successful (if any prey proteins are detected) or unsuccessful (if no preys are detected). For simplicity and because the required information is not always available, both binding- and activation-domain fusions

Table 1. Interaction Proteomics Data Sets Used in the Studya data set

technology

organism

total baits

successful baits

unsuccessful baits

ref

D1 D2 D3 D4 D5 D6 D7

Y2H AP-MS AP-MS Y2H Y2H AP-MS AP-MS

yeast yeast yeast worm human human human

5796 4562 6466 10000 6851 385 75

2018 2357 1993 2528 1482 326 75

3778 2205 4473 7472 5369 59 0

13 7 6 9 10 11 19

a Data sets were utilized as follows: Bayesian model training and testing (D1, D2), Term enrichment analysis (D1-D4), and validation in human data sets (D5-D7).

Table 2. Annotation Features (F1-F7) and Sources (UniProtKB release 14.5, Kegg release 48.0) Used to Construct Feature Vectors for Each Bait feature

description

F1

post-translational modification subcellular location prosite motifs gene ontology biological process gene ontology molecular function pathway abundance

F2 F3 F4 F5 F6 F7

source

ref

UniProtKB

17

UniProtKB GenomeNet-Kegg UniProtKB

17 18 17

UniProtKB

17

GenomeNet-Kegg Ghaemmaghami et al.

18 20

in Y2H experiments were classed as baits. An additional yeast AP-MS data set, data set D3, and a nematode Y2H data set, data set D4, were used to extend the analysis of bait features. Finally, several human data sets (D5-D7) were used for testing the model and for further validation. Bait Feature Vectors. Sequence, annotation features, and protein abundance (Table 2) for each bait were formulated into a vector, hereafter called the feature vector. “Term” is used hereafter to refer to the individual annotations corresponding to a given feature. Thus a protein may be annotated with a term (e.g., “membrane”) corresponding to a given feature (e.g., “subcellular location”). Missing data (where there is no term available for a bait/feature combination) were recorded as “unknown”. Pathway annotations (F6) were recorded as binary “present” and “absent” since overall representation is low (946 out of 4135 baits annotated with 106 unique pathways). The abundance of proteins in yeast, as measured in a large-scale study,20 ranges from approximately 101 to over 106 molecules per cell. To use this information in our Bayesian model, we binned the abundance measures into 7 groups (by order of magnitude) (Supplementary Table S2). Additive smoothing was performed on the data for assignment of non-zero probabilities to terms that do not occur in the training set. Three criteria were applied in the selection of the seven features ultimately used in the model (Table 2). First, features were selected that are known as important mediators of protein interactions (e.g., domain/motif, molecular function). Second, since the goal is to provide predictions for baits on a genome-wide scale, coverage of the features across the sets of baits in the data sets was an important selection criterion. Third, only the set of features that exhibited minimal dependence upon each other (data not shown) were evaluated so that they could be appropriately modeled using the naı¨ve Bayesian model. The frequencies of the annotated terms for successful and unsuccessful baits are provided in Supplementary Table S3. Journal of Proteome Research • Vol. 9, No. 10, 2010 4973

research articles

Saha et al.

Naı¨ve Bayesian Model. The overall strategy for computation and analysis of the statistical model is shown in Figure 1A. The naı¨ve Bayes model calculates posterior probabilities for a given hypothesis (successful/nonsuccessful bait) assuming that the features that describe data instances are conditionally independent.21 The posterior probability of a bait being successful in a given experiment for a given feature vector is calculated as follows:

n

(P(h ) 1)

∏ P(f |h ) 1)) i

i)1

bait compatibility index (F) ) log2

n

(P(h ) 0)

∏ P(f |h ) 0)) i

i)1

(3) Since the prior probabilities (P(h ) 1) and P(h ) 0)) are equal (0.5), eq 3 becomes

n

(P(h) (P(F|h)P(h)) P(h|F) ) ) P(F)

∏ P(f |h ) 1)) i

i)1

P(F)

bait compatibility index (F) ) log2

where h indicates one of the two hypotheses: bait being (i) successful (h ) 1) or (ii) unsuccessful (h ) 0) in having one or more preys; F (f1, f2, ..., fn) represents the feature vector, where n is the number of features under consideration; P(h) denotes the prior probability of a bait being successful or unsuccessful (assumed to be 0.5 in both the cases); P(F|h) represents the probability of observing feature values in F when hypothesis h is true; P(F) denotes the marginal probability of observing feature vector F. Each bait in our study was assigned a bait compatibility index score for each model (i.e., each bait was assigned a Y2H score and an AP-MS score), which represents the likelihood of a successful outcome for a given bait with each technology and is defined as follows:

bait compatibility index (F) ) log2

(P(h ) 1|F)) (P(h ) 0|F))

Substituting the terms from eq 1 in eq 2,

(∏ (∏ n

(1)

(2)

n

i)1

) )

P(fi |h ) 1)

i)1

P(fi |h ) 0)

(4)

Equation 4 can be interpreted as the log likelihood ratio statistic; the numerator represents the probability of an observed outcome (F) if the bait was successful, while the denominator indicates the probability of observing F if the bait was unsuccessful. A higher value of the BCI indicates higher likelihood of a bait being successful for a given experiment. Data sets D1 and D2 were divided into training and testing sets. Naı¨ve Bayes models for all combinations of the features were computed for AP-MS and Y2H data sets (Figure 1A). The optimal combination of features for Y2H and AP-MS that were used to create the final models were selected on the basis of performance measures as described below. The feature vectors and frequencies of terms are used as follows in computing the posterior probability of successful (h ) 1) and unsuccessful (h ) 0) outcomes for each protein. For features that may have multiple terms per protein (such as motifs, biological processes

Figure 1. Overview of data processing workflow. (A) Training and testing of Bayesian model to predict success according to features of each bait. (B) Identification of annotation terms significantly enriched in successful and unsuccessful baits. 4974

Journal of Proteome Research • Vol. 9, No. 10, 2010

research articles

Bait Compatibility Index or functions), the geometric mean of the frequency term values is used in the feature vector. The feature vectors are then used to compute the posterior probabilities of success and failure for each protein using the AP-MS and Y2H models, and the bait compatibility index (BCI) is computed as the log ratio of these values (eq 4). Model Validation and Performance Measures. Five-fold cross validation methods22 were used to evaluate the performance of all models and the overall performance of a model calculated as the average performance over the five sets. Three performance measures, sensitivity (the percent of correctly predicted successful baits), specificity (the percent of correctly predicted negative baits), and accuracy (the proportion of overall correctly predicted baits),23 were calculated as follows: sensitivity )

TP × 100 TP + FN

specificity )

TN × 100 TN + FP

TP + TN accuracy ) × 100 TP + FP + TN + FN Where TP and FN refer to true positive and false negatives, and TN and FP refer to true negatives and false positives. Term Enrichment Analysis. To identify feature terms that are unevenly represented in successful and unsuccessful bait classes, a two-tailed Fisher’s exact test was performed by constructing a 2 × 2 contingency table for the frequency of each term (i.e., number of bait proteins annotated with and without the given term) across the successful and unsuccessful classes in data sets D1-D4. The p-values were corrected using multiple hypothesis testing, and q-values were calculated using the bootstrap method.24 The set of terms significant (p-value e0.05) in one or more of the four data sets were further filtered by removing redundant terms (terms that co-occur across the sets of baits) and are shown in Supplementary Table S4.

Results and Discussion Bait-Centric Models for Predicting the Outcome of Interaction Proteomics Experiments. Figure 1 outlines the data analysis work-flows to construct predictive Bayesian models for interaction proteomics experiments (Figure 1A) and to identify annotation terms that are significantly associated with the outcome of interaction proteomics experiments (Figure 1B). The starting point for the analysis is the construction of the annotation table, comprised of the unified set of baits from data sets D1 and D2, and all of their respective feature annotations and experimental outcomes (successful/ unsuccessful). Preliminary analysis of several fundamental bait attributes (e.g., length of ORF) suggested that they were not significant determinants of the outcome of the interaction experiments (data not shown), and they were excluded from further analysis. The set of seven features used to train the Bayesian model are listed in Table 2 and comprise a mixture of “low” level sequence-specific features (motif, post-translational modification) and higher-level features such as biological functions, pathways, and cellular abundance. To identify the best performing models for Y2H (data set D1) and AP-MS (data set D2), the performance of each combination

Figure 2. ROC plot of AP-MS (solid) and Y2H (dotted) optimum models in yeast interaction data sets. Four feature (F1, PTMs; F2, subcellular location; F6, pathway; F7, abundance) combination achieved maximal accuracy for AP-MS whereas two features (F5, GO molecular function; F6, pathway) achieved maximal accuracy in the Y2H data set.

of the seven features (127 combinations) was tested using 5-fold cross validation in a superdata set consisting of 4135 proteins (Supplementary Table S5). Of this set of 4135 proteins, 38% and 62% were successful and unsuccessful according to data set D1 (Y2H), and 52% and 48% were successful and unsuccessful according to data set D2 (AP-MS). Receiver operator characteristics (ROC) curves were used to assess the performance of the optimal models (best feature combinations) as shown in Figure 2. The area under the curve (AUC) is an important index of the overall accuracy of the models (0.76 for AP-MS and 0.66 for Y2H). Maximal accuracy (71.25%; sensitivity ) 75.81%, specificity ) 66.30%) for the AP-MS data set was achieved with four features (F1, F2, F6, F7), and maximal accuracy (63.17%; sensitivity ) 70.83%, specificity ) 58.48%) for the Y2H data set was achieved with two features (F5, F6). The cellular abundance was found to be the single best predicting feature, especially for AP-MS experimental outcomes (66.19% and 52.38% accuracy for AP-MS and Y2H, respectively. The optimal model for AP-MS is composed of four features (PTMs, subcellular localization, biological process, and abundance), whereas the optimal Y2H model uses only molecular function and pathway. This difference reflects the underlying requirements of each technology; because AP-MS relies on recovery of endogenous complexes, there is a much stronger dependence upon cellular localization and abundance as compared to Y2H. The bait compatibility index, defined as the ratio of the probability of a successful outcome to an unsuccessful outcome (Figure 1) was calculated for all yeast ORFs for each of the two data sets D1 and D2. The distributions of these scores for the successful and unsuccessful sets of baits are shown as box plots in Figure 3. For both Y2H and AP-MS, the scores for successful baits are significantly higher than for unsuccessful baits (Student’s t test; p < 2.2 × 10-16), showing that the bait compatibility index is a useful predictor of baits that have a higher probability of yielding prey proteins. Although the BCI score distributions for successful and unsuccessful baits are overlapping, the outcome for a significant fraction of the baits can be predicted with high accuracy. As described above, the overall accuracy for the AP-MS model is 71.25%. However, we note that for baits with either high or low BCI scores, as shown in Figure 3B, the BCI score is a good predictor of experimental Journal of Proteome Research • Vol. 9, No. 10, 2010 4975

research articles

Figure 3. Bait score distributions distinguish successful and unsuccessful baits in yeast interaction proteomics experiments. (A) Yeast Y2H, data set D1. (B) Yeast AP-MS, data set D2. Successful bait sets have statistically significantly higher bait scores (Student’s t test; p < 2.2 × 10-16) in (A) and (B) (the band in each box represents the median values).

outcome; for approximately 25% of the data set (>1000 baits), the accuracy of prediction is ∼90%. Bait Feature Analysis. In order to better understand the specific biases of the interaction proteomics data sets and

Saha et al. techniques, we identified bait features that are statistically associated with successful and nonsuccessful interaction proteomics experiments. A set of 494 significantly enriched terms (Supplementary Table S5) were selected for further analysis (terms with p-value < 0.05 according to Fisher’s exact test in one or more of data sets D1-D4 were classed as significant). To aid analysis and identify trends within these significant terms, sets of redundant terms (co-annotated to the same sets of proteins) were collapsed into a smaller nonredundant set of 391 terms. The terms and their relative enrichment (ratio of term occurrence in successful to unsuccessful baits) across the data sets are visualized as a heat-map in Figure 4. Clustering of the terms reveals patterns of enrichment as shown in Figure 4B-E. In addition, clustering of the studies shows that the two Y2H data sets cluster separately from the two AP-MS data sets, suggesting that these patterns of enrichment represent fundamental differences between the two technologies. This analysis enables a broad overview of those annotation terms that are associated with successful and unsuccessful baits in Y2H and AP-MS. For example, annotation terms associated with membrane proteins are highly enriched in the unsuccessful bait class for both Y2H and AP-MS (Figure 4C). This is expected, since integral membrane proteins perform relatively poorly, with commonly used Y2H and AP-MS work-flows, and

Figure 4. Hierarchical clustering enables visualization of patterns of enrichment of bait features across studies and technologies. (A) Heat-map showing all 391 significant terms (Fisher’s exact test, p-value