MS Experiments - Journal of

Nov 30, 2010 - PIPINO: A Software Package to Facilitate the Identification of Protein-Protein Interactions from Affinity Purification Mass Spectrometr...
6 downloads 11 Views 368KB Size
Modeling Contaminants in AP-MS/MS Experiments Mathieu Lavalle´e-Adam,† Philippe Cloutier,‡ Benoit Coulombe,‡ and Mathieu Blanchette*,† McGill Centre for Bioinformatics and School of Computer Science, McGill University, Montre´al, Canada, and Gene Transcription and Proteomics Laboratory, Institut de recherches cliniques de Montre´al, Montre´al, Canada Received August 1, 2010

Abstract: Identification of protein-protein interactions (PPI) by affinity purification (AP) coupled with tandem mass spectrometry (AP-MS/MS) produces large data sets with high rates of false positives. This is in part because of contamination at the AP level (due to gel contamination, nonspecific binding to the TAP columns in the context of tandem affinity purification, insufficient purification, etc.). In this paper, we introduce a Bayesian approach to identify false-positive PPIs involving contaminants in AP-MS/MS experiments. Specifically, we propose a confidence assessment algorithm (called Decontaminator) that builds a model of contaminants using a small number of representative control experiments. It then uses this model to determine whether the Mascot score of a putative prey is significantly larger than what was observed in control experiments and assigns it a p-value and a false discovery rate. We show that our method identifies contaminants better than previously used approaches and results in a set of PPIs with a larger overlap with databases of known PPIs. Our approach will thus allow improved accuracy in PPI identification while reducing the number of control experiments required. Keywords: bioinformatics • protein-protein interactions • artificial intelligence • contaminants • affinity purification • mass spectrometry • computational biology • proteomics • Bayesian statistics

Introduction The study of protein-protein interactions (PPI) is crucial to the understanding of biological processes taking place in cells.1 Affinity purification (AP) combined with mass spectrometry (MS) is a powerful method for the large scale identification of PPIs.2-7 The experimental pipeline of AP consists in first tagging a protein of interest (bait) by genetically inserting a small peptide sequence (tag) onto the recombinant bait protein. The bait protein is affinity purified, together with its interacting partners (preys), which are identified using MS. However, this type of experiment is prone to false positive identifications for various reasons,8 which can seriously complicate the downstream analyses. In the context of affinity purification, contamination of manually handled gel bands, inadequate puri* To whom correspondence should be addressed. E-mail: blanchem@ mcb.mcgill.ca. Phone: 514-398-5209. Fax: 514-398-3387. † McGill University. ‡ Institut de recherches cliniques de Montre´al.

886 Journal of Proteome Research 2011, 10, 886–895 Published on Web 11/30/2010

fication, purification of specific complexes from abundant proteins, and nonspecificity of the tag antibody used are some of the many ways contaminants can be introduced in the experimental pipeline before the mass spectrometry (MS) phase. These contaminants, added to the already large set of valid preys of a given bait, create even longer lists of proteins to analyze. While common contaminants can be identified easily by a trained eye, sporadic contaminants can be considered erroneously as true positive interactions. In addition to contaminants, false positive PPIs can be introduced at the tandem mass spectrometry phase (MS/MS) step.9 For example, peptides of proteins with low abundance or involved in transient interactions can be difficult to identify because of the lack of spectra. Such peptides can be misidentified by database searching algorithms such as Mascot10 or SEQUEST.11 Although many approaches have been proposed to limit the number of mismatched MS/MS spectra (e.g., Peptide Prophet12 and Percolator13), the modeling and detection of contaminants, which is the problem we consider in this paper, has received much less attention. Related Work. A number of experimental and computational approaches have been proposed to reduce the rate of falsepositive PPIs. Several steps in the experimental pipeline can be optimized to minimize contamination. In-cell near physiological expression of the tagged proteins is preferred to overexpression to prevent spurious PPIs. Also, additional purifications could be performed in order to remove contaminating proteins from affinity purified eluate. The drawback of an increased number of purifications is a loss of sensitivity, as transient or weak PPIs will be more likely to be disrupted.2 When performing gel-based sample separation methods before MS/MS, manual gel band cutting can introduce contaminants such as human keratins in the sample. This can be addressed by robot gel cutting, although this increases equipment cost. As an alternative, gel-free protocols simply use liquid chromatography to separate the peptide mixture before MS/MS. However, depending on the complexity of the mixture, less separation might result in an important decrease in sensitivity. Finally, liquid chromatography column contamination from previous chromatographic runs is also important to consider. Although it is possible to wash the column to eluate peptides from previous chromatographic runs, very limited washing is typically done because its time consumption. Several computational methods have been used to identify the correct PPIs from AP-MS/MS data.14,15 Some involve the use of the topology of the network formed by the PPIs (e.g., number of times two proteins are observed together in a purification to assign a Socio-affinity index4 or a Purification 10.1021/pr100795z

 2011 American Chemical Society

technical notes

Modeling Contaminants in AP-MS/MS Experiments 16

Enrichment score). Others used various combinations of data features such as mass spectrometry confidence scores, network topology features, and reproducibility data with machine learning approaches to assign probabilities that a given PPI is a true positive.2,5,17,18 However, with each of these methods, contaminants would often be classified as true interactions because of their high database matching scores and reproducibility. Such sophisticated machine learning procedures can be prone to overfitting and the use of small, manually curated, but often biased training set, such as MIPS complexes19 as used by Krogan et al.5 or a manually selected training set as used by our previous approach,2,18 can be problematic depending on the nature of the data analyzed. Finally, Chua et al. combined PPI data obtained from several different experimental techniques in an effort to reduce false-positives.20 Although such methods will be very efficient at filtering contaminants, they will typically suffer of poor sensitivity. All of these methods attempt to model simultaneously several sources of false positive identifications including contamination but also, for example, misidentification of peptides at the mass spectrometry level. For instance, scoring methods relying on the topology of PPI networks will tend to assign low scores to proteins being observed as preys in several experiments because they are likely contaminants, while also scoring poorly proteins largely disconnected from the network, which are potential database misidentifications. However, none of these attempt to directly model and filter out contaminants resulting from AP experiments. Deconvoluting the modeling of false identification into AP contaminants modeling and database matching of MS/MS spectra will potentially lead to methods identifying PPIs with higher accuracy. To date, most computational methods aiming specifically at filtering out likely contaminants have been quite simplistic. Several groups maintain a manually assembled list of contaminants and then systematically reject any interactions involving these proteins.8 However, it is possible that a contaminant for one bait is a true interaction for another, suggesting that a finer model of contaminant level would be beneficial. Recently, the Significance Analysis of Interactome (SAINT), a sophisticated statistical approach attempting to filter out contaminant interactions resulting from AP-MS/MS experiments, was introduced.7 SAINT assesses the significance of an interaction according to the semiquantitative peptide count measure of the prey. It discriminates true from false interactions using mixture modeling with Bayesian statistical inference. However, the currently available version of SAINT (1.0) lacks the flexibility to learn contaminant peptide count distributions from available control data and requires a considerable number of baits (15 to 20) in order to yield optimal performances. Moreover, although not necessary, manual labeling of proteins as hubs or known contaminants is required to achieve the best possible accuracy. (Another version of SAINT is currently under development and promises to address many of these issues.) Once the tagging performed, some affinity purification methods require the vector of the bait to be induced so that the tagged protein is expressed. An alternate method to identify likely contaminant PPIs for a given bait is to perform a control experiment where the expression of the tagged protein is not induced prior to immunoprecipitation. It is then possible to compare mass spectrometry confidence scores (e.g., from Mascot10) for the preys from both the control and induced experiments. For example, in Jeronimo et al.,18 only preys with a Mascot score at least 5 times larger in the induced experiment

than the control experiment were retained, the others being considered as likely contaminants. There are limitations to such false positive filtering procedure. First, this method is expensive in terms of time and resources, since the cost is doubled for each bait studied. Second, because of the noisy nature of MS scores for low-abundance preys, comparing a single induced experiment to a single noninduced experiment is problematic. Pooling results from several noninduced experiments could be beneficial. Third, some baits will show leaky expression of the noninduced vector. Depending on the level of leakiness, several true interactions may be mistakenly categorized as false positives. Here, we propose a confidence assessment algorithm (Decontaminator) using only a limited number of high quality controls sufficient to the proper identification of contaminants obtained from AP-MS/MS experiments without prior knowledge about neither hubs nor contaminants. By pooling control experiments, one-to-one comparisons of induced and noninduced experiment Mascot scores are avoided. Our fast computational method thus provides accurate modeling of contaminants while limiting resource usage.

Methods We propose a Bayesian approach called Decontaminator that makes use of a limited number of noninduced control experiments in order to build a model of contaminant levels as well as to analyze the noise in the measurements of Mascot scores. Decontaminator then uses this model to assign a p-value and an associated false discovery rate (FDR) to the Mascot score obtained for a given prey. We start by describing the AP-MS/ MS approach used to generate the data, and then describe formally our contaminant detection algorithm. Alternate approaches are also considered and their accuracy is compared in the Results section. Biological Data Set. The Proteus database contains the results of tandem affinity purification (TAP) combined with MS/ MS experiments performed for a set E of 89 baits, both in noninduced and induced conditions.2,18,21,22 The baits selected revolve mostly around the transcriptional and splicing machineries. The set of proteins identified as preys by at least one bait (in either the induced or noninduced experiments) consists of 3619 proteins. Detailed TAP-MS/MS methodology has been described elsewhere.2 Briefly, a vector expressing the TAPtagged protein of interest was stably transfected in HEK 293 cells. Following induction, the cells were harvested and lysed mechanically in detergent-free buffers. The lysate was cleared of insoluble material by centrifugation and the tagged protein complexed with associated factors were purified twice using two sets of beads each targeting a different component of the TAP tag. The purified protein complexes were separated by SDS-PAGE and stained by silver nitrate. The acrylamide gel was then cut in its entirety in about 20 slices that were subsequently trypsin-digested. Identification of the tryptic peptides obtained was performed using microcapillary reversed-phase high pressure liquid chromatography coupled online to a LTQ-Orbitrap (Thermo Fisher Scientific) quadrupole ion trap mass spectrometer with a nanospray device. Proteins were identified using the Mascot software10(Matrix Science) (see Supporting Information for software information). For some of the baits tested, the promoter was leaky, which resulted in the expression, at various levels, of the tagged protein, even when it was not induced. These baits were identified by detection of the tagged protein in the noninduced samples and these samples were Journal of Proteome Research • Vol. 10, No. 2, 2011 887

technical notes

Lavalle´e-Adam et al.

Figure 1. Decontaminator workflow. First, control and induced experiments are pooled to build a noise model for each. These noise models are then used to assign a p-value to each prey obtained upon the induction of a bait. Finally, a false discovery rate is calculated for each p-value.

excluded from our analysis. A set B ⊂ E of 14 nonleaky baits was selected for this study: B ) {b1, ..., b14} ){SFRS1, NOP56, TWISTNB, PIH1D1, UXT, MEPCE, SART1, RP11 - 529I10.4, TCEA2, PDRG1, PAF1, KIAA0406, POLR1E, KIN}. The set P of preys they detected in at least one of these 28 induced and noninduced experiments contains 2415 proteins. Out of these, 1067 proteins were unique to a single noninduced bait and 808 were detected more than once in the set of 14 controls, while 540 were only observed in induced experiments. We denote by MNI b,p the Mascot score obtained for prey p in the experiment I where bait b is not induced, and by Mb,p the analogous score in the induced experiment. Note that for most pairs (b,p), where b ∈ B and p ∈ P, p was not detected as a prey for b, in which case we set the relevant Mascot score to zero. In the noninduced experiments, the number of preys detected for each bait varies from 206 to 626, with a mean of 316. These preys are likely to be contaminants, as the tagged bait is not expressed. In induced experiments, the number of preys per bait goes from 5 to 516, with a mean of 135. It may appear surprising that there are on average more preys detected in the noninduced experiments than in the induced ones. This is likely due to the fact that the presence of high-abundance preys in the induced experiments masks the presence of lowerabundance ones, including contaminants. Still, a significant fraction of the proteins detected in the induced condition are likely contaminants. 888

Journal of Proteome Research • Vol. 10, No. 2, 2011

Computational Analysis. An ideal model of contaminants would specify, for each prey p, the distribution of the MS scores (in our case, Mascot score10) in noninduced experiments, which we call the null distribution for p. However, accurately estimating this distribution would require a large number of noninduced experiments and the cost would be prohibitive. Instead, we use a small number of noninduced experiments and make the assumption that preys with similar average Mascot scores have a similar null distribution (this assumption is substantiated in the Discussion section). This allows us to pool noninduced scores from different preys (if they have similar Mascot score averages) to build a more accurate noise model. Thus, results from a few control experiments are sufficient to build a contaminant model that can then be used to analyze the results of any number of induced experiments performed under the same conditions. Figure 1 summarizes our approach. Our goal is now to use the data gathered from induced and noninduced AP-MS/MS experiments to build a model of noise in Mascot score measurements and eventually be able to assess I j NI the significance of a given Mascot score Mb,p . Let M p be the unobserved true mean of Mascot scores for prey p in noninduced experiments, which is defined as the mean of the Mascot scores of an infinite number of noninduced biological replij NI cates. M p can never be observed, but its posterior distribution can be obtained if a few samples are available. Similarly, define I j b,p M as the average of the Mascot scores of p of an infinite

Modeling Contaminants in AP-MS/MS Experiments

technical notes

¯ NI ¯I Figure 2. Plots of both the posterior distributions of M p and posterior distributions of Mb,p for three different interactions. The blue I ¯ NI ¯ b,p curve is the posterior distribution of M and the red curve is the posterior distribution of M . Orange X’s positions on the x-axis are p observed Mascot scores of the prey in control experiments. When the prey was not detected for a given control, an X is drawn on the x-axis close to the origin. The blue X’s position on the x-axis is the Mascot score of the prey in the induced experiment for the corresponding bait. (a) RPAP3-POLR2E interaction is an example of an interaction considered valid by the algorithm. (b) RPAP3-SAMD1 interaction is an example of an interaction where the prey is considered as a contaminant. (c) KPNA2-ACTB is an example of an interaction that is predicted as positive, even though ACTB is often observed as a contaminant (but with lower Mascot scores) in control experiments.

number of biological replicates of experiments where b is induced. The essence of our approach is to calculate the j NI posterior distribution of M p , the true mean Mascot score of prey p given our data from noninduced experiments, and to I j b,p compare it to the posterior distribution of M , the true mean Mascot score of prey p in an induced experiment, given our I observed data Mb,p . If the second distribution is significantly to the right of the first, prey p is a likely bona fide interaction of bait b. If not, p is probably a contaminant and should be discarded. Before describing our method in details, we give a few examples that illustrate how it works. Figure 2a shows an example of an interaction accepted by Decontaminator. POLR2E obtained a Mascot score of 320 in the induced experiment of RPAP3 and was only detected twice in control experiments (Mascot scores 38 and 48). From this figure, it can clearly be I j RPAP3,POLR2E seen that the resulting posterior distribution of M is j NI significantly to the right of the posterior distribution of M POLR2E. Therefore, the RPAP3-POLR2E interaction obtains a small p-value (0.00018) and FDR (0.013) and is considered a valid interaction. This prediction is consistent with the literature about this interaction,18 which is present in databases such as BioGrid.23 Conversely, Figure 2b shows the posterior distribuI j RPAP3,SAMD1 of SAMD1 for the induced experiment of tion of M RPAP3 (Mascot score: 102). The distribution is largely overlapNI j SAMD1 ping with the posterior distribution of M , which obtained Mascot scores of 95, 53, 147, and 164 in control experiments but was unobserved in 10 other controls. Clearly the RPAP3SAMD1 interaction is not very reliable because its Mascot score does not exceed by enough some of the observed noninduced Mascot scores. Decontaminator assigns a large p-value (0.18) and FDR (>0.9) and labels the interaction as a contaminant. Of note, this interaction would often have been incorrectly classified by methods based on a direct comparison of Mascot scores between matched induced and noninduced experiments (as used, for example, in Jeronimo et al.18), as SAMD1 shows up in control experiments only four out of 14 times. Finally, Figure 2c shows the results for the KPNA2-ACTB interaction. ACTB, like other actin proteins, is a known contaminant for our experimental pipeline, obtaining Mascot scores varying from 35 to 210 for eight of the controls. However, the Mascot

score observed for this interaction (793) is sufficiently higher to allow the interaction to be predicted as likely positive (pvalue 0.005, FDR 0.17). Beta-actin (ACTB) is known to shuttle to and from the nucleus,24 a process which may directly require karyopherin alpha import protein KPNA2. Beta-actin and actinlike proteins have been shown to be part of a number of chromatin remodeling complexes25 and analysis of the KPNA2 purification revealed a strong presence for such complexes (TRRAP/TIP60 and SWI/SNF-like BAF complexes). Therefore, it is possible that these remodelers are assembled in the cytoplasm and imported together to the nucleus via KPNA2. This example shows how the method can differentiate specific interactions from classic contamination. The Decontaminator approach involves four steps (see Figure 1): 1. Use noninduced experiments for different baits as biologij NI cal replicates and obtain a noise model defined by Pr[MNI b,p|Mp ] I I j b,p and Pr[Mb,p |M ]. 2. For each protein p ∈ P, calculate the posterior distribution NI NI NI j NI of M p given Mb1,p,Mb2,p, ..., Mb14,p. Similarly, calculate the I I j b,p posterior distribution of M , given Mb,p . I 3. For each pair (b,p) ∈ B × P, assign a p-value to Mb,p : I I I j pNI g M j b,p p-value(Mb,p ) ) Pr[M |Mb,p , MbNI1,p, MbNI2,p, ..., MbNI14,p]

4. Using the noninduced and full induced data sets, assign a false discovery rate (FDR) to each p-value. Each step is detailed further below. Step 1: Building a Noise Model from Noninduced Experiments. The set of 14 TAP-MS/MS experiments where the bait’s expression was not induced can be seen as a set of biological replicates of the null condition. We use these measurements to assess the amount of noise in each replicate, j NI i.e. to estimate Pr[MNI b,p|Mp ], the probability of a given observaNI j NI tion Mb,p, given its true mean Mascot score M p . This distribution is estimated using a leave-one-out cross-validation approach on the set of 14 noninduced experiments. Specifically, NI for each bait b ∈ B, we compare Mb,p to µ*b,p, the corrected average (see Supporting Information) of the 13 Mascot scores of p in all noninduced experiments except where bait b was Journal of Proteome Research • Vol. 10, No. 2, 2011 889

technical notes

Lavalle´e-Adam et al.

j NI used. µ*b,p provides a good estimate of M p . Let C(i,j) be the NI number bait-prey pairs for which Mb,p ) i and µ*b,p ) j. Then, a straightforward estimator is NI j pNI ) y] ) C(x, y)/ ) x|M Pr[Mb,p

∑ C(x′, y) x′

Note that C is a fairly large matrix (the number of rows and columns is set to 1000; larger Mascot scores are culled to 1000). In addition, aside from the zero-th column C(*,0), it is quite sparsely populated, as the sum of all entries is 40 306. Thus, the above formula yields a very poor estimator. Matrix C therefore needs to be smoothed to matrix Cs using a k -nearest neighbors smoothing algorithm.26 Specifically, let Nδ(i,j) ){(i′,j′): |i - i′|eδ,|j - j′|eδ} be the set of neighboring matrix cells to entry i,j, for some distance threshold δ. For each entry (i,j) in the matrix, we choose δ in such a way that Sδ(i,j) ) Σ(i′,j′) ∈ Nδ(i,j)C(i′,j′)ek and Sδ+1 ) Σ(i′,j′) ∈ Nδ+1(i,j)C(i′,j′) > k. Then, Cs is obtained as:

Cc(i, j) ) Cs(i, j) ·

I(j) NI(j)

The correction is performed by multiplying each entry (i,j) of Cs by the ratio of the smoothed probabilities of the induced mascot score i and of the true mean Mascot score j. This correction allows a better estimation of the noise in induced experiments by putting more weight in the matrix Cc for larger Mascot scores. (See Supporting Information section for correction factor derivation.) I j pNI and M j b,p Step 2: Posterior Distribution of M . We are now interested in obtaining the posterior distribution of the mean NI NI j NI Mascot score M p , given the set of observations Mb1,p, ..., Mb14,p. This is readily obtained using Bayes rule and the conditional independence of the observations, given their means:

14

j pNI] · j pNI |MbNI,p, ..., MbNI ,p] ) Pr[M Pr[M 1 14

∏ Pr[M

NI j NI bi,p |Mp ]/ζ

(2)

i)1

s

C (i, j) )



wa,b · C(a, b)

j NI where ζ is a normalizing constant and Pr [M p ) R] ) NI(R). Since observed Mascot scores of induced experiments are also noisy, we estimate the noise of each Mascot score from an induced experiment as:

(a,b)∈Nδ(i,j)



wa,b

(a,b)∈Nδ(i,j)

where

w(a, b) )

{

I NI j b,p ) j|Mb,p ) i] ) Cc(i, j)/ Pr[M

if (a, b) ∈ Nδ(i, j) 1 k - Sδ(i, j) if (a, b) ∈ Nδ+1(i, j)\Nδ(i, j) Sδ+1(i, j) - Sδ(i, j)

∑ C (x′, y) s

c

j′

In our experiments k ) 10 has produced the best results. Finally, we obtain our estimate of the noise in each measurement as NI j pNI ) y] ) Cs(x, y)/ ) x|M Pr[Mb,p

∑ C (i, j′)

Step 3: P-Value Computation. Given an observed Mascot score MIb,p for prey p, Decontaminator can now assign a p-value j NI to this score, which represents the probability that M p , the true Mascot score of p in noninduced experiments, is larger than I j b,p or equal to M , the true Mascot score for the induced experiment:

(1)

x′

I I I j pNI g M j b,p ) ) Pr[M |Mb,p , MbNI1,p, MbNI2,p, ..., MbNI14,p] p-value(Mb,p

The approach proposed so far is appropriate to model noise in noninduced experiments. However, it does not correctly model noise in induced experiments for the following reason. The distribution of Mascot scores observed in noninduced and induced experiments differ significantly, with many more high scores observed in the induced experiments. High Mascot scores in noninduced data are rare and, when they do occur, relatively often correspond to cases where a prey p obtains very low scores with most noninduced baits, but a fairly high score with one particular bait, resulting in a large difference between j NI MNI b,p and Mp . Large Mascot scores are thus seen as particularly unreliable, which is the correct conclusion for noninduced experiments, but not for induced experiments. In a situation I j b,p where high M are frequent, for example, when b is induced, this does not properly reflect the uncertainty in the true Mascot score. To address this problem, a correction factor is computed. I Let I(i) be the fraction of (b,p) pairs such that Mb,p  ) i and let j NI NI(i) be the fraction of preys with [M p ] ) i. The I and NI distributions are first smoothed using a k-nearest neighbor approach, then the corrected matrix Cc for the noise model of Mascot scores for induced experiments is obtained as: 890

Journal of Proteome Research • Vol. 10, No. 2, 2011

)

∑ Pr[Mj

NI p

I I j b,p ) x|MbNI1,p, MbNI2,p, ..., MbNI14,p] · Pr[M ) y|Mb,p ],

xgy

(3)

where the required terms are obtained from Steps 1 and 2. Step 4: False Discovery Rate Estimation. Although our p-values are in principle sufficient to assess the confidence in the presence of a given PPI, they may be biased if some of the assumptions we make were violated. A hypothesis-free method to assess the accuracy of our predictions is to measure a false discovery rate (FDR) for each PPI. For a given p-value threshold t, FDR(t) is defined as the expected fraction of predictions (interactions with p-values below t) that are false positives (i.e., due to contaminants). We use a leave-one-out strategy to estimate the FDR: For every bait b in B, we compute NI p-value(Mb,p ) for all preys p ∈ PbNI (the set of preys detected I when b was the control bait) and p-value(Mb,p ) for all preys p I ∈ Pb (the set of preys detected when b was induced), based on the set of noninduced experiments excluding bait b. Then, we obtain

Modeling Contaminants in AP-MS/MS Experiments

∑ ∑1

NI ) t, for some threshold t. This is the approach that was used as primary filtering by Jeronimo et al.18 and Cloutier et al.2 I NI 4. The MRatio method computes r′(b,p) ) Mb,p /(1 + Mb,p ). It reports the interaction as true positive if r′(b,p) > t′, for some threshold t′. We note the MScore and MRatio approaches are only applicable if both induced and noninduced experiments are performed for all baits of interest. 5. The Z-score method assigns a Z-score to each Mascot j NI score, as compared to the prey-specific mean M p and standard deviation σ(MNI ) for prey p across all baits: Z-score(b,p) ) (MIb,p p NI j NI - M p )/σ(Mp ). Compared to our Bayesian approach, the Z-score method may be advantageous if the Mascot score variance of contaminants with similar averages varies significantly from prey to prey. Our approach, by pooling all noninduced Mascot score observations from different preys, makes the assumption that such variation is low. If this is not the case the modeling accuracy of contaminants will be negatively affected. However, the Z-score approach works under the assumption that noise in Mascot scores is normally distributed which is not always the case. In addition, the mean and variance estimates are obtained from only as many data points as there are control experiments available, as no pooling is performed. Implementation and Availability. The proposed methods are implemented in a fast, platform independent Java program. Given a representative set of AP-MS/MS control experiments our software computes FDRs for all interactions identified in induced experiments as we describe here. Note that any protein identification mass spectrometry confidence scores (SEQUEST Xcorr, spectral counts, peptide counts, etc.) could replace the Mascot scores used in our software package by applying simple modifications. The Java program is available at: http://www. mcb.mcgill.ca/∼blanchem/Decontaminator.

Results Contaminant Detection Accuracy. We start by studying the ability of Decontaminator to tease out contaminants from true

technical notes interactions. Because neither contaminants nor true interactions are known before hand, one way to assess our method and compare it to others is to consider the number of baitprey pairs from induced experiments that achieve a p-value at most t to the number of such pairs in noninduced experiments. The ratio of these two numbers, the false discovery rate FDR(t), indicates the fraction of predictions from the induced experiments that are expected to be due to contaminants. One can thus contrast two prediction methods by studying the number of bait-prey pairs that can be detected at a given level of FDR. Figure 3 shows the cumulative distributions of FDRs for data from induced experiments, for Decontaminator as well as the Z-score approach described in the Methods section. Since the MScore and MRatio approaches require matched induced and noninduced experiments for every bait, FDRs cannot be computed for them. FDRs comparison is also not possible with SAINT because it scores the entirety of the input data set without possibility of leaving out a subset of the data. It can be observed that at low FDRs, Decontaminator always yield a larger set of predicted interactions than the Z-score approach. For example, at 1, 3, and 15% FDRs, our approach yields 140, 74, and 43% more interactions than the Z-score approach. Alternatively, for the same number of interactions predicted, say 2000, the FDR of Decontaminator (1%) is more than three times lower than that of the Z-score approach (3.4%). Number of Control Experiments Required. The required number of control noninduced experiments needed to build an accurate model of contaminants is an important issue to consider. So far, we used all 14 available controls to filter out contaminants. Figure 4 shows the cumulative distributions of the FDRs obtained from Decontaminator with different number of controls. It can be observed that there is an increase in the number of high confidence (1200 a.a.) of these proteins. These preys could also be suffering of an undersampling by the mass spectrometer and be masked due to the presence of more abundant proteins. However, those are rare cases that can be treated separately by flagging, from the set of noninduced experiments, proteins with unexpectedly high variance, and analyzing them separately.

Conclusion Several methods have been proposed to identify false positive PPIs in AP-MS/MS experiments. However, very few have considered the modeling of contaminants resulting only from the AP experimental pipeline. We hypothesized that a more accurate model of contaminants would yield higher accuracy of PPI identification. We have shown here that using Decontaminator, only a few representative control experiments are necessary to accurately discard the vast majority of contaminants while allowing the detection of true PPIs involving preys that, in other experiments, may be contaminants. These findings will allow significant reductions in expenses and a greater number of experiments to be conducted with higher accuracy. Abbreviations: AP, affinity purification; FDR, false discovery rate; MS, mass spectrometry; MS/MS, tandem mass spectrometry; PPI, protein-protein interaction; TAP, tandem affinity purification.

Acknowledgment. This work was funded by a CIHR operating grant to BC and MB and NSERC Vanier and CGS scholarships to MLA. The authors thank Christian Poitras for his help in the software development process, Denis Faubert for the mass spectrometry analysis, and Ethan Kim for useful suggestions. Supporting Information Available: Protein and peptide identification software information, computation of corrected averages for Mascot scores, and Cc correction factor derivation. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) von Mering, C.; Krause, R.; Snel, B.; Cornell, M.; Oliver, S.; Fields, S.; Bork, P. Nature 2002, 399–404.

(2) Cloutier, P.; Al-Khoury, R.; Lavalle´e-Adam, M.; Faubert, D.; Jiang, H.; Poitras, C.; Bouchard, A.; Forget, D.; Blanchette, M.; Coulombe, B. Methods 2009, 48, 381–386. (3) Gavin, A.; Bosche, M.; Krause, R.; Grandi, P.; Marzioch, M.; Bauer, A.; Schultz, J.; Rick, J.; Michon, A.; Cruciat, C.; et al. Nature 2002, 415, 141–147. (4) Gavin, A.; Aloy, P.; Grandi, P.; Krause, R.; Boesche, M.; Marzioch, M.; Rau, C.; Jensen, L.; Bastuck, S.; Dumpelfeld, B.; et al. Nature 2006, 440, 631–636. (5) Krogan, N.; Cagney, G.; Yu, H.; Zhong, G.; Guo, X.; Ignatchenko, A.; Li, J.; Pu, S.; Datta, N.; Tikuisis, A.; et al. Nature 2006, 440, 637–643. (6) Ho, Y.; Gruhler, A.; Heilbut, A.; Bader, G.; Moore, L.; Adams, S.; Millar, A.; Taylor, P.; Bennett, K.; Boutilier, K.; et al. Nature 2002, 415, 180–183. (7) Breitkreutz, A.; Choi, H.; Sharom, J.; Boucher, L.; Neduva, V.; Larsen, B.; Lin, Z.; Breitkreutz, B.; Stark, C.; Liu, G.; et al. Sci. STKE 2010, 328, 1043. (8) Gingras, A.; Aebersold, R.; Raught, B. J. Physiol. 2005, 563, 11. (9) Boutilier, K.; Ross, M.; Podtelejnikov, A.; Orsi, C.; Taylor, R.; Taylor, P.; Figeys, D. Anal. Chim. Acta 2005, 534, 11–20. (10) Perkins, D.; Pappin, D.; Creasy, D.; Cottrell, J.; et al. Electrophoresis 1999, 20, 3551–3567. (11) Eng, J.; McCormack, A.; Yates, J.; et al. J. Am. Soc. Mass Spectrom. 1994, 5, 976–989. (12) Keller, A.; Nesvizhskii, A.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383–5392. (13) Kall, L.; Canterbury, J.; Weston, J.; Noble, W.; MacCoss, M. Nat. Methods 2007, 4, 923–926. (14) Sardiu, M.; Cai, Y.; Jin, J.; Swanson, S.; Conaway, R.; Conaway, J.; Florens, L.; Washburn, M. Proc. Natl. Acad. Sci. U.S.A. 2008, 105, 1454. (15) Sowa, M.; Bennett, E.; Gygi, S.; Harper, J. Cell 2009, 138, 389–403. (16) Collins, S.; Kemmeren, P.; Zhao, X.; Greenblatt, J.; Spencer, F.; Holstege, F.; Weissman, J.; Krogan, N. Mol. Cell. Proteomics 2007, 6, 439. (17) Ewing, R.; Chu, P.; Elisma, F.; Li, H.; Taylor, P.; Climie, S.; McBroom-Cerajewski, L.; Robinson, M.; O’Connor, L.; Li, M.; et al. Mol. Syst. Biol. 2007, 3. (18) Jeronimo, C.; Forget, D.; Bouchard, A.; Li, Q.; Chua, G.; Poitras, C.; The´rien, C.; Bergeron, D.; Bourassa, S.; Greenblatt, J.; et al. Mol. Cell 2007, 27, 262–274. (19) Mewes, H.; Amid, C.; Arnold, R.; Frishman, D.; Guldener, U.; Mannhaupt, G.; Munsterkotter, M.; Pagel, P.; Strack, N.; Stumpflen, V.; et al. Nucleic Acids Res. 2004, 32, D41. (20) Chua, H.; Sung, W.; Wong, L. Bioinformatics 2006, 22, 1623. (21) Forget, D.; Lacombe, A.; Cloutier, P.; Al-Khoury, R.; Bouchard, A.; Lavallée-Adam, M.; Faubert, D.; Jeronimo, C.; Blanchette, M.; Coulombe, B. Mol. Cell. Proteomics 2010, 9, 2827. (22) Krueger, B.; Jeronimo, C.; Roy, B.; Bouchard, A.; Barrandon, C.; Byers, S.; Searcey, C.; Cooper, J.; Bensaude, O.; Cohen, E.; et al. Nucleic Acids Res. 2008, 36, 2219. (23) Stark, C.; Breitkreutz, B.; Reguly, T.; Boucher, L.; Breitkreutz, A.; Tyers, M. Nucleic Acids Res. 2006, 34, D535. (24) Wada, A.; Fukuda, M.; Mishima, M.; Nishida, E. EMBO J. 1998, 17, 1635–1641. (25) Olave, I. A.; Reck-Peterson, S. L.; Crabtree, G. R. Annu. Rev. Biochem. 2002, 71, 755–81. (26) Fix, E.; Hodges Jr., J. Project 21-49-004, Report No. 11, under Contract No. AF41(148)-31E, USAF School of Aviation Medicine, 1952. (27) Prasad, T.; Goel, R.; Kandasamy, K.; Keerthikumar, S.; Kumar, S.; Mathivanan, S.; Telikicherla, D.; Raju, R.; Shafreen, B.; Venugopal, A. Nucleic Acids Res. 2008, 37, D767. (28) Ashburner, M.; Ball, C.; Blake, J.; Botstein, D.; Butler, H.; Cherry, J.; Davis, A.; Dolinski, K.; Dwight, S.; Eppig, J.; et al. Nature Genetics 2000, 25, 25–29. (29) Mosley, A.; Pattenden, S.; Carey, M.; Venkatesh, S.; Gilmore, J.; Florens, L.; Workman, J.; Washburn, M. Mol. Cell 2009, 34, 168–178. (30) Nesvizhskii, A.; Keller, A.; Kolker, E.; Aebersold, R.; et al. Anal. Chem. 2003, 75, 4646–4658. (31) Strittmatter, E.; Kangas, L.; Petritis, K.; Mottaz, H.; Anderson, G.; Shen, Y.; Jacobs, J.; Camp, D., II; Smith, R. J. Proteome Res. 2004, 3, 760–769. (32) Kawakami, T.; Tateishi, K.; Yamano, Y.; Ishikawa, T.; Kuroki, K.; Nishimura, T. Proteomics 2005, 5, 856–864. (33) Havilio, M.; Haddad, Y.; Smilansky, Z. Anal. Chem. 2003, 75, 435– 444. (34) Bernholtz, B. Stat. Papers 1977, 18, 2–12. (35) Persson, T.; Rootzen, H. Biometrika 1977, 64, 123.

PR100795Z Journal of Proteome Research • Vol. 10, No. 2, 2011 895