SFINX: Straightforward Filtering Index for Affinity Purification–Mass

Nov 30, 2015 - Advanced Database Research and Modelling (ADReM), Department of Mathematics and Computer Science, University of Antwerp, B-2020 Antwerp...
0 downloads 5 Views 808KB Size
Subscriber access provided by TULANE UNIVERSITY

Technical Note

SFINX: straightforward filtering index for affinity purification-mass spectrometry data analysis Kevin Titeca, Pieter Meysman, Kris Gevaert, Jan Tavernier, Kris Laukens, Lennart Martens, and Sven Eyckerman J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00666 • Publication Date (Web): 30 Nov 2015 Downloaded from http://pubs.acs.org on December 1, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

SFINX: straightforward filtering index for affinity purification-mass spectrometry data analysis Kevin Titecaa,b, Pieter Meysmanc,d, Kris Gevaerta,b, Jan Taverniera,b, Kris Laukensc,d, Lennart Martensa,b,e, Sven Eyckermana,b,* Interactomics, interaction filtering, affinity purification-mass spectrometry, protein-protein interactions, interaction detection

ABSTRACT

Affinity purification-mass spectrometry is one of the most common techniques for the analysis of protein-protein interactions, but inferring bona fide interactions from the resulting datasets remains notoriously difficult. We introduce SFINX, a Straightforward Filtering INdeX that identifies true positive protein interactions in a fast, user-friendly and highly accurate way. SFINX outperforms alternative techniques on two benchmark datasets and is available via the web-interface at http://sfinx.ugent.be/

INTRODUCTION

The analysis of protein-protein interactions (PPIs) enables scientists to connect genotypes with phenotypes and to answer fundamental biological questions or generate new hypotheses on the functions of proteins1. In this field, affinity purification-mass spectrometry (AP-MS) is a

ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 24

classical approach wherein a protein of interest (bait) containing an epitope tag is purified under conditions that preserve the protein complex to allow the identification of co-purifying proteins by mass spectrometry. Although AP-MS yields rich biological data and can identify new interactors, accurate data analysis is notoriously difficult because of the many false positives mainly caused by nonspecific protein binding. Non-specific proteins are often abundant proteins from for example the cytoskeleton or ribosome, or proteins that bind to the affinity matrices or unfolded proteins. In addition, most AP-MS experiments focus on only a few baits (instead of the complete proteome) with a limited number of replicates and controls2. Several approaches have been developed to filter AP-MS data without stable isotope labeling24

, for example PP-NSAF5 , the CompPASS scores6 and SAINT7. These approaches use spectral

count data, defined as the total number of peptide-to-spectrum matches assigned to a specific protein in the project of interest (Supplementary figure 1A). PP-NSAF heuristically determines the posterior probabilities of true interaction between baits and preys based on normalized spectral abundance factors (NSAFs)8. A crucial step of PP-NSAF is the calculation of vector ratios of specific over negative purifications, which precludes the use of this technique on datasets without parallel true negative control purifications. CompPASS calculates two scores: a Z-score for each interaction after mean centering and scale normalization of the original spectral counts, and a WD-score derived from scaled original spectral counts, based on individual protein abundance in the dataset and reproducibility over replicate bait purifications. SAINT normalizes the original spectral counts to the total number of spectra per project and to the length of each protein, to ultimately model mixture distributions with one component for the true interactions and one for the false interactions.

ACS Paragon Plus Environment

2

Page 3 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The ideal filter technique is highly accurate, fast and user friendly without the need to rely on extensive parameter optimization or external databases, which also makes it reproducible and unbiased. Because none of the current filter techniques combines all these features9, we developed SFINX, the Straightforward Filtering INdeX. SFINX filters out false positive interactions and then ranks true positives by their individual certainties. SFINX determines the probability that the observed interactions are the result of random events following a binomial distribution. SFINX only retains interactions considered sufficiently exceptional according to an automatically determined cut-off, which also incorporates corrections for multiple testing. We benchmarked SFINX on two independent datasets and compared it with the other techniques. SFINX shows superior performance over the other approaches, and is highly intuitive and extremely fast. It does not require parameter optimization, and is independent of external resources, which allows it to be more unbiased and reproducible. Both the algorithm and its website interface are highly intuitive with limited need for user input and the possibility of immediate network visualization and interpretation.

MATERIALS & METHODS

The SFINX concept differs fundamentally from the other filter techniques in several key aspects. Firstly, SFINX automatically detects projects that contain the baits of interest, instead of relying on user-supplied metadata (as further discussed in the next section). This allows analysis that is less sensitive to bait-presence in ‘negative controls’, bait-absence in experiments, column memory effects between MS-runs, and mislabeling. Secondly, SFINX uses peptide counts instead of spectral counts, with the peptide count of a protein defined as the number of distinct

ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 24

peptide sequences (irrespective of modifications) found for the protein in the project of interest (Supplementary figure 1A). The choice for peptide counts over spectral counts in SFINX stems from the focus on the certainty of identification of the potential interactor; indeed, the detection of a higher number of distinct peptides per protein signals a higher certainty of the presence of that protein in the sample. Put simply, if a protein with a high peptide count is more typical for projects with the bait than for projects without the bait, it can be considered as a bona fide interactor. Thirdly, SFINX focuses its scoring on the relevant range of potential protein interactions and stops when the reliability drops too low (as further discussed in the next section).

Overview of the SFINX algorithm

SFINX filters out false positive interactions and ranks true positives by their individual certainties instantaneously, but the underlying algorithm consists of different parts. The following paragraphs describe these parts in more detail, while figure 1 provides a general overview. The mathematical description can be found in Supplementary text 1. SFINX only needs two files as input. One file is a list with the baits of interest, while the other file is a matrix with a column for each project (an individual MS run or experiment), and with a row for every possible protein interactor that was detected. The matrix itself is populated by the observed peptide counts for each of the proteins per project, with the peptide count of a protein defined as the number of distinct peptide sequences identified for that protein in the project of interest irrespective of any protein modifications. All entries in the matrix are derived from similar experimental set-ups, with for example the same mass spectrometer, identification algorithm and used parameters for these.

ACS Paragon Plus Environment

4

Page 5 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Before starting the core algorithm, SFINX first evaluates the consistency of the data (Check-up (a)). SFINX checks for the presence of the bait proteins in the data matrix and determines which columns contain the different baits. The bait proteins that are not present in the data matrix or are represented by only a single peptide count are eliminated from further analysis. Projects containing the bait are then set as bait-projects, even if this does not match the reported bait for that sample. This bait-centric way of analyzing the data breaks with the common practice of focusing on the project, where each project needs metadata to indicate the used bait. This allows SFINX to be flexible in the identities of the different projects, so correct analysis is less burdened by bait-presence in the negative controls, bait-absence in experiments, column memory effects between MS-runs or even mislabeling. After evaluating the data consistency, SFINX applies the core algorithm to each bait individually, which determines per possible interactor the probability that the observed distribution of peptide counts between bait-projects and a restricted set of non-bait-projects is the result of random events following a binomial distribution. The binomial distribution (Probability Mass Function (PMF) in Equation 1) is one of the most straightforward discrete probability distributions that is typically used for the calculation of the number of successes in a number of independent trials, if the probability of success is known. SFINX sees the sum of peptide counts in bait-specific projects as the number of successes (k), and it uses the total peptide count sum of the possible interactor as the number of trials (n). The probability of success (p) is the sum of all peptide counts for the bait-projects divided by the total sum of all peptide counts. This is conceptually equivalent to dividing the number of bait-projects by the total number of projects, but also accounts for possible imbalances between projects. ܲ‫ = ܨܯ‬൫௡௞൯‫݌‬௞ (1 − p)௡ି௞

(Equation 1)

ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 24

SFINX limits the total number of projects that it uses to calculate the total peptide count sum (n) to prevent the use of too much irrelevant negative control projects. The negative control projects get limited to five times the number of bait-projects, if these negative controls lack all of the baits of interest. The limit on the number of negative controls can be changed in the advanced version of SFINX, but a change of this number has generally no drastic effects on the performance of the algorithm. Only the complete elimination of this limit can have a drastic effect in some extreme cases with an overabundance of irrelevant negative controls. The SFINX core only retains interactions that are exceptional enough according to an internally defined cut-off that is separately determined for every bait. The applied cut-off is a generalization of the concept that the total sum of the peptide counts for the bait-projects has to surpass the total sum of the peptide counts for the non-bait-projects with at least the number of bait-projects. If we consider the most borderline case that still should be rejected, each baitproject would only contain one peptide count. The number of successes (k) would equal the number of trials (n), and these would also equal the number of bait-projects for that specific bait. So, if we define the number of bait-projects as b and the probability of success as p, the cut-off of SFINX as supplied by Equation 1 should be defined as pb. As the internal working of SFINX involves many independent tests, strict corrections for multiple testing are absolutely essential. Therefore, after selection of interactors based on the PMF-derived score, SFINX also calculates the associated p-values for the selected interactions. Based on these p-values, SFINX then controls the family-wise error rate by Bonferroni correction where all interaction evaluations for the same bait are considered to belong to the same family. As strict elimination of false positive interactions is absolutely essential, we opted for the Bonferroni correction because it is straightforward, fast and stringent.

ACS Paragon Plus Environment

6

Page 7 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

After SFINX generates a final output list for every bait, these lists get combined and the result is sorted based on the SFINX score. The retained interactions get a SFINX score that represents the probability of false interaction and allows to quantitatively distinguish between interactions of sufficient certainty (the scores are always reported together with the corresponding p-values). Hence, a low SFINX score indicates a high certainty of interaction. SFINX also outputs the results of the check-ups, and sometimes suggests optimization of the experimental design if e.g. some bait proteins were not included in the rows of the input matrix, if a certain bait was only reported with one peptide count, or if a bait is too abundantly present in the samples and the number of alternative projects or true negative control projects is insufficient (Check-up (b)). The main output of SFINX consists of a interaction network and a detailed data matrix that the user can immediately analyze or open directly in network visualization software like Cytoscape10. The matrix consists of four columns: one for the baits, one for the SFINX scores, one for the identified preys and one for the associated p-values that were also used for the Bonferroni corrections. As a control, SFINX always has to find the bait itself at a p-value equal to its score, because the bait should obviously always be present in the bait projects. Hence, ‘identified’ interactions between the same bait proteins do not define true interactions, but give a reference for the scores of the other interactors. In real-life datasets, most identified interactions will have SFINX scores that are higher (so lower certainty) than the reference, but some interactions can have a lower score (so higher certainty). An example of this last situation can be found in figure 1, where the bait proteins b and e interact with respectively the proteins c and d at lower scores (higher certainty) than the references themselves. Generally, every interaction gets a unique score, which improves quantitative differentiation between bona fide interactions.

ACS Paragon Plus Environment

7

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 24

SFINX is written in the R language11, the web interface is partly built with Shiny12 and the network visualization uses the networkD3 package for the creation of D3 JavaScript network graphs13. The example dataset at this interface is the TIP49 dataset derived from the publication by Sardiu et al5.

Comparison with other algorithms

We compared the performance of SFINX with PP-NSAF5, SAINT7 and the CompPASS Zscore and WD-score6 on two different human AP-MS datasets. Both datasets have been used for benchmarking by the other techniques, and differ from each other in several aspects. The first dataset (the TIP49 dataset) was introduced by the developers of the PP-NSAF technique5 and the second dataset (the DUB dataset) by the developers of CompPASS6, while both datasets were also evaluated by SAINT7. The TIP49 dataset contains complexes involved in chromatin remodeling and consists of 27 baits used in 35 purifications with 35 negative controls, while the DUB dataset focuses on deubiquitinating enzymes and consists of 75 baits with only one negative control. The DUB dataset hence contains more baits, but less negative controls than the TIP49 dataset, and this lack of negative controls precludes the use of PP-NSAF on the DUB dataset. To compare the different techniques in a standardized and reproducible way, we used the benchmark datasets as described in the previous paragraph and reused as much as possible the previously reported results of the filter techniques on these datasets. The TIP49 dataset was used as originally introduced by the developers of the PP-NSAF technique5, and the PP-NSAF5, SAINT7 and CompPASS scores6 were all reused for this dataset. The DUB dataset was slightly different in the SAINT article7 compared with the CompPASS article6. As the CompPASS article

ACS Paragon Plus Environment

8

Page 9 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

described the most original dataset, this was used for our calculations. For the DUB dataset, SAINT (v_2.3.4; http://sourceforge.net/projects/saint-apms/files/) was run locally, but the CompPASS scores were reused. In the comparison with the other techniques, we here only show the results of the SAINT version that functions without negative controls, as this version was also used by Choi et al.7 and our own tests showed that the resolution of SAINT was only acceptable for the non-controlled SAINT variant. Further, we also used the exact parameter combinations as described by Choi et al.7 and in the user manual (seed value 123, fthres 0.1, fgroup 0.07 and var 0). The used form of normalization was not clear in the description by Choi et al.7, therefore we evaluated both forms. We decided to use the parameter “normalize 1” in the comparison of the different techniques as it yielded the best resolution and was thus the most favorable for SAINT. On each of the datasets, we applied the standard version of SFINX (as described before) and all filtered interactions were considered without any further subsetting or parameter optimization in the comparison with the other techniques. Note that several other filter techniques exist, but these are less suited for correct comparison with SFINX. Techniques like Decontaminator14 or SAINT-MS115 use identification scores or intensities as basic input, while SAINTexpress16 depends on the integration of external data sources (iRefIndex and Gene Ontology terms) for optimal performance. Furthermore, some other techniques are developed for the analysis of much larger interaction networks and sometimes even cover more than half of an organism’s proteome, such as Markov Clustering algorithm (MCL)17, Socioaffinity Index (SAI)18, Purification Enrichment score (PE)19 and Hypergeometric Spectral Counts score (HGSCore)20. Almost all of these filter techniques for larger interaction networks rely heavily on reciprocity in the data. In other words, all proteins in the analysis should be used as bait at least once in the

ACS Paragon Plus Environment

9

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 24

ideal case. Nevertheless, the HGSCore is theoretically less strongly dependent on reciprocity and also tries to tackle column memory effects between MS-runs. Furthermore, Pu et al. showed in their comparison of filter techniques for larger interaction networks that the HGSCore was generally the best performing technique9. As the authors of the TIP49 dataset also provided NSAFs, an unbiased comparison between SFINX and a refurbished HGSCore for medium-sized datasets is possible. We implemented the HGSCore as used by Pu et al.9 and used a ‘spoke’ model instead of a ‘matrix’ model for fair comparison. The ‘spoke’ model is typical for mediumscale filter techniques and only assumes interactions between the bait and each identified prey, while the ‘matrix’ model is mainly found in filter techniques for larger interaction networks and assumes that the identified preys are also interacting with each other. We focused on four benchmark parameters for the evaluation. We assessed the recall, precision and F1-score by benchmarking against the BIOGRID database, and derived the coannotation rate of the selected interaction partners by determining the mean Gene Ontology Jaccard Indices (mGOJI). The recall is defined as the fraction of relevant interactions in the BIOGRID database that was retained by the technique, while the precision is defined as the fraction of the retained interactions that overlaps with the BIOGRID database. The recall and precision provide conservative estimates of the absence of false negatives and false positives, respectively. The F1-score, the harmonic mean of precision and recall, provides a conservative measure of the overall accuracy, while the mGOJI allows an evaluation that does not depend on previously found interactions. For the determinations of the recall, precision and F1-score, a subset of the BIOGRID database (3.2.108 downloaded on 30/01/2014) was used. The subset consisted of only physical interactions with both prey and bait from human origin. All entries in BIOGRID that were

ACS Paragon Plus Environment

10

Page 11 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

uniquely derived from the study by Sowa et al., were removed, because the presence of these interactions could have skewed the benchmarking process. For each bait, the relevant entries from the BIOGRID database were extracted in both directions; the bait could appear as both a bait or prey in the database. The output of the different techniques was taken as such, and was compared with this reference database relevant for the bait. Hence, baits that detected other baits as interactors were also treated as such, and these scores were not unified or modified in any sense. For the determinations of the mGOJI, the human Gene Ontologies were retrieved in R via biomaRt21

(by

"ENSEMBL_MART_ENSEMBL",

host="www.ensembl.org",

dataset

=

"hsapiens_gene_ensembl") on 07/05/2015. For the determination of the speed of the SFINX algorithm, the performance of the core algorithm was evaluated with the “system.time”command in R. All performance estimates were done on a laptop (8 GB RAM, Intel® Core™ i72760QM 2.40GHz). For some of the techniques, the evaluations can yield different results depending on the way that the BIOGRID database is ordered, because some techniques lack resolution in parts of the relevant reach of protein interactions. Therefore, all evaluated techniques were always considered under the most favorable BIOGRID sorting and the least favorable BIOGRID sorting, and afterwards the mean of both was reported. For the determination of the overlap between the results of the different filter techniques on different datasets (as in Supplementary figure 2), we sorted the predicted interactions of each of the techniques based on the scores that were generated by each of these techniques. After the sorting, equal numbers of top-interactors were selected from each technique, where the exact number equaled the maximal number of identified interactions by SFINX (as described in the results & discussion section). The percentual overlap between two techniques was calculated as

ACS Paragon Plus Environment

11

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 24

the Jaccard index of this overlap, in which the intersection of both output datasets was compared to the union.

RESULTS & DISCUSSION

SFINX is a straightforward, fast and accurate technique to detect true positive interactions and rank them by their individual certainties. The underlying algorithm determines for each candidate interactor the probability that the observed distribution of peptide counts between baitprojects (the projects where the bait is present) and a limited set of non-bait-projects is the result of random events following a binomial distribution, and it only needs two input files: a list of bait proteins and a matrix with the associated peptide counts (Figure 1). SFINX automatically detects the projects that contain the baits of interest, and it functions without the need for parameter optimization or external resources (like golden reference datasets and user-supplied metadata). Hence, SFINX is more unbiased and reproducible, and less sensitive to bait-presence in ‘negative controls’, bait-absence in experiments, column memory effects between MS-runs and mislabeling. The algorithm is fast and straightforward, so SFINX also promises to scale well to larger datasets and can easily be integrated in future processing workflows. After the development of the SFINX algorithm and the creation of its web-interface, we compared the performance of SFINX with that of the alternative filter techniques. We evaluated the overlap of the results and we benchmarked all techniques with different parameters on two independent datasets. The

SFINX

technique

clearly

complements

the

other

benchmarked

techniques

(Supplementary figure 2) and it also outperforms them on both benchmark datasets for any tested parameter (Figure 2 and Supplementary table S1). SFINX is more accurate than any of

ACS Paragon Plus Environment

12

Page 13 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

the alternatives upon comparison of the areas under the curve for the F1-scores, while the coannotation rate of the selected interaction partners is also clearly enhanced upon comparison of the areas under the curve for the mGOJI scores. Moreover, the high-ranking GO terms are predominantly relevant for the individual benchmark-datasets (Supplementary table S2). The techniques with the second best performance for each of the benchmark datasets are those that were originally designed or tested on these respective datasets. SAINT outperforms the CompPASS WD-score on the TIP49 dataset, but the inverse is true on the DUB dataset, although the spread between the performances of all the techniques is generally larger for the TIP49 dataset than the DUB dataset (Figure 2 and Supplementary table S1). This difference in spread is probably the result of differences in analytical complexity of these datasets. The DUB dataset has for example less interconnected baits which should make correct analysis more feasible for all techniques. Hence, SFINX shows a 19.4% enhanced accuracy and a 28.6% enhanced mGOJI score on the TIP49 dataset when compared with the second best performing technique, SAINT, but it shows a 9.5% enhanced accuracy and a 5.0% enhanced mGOJI score on the DUB dataset when compared with the second best performing technique, CompPASS WD-score. Note that the performance evaluations in Figure 2 only focus on the top 642 and 1568 interactions of the TIP49 and DUB datasets, respectively. Supplementary figure 3 shows that SFINX only generates output until a maximum number of identified interactions. This phenomenon is the result of one of the core principles of SFINX: it avoids generating output if the underlying data are unreliable. These represent the functional ranges for SFINX on these datasets, but they also match the realistic ranges according to the estimates of the human interactome22. An average protein is predicted to interact with around 20 to 28 proteins. SFINX identifies maximally 642 interactions for the TIP49 dataset with 26 baits, which equals 24.69

ACS Paragon Plus Environment

13

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 24

interactions on average, while SFINX identifies maximally 1568 interactions for the TIP49 dataset with 74 baits, which equals 21.19 interactions on average. ). Moreover, most of the compared techniques also reach their maximum F1-score (and thus accuracy) around the maximum of these ranges (Figure 2) (see Supplementary figure 3 for visualization of a broader range). Also note that the data points in Figure 2 summarize a range of performances. Supplementary figure 4 illustrates that some of the tested alternative techniques lack performance resolution in the relevant domain. In other words, several of the other techniques score almost all interactions identically, while SFINX is able to quantitatively distinguish between bona fide interactions. In theory, lower resolution techniques can be redefined by for example changing the output digits of precision or the underlying distribution, but this is not typically available as an standard option in existing implementations. The SFINX tool as implemented allows a better performance evaluation and enables users to prioritize candidate interactions for downstream analyses. SFINX couples this enhanced resolution and higher accuracy with great speed; processing the entire DUB dataset takes only 140 ms. SFINX is also user-friendly in other ways as it eliminates the need for extensive parameter optimization and is accessible through an easy-to-use webbased user interface at http://sfinx.ugent.be/. Immediately after putting in the data matrix and the bait list, SFINX will generate several kinds of output, so the user can start analysing the data immediately with no need for extra information. The tabular output allows the user to focus on certain proteins or to analyse the details of the filtering, while the network output allows forcedirected visualisation of the interactions. If more advanced network visualisations and analysis is needed, the user can directly download the filtered data as a file that is immediately ready for use

ACS Paragon Plus Environment

14

Page 15 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

in spreadsheet programs or network visualization software like Cytoscape10. For further detailed information, users can consult the on-line tutorial at the SFINX website. As SFINX does not compensate for suboptimal or missing underlying data and takes peptide counts as input, it will not detect proteins with less than two observable peptides over all experiments, for example when the proteins are too small, when the peptides have problems to ionize, or when the tryptic peptides do not generate sufficiently intense fragment ions. When users attempt to analyze such bait proteins, SFINX will even skip these proteins and warn the user. With the presence of more information, like higher peptide counts or more experimental replicates, proteins are more readily detected, which also holds true for many other filter techniques9. As SFINX is designed to have peptide counts as input, we discourage the use of spectral counts to bypass the previously mentioned SFINX restrictions. For spectral count input, SFINX does not guarantee similar accuracy levels for all types of datasets and the internal cutoff system will often become too liberal (Supplementary figure 1B). Like most other filter techniques that focus on medium-sized AP-MS datasets, SFINX scores are asymmetric. In other words, bait A might detect bait B with a different score than bait B detects bait A. This asymmetry results directly from the scoring scheme of SFINX that accounts for the uniqueness of the interaction which is often different for two interacting bait proteins. Filter techniques that focus on much larger and completely interconnected interactomes, like SAI18 and the PE score19, typically need information about the reciprocity of the interaction, and thus report symmetric scores but are less applicable to medium-sized datasets where this information is often absent or incomplete. As stated before, the HGSCore20 is not so heavily dependent on reciprocity and the authors of the TIP49 dataset provided NSAFs, which allowed an unbiased comparison between SFINX, a refurbished HGSCore for medium-sized datasets and

ACS Paragon Plus Environment

15

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 24

the other benchmarked filter techniques. Although SFINX and the HGSCore reported similar mGOJI scores in the relevant cut-off range, SFINX was clearly more accurate than the HGSCore (Supplementary figure 5), with the HGSCore showing accuracy levels between the ones of SAINT and CompPASS WD-score.

CONCLUSION

In conclusion, SFINX identifies known as well as new protein interactions in a fast and userfriendly way, and produces highly accurate results that can be prioritized thanks to the high resolution of the obtained scores. Moreover, SFINX allows immediate network visualization and interpretation through its freely accessible web-interface.

ASSOCIATED CONTENT

Supporting Information

SFINX uses peptide counts instead of spectral counts (Supplementary figure 1A), SFINX with peptide counts versus spectral counts (Supplementary figure 1B), overlap of the interactions detected by the different filter techniques (Supplementary figure 2), benchmarking visualized in broader range (Supplementary figure 3), benchmarking visualized with confidence areas (Supplementary figure 4), benchmarking of the HGSCore in comparison with the other techniques on TIP49 data (Supplementary figure 5), numeric performance increases of SFINX over alternative techniques (Supplementary table S1), GO terms of benchmarkings

ACS Paragon Plus Environment

16

Page 17 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(Supplementary table S2). This material is available free of charge via the internet at http://pubs.acs.org.

AUTHOR INFORMATION Corresponding Author * Sven Eyckerman VIB Medical Biotechnology Center A. Baertsoenkaai 3, B-9000 Gent, Belgium Tel: +32-9-264.92.73 Fax: +32-9-264.94.90 Email: [email protected] Present Addresses a

Medical Biotechnology Center, VIB, B-9000 Ghent, Belgium

b

Department of Biochemistry, Ghent University, B-9000 Ghent, Belgium

c

Advanced Database Research and Modelling (ADReM), Department of Mathematics and

Computer Science, University of Antwerp, Antwerp, Belgium d

Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp / Antwerp

University Hospital, Edegem, Belgium e

Bioinformatics Institute Ghent, Ghent University, B-9000 Ghent, Belgium

ACS Paragon Plus Environment

17

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 24

Author Contributions KT conceived, developed and benchmarked the SFINX algorithm and developed the on-line user interface, with support of PM, KL and LM. SE supervised the project, assisted by JT and KG. PM and LM suggested benchmarking strategies. All authors provided feedback on the algorithm, its benchmarking and the user interface. KT wrote the manuscript with input from all the authors. Notes The authors declare no competing financial interests.

ACKNOWLEDGEMENT

We thank M Washburn and M Sowa for providing additional information about the datasets used in the benchmarking, SJ Wodak and J Vlasblom for providing insight in the code of their version of the HGSCore, and S Degroeve, SJ Wodak and D Ratman for discussions. We also thank Peter Van den Hemel for setting up the server and we thank G De Jaeger, D Eeckhout, E Van Quickelberghe, G Vandemoortele and N Samyn for orthogonal testing of SFINX. KT is a PhD student with the Agency for Innovation by Science and Technology (IWT). SE was supported by a Methusalem grant to JT. KG and LM acknowledge support from the PRIME-XS project, grant agreement number 262067, funded by the European Union 7th Framework Program; the Ghent University Multidisciplinary Research Partnership “Bioinformatics: from nucleotides to networks”; the Ghent University Concerted Research Action grant BOF12/GOA/014; and the Fund for Scientific Research – Flanders (grant G.0113.12). KL and PM also acknowledge support from the Fund for Scientific Research – Flanders (grant

ACS Paragon Plus Environment

18

Page 19 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

G.0903.13N). JT was supported by grants from IUAP P6/36, the GROUP-ID MRP-UGent and the Fund for Scientific Research – Flanders (grants G.0747.10N and G.0864.10), and is a recipient of an ERC Advanced grant (CYRE,340941).

ABBREVIATIONS

AP-MS, Affinity Purification-Mass Spectrometry; CompPASS, Comparative Proteomics Analysis Software Suite; GO, Gene Ontology; HGSCore, Hypergeometric Spectral Counts score; MCL, Markov CLustering algorithm; mGOJI, mean Gene Ontology Jaccard Indices; MS, Mass Spectrometry; NSAF, Normalized Spectral Abundance Factor; PE, Purification Enrichment score; PPIs, Protein-Protein Interactions; PP-NSAF, algorithm for calculation of Posterior Probabilities based on Normalized Spectral Abundance Factors; SAI, SocioAffinity Index; SAINT, Significance Analysis of INTeractome; SFINX, Straightforward Filtering INdeX.

ACS Paragon Plus Environment

19

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 24

REFERENCES

1. 2.

3.

4. 5. 6. 7. 8. 9. 10.

11. 12. 13.

14. 15.

16. 17. 18. 19.

Rolland, T. et al. A proteome-scale map of the human interactome network. Cell 159, 1212-1226 (2014). Nesvizhskii, A.I. Computational and informatics strategies for identification of specific protein interaction partners in affinity purification mass spectrometry experiments. Proteomics 12, 1639-1655 (2012). Armean, I.M., Lilley, K.S. & Trotter, M.W. Popular computational methods to assess multiprotein complexes derived from label-free affinity purification and mass spectrometry (AP-MS) experiments. Mol Cell Proteomics 12, 1-13 (2013). Meysman, P. et al. Protein complex analysis: from raw protein lists to protein interaction networks. Mass Spectrometry Reviews In Press (2015). Sardiu, M.E. et al. Probabilistic assembly of human protein interaction networks from label-free quantitative proteomics. Proc Natl Acad Sci U S A 105, 1454-1459 (2008). Sowa, M.E., Bennett, E.J., Gygi, S.P. & Harper, J.W. Defining the human deubiquitinating enzyme interaction landscape. Cell 138, 389-403 (2009). Choi, H. et al. SAINT: probabilistic scoring of affinity purification-mass spectrometry data. Nat Methods 8, 70-73 (2011). Zybailov, B. et al. Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. J Proteome Res 5, 2339-2347 (2006). Pu, S. et al. Extracting high confidence protein interactions from affinity purification data: At the crossroads. J Proteomics (2015). Smoot, M.E., Ono, K., Ruscheinski, J., Wang, P.L. & Ideker, T. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 27, 431-432 (2011). R, C.T. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/ (2013). Chang, W. Shiny: Web Application Framework for R. R package version 0.11. http://CRAN.R-project.org/package=shiny. (2015). Gandrud, C., Allaire, J.J. & Lewi, B.W. NetworkD3: Tools for Creating D3 JavaScript Network Graphs from R. R package version 0.1.1. http://CRAN.Rproject.org/package=networkD3. (2014). Lavallee-Adam, M., Cloutier, P., Coulombe, B. & Blanchette, M. Modeling contaminants in AP-MS/MS experiments. J Proteome Res 10, 886-895 (2011). Choi, H., Glatter, T., Gstaiger, M. & Nesvizhskii, A.I. SAINT-MS1: protein-protein interaction scoring using label-free intensity data in affinity purification-mass spectrometry experiments. J Proteome Res 11, 2619-2624 (2012). Teo, G. et al. SAINTexpress: improvements and additional features in Significance Analysis of INTeractome software. J Proteomics 100, 37-43 (2014). Enright, A.J., Van Dongen, S. & Ouzounis, C.A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30, 1575-1584 (2002). Gavin, A.C. et al. Proteome survey reveals modularity of the yeast cell machinery. Nature 440, 631-636 (2006). Collins, S.R. et al. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics 6, 439-450 (2007).

ACS Paragon Plus Environment

20

Page 21 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

20. 21.

22.

Guruharsha, K.G. et al. A protein complex network of Drosophila melanogaster. Cell 147, 690-703 (2011). Durinck, S., Spellman, P.T., Birney, E. & Huber, W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc 4, 1184-1191 (2009). Stumpf, M.P. et al. Estimating the size of the human interactome. Proc Natl Acad Sci U S A 105, 6959-6964 (2008).

ACS Paragon Plus Environment

21

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 24

For TOC only

FIGURES

Figure 1. Overview of the SFINX pipeline. Figure 2. Benchmarking of interactions obtained with different filter techniques.

ACS Paragon Plus Environment

22

Page 23 of 24

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

23

Journal of Proteome Research

Page 24 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

24