Protein Inference Using Peptide Quantification Patterns - Journal of

May 12, 2014 - Our preprocessing pipelines support both labeled LC–MS/MS or label-free LC–MS followed by LC–MS/MS providing the peptide quantifi...
0 downloads 0 Views 3MB Size
Article pubs.acs.org/jpr

Protein Inference Using Peptide Quantification Patterns Pieter N. J. Lukasse*,†,‡,§ and Antoine H. P. America†,‡,∥ †

Plant Research International, Wageningen UR, P.O. Box 16, 6700AA Wageningen, The Netherlands Netherlands Proteomics Centre, P.O. Box 80082, 3508 TB Utrecht, The Netherlands § Netherlands BioInformatics Centre, Route 260, 6525GA Nijmegen, The Netherlands ∥ Centre for BioSystems Genomics, P.O. Box 98, 6700 AB Wageningen, The Netherlands ‡

S Supporting Information *

ABSTRACT: Determining the list of proteins present in a sample, based on the list of identified peptides, is a crucial step in the untargeted proteomics LC−MS/MS data-processing pipeline. This step, commonly referred to as protein inference, turns out to be a very challenging problem because many peptide sequences are found across multiple proteins. Current protein inference engines typically use peptide to spectrum match (PSM) quality measures and spectral count information to score protein identifications in LC−MS/MS data sets. This is, however, not enough to confidently validate or otherwise rule out many of the proteins. Here we introduce the basis for a new way of performing protein inference based on accurate quantification patterns of identified peptides using the correlation of these patterns to validate peptide to protein matches. For the first implementation of this new approach, we focused on (1) distinguishing between unambiguously and ambiguously identified proteins and (2) generating hypotheses for the discrimination of subsets of the ambiguously identified proteins. Our preprocessing pipelines support both labeled LC−MS/MS or label-free LC−MS followed by LC−MS/MS providing the peptide quantification. We apply our procedure to two published data sets and show that it is able to detect and infer proteins that would otherwise not be confidently inferred. KEYWORDS: protein inference, isoforms, secondary protein, protein quantification, correlation



Meyer-Arendt et al.1). Although this may seem like a good approach, it is accompanied by the risk of missing the proteins that do not belong to the minimal list (also called secondary proteins by Meyer-Arendt et al.1), which can, in practice, be isoforms or other proteins showing some sequence overlap. Furthermore, if protein quantification is done in a next step, there is an additional risk of overestimating the quantities of the primary proteins and assigning to them a wrong quantification pattern over the different samples. Figure 1 illustrates these problems. In this work, we present a way for indicating the putative presence of secondary proteins. Our ideas are based on the concept that peptides belonging to the same protein should show a good correlation in their quantification patterns, while peptides originating from two or more different proteins present in the sample will show lower correlation. By focusing on the inference of secondary proteins, our method, in fact, continues where parsimony-based inference methods stop. A recent publication by Ahmad et al.6 has shown the benefits of using quantitative data, and in their work they give an example of how they used a quantification data analysis strategy similar to ours for inferring the presence of a protein isoform for which

INTRODUCTION In a typical LC−MS/MS data-processing pipeline the MS/MS spectra of peptides deriving from trypsin digested proteins are matched against theoretical spectra of peptides from in silico digested protein sequences. This generates a list of peptide identifications, which have to be compiled back into a list of proteins. The problem is that of the 3.8 million fully tryptic peptides found in the human protein database, over 2 million are found to be shared between two or more proteins.1 This huge overlap between protein sequences makes it hard to determine, from the list of peptides, which proteins are really in the sample. How to solve this so-called protein inference problem is still an ongoing debate.2 Nevertheless, a number of solutions have already been proposed, some of which are reviewed in various articles.1−3 The main automated protein inference solutions (we) found published to date all focus on using a combination of peptide to spectrum match (PSM) quality measures and parsimony heuristics to rank the proteins.1,2,4 The PSM quality measures and the amount of PSMs linked to each protein are used to determine the quality of the protein identifications, while the parsimony heuristics are used to give a higher ranking to proteins that are part of the minimal list of proteins that best account for the observed peptides (also called primary proteins by © 2014 American Chemical Society

Received: October 28, 2013 Published: May 12, 2014 3191

dx.doi.org/10.1021/pr401072g | J. Proteome Res. 2014, 13, 3191−3199

Journal of Proteome Research

Article

separate quantification and identification runs (label-free MS and MS/MS), then assignments with lower quality are also removed and a ranking algorithm sorts out eventual ambiguous quantification assignments (i.e., same quantification assigned to different peptides). Finally, proteins that do not have a minimum coverage and a minimum number of good scoring peptide identifications, according to thresholds specified by the user, are removed as well. (2) Correlation clustering: Sibling peptides (i.e., sharing at least one protein hit) are clustered together based on their expression patterns. They are clustered using agglomerative hierarchical clustering with (1 − Pearson’s correlation) as the distance measure and using average linkage as the linkage criterion. A cluster stops growing when the next addition to the cluster causes the average correlation between the members to drop below a given correlation threshold specified by the user. Currently, the choice for this threshold is a user decision, which can be made guided by statistics Quantifere provides on the correlation differences between (full)sibling and not-(full)sibling peptides. (See also Figure 25 of the Supporting Information and Figure 7 later.) (3) Protein classification: the proteins are classified as being primary or secondary. Primary proteins are the ones belonging to the minimum set of proteins needed to account for all peptide identifications. (4) Secondary proteins inference: proteins classified as secondary proteins in the previous step are verified based on their peptide matches and the clusters in which these peptides are found: If a secondary protein has a peptide match that does not cluster with a foreign peptide, then the protein is marked as “inferred”, and this peptide match is marked as an “inference peptide” for this secondary protein. A peptide is a foreign peptide to a protein if it is a sibling to one of its peptide matches but it does not match the protein in question. (For example, in Figure 3, the peptide matching only Q8VCW9 is a foreign peptide to P56593.) (5) Inference postprocessing: Each “inference peptide” is checked again to verify if a better explanation can be found for its deviating expression pattern. For example, in some cases the deviating pattern can be explained by the unique set of primary proteins to which the peptide matches. If the “inference peptide” can be explained in a simpler way, that is, requiring less secondary proteins to account for its presence, it loses its “inference peptide” status. Finally, secondary proteins that have no more inference peptides at the end of this step also lose their “inferred” status.

Figure 1. Section of peptide to protein matches from one of the test data sets we used,5 represented as a network. Here we see a primary protein (Q9WUD0 in green) and a secondary protein (P12791 in yellow). Primary proteins like Q9WUD0 are sufficient for explaining the presence of all peptides in the results, but ruling out the presence of the secondary proteins like P12791 above could lead to false-negatives, which, on their turn, would lead to quantification errors as a result of overestimating the primary proteins.

there was no isoform-specific peptide identification. A manual approach to augment protein inference based on expression ratios is also described in the work by Colaert et al.7 for the software tool ROVER. However, their approach is in many ways different from ours and relies on extensive manual curation work, as opposed to our approach, where we have aimed for automated and systematic ranking of candidate isoforms based on clustering procedures of expression profiles of the individual peptides. To test our method, we apply it to two published LC−MS/MS quantitation data sets and show that it provides a way of discovering likely secondary protein candidates, thereby helping the researcher to quickly find new proteins and possible isoforms present in his or her data.



IMPLEMENTATION

General Idea

The procedure we propose uses the following general ideas for inferring secondary proteins: (1) Group peptides on protein and expression: group peptides by having a common protein match and by having a good correlation in their expression patterns over the samples. (2) Infer secondary proteins based on correlation: if a secondary protein has one or more peptides that show a different expression pattern from the pattern of their other sibling peptides that are not connected to this secondary protein, then this secondary protein is inferred to be present in the sample. Figure 2 shows an example of a secondary protein being classified as “not evidenced”, and Figure 3 shows an example of a secondary protein being inferred as “present”.

Solution Implementation

The quantification-based inference solution that we describe in this work, is available on the main Galaxy8 Toolshed (http:// toolshed.g2.bx.psu.edu/) under the name Quantifere. Galaxy provides the end-users with an intuitive graphical user interface to Quantifere, allowing them to run Quantifere on their uploaded data via their browser. To support users in preparing the right data files needed by Quantifere, we have also added a number of other tools to Galaxy that are listed in the following section and in Tables 1 and 2. The backend code is written in Java, and the front-end part is addressed by the standard Galaxy tool wrapping utilities. Advanced users can also use the tool via the command line. Galaxy is an open-source, web-based platform for bioinformatics research (https://usegalaxy.org/). It provides a flexible environment for tool integration into diverse data-processing workflows.

Inference Procedure

The whole inference procedure is performed by the following steps: (1) Data preprocessing and data cleansing: For the inference procedure to work, it is crucial that the quantifications assigned to each peptide are correct. Errors can arise during the quantification as well as during the identification steps. Peptide identifications with low statistical confidence measures are removed. In case the quantification assignments had to be done by aligning 3192

dx.doi.org/10.1021/pr401072g | J. Proteome Res. 2014, 13, 3191−3199

Journal of Proteome Research

Article

Figure 2. Here we see a secondary protein group (P00184) and all of its peptides, together with the protein-specific peptides from the primary protein (P00186). Selected peptides are shown as blue nodes. The line plot on the bottom shows the scaled expression levels of the selected peptides over the four samples (labeled 0 to 3) and shows that these peptides correlate well with each other but also correlate very well with the other peptides from the primary protein. This means we have no evidence to support the presence of the secondary protein group P00184.

In the first one we accommodated part of the “data preprocessing and data cleansing” procedure. This split gives the user the possibility to visualize and tune the data cleanup results before continuing with the next step of the Quantifere procedure, schematically displayed in Figure 5.

The graphical user interface also enables easy access for nonprogrammer experts. The use of pull-down menus provides background information for the required input parameters. Also, the use of saved histories and dedicated workflows enables easy rerunning of data-processing workflows with altered parameters or with new data. A screenshot of one of the steps of Quantifere in Galaxy is provided in the Supporting Information (Figure 29), which also shows the parameters that can be set by the user in the last step of the Quantifere workflow. The user interfaces that are available for all steps can be seen by browsing through the tools in the Galaxy Toolshed (more details provided together with Figures 27 and 28 of the Supporting Information). Quantifere currently supports an adaptation of Institute of Systems Biology’s (ISB) APML format9 for the coupled peptide plus expression data, and this same APML format or the HUPO-PSI’s mzIdentML format10 for the complete set of peptide identifications. The scheme in Figure 4 shows an overview of the Quantifere solution in the context of both labeled and label-free experiments. Our current implementation of the Quantifere procedure is divided over two steps: Quantifere-prep and Quantifere-main.

Support for External Tools and Standards

Our solution potentially supports a range of proteomics tools as the necessary data processing, data conversion, and compliance with a subset of the HUPO-PSI mzML and mzIdentML standards are supported and integrated in one way or the other. Later, we highlight the main third-party tools supported. Tables 1 and 2 provide an overview of how each workflow step is currently supported. For the “MS/MS search” step, we support systems that are able to take in mzML format and generate output in either mzIdentML or our variant of the APML format. To be able to support the popular X!Tandem search engine,12 we also developed, in collaboration with others,13 a tool to convert X! Tandem output to mzIdentML. To be able to support the Waters 3193

dx.doi.org/10.1021/pr401072g | J. Proteome Res. 2014, 13, 3191−3199

Journal of Proteome Research

Article

Figure 3. Simplified illustration of what the algorithm is looking for to validate secondary proteins. (A) An example is displayed where two peptides of the secondary protein P56593 have a different expression when compared with the protein-specific peptide GTEVFPILGSLMTDPK selected from the primary protein (Q8VCW9). (B) Selecting more peptides shows two “main” patterns (rough approximations shown in the dashed and dotted lines), suggesting the presence of more than one protein.

Table 1. Support for Labeled Data Workflow Stepsa workflow step

tool implementation and availability

support mode

support availability Galaxy Toolshed (PRIMS packages)

third party

Galaxy wrapper and search results display using xiSPEC spectrum viewer27 PLGS plugin exporting peptide identifications to APML format mzIdentML upload

custom, open source custom, open source custom, open source

Galaxy wrapper Galaxy wrapper Galaxy wrapper

Galaxy Toolshed (PRIMS packages) Galaxy Toolshed (PRIMS packages) Galaxy Toolshed (PRIMS packages)

MS/MS search

X!Tandem

third party, open source

MS/MS search

Waters PLGS

third party, commercial

MS/MS search

othersearchenginescapable ofproducingmzIdentML files Quantiline MsFilt Quantifere

peptide quantification Quantifere-prep Quantifere a

tool name

Our PLGS Plugin can be made available on request. Galaxy standard upload tool

Tools provided as Galaxy wrappers can also be used via the command line by advanced users.

packages, as explained in Figure 6. This “MS alignment procedure” typically consists of the detection of MS peaks in multiple runs and the subsequent alignment of these runs resulting in the quantification of each aligned peak over the multiple runs. For the “peptide quantification”, step we also added an option to run this in SEDMAT, which is a tool we developed as an alternative to the Progenesis merge step. It automatically assigns to each quantified MS feature its respective MS/MS identification based on mass

ProteinLynx Global Server (PLGS) MS/MS search engines and their respective results for data acquired via both data-dependent analysis (DDA) and data-independent analysis (MSE)14 we developed a plugin for PLGS to enable exports of the MS/MS search results into our variant of the APML format. In the label-free data pipeline variant the “MS alignment procedure” and the “peptide quantification” step (as displayed in Figure 4) can be executed in existing commercial or open-source 3194

dx.doi.org/10.1021/pr401072g | J. Proteome Res. 2014, 13, 3191−3199

Journal of Proteome Research

Article

Table 2. Support for Label-Free Data Workflow Stepsa workflow step

tool implementation and availability

MS/MS search

X!Tandem

MS/MS search

Waters PLGS

MS/MS search

MS alignment procedure

other search engines capable of producing mzIdentML files msCompare framework tools for OpenMS and MZmine Progenesis

peptide quantification

Progenesis

third party, commercial

peptide quantification Quantifere-prep Quantifere

SEDMAT MsFilt Quantifere

custom, open source custom, open source custom, open source

MS alignment procedure

a

tool name

support mode

support availability

third party, open source third party, commercial third party

Galaxy wrapper

Galaxy Toolshed (PRIMS packages)

PLGS plugin exporting peptide identifications to APML format mzIdentML upload

Our PLGS Plugin can be made available on request Galaxy standard upload tool

third party, open source third party, commercial

Galaxy wrappers, beta support

msCompare code base18

Conversion tool available as a Galaxy wrapper to convert Progenesis alignment and peptide quantification output to APML format Conversion tool available as a Galaxy wrapper to convert Progenesis alignment and peptide quantification output to APML format Galaxy wrapper Galaxy wrapper Galaxy wrapper

Galaxy Toolshed (PRIMS packages)

Galaxy Toolshed (PRIMS packages)

Galaxy Toolshed (PRIMS packages) Galaxy Toolshed (PRIMS packages) Galaxy Toolshed (PRIMS packages)

Tools provided as Galaxy wrappers can also be used via the command line by advanced users.

The full set of inferred secondary proteins is provided in the Supporting Information together with the steps on how to visualize many of the inference details using Cytoscape.

and retention time similarity. The addition of SEDMAT also permits the users to use the open-source tools OpenMS15 or MzMine16 for the MS alignment step, as it provides (beta) support for the msCompare framework17,18 (an open-source framework, which uses the open-source systems OpenMS or MzMine to do the alignment).

Label-Free Data Set

For this test, we used a data set from a previously published study on immunoisolated ribosomes from Arabidopsis.20 The whole data set consists of 24 LC−MSE (DIA mode) runs and 6 LC−MS/MS runs (DDA mode). These data sets were already processed by the Waters PLGS software for peptide identifications and by the Progenesis software for peptide quantifications. The extra work we had in this case was to convert the data sets (provided in supplemental tables S1 and S3 of the original article) into formats such that our pipeline would be able to process it. This extra data set editing is reported in the Supporting Information. As was the case in the previous labeled data set, also in this data set we found a number of interesting secondary protein inferences. By combining our protein inference results with the protein score originally given by the authors to a number of secondary proteins, we were able to narrow down a number of likely isoforms. When comparing the secondary proteins list (46 secondary proteins in total) in the results of the original article with our own list we see that 24 out of 25 secondary proteins we inferred also had a high score and a low false discovery rate (FDR) in the original results. The remaining 21 secondary proteins were classified as “not evidenced” by Quantifere, while in the original results 8 were not given a score by the authors (because they were only found in DDA mode) and 13 were given a high score based on qualitative measures like the quality of the peptide identifications matching the protein. In our Supporting Information, we analyze a number of cases, where we show that the evidence points toward a better classification by Quantifere based on quantitative information.



RESULTS AND DISCUSSION To test and validate our protein inference procedures, we tested it on two different data sets, one labeled and one label-free. Both data sets are available as part of previous publications. Necessary additions or transformations to these data sets, together with the Quantifere output, are made available with this manuscript. (See the Supporting Information.) The tools used for processing the data are available on the main Galaxy Toolshed (http://toolshed. g2.bx.psu.edu/). Information regarding which tools were used at each step can also be found in the Supporting Information. iTRAQ-Labeled Data Set

For this test, we used the 2008 study published by Proteome Informatics Research Group (iPRG).5 This data set, composed of iTRAQ-labeled peptides from four mouse samples, was specifically made available for benchmarking protein inference procedures. Interestingly, the study participation letter reports that quantification was not the focus of the analysis.19 Nevertheless, we used the quantification data in our procedure and found evidence supporting the inference of a number of extra proteins. In general, our results agreed with the results reported by the study group. However, there were some cases in which our method also detected some extra proteins, mainly because of its ability to infer secondary proteins. One example is the complex protein cluster “107”, in which the study group reports the detection of three “isoforms”, while our method detects five. As shown in the Supporting Information, a closer inspection of the quantification patterns reveals there is indeed a good basis for inferring at least five isoforms in this cluster. Another example is the protein cluster “328”, for which our method detected two isoforms instead of only one, as reported by the study group.

Discussion on Peptide Pattern Variation Sources

In our analysis, we did not take into account possible sources of unwanted variation in quantification. In general, we assume that most variation sources equally affect the peptides from the same protein. This is true for many variations at the biological level and at the technical level, although at both levels some exceptions can 3195

dx.doi.org/10.1021/pr401072g | J. Proteome Res. 2014, 13, 3191−3199

Journal of Proteome Research

Article

Figure 5. Quantifere procedure is separated in two main steps. The first step performs basic data cleansing, that is, filtering out low-quality identifications and quantification assignments. It also carries out a ranking for eventual conflicting quantification assignments for the labelfree data scenario. This first “Quantifere-prep” step can be seen as what Claassen et al.2 call a “selection scheme”, that is, a scheme designed to remove as many false-positives (FPs) as possible from the result set.

Figure 6. Detailed view of the “MS alignment procedure” and the “peptide quantification” steps (first mentioned in Figure 4). For the “MS alignment procedure” we support the alignment being done both by msCompare (an open-source framework, beta support) or by Progenesis (a commercial LC−MS data processing system). The Progenesis system can optionally also execute the “Peptide quantification” step (shown in the bottom part of the diagram). A converter is provided in our Galaxy environment to convert Progenesis results to our variant of the APML format.

Figure 4. (A) Pipeline variant supporting labeled data in which peptide quantification is done via labeling techniques, for example, labeling peptides with isobaric tags for relative and absolute quantitation (iTRAQ). The “peptide quantification” step in this case is basically a tool that reads the mzML spectrum for each mzIdentML peptide record and looks for the label peaks. These label peaks can be specified by the user and their intensity is used for the relative quantification of the peptide over the samples. (B) Pipeline variant supporting label-free data in which peptide identification and peptide quantification are done in separate MS/MS and MS runs, respectively. The “peptide quantification” step in this case can, for example, be a tool that couples MS/MS peptide identifications to quantified MS features based on mass and retention time similarity. (C) There can also be tools that are capable of executing both the “MS alignment procedure” and the “peptide quantification”, such as the Progenesis LC−MS software package,11 for which we also added support.

For both of our data sets, we assumed such variation sources to be limited. To limit even more the effect of such “spurious” errors on candidate inference peptides, we introduced a protein inference scoring scheme for scoring the quality of inference for secondary proteins. The secondary protein inference score is a function of the number of inference peptides over the number of quantified peptides matched to a protein. It is calculated as specified in eq 1 below.

also occur. At the biological level, modifications and genetic variations (substitutions) on some peptides can have an influence on their identification and quantification in some of the samples, resulting in an aberrant quantification pattern when compared with their siblings. At the technical level, phenomena like ion suppression could cause sibling peptides to get different quantification patterns. For isobaric label quantification in MS/MS mode, coselection of multiple precursor peptides is a well-known source for aberrant quantification. These different patterns could then wrongly be seen by Quantifere as the basis for the inference of one or more secondary proteins that match the affected peptides.

n

∑i = 1 ((1 − max{(|rif | + rif )/2: f ∈ F }) × 100) m

(1)

where n is the number of inference peptides matching a (secondary) protein, m is the total number of quantified peptides matching a protein, F is the set of foreign peptides (foreign to the secondary protein in question, according to the definition of foreign peptide given before), and rif is the Pearson’s correlation value between the inference peptide i and each foreign peptide f. So for each inference peptide we look for its best correlation 3196

dx.doi.org/10.1021/pr401072g | J. Proteome Res. 2014, 13, 3191−3199

Journal of Proteome Research

Article

Figure 7. Assessment of peptide correlations on subsets of the data sets before and after processing by Quantifere. (A, B) Peptide correlations assessment comparing the correlation of “full sibling peptides” (i.e., matching the same protein) versus “half or not-sibling peptides”. As expected, in both data sets we see a better correlation between the “full sibling peptides” than between the “half or not-sibling peptides”. (C, D) Same assessment as in A and B, but now after updating the peptide to protein matches network according to Quantifere’s inference results. The expected increase in the difference between the average correlation of “full sibling peptides” and the average correlation of “half or not-sibling peptides” can be seen in both data sets.

correlation values between peptides will vary per data set, so the previous scoring scheme could still be complemented with a statistical confidence value, indicating how significant the differences in correlation are. For now, we help the user to relativize the protein inference scores by showing him or her an overall protein inference score distribution plot in the Quantifere report (as shown in the Supporting Information Figure 3).

value with any foreign peptide. A high correlation value (e.g., close to 1.0) will imply in a low inference score contribution from this peptide, while a low correlation value (e.g., close to 0.0) will imply a high inference score contribution from this peptide. Correlation values of 0.0 and negative correlation values contribute with the same weight to the final inference score. This scoring scheme follows the overall rationale of Quantifere; that is, the closer the inference peptide is following a pattern of other peptides that do not match to this secondary protein (in other words, has a high correlation value with a foreign peptide), the less reliable it is considered to be as its pattern cannot be confidently attributed to the putative presence of the secondary protein. Finally, as can be seen in the following section, average

Checking the Data against the Concept Behind the Inference

As mentioned in the Introduction, our inference is based on the concept that peptides belonging to the same set of proteins (“full sibling peptides”) should show a good correlation in their 3197

dx.doi.org/10.1021/pr401072g | J. Proteome Res. 2014, 13, 3191−3199

Journal of Proteome Research

Article

sensitivity and specificity of protein inference methods such as the one we propose here. This should include testing the method on a data set where we know the complete list of proteins beforehand (which is not the case for the data sets we used in this work) and which contains enough differences between different isoform concentrations over the samples. One might think of solving this by developing a synthetic (spiked) sample set. for example. Such a synthetic sample set would be useful to not only validate the ability of the algorithm to detect variation (which is something we showed already in this work) but also statistically assess the accuracy of such an approach in the light of other variation sources next to the variation introduced by the presence of isoforms. However, because of the synthetic nature of such samples and their artificial isoform concentrations, these samples (if not crafted carefully) could still not be the ideal means of validating the algorithm’s ability to detect isoforms in practice, where the relative isoform concentrations and the sources of variation may differ from the ones in the synthetic samples.

quantification patterns, while peptides originating from a different set of proteins (“half or not-sibling peptides”) will show lower correlation. We assessed this in both data sets and found that they confirm our expectations (Supporting Information, Figures 24 and 25), although the difference between both peptide groups is smaller in the label-free data set (probably due to stronger interference of the technical variation). These results are important because they corroborate our case for making use of the peptide quantification patterns to detect the presence of additional proteins. Next to the detailed assessment done for some specific inference cases in both data sets (Figures 5−12 and 18−23 in Supporting Information), we also used the previous correlation comparison method to assess the Quantifere results at a higher level. This is possible because the Quantifere algorithm, if parametrized correctly, is expected to improve the average correlation for the group of “full sibling peptides”. Such an improvement can come, for example, from peptides that were first seen as “half siblings” (because of their distinct matches to secondary proteins), which are now grouped together as “full siblings” based on the high correlation of their quantification patterns and the subsequent removal of the secondary proteins that were classified as “not evidenced” (Supporting Information Figure 26 illustrates this). For this assessment, we used subsets of both data sets comprising only the parts of the peptide to protein matches networks linked to secondary proteins that were classified as “not evidenced” (so the parts where changes occurred). Figure 7 shows the assessment results, where we observe a correlation improvement in both data sets after processing by Quantifere.



CONCLUSIONS The new protein inference approach we proposed here relies strongly on the quantification patterns and therefore on the quality of the quantification method itself. Although there are many events that can interfere with the quantification of the peptides, we believe that as our abilities to correctly quantify peptides increase so will the importance of our inference approach. The fact that various recent publications,21−24 including a recent Virtual Issue of this journal,25,26 have as central theme the improvement of peptide quantification is also a reassurance to us that new protein inference methods that are able to use quantification information are a valuable addition to the proteomics community.

Protein Quantification Discussion



One interesting advantage of our inference procedure is that by decreasing the complexity of the protein−peptide matching network, it simplifies the protein quantification procedure for the primary proteins. That is, when removing the “not evidenced” secondary proteins from the match network, the number of protein specific peptides (i.e., the peptides that match only one protein) increases. Our tool makes use of this reduced ambiguity at the peptide level to output a simple protein quantification based on the following scheme: (1) Remove the “not evidenced” (not inferred) secondary proteins from the protein−peptide matching network. (2) For each primary protein, we calculate its intensity value on each sample separately by finding the protein specific peptides and summing their intensities for the sample in question. The next step would be to apportion quantification values also to inferred secondary proteins by taking the peptides matching these proteins and subtracting quantification values of sibling protein specific peptides for example. In practice, however, this can still be quite complicated because this would basically require peptides to first be quantified by an absolute quantification method, which is able to deal with and correct for quantification variations like differences in peptide ionization efficiency for distinct peptides. In other words, we would need accurate absolute peptide quantification first before we start apportioning quantification values of peptides that are shared/matched by multiple proteins.

ASSOCIATED CONTENT

S Supporting Information *

Data processing and data analysis steps for both data sets used in the Results section. Zip file containing the main files produced by Quantifere for both data sets used in the Results section. Screenshot of the Quantifere user interface in Galaxy. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Tel: +31 (0)317 480891. Fax: +31 (0)317 418094. Author Contributions

The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We acknowledge financial support from The Netherlands Genomics Initiative via the programs of Netherlands BioInformatics Centre (NBIC), The Netherlands Proteomics Centre (NPC), the Centre for Biosystems and Genomics (CBSG), the Consortium for Improving Plant Yield (CIPY), and the 7th Framework Program FUEL4ME (FP7-ENERGY-20121-2stage grant number 308983).

Data Sets and Statistical Assessment Discussion

As previously discussed, differences in abundance patterns of sibling peptides can originate also from technical variation, next to differences in isoform expressions. One possible follow-up to the work presented here would be to assess in more detail the 3198

dx.doi.org/10.1021/pr401072g | J. Proteome Res. 2014, 13, 3191−3199

Journal of Proteome Research



Article

(19) Proteome Informatics Research Group (iPRG). (iPRG) 2008 Study Participation Letter. http://www.abrf.org/ResearchGroups/ ProteomicsInformaticsResearchGroup/Studies/ABRF_iPRG2008_ StudyParticipationLetter.pdf (accessed 2013). (20) Hummel, M.; Cordewener, J. H. G.; M, d. G. J. C.; Smeekens, S.; America, A. H. P.; Hanson, J. Dynamic Protein Composition of Arabidopsis thaliana Cytosolic Ribosomes in Response to Sucrose Feeding As Revealed by Label Free MSE Proteomics. Proteomics 2012, 12, 1024−1038. (21) Bond, N. J.; Shliaha, P. V.; Lilley, K. S.; Gatto, L. Improving Qualitative and Quantitative Performance for MSE-Based Label-Free Proteomics. J. Proteome Res. 2013, 12, 2340−2353. (22) Gillet, L. C.; Navarro, P.; Tate, S.; Röst, H.; Selevsek, N.; Reiter, L.; Bonner, R.; Aebersold, R. Targeted Data Extraction of the MS/MS Spectra Generated by Data-independent Acquisition: A New Concept for Consistent and Accurate Proteome Analysis. Mol. Cell. Proteomics 2012, 11, O111.016717. (23) Colaert, N.; Huele, C. V.; Degroeve, S.; Staes, A.; Vandekerckhove, J.; Gevaert, K.; Martens, L. Combining Quantitative Proteomics Data Processing Workflows for Greater Sensitivity. Nat. Methods 2011, 8, 481−483. (24) Park, S. K.; Venable, J. D.; Xu, T.; Yates, J. R., III. a Quantitative Analysis Software Tool for Mass Spectrometry-based Proteomics. Nat. Methods 2008, 319−322. (25) Journal of Proteome Research. Virtual Issue: Quantitative Proteomics. http://pubs.acs.org/page/vi/2013/quantitativeproteomics.html (accessed October 2013). (26) Yates, J. R., III; Washburn, M. P. Quantitative Proteomics. Anal. Chem. 2013, 8881−8881. (27) Rappsilber Laboratory. xiSPEC Spectrum Viewer for Biologists & Mass Spectrometrists. http://spectrumviewer.org (accessed 2013).

ABBREVIATIONS PSM, peptide to spectrum match; iTRAQ, isobaric tags for relative and absolute quantitation; FP, false positive; PLGS, Waters ProteinLynx Global Server software; DDA, datadependent analysis; MSE or DIA, data-independent analysis; PSM, peptide to spectrum match; ISB, Institute for Systems Biology; FDR, false discovery rate



REFERENCES

(1) Meyer-Arendt, K.; Old, W. M.; Houel, S.; Renganathan, K.; Eichelberger, B.; Resing, K. A.; Ahn, a. N. G. IsoformResolver: A Peptide-Centric Algorithm for Protein Inference. J. Proteome Res. 2011, 10, 3060−3075. (2) Claassen, M.; Reiter, L.; Hengartner, M. O.; Buhmann, J. M.; Aebersold, R. Generic Comparison of Protein Inference Engines. Mol. Cell. Proteomics 2012, 11, O110.007088. (3) Huang, T.; Wang, J.; WeichuanYu; He, Z. Protein Inference: A Review. Briefings Bioinf. 2012, 586−614. (4) Cox, J.; Mann, M. MaxQuant Enables High Peptide Identification Rates, Individualized p.p.b.-Range Mass Accuracies and Proteome-Wide Protein Quantification. Nat. Biotechnol. 2008, 1367−1372. (5) The Proteome Informatics Research Group. iPRG2008 Study at http://www.abrf.org/index.cfm/group.show/ ProteomicsInformaticsResearchGroup.53.htm; Study materials and study report; The Association of Biomolecular Resource Facilities, 2008. (6) Ahmad, Y.; Boisvert, F.; Lundberg, E.; Uhlen, M.; Lamond, A. Systematic Analysis of Protein Pools, Isoforms, And Modifications Affecting Turnover and Subcellular Localization. Mol. Cell. Proteomics 2012, 11, M111.013680. (7) Colaert, N.; Helsens, K.; Impens, F.; Vandekerckhove, J.; Gevaert, K. Rover: a Tool to Visualize and Validate Quantitative Proteomics Data from Different Sources. Proteomics 2010, 10, 1226−1229. (8) Goecks, J.; Nekrutenko, A.; Taylor, J.; Team, T. G. Galaxy: a Comprehensive Approach for Supporting Accessible, Reproducible, And Transparent Computational Research in the Life Sciences. Genome Biol. 2010, 11, R86. (9) Brusniak, M.-Y.; Bodenmiller, B.; Campbell, D.; Cooke, K.; Eddes, J.; Garbutt, A.; Lau, H.; Letarte, S.; Mueller, L. N.; Vagisha Sharma, O. V.; Zhang, N.; Aebersold, R.; Watts, J. D. Corra: Computational framework and tools for LC−MS discovery and targeted mass spectrometry-based proteomics. BMC Bioinf. 2008, 9, 542. (10) HUPO. mzIdentML 1.1 specification. http://www.psidev.info/ mzidentml,http://www.psidev.info/mzidentml#mzIdentML1_1_0 (accessed October 2013). (11) Nonlinear Dynamics. Progenesis LC−MS. http://www. nonlinear.com/products/progenesis/LC-MS/overview/ (accessed 2013). (12) Craig, R.; Beavis, R. C. TANDEM: Matching Proteins with Tandem Mass Spectra. Bioinformatics 2004, 20, 1466−1467. (13) Ghali, F.; Krishna, R.; Lukasse, P.; Martínez-Bartolomé, S.; Reisinger, F.; Hermjakob, H.; Vizcaíno, J. A.; Jones, A. R. Tools (Viewer, Library and Validator) that Facilitate Use of the Peptide and Protein Identification Standard Format, Termed mzIdentML. Mol. Cell. Proteomics 2013, 12, 3026−3035. (14) Waters. Quantitative and Qualitative Proteomics Research Platform. http://www.waters.com (accessed 2013). (15) OpenMS. OpenMS: an open source framework for mass spectrometry and TOPP- The OpenMS Proteomics Pipeline. http:// open-ms.sourceforge.net/ (accessed 2013). (16) MzMine. MzMine 2 - framework for mass spectrometry data processing. http://mzmine.sourceforge.net/ (accessed 2013). (17) Hoekman, B.; Breitling, R.; Suits, F.; Bischoff, R.; Horvatovich, P. msCompare: A Framework for Quantitative Analysis of Label-free LC− MS Data for Comparative Candidate Biomarker Studies. Mol. Cell. Proteomics 2012, 11, M111.015974. (18) msCompare. msCompare. https://trac.nbic.nl/mscompare/ (accessed 2013). 3199

dx.doi.org/10.1021/pr401072g | J. Proteome Res. 2014, 13, 3191−3199