LFQProfiler and RNPxl: Open-Source Tools for Label-Free

Aug 1, 2016 - Modern mass spectrometry setups used in today's proteomics studies generate vast amounts of raw data, calling for highly efficient data ...
2 downloads 9 Views 1MB Size
Subscriber access provided by Northern Illinois University

Technical Note xl

LFQProfiler and RNP – Open-source tools for label-free quantification and protein-RNA cross-linking integrated into Proteome Discoverer Johannes Veit, Timo Sachsenberg, Aleksandar Chernev, Fabian Aicheler, Henning Urlaub, and Oliver Kohlbacher J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.6b00407 • Publication Date (Web): 01 Aug 2016 Downloaded from http://pubs.acs.org on August 1, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

LFQProfiler and RNPxl – Open-source tools for label-free quantification and protein-RNA cross-linking integrated into Proteome Discoverer Johannes Veit,∗,† Timo Sachsenberg,† Aleksandar Chernev,‡,¶ Fabian Aicheler,† Henning Urlaub,‡,¶ and Oliver Kohlbacher†,§,k †Center for Bioinformatics, University of Tübingen, Tübingen, Germany ‡Bioanalytical Mass Spectrometry, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany ¶Bioanalytics Group, University Medical Center Göttingen, Göttingen, Germany §Quantitative Biology Center, University of Tübingen, Tübingen, Germany kBiomolecular Interactions, Max Planck Institute for Developmental Biology, Tübingen, Germany E-mail: [email protected] Phone: +49 (0)7071 29 70456. Fax: +49 (0)7071 29 5152

Abstract Modern mass spectrometry setups used in today’s proteomics studies generate vast amounts of raw data, calling for highly efficient data processing and analysis tools. Software for analyzing these data is either monolithic (easy to use, but sometimes too rigid) or workflow-driven (easy to customize, but sometimes complex). Thermo Proteome

1

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Discoverer (PD) is a powerful software for workflow-driven data analysis in proteomics which, in our eyes, achieves a good tradeoff between flexibility and usability. Here we present two open-source plugins for PD providing additional functionality: LFQProfiler for label-free quantification of peptides and proteins, and RNPxl for UV-induced peptide-RNA cross-linking data analysis. LFQProfiler interacts with existing PD nodes for peptide identification and validation and takes care of the entire quantitative part of the workflow. We show that it performs at least on par with other state-of-the-art software solutions for label-free quantification in a recently published benchmark. 1 The second workflow, RNPxl , represents the first software solution to date for identification of peptide-RNA cross-links including automatic localization of the cross-links at amino acid resolution and localization scoring. It comes with a customized integrated crosslink fragment spectrum viewer for convenient manual inspection and validation of the results.

Keywords Computational, Analysis, Workflow, Nodes, Open-source Software

Introduction Mass spectrometry coupled to high-performance liquid chromatography (HPLC-MS) has become a key technology in proteomics. The huge amounts of raw data generated by modern mass spectrometers in the context of today’s large-scale proteomics studies necessitate efficient and highly automated tools for data analysis. The variety of existing software solutions can be roughly divided into two groups: monolithic applications and modular workflow systems. The former are usually tailored towards one or few specific purposes and feature an easy-to-use graphical user interface (GUI). Examples include open-source tools like Skyline, 2 free but closed-source tools like MaxQuant, 3,4 and commercial software solutions such

2

ACS Paragon Plus Environment

Page 2 of 24

Page 3 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

as Progenesis QI (NonLinear Dynamics). The disadvantage of many of these systems, however, is their lack of flexibility: it is basically impossible to use the software for purposes other than those envisioned by its developers. Hence, adapting the data analysis workflow to even a small change in the experimental workflow can sometimes pose an insurmountable obstacle. Workflow systems, on the other hand, consist of a set of many rather small computational tools which can be flexibly combined to form powerful data analysis workflows using a (graphical) workflow language that exactly defines the interplay and data flow between the individual components. Here, it often suffices to exchange one or two building blocks in order to adapt the data analysis to a change in the experimental workflow. A number of different generic workflow systems is available today, such as Galaxy, 5–7 Taverna, 8,9 or KNIME. 10 In theory, it is possible to integrate arbitrary tool sets into these systems, including pipeline-based data analysis tool kits for proteomics data analysis, such as OpenMS / TOPP 11,12 or the Trans-Proteomic Pipeline (TPP). 13,14 Efforts have been made to integrate these tool sets into various workflow platforms. However, since setting up these generic environments and integrating the various tools can still require a substantial amount of work, ready-to-use workflow engines specifically tailored for the purpose of HPLC-MS proteomics data analysis have been developed (e.g., TOPPAS 15 or Proteomatic 16 ). Furthermore, extremely convenient deployment mechanisms for seemless integration of entire tool kits into existing workflow systems have been established, such as the integration of OpenMS / TOPP into the KNIME 10 platform via the Generic KNIME Nodes (GKN) module. 17 However, there is still a gap between monolithic GUI applications and proper workflow systems for LC-MS data analysis. As mentioned above, monolithic applications are often too rigid and cannot be adapted to suit the specific needs of its user. Workflow systems, on the other hand, are extremely flexible but this comes at the price of potentially high complexity of the employed workflows. The more explicit a workflow language is, the more powerful and flexible, but also more complex. The Thermo Proteome Discoverer (PD) approach represents a tradeoff between these two extremes: It features a convenient GUI in which users can easily

3

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

load raw data directly from the instrument and explore and analyze it, all within one and the same platform. Data analysis can be performed using a variety of workflows, and Proteome Discoverer can be extended by plugins, which makes it easy to adapt the software to novel use cases. In comparison to the aforementioned workflow systems, however, the Proteome Discoverer workflow language is somewhat less explicit and thus less complex, as it consists of fewer and larger building blocks which have more built-in logic and sanity checks. Although this takes away some of the power of full-fledged workflow systems, this approach is often a good compromise between user-friendliness and flexibility. We thus provide so-called meta nodes for the Proteome Discoverer workflow engine, i.e., nodes that contain a more complex workflow under the hood which is hidden from the user for the sake of simplicity and usability. In this article, we present a freely available plugin for Proteome Discoverer providing software solutions for two important problems in computational proteomics: LFQProfiler for label-free quantification (LFQ) and RNPxl for protein-RNA cross-linking data analysis. Those two applications were chosen out of the hundreds of other algorithms and tools contained in OpenMS because we felt there was an urgent need for an improved, user-friendly label-free quantification tool on the one hand. On the other hand, we wanted to explore to what extent the tight integration with the raw data visualization could improve the complex annotation and curation still required for protein-RNA cross-linking analysis. These are thus the first two OpenMS tools to be integrated into Proteome Discoverer, but most likely not the last. LFQ is a well-established technique, and a number of commercial and free software solutions exist for analyzing LFQ data, e.g., MaxQuant (MaxLFQ), 3,4 MFPaQ, 18 Progenesis QI (NonLinear Dynamics), SuperHirn, 19 or the various feature-finding tools contained in OpenMS / TOPP. 11,12 Until now, Proteome Discoverer did provide only rather limited means to analyze LFQ data: natively, it supports spectral counting, and a rather basic way of MS1 intensity-based quantification. However, comparing abundances across different samples becomes difficult using this approach, because the crucial step of matching between runs, also known as retention time (RT) alignment or feature linking, is missing.

4

ACS Paragon Plus Environment

Page 4 of 24

Page 5 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

This cannot simply be overcome by adding an additional node for matching between runs, since this would require an algorithm that can detect and quantify peptide features in the data independently of identification results. Such an algorithm is currently missing. Proteome Discoverer can only quantify XICs at those positions where an identification is already present. Since many (especially low-intensity) peptides are identified in only one or few runs, but might be consistently quantifiable across several runs in the MS1 data, this has a big impact on the number of peptides that can be identified and quantified. A less established but trending topic in proteomics is protein-RNA complex analysis using ultra violet (UV) light induced cross-linking. Protein-RNA complexes are essential components in all life forms. They play pivotal roles in a wide range of biological processes, including bacterial anti-termination, spliceosomal cleavage of intronic regions, small RNA maturation, translational control by miRNA / non-coding RNA, epigenetic modulation, regulation of DNA degradation, etc. In many cases, the structural arrangement of the individual subunits is still unknown and the biological processes these complexes are involved in are poorly understood. Thus, protein-RNA complexes represent a highly interesting target of biological research. Our RNPxl workflow for Proteome Discoverer is a powerful and convenient tool for analyzing protein-RNA cross-linking data. It is based on an established method by Kramer et al. 20 and provides a user-friendly interface to their algorithms, including a custom-tailored spectrum visualization widget facilitating manual validation of identified cross-links.

Implementation Thermo Proteome Discoverer is a versatile and user-friendly software for 64-bit Windows platforms enabling proteomics data analyses for a wide range of experimental techniques. It already supports multiple sequence database search engines (e.g., Sequest HT 21 or Mascot 22 ), spectral library searching, peptide-spectrum-match validation (e.g., using Percolator 23 ), as

5

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

well as various quantification techniques, such as isobaric mass tagging (iTRAQ, 24 TMT 25 ) or SILAC. 26 Besides data processing workflows, Proteome Discoverer also offers powerful integrated visualization options including spectrum viewers, scatter charts, histograms, Venn diagrams, and many more. In PD 2.x, data analysis is workflow-driven. Workflows are always split into two parts: the processing step and the consensus step. The idea behind this distinction is that some computationally expensive tasks have to be run only once (or few times), while other parts further downstream in the analysis might involve more tweaking and optimization and thus have to be run more often using different parameter settings or even different workflows. The results of the processing step (e.g., the sequence database search results) can thus be computed once and various consensus workflows can be tried on them, which usually run much faster. Another intuition for this two-step approach is that tasks in the processing step can usually be computed individually for each input file, one after another, and hence can be parallelized using the batch processing mode, whereas the consensus step requires the processing results of all input files at once for a combined analysis. Proteome Discoverer is written in the C# programming language and offers an application programming interface (API) for node development enabling the community to write their own PD workflow nodes. Today, a number of PD community nodes is available free of charge, e.g., the popular search engine MS Amanda, 27 or the modification site localization tools phosphoRS and ptmRS. 28 A selection of useful nodes can be found on pd-nodes.org. All algorithms utilized by LFQProfiler and RNPxl are implemented as standalone executable tools contained in the OpenMS / TOPP tool suite. In order to make these tools and workflows accessible from PD 2.x, we developed a plugin that adds two new processing nodes and two corresponding consensus nodes to the PD node repository. These control the data flow between individual tools and perform conversion of input/output data formats used by PD and OpenMS. This is necessary because OpenMS data storage and exchange is based on XML files, whereas PD uses a relational database approach together with object-relational

6

ACS Paragon Plus Environment

Page 6 of 24

Page 7 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

mapping for storing and accessing all its data. The results of a PD analysis are stored in a pdResult file which contains a SQLite database. From a node developer’s point of view, accessing this data from within the node is seamless, since the object-relational mapper (the so-called EntityDataService of PD) takes care of loading objects from and persisting them to the database. From a conceptual point of view, our PD workflow nodes represent socalled meta nodes which encapsulate and allow execution of larger workflows while hiding complexity from the user. In addition, the plugin implements logic to facilitate usage and provides visualization capabilities.

Results and Discussion LFQProfiler Workflow An overview of the LFQProfiler workflow is depicted in Figure 1. The workflow is split into two parts: The processing step starts with loading raw files using the “Spectrum Files” node. This node is followed by a “Spectrum Selector” for optionally filtering spectra based on various criteria. After that, the workflow branches: MS1 peptide features are quantified in all runs by “LFQProfiler FF”. In parallel, MS2 spectra are identified using the native PD node Sequest HT. Subsequently, peptide identifications are validated using Percolator. As soon as the processing step has finished, the consensus step combines the individual processing results. It starts with the obligatory “MSF Files” node for loading the processing results stored in a Thermo MSF file. “LFQProfiler” then exports peptide identifications from the Proteome Discoverer format to a file in OpenMS’s idXML format. Then, for each run, peptide identifications are mapped onto their corresponding quantified features contained in the featureXML files from “LFQProfiler FF”. The resulting ID-annotated features are again stored in a featureXML file. At the same time, all peptide-level identification results are

7

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 8 of 24

Page 9 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

used for protein inference using Fido 29 via the TOPP tool FidoAdapter. The result of this step is a set of protein groups that plausibly explain the observed peptides. These will be quantified in the remaining steps. In order to match peptide signals between runs, chromatographic shifts are first reduced by retention time alignment on the detected MS1 feature level. To this end, RT transformations are computed for each map based on the deviating retention times of corresponding MS1 features across different runs. Features are assumed to correspond to one another if they have identical peptide annotations and lie within a (user-defined) m/z and RT tolerance window. The computed transformations are applied to warp the retention times of all peptide signals. This initial warping facilitates the actual task of establishing correspondence between (potentially unidentified) peptide features across runs, the so-called feature linking step, which is achieved using a quality threshold clustering algorithm. 30 Once correspondences are established, the linked peptide signals together with the transferred identifications are stored in a consensusXML file. Feature intensities are normalized across all runs in order to make them comparable. Both the quantified and identified features from the consensusXML file and the protein inference results in idXML format are used as input for the final protein quantification step. Here, feature intensities are summarized to peptide intensities (i.e., different charge states of the same peptide are merged), and finally, peptide intensities are summarized to protein (group) intensities using the results from the Fido protein inference. The final result are tables of feature, peptide, and protein (group) abundances for all MS runs in CSV format. These are parsed back into Proteome Discoverer result tables. Finally, all result tables are interconnected, so the user can conveniently navigate the results and analyze, for example, all quantified peptides contributing to the quantification of a particular protein, or all PSMs mapped to a certain quantified feature. All result tables can be exported to CSV files in order to facilitate downstream data analysis using external tools such as Perseus 31 or R 32 for, e.g., missing value imputation and statistical testing.

9

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

LFQProfiler is based on an established workflow for label-free quantification originally described by Weisser et al. 30 It provides functionality currently missing in Proteome Discoverer and features a number of improvements compared to the original version of the workflow. 30 Most notably, we have added the aforementioned protein inference step using the Fido algorithm 29 and an intensity normalization step on the feature level using median scaling or quantile normalization before quantifying proteins based on feature intensities. Moreover, we have exchanged the feature detection tool FeatureFinderCentroided by the more recent FeatureFinderMultiplex algorithm, 33 which is more actively developed and maintained. The benchmarking and refactoring efforts that were part of the development of the PD nodes lead to several algorithmic changes that could reduce the runtime of FeatureFinderMultiplex. The overall performance improvement was approximately 50-fold (less than 2 minutes instead of 1h 30min on an ∼800 MB Orbitrap Velos run). As a result of the ongoing efforts of the OpenMS developer team to constantly maintain and improve existing OpenMS tools, the retention time alignment and feature linking code has also become much faster and has a reduced memory footprint in the current OpenMS 2.0 release 34 compared to OpenMS version 1.8, which was the basis for the workflow described by Weisser et al. 30 Except for the peptide identification using PD’s Sequest HT node, the LFQProfiler workflow can also be represented as an equivalent TOPPAS 15 or KNIME 10 workflow using the same underlying OpenMS tools. However, the main advantage of LFQProfiler over equivalent representations in other workflow systems lies in its user-friendly interface and the tight integration with PD’s raw data visualization features. For example, it is easy to retrieve all PSMs matching a particular quantified peptide or protein of interest and to visualize the corresponding spectra or extracted ion chromatograms (XIC) using PD’s built-in visualization tools.

10

ACS Paragon Plus Environment

Page 10 of 24

Page 11 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Benchmark We have evaluated our method on a recently published publicly available benchmark data set for label-free LC-MS data processing workflows from Ramus et al. 1 They have measured a proteomics standard consisting of an equimolar mix of 48 human proteins (Sigma UPS1) spiked into a complex yeast cell lysate background at nine different concentrations in three replicates. The entire data set thus comprises 27 full LC-MS runs. The main performance metric is a receiver operating characteristic (ROC) curve of a classifier determinining differential abundance between two conditions with different spikein concentrations. An ideal classifier would detect all spike-in proteins as differential and all background proteins as non-differential and would thus achieve an area under the curve (AUC) of 1. The ROC is plotted for the combined result containing three comparisons: 50 vs 0.5 fmol/µg, 50 vs 5 fmol/µg, and 25 vs 12.5 fmol/µg (each condition measured in three replicates). For each of the three comparisons, six LC-MS runs (two conditions, three replicates) were processed at once by the different investigated workflows. The results for all three comparisons were then merged into a single table as in the original study for further statistical analysis. The full table of quantified protein groups including statistical test results is available in the Supplementary Material. We have replicated the exact statistical analysis described in this publication in order to assess the performance of LFQProfiler in comparison with MaxLFQ, MaxQuant, MFPaQ, and Skyline. In order to ensure a fair comparison, we recomputed the performance evaluation metrics for these tools based on the result tables accompanying the publication. The same statistical analysis was then applied to the results of LFQProfiler. Where possible, we tried to choose settings comparable to the ones used in the other workflows from the benchmark publication, e.g., we set both the PSM-level and proteinlevel FDR thresholds to 1%. As in the original publication, missing values were imputed on the protein level as the 5-percentile of all protein abundances in the respective LC-MS run. The exact parameter settings of the LFQProfiler workflow are described in Supplementary 11

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table S1. With these settings and after applying all filtering criteria, 1 LFQProfiler was able to quantify a total of 2,535 proteins across the combined data set of all three comparisons (MaxLFQ 2,625; MaxQuant Intensity 2,644; MFPaQ 2,721; Skyline 2,620). Figure 2 shows the ROCs of the different workflows. Since all investigated software solutions score very high on this data set (AUCs in the 97% - 99% range), we limited the ROC plot to the more informative false positive rate (FPR) interval [0, 0.05] and computed the corresponding relative partial AUCs (pAUC). Note that the relative pAUC of a perfect classifier would equal 100% and correspond to a pAUC of 0.05, which is the maximum pAUC possible for FPRs between 0 and 0.05. A random classifier would have a relative pAUC of 50%, corresponding to a pAUC of 0.025. LFQProfiler achieves a relative pAUC of 97.32%, which is the best performance among the evaluated workflows (MaxLFQ 96.44%; MFPaQ 93.05%; MaxQuant Intensity 91.27%; Skyline 85.70%). The slightly lower number of quantified proteins might be due to a more conservative protein-level FDR filtering strategy based on protein inference results in LFQProfiler. Supplementary Figure S1 shows the volcano plot corresponding to these results. Running the entire LFQProfiler workflow on six input files took approximately 2h 20min (1h 47min processing step, 31 min consensus step) using a single core of a 3.20 GHz Intel Core i5-3470 machine with 16 GB of RAM. For comparison, running MaxLFQ on the same machine and input files took 5h 30min.

RNPxl We introduce RNPxl , a Proteome Discoverer workflow based on the work of Kramer et al. 20 for identification and localization of peptide-RNA cross-links which is easy and quick to handle. UV cross-linking of proteins with RNA and identification of the resulting products has been used to assign novel binding regions and exact binding sites in proteins. The large number of potential cross-linked amino acids and oligonucleotides poses a data analysis challenge and has to be accounted for in MS database searching. Kramer et al. introduced an experimental MS workflow together with a processing pipeline implemented in the OpenMS 12

ACS Paragon Plus Environment

Page 12 of 24

Page 13 of 24

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 14 of 24

Page 15 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

A schematic of the RNPxl workflow is shown in Figure 5. The “RNPxl” processing node encapsulates two conceptually distinct subworkflows. The first subworkflow can be seen as a computational cross-link enrichment step that aims at removing all tandem mass spectra of non-cross-linked analytes. To this end, all spectra that can be assigned to a (non-crosslinked) peptide, given a user-provided false discovery rate, are removed. If an (optional) non-cross-linked control is provided, the UV and the control file are first aligned in order to correct for chromatographic shifts. Extracted ion chromatograms (XIC) from potential crosslinked precursors are compared between control and UV, and all tandem spectra that show a strong signal in the control are discarded. The idea behind using a control to filter tandem spectra in the cross-linked sample is that signals in the control are known to not correspond to cross-linked peptides. Hence, they can be used to discard co-eluting precursors and the corresponding tandem mass spectra in the UV file. Now, spectra that originate from noncross-linked peptides or contaminants have been removed and an enriched set of potential cross-linked tandem mass spectra is used in the second subworkflow performing the actual cross-link search. Compared to the original version of Kramer et al., 20 the workflow presented here is more convenient to use and its built-in visualization capabilities can greatly facilitate manual validation of the results. Besides, the current version of the underlying OpenMS tool used by our workflow also contains several enhancements on the algorithmic side. To account for the complex fragmentation behavior of cross-linked moieties, we implemented a novel search engine designed specifically for peptide-RNA cross-link identification. It runs approximately 90 times faster than the original approach from Kramer et al. 20 (less than an hour instead of more than 80 hours on a large Orbitrap XL run with ∼30,000 spectra). The previous version employed the highly customizable but relatively slow OMSSA algorithm 35 with custom parameter settings as an external tool for cross-link identification. In contrast, the new search engine implements the X!Tandem 36 scoring function but accounts for the various occuring fragment masses of the cross-linked molecules in a more efficient way. Differences in

15

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

accuracy between the old and the new approach are negligible and can be attributed to the different scoring functions. Moreover, the new algorithm is able to automatically localize the cross-linked amino acids in the peptide sequence based on characteristic product ions of the cross-linked species. Localizations are scored using a heuristic in order to assess their confidence. To this end, an intensity- and distance-weighted linear score is calculated at each amino acid position that aggregates supporting and contradicting evidences from observed fragment ions (a-,b-,y- immonium-, neutral loss-ions). The maximum scoring position is returned as the putative cross-linking site. The details of this new method (search engine and cross-link localization algorithm, scoring heuristics, benchmark comparisons) will be described in a separate publication (manuscript in preparation). Visualization In addition to the algorithmic improvements, our plugin offers an integrated cross-link spectrum visualization including annotations of peptide ions and cross-linked nucleotide ions, as illustrated in Figure 4. This feature substantially facilitates manual validation of the cross-link identifications, which is a crucial step when analyzing these data. For a thorough evaluation of the method and a complete example data set, see Kramer et al. 20 Due to a lack of competing implementations, we cannot give a comparison to other tools in this case.

Conclusion We have developed user-friendly plugins for Proteome Discoverer adding novel community nodes powered by OpenMS. These enable two powerful data analysis workflows in PD: LFQProfiler for label-free quantification of peptides and proteins, and RNPxl for peptideRNA cross-linking data analysis. LFQProfiler is valuable for PD users who want to perform label-free quantification. Until now, the tools for label-free quantification in PD were rather unsatisfactory and limited to

16

ACS Paragon Plus Environment

Page 16 of 24

Page 17 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4: Annotation and visualization of cross-links. A tabular view for quick validation of spectrum match information contains protein accessions, precursor charge, m/z and RT, detected cross-linked peptide and RNA, charge, and score. A click on “Show Spectrum” opens an interactive spectrum visualization that highlights all detected fragment ions. In this example, a precursor heteroconjugate with ribosomal peptide and U-H2 O RNA adduct fragmented to y-ions (blue), a-ions (green), cross-linked b-ions (violet) as well as a crosslinked immonium ion of tryptophan (brown). spectral counting and calculating the area under the extracted ion chromatogram (XIC) of identified precursor ions. Proper MS1 -based label-free quantification, however, requires a number of additional algorithmic steps (e.g., feature detection, mapping of identifications to quantified features, retention time alignment) which are all taken care of by LFQProfiler. LFQProfiler is based on an established workflow described and benchmarked by Weisser et al. 30 but features a number of improvements and additions, such as protein inference, intensity normalization, as well as faster algorithms and a reduced memory footprint. We have demonstrated its performance and compared it to a selection of other tools in a benchmark setting defined by Ramus et al. 1 We could show that LFQProfiler performs at least on par with other state-of-the-art tools for label-free quantification. The second workflow, termed RNPxl , represents the first software solution to date for identification of peptideRNA cross-links including automatic localization of the cross-links at amino acid resolution and localization scoring. Compared to the original version described by Kramer et al., 20 17

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

it is substantially faster and more convenient to use, as it is fully integrated into the Proteome Discoverer GUI and comes with a customized interactive peptide-nucleotide cross-link spectrum viewer for convenient manual inspection of the results.

Availability LFQProfiler and RNPxl binary installers for Proteome Discoverer 2.0 and 2.1, the user manual, as well as example workflows demonstrating the basic usage of our nodes are available free of charge at http://www.openms.org/pd. Both the C# plugins for Proteome Discoverer and the source code of the underlying OpenMS library and TOPP tools are open source, published under a BSD 3-clause license. Source code is available online on GitHub at http: //github.com/OpenMS/PDCommunityNodes and http://github.com/OpenMS/OpenMS.

Acknowledgement The authors thank the Thermo Scientific Proteome Discoverer software development team for their support. Special thanks to the Skyline / ProteoWizard teams for releasing their software, from parts of which our RNPxl spectrum viewer is derived, under the permissive Apache V2 license. The authors are grateful to Tjeerd Dijkstra for proof-reading the manuscript. T.S and O.K. acknowledge funding from BMBF (de.NBI, grant no. 031 A535A). F.A. and O.K. acknowledge funding from the European Union (MARINA, grant no. 236215).

Supporting Information The following files are available free of charge.

Veit_et_al_PD_nodes_JPR_supplement.pdf. Supplementary Table S1: Parameter settings of PD nodes in the LFQProfiler benchmark. Supplementary Figure S1: Volcano 18

ACS Paragon Plus Environment

Page 18 of 24

Page 19 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

plot for LFQProfiler benchmarking results.

LFQProfiler_benchmark_results.xlsx. Quantification and statistical testing results of LFQProfiler in the Ramus et al. 1 benchmark.

References (1) Ramus, C.; Hovasse, A.; Marcellin, M.; Hesse, A.-M. Benchmarking quantitative labelfree LC-MS data processing workflows using a complex spiked proteomic standard dataset. J. Proteomics 2016, 132, 51–62. (2) MacLean, B.; Tomazela, D. M.; Shulman, N.; Chambers, M.; Finney, G. L.; Frewen, B.; Kern, R.; Tabb, D. L.; Liebler, D. C.; MacCoss, M. J. Skyline: An open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 2010, 26, 966–968. (3) Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26, 1367–72. (4) Cox, J.; Hein, M. Y.; Luber, C. A.; Paron, I.; Nagaraj, N.; Mann, M. Accurate Proteome-wide Label-free Quantification by Delayed Normalization and Maximal Peptide Ratio Extraction, Termed MaxLFQ. Mol. Cell. Proteomics 2014, 13, 2513–2526. (5) Giardine, B.; Riemer, C.; Hardison, R. C.; Burhans, R.; Elnitski, L.; Shah, P.; Zhang, Y.; Blankenberg, D.; Albert, I.; Taylor, J.; Miller, W.; Kent, W. J.; Nekrutenko, A. Galaxy: A platform for interactive large-scale genome analysis. Genome Res. 2005, 15, 1451–1455. (6) Blankenberg, D.; Kuster, G. V.; Coraor, N.; Ananda, G.; Lazarus, R.; Mangan, M.;

19

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 24

Nekrutenko, A.; Taylor, J. Galaxy: A web-based genome analysis tool for experimentalists. Curr. Protoc. Mol. Biol. 2010, Chapter 19, Unit 19.10.1–21. (7) Goecks, J.; Nekrutenko, A.; Taylor, J. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010, 11, R86. (8) Hull, D.; Wolstencroft, K.; Stevens, R.; Goble, C.; Pocock, M. R.; Oinn, T. Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 2006, 34, W729–32. (9) Oinn, T.; Addis, M.; Ferris, J.; Marvin, D.; Carver, T.; Wipat, A.; Li, P. Taverna , lessons in creating a workflow environment for the life sciences. Computing 2006, 18, 1067–1100. (10) Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. In Data Analysis, Machine Learning and Applications: Proceedings of the 31st Annual Conference of the Gesellschaft für Klassifikation e.V., Albert-Ludwigs-Universität Freiburg, March 7–9, 2007 ; Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R., Eds.; Springer Berlin Heidelberg: Berlin, Heidelberg, 2008; Chapter KNIME: The Konstanz Information Miner, pp 319–326. (11) Sturm, M.; Bertsch, A.; Gröpl, C.; Hildebrandt, A.; Hussong, R.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Zerck, A.; Reinert, K.; Kohlbacher, O. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinformatics 2008, 9, 163. (12) Kohlbacher, O.; Reinert, K.; Gröpl, C.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Sturm, M. TOPP - The OpenMS proteomics pipeline. Bioinformatics 2007, 23, e191– 7. (13) Deutsch, E. W.; Mendoza, L.; Shteynberg, D.; Farrah, T.; Lam, H.; Tasman, N.; Sun, Z.;

20

ACS Paragon Plus Environment

Page 21 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Nilsson, E.; Pratt, B.; Prazen, B. A guided tour of the Trans Proteomic Pipeline. Proteomics 2010, 10, 1150–1159. (14) Deutsch, E. W.; Mendoza, L.; Shteynberg, D.; Slagel, J.; Sun, Z.; Moritz, R. L. TransProteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics. Proteomics Clin. Appl. 2015, 9, 745–754. (15) Junker, J.; Bielow, C.; Bertsch, A.; Sturm, M.; Reinert, K.; Kohlbacher, O. TOPPAS: A graphical workflow editor for the analysis of high-throughput proteomics data. J. Proteome Res. 2012, 11, 3914–3920. (16) Specht, M.; Kuhlgert, S.; Fufezan, C.; Hippler, M. Proteomics to go: Proteomatic enables the user-friendly creation of versatile MS/MS data evaluation workflows. Bioinformatics 2011, 27, 1183–1184. (17) Aiche, S.; Sachsenberg, T.; Kenar, E.; Walzer, M.; Wiswedel, B.; Kristl, T.; Boyles, M.; Duschl, A.; Huber, C. G.; Berthold, M. R.; Reinert, K.; Kohlbacher, O. Workflows for automated downstream data analysis and visualization in large-scale computational mass spectrometry. Proteomics 2015, 15, 1443–1447. (18) Bouyssié, D.; Gonzalez de Peredo, A.; Mouton, E.; Albigot, R.; Roussel, L.; Ortega, N.; Cayrol, C.; Burlet-Schiltz, O.; Girard, J.-P.; Monsarrat, B. Mascot file parsing and quantification (MFPaQ), a new software to parse, validate, and quantify proteomics data generated by ICAT and SILAC mass spectrometric analyses: application to the proteomics study of membrane proteins from primary human endothelia. Mol. Cell. Proteomics 2007, 6, 1621–1637. (19) Mueller, L. N.; Rinner, O.; Schmidt, A.; Letarte, S.; Bodenmiller, B. SuperHirn - a novel tool for high resolution LC-MS-based peptide / protein profiling. Proteomics 2007, 7, 1–11.

21

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(20) Kramer, K.;

Sachsenberg, T.;

Beckmann, B. M.;

Page 22 of 24

Qamar, S.;

Boon, K.-L.;

Hentze, M. W.; Kohlbacher, O.; Urlaub, H. Photo-cross-linking and high-resolution mass spectrometry for assignment of RNA-binding sites in RNA-binding proteins. Nat. Methods 2014, 11, 1064–1070. (21) Eng, J. K.; McCormack, A. L.; Yates III, R. J. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976–989. (22) Perkins, D. N.; Pappin, D. J. C.; Creacy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 5331–3567. (23) Käll, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 2007, 4, 923–925. (24) Ross, P. L. et al. Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol. Cell. Proteomics 2004, 3, 1154–1169. (25) Thompson, A.; Schäfer, J.; Kuhn, K.; Kienle, S.; Schwarz, J.; Schmidt, G.; Neumann, T.; Hamon, C. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Anal. Chem. 2003, 75, 1895–1904. (26) Ong, S. E.; Blagoev, B.; Kratchmarova, I.; Kristensen, D. B.; Steen, H.; Pandey, A.; Mann, M. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 2002, 1, 376–386. (27) Dorfer, V.; Pichler, P.; Stranzl, T.; Stadlmann, J.; Taus, T.; Winkler, S.; Mechtler, K. MS Amanda, a Universal Identification Algorithm Optimized for High Accuracy Tandem Mass Spectra. J. Proteome Res. 2014, 13, 3679–3684.

22

ACS Paragon Plus Environment

Page 23 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(28) Taus, T.; Köcher, T.; Pichler, P.; Paschke, C.; Schmidt, A.; Henrich, C.; Mechtler, K. Universal and Confident Phosphorylation Site Localization Using phosphoRS. J. Proteome Res. 2011, 10, 5354–5362. (29) Serang, O.; Noble, W. S. Faster mass spectrometry-based protein inference: Junction trees are more efficient than sampling and marginalization by enumeration. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012, 9, 809–817. (30) Weisser, H.; Nahnsen, S.; Grossmann, J.; Nilse, L.; Quandt, A.; Brauer, H.; Sturm, M.; Kenar, E.; Kohlbacher, O.; Aebersold, R.; Malmström, L. An automated pipeline for high-throughput label-free quantitative proteomics. J. Proteome Res. 2013, 12, 1628– 1644. (31) Tyanova, S.; Temu, T.; Sinitcyn, P.; Carlson, A.; Hein, M. Y.; Geiger, T.; Mann, M.; Cox, J. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods 2016, (32) R Development Core Team, R: A Language and Environment for Statistical Computing. Vienna Austria R Foundation for Statistical Computing 2008, 1, ISBN 3–900051–07–0. (33) Nilse, L.; Sigloch, F. C.; Biniossek, M. L.; Schilling, O. Toward improved peptide feature detection in quantitative proteomics using stable isotope labeling. Proteomics Clin. Appl. 2015, 9, 706–714. (34) Röst, H. L. et al. OpenMS: A flexible open-source software platform for computational mass spectrometry. Nat. Methods In press. (35) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open mass spectrometry search algorithm. Journal of Proteome Research 3, 958–964.

23

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(36) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics (Oxford, England) 2004, 20, 1466–1467.

Figure 5: for TOC only

24

ACS Paragon Plus Environment

Page 24 of 24