TOPPAS: A Graphical Workflow Editor for the ... - ACS Publications

May 15, 2012 - Institute of Computer Science, Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany. §. Internat...
1 downloads 10 Views 3MB Size
Technical Note pubs.acs.org/jpr

TOPPAS: A Graphical Workflow Editor for the Analysis of HighThroughput Proteomics Data Johannes Junker,*,†,⊥ Chris Bielow,*,‡,§,⊥ Andreas Bertsch,†,∥ Marc Sturm,† Knut Reinert,‡ and Oliver Kohlbacher† †

Applied Bioinformatics, Center for Bioinformatics and Quantitative Biology Center, University of Tübingen, Tübingen, Germany Institute of Computer Science, Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany § International Max Planck Research School for Computational Biology and Scientific Computing, Berlin, Germany ‡

S Supporting Information *

ABSTRACT: Mass spectrometry coupled to high-performance liquid chromatography (HPLC−MS) is evolving more quickly than ever. A wide range of different instrument types and experimental setups are commonly used. Modern instruments acquire huge amounts of data, thus requiring tools for an efficient and automated data analysis. Most existing software for analyzing HPLC−MS data is monolithic and tailored toward a specific application. A more flexible alternative consists of pipeline-based tool kits allowing the construction of custom analysis workflows from small building blocks, e.g., the Trans Proteomics Pipeline (TPP) or The OpenMS Proteomics Pipeline (TOPP). One drawback, however, is the hurdle of setting up complex workflows using command line tools. We present TOPPAS, The OpenMS Proteomics Pipeline ASsistant, a graphical user interface (GUI) for rapid composition of HPLC−MS analysis workflows. Workflow construction reduces to simple drag-and-drop of analysis tools and adding connections in between. Integration of external tools into these workflows is possible as well. Once workflows have been developed, they can be deployed in other workflow management systems or batch processing systems in a fully automated fashion. The implementation is portable and has been tested under Windows, Mac OS X, and Linux. TOPPAS is open-source software and available free of charge at http://www.OpenMS.de/TOPPAS. KEYWORDS: mass spectrometry, proteomics, GUI, pipeline, OpenMS



INTRODUCTION Mass spectrometry coupled to high-performance liquid chromatography (HPLC−MS) has become a key technology in proteomics and metabolomics. High separation performance on the HPLC column and high-resolution mass spectrometers result in huge data volumes that require automated data analysis. Additionally, many different techniques for quantification and identification and a wealth of different instrument types give rise to a broad range of computational problems. Not surprisingly, bioinformatics and data analysis have turned out to be the bottlenecks and a key research focus in proteomics and metabolomics. Numerous algorithms and software tools have been developed over the past years. There are basically two types of open software tools: monolithic applications, usually with © XXXX American Chemical Society

graphical user interfaces tailored toward specific applications (identification, quantification), and pipeline-based tool kits, such as the Trans-Proteomic Pipeline (TPP1) or The OpenMS Proteomics Pipeline (TOPP2). Monolithic tools are often easy to use but inflexible if the user deviates from the analysis workflow the developers had in mind or if different experimental setups are used. In contrast, pipeline tool kits are very flexible, and the individual tools can be freely combined to support new experimental setups with little effort. However, they are hard to use and often deployed in large core facilities only. Common to all open source platforms is the Received: February 27, 2012

A

dx.doi.org/10.1021/pr300187f | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

Figure 1. Creation of a simple workflow. Tool nodes can be dragged and dropped from the list of tools (left pane) to the workflow canvas (center). Documentation for the whole workflow or individual tools is displayed on the right.

support of PSI formats, such as mzML,3 mzIdentML,4 or TraML,5 as a way to facilitate tool interoperability. Scientific workflow management systems, such as Galaxy,6−8 Taverna,9,10 KNIME,11 Conveyor,12 Mobyle,13 Pegasus,14 or Kepler,15 can provide a more user-friendly interface to command line-based pipeline tool kits. A proof-of-concept implementation16 supporting multicore CPUs (no remote parallelization) integrated parts of the Trans-Proteomic Pipeline1 and X!Tandem17 into the Taverna Workbench. However, integrating new tools into these generic workflow systems can be difficult. An alternative specifically tailored to the analysis of HPLC−MS data is Proteomatic.18 Here, a selection of scripts for analyzing proteomics data is available and can be incorporated into custom workflows using a graphical user interface (GUI). In addition, Proteomatic provides adapters to external tools, such as OMSSA.19 We present TOPPAS, The OpenMS Proteomics Pipeline ASsistant, a graphical workflow editor integrated into the OpenMS/TOPP framework.2,20 It enables fast construction of custom analysis workflows using all TOPP tools from OpenMS as well as arbitrary external programs such as ProteinProphet.21 TOPPAS also facilitates sharing established workflows by simply sending a single file to a collaborator or through our online repository of shared standard workflows. TOPPAS is suitable for a wide range of applications without the need to write shell scripts or to do any programming whatsoever. In contrast to generic workflow management systems, setting up and using TOPPAS is straightforward. TOPPAS is included in version 1.9 or later of OpenMS.

Installation takes only a few minutes using readily available binary packages for all major operating systems. TOPPAS has the complete functionality of all TOPP tools available out-ofthe-box. This includes a wealth of efficient algorithms for signal processing and preprocessing, peptide property prediction, quantification using different experimental techniques, e.g., SILAC, iTRAQ, and label-free analyses, as well as adapters for identification using several popular search engines, including OMSSA,19 Mascot,22 and X!Tandem.17 Moreover, almost any other external program can be integrated into TOPPAS by providing a simple configuration file. We provide sample configuration files for some tools of interest on the OpenMS Web site. Workflows can be created and run locally on the user’s machine. Alternatively, a command line version of TOPPAS without the graphical user interface enables workflow execution for batch processing of a larger number of data sets. In order to take advantage of modern multicore CPUs, all processing steps that are independent of each other can be executed in parallel. The user can choose the number of parallel jobs to be executed on the machine. No additional configuration steps are required. In the following, we will describe the main features of TOPPAS and showcase its use with several examples, ranging from very simple to rather complex workflows.



USAGE AND FEATURES

User Interface

TOPPAS features a user-friendly GUI that allows users to create, edit, save, and run workflows. The parameters of all B

dx.doi.org/10.1021/pr300187f | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

linear, meaning that a set of input files is sequentially processed by one or more tools and exactly one output file is produced for each input file. More complex workflows may contain more than one input node. An example pipeline with two different input nodes is illustrated in Figure 4. Even more advanced workflows require results from two or more different processing branches to be merged or certain files to be reused multiple times (e.g., in identification, when several data sets are searched against one and the same FASTA database). For these purposes, we introduce three additional elements of our workflow language: two special nodes, called Merge and Collect, which combine the results of multiple incoming workflow branches, and the Recycle mode which allows the same file to be reused over multiple rounds. A Merge node can have arbitrarily many incoming connections from preceding nodes. In each round, it compiles a new list of files consisting of exactly one file per connected predecessor and passes this list of files to its successor node(s). Thus, the lists of files from all preceding nodes must have equal length and must be in the same order, such that corresponding files are merged together. A Collect node behaves similarly but waits for all rounds to finish before passing on a combined list of all output files from all its predecessors. Thus, successors of a Collect node will be called only once during the entire pipeline run. Another useful concept is Recycling, where the (output) files of a node can be reused in multiple rounds. For example, the database input node in Figure 3 is set to constantly feed the

involved tools can be adjusted within the application and are also saved as part of the pipeline definition in a workflow file. Furthermore, TOPPAS interactively performs validity checks during the pipeline editing process and before execution. Figure 1 shows the TOPPAS main window. A simple pipeline is just being created. The user has added several tool nodes to a workflow by dragging them from the TOPP tool list on the left to the central area. Additionally, special nodes for input and output files have been added. Edges were drawn between the nodes that determine the data flow of the pipeline. An edge maps an output file of a source node to an input file of the target node. A TOPP node might have more than one input or output file parameter, e.g., the OMSSAAdapter has two input files, an mzML file and a FASTA database file. When an edge is created and either source or target node have more than one input or output parameter, an input/output parameter mapping dialogue is displayed to the user to select the output parameter of the source node and the input parameter of the target node. In order to facilitate workflow construction, TOPPAS does not permit to add edges whose source and target file types are not compatible with each other or edges that would lead to a cyclic workflow. Figure 2 shows the parameter editing dialogue that appears when a tool node is double-clicked.

Figure 2. Each tool has parameters that can be adjusted through a dialogue. Each parameter is explained in the lower part of the dialogue, and a simple validity check on the parameters is performed automatically.

Figure 3. Basic identification workflow.

same FASTA database to the OMSSAAdapter node, which is called three times (once for each mzML input file). In this case, the workflow is valid although its two input nodes contain different numbers of input files, since the FASTA database can be reused in every round. Without recycling, one would need to specify a list of three identical FASTA files instead. A use case of both merging and input recycling is illustrated in Figure 5. Figure 4 demonstrates the usage of a Collect node.

Once the pipeline is set up, input files have to be specified before it can be run. This is done by double-clicking an input node and selecting the desired input files in the dialogue that appears. As soon as a valid set of input files has been selected, the corresponding edge will turn green and the workflow is ready for execution. During pipeline execution, the circles in the top-right corner of the tools indicate whether a tool has finished successfully (green), is currently running (yellow), has not been executed yet (gray), or could not be executed successfully (red). When the execution has finished, the output files generated by each of the workflow’s nodes can be inspected quickly by selecting Open output in TOPPView from its context menu.

External Software Tools

In addition to all TOPP tools, which are included with the OpenMS/TOPP distributions, it is also possible to add custom nodes to a pipeline. These nodes can represent almost any external command line tool, from analysis tools like ProteinProphet to R23 for statistical data analysis. It has recently been shown that in some scenarios, heterogeneous workflows incorporating LC−MS analysis tools from different software suites can achieve higher performance than homogeneous workflows.24 Thus, the ability to also include external tools is highly desirable.

Workflow Concepts

TOPPAS pipelines follow a round-based concept: in the simplest scenario, the entire workflow is traversed exactly once for each input file. We refer to each traversal as one round of processing. However, only the most basic workflows are strictly C

dx.doi.org/10.1021/pr300187f | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

Figure 4. This basic label-free quantification workflow is one of the example pipelines included in TOPPAS. It ships with three mzML files containing MS1 scans of varying concentrations of bovine serum albumin (BSA) as well as three idXML files containing identifications from a search engine run on the corresponding MS2 data. The context menu of any node can be used to open its output in either TOPPView or a file system browser. Visualizations from TOPPView of the mzML input files as well as the single consensusXML output file are superimposed for illustrative purposes.

Figure 5. iTRAQ identification and quantification pipeline featuring multiple search engines, easy to substitute protein databases, and FDR filtering.

Using Preconfigured Workflows

Integrating an external tool into TOPPAS requires a TOPP tool description (TTD) file. This is an XML file specifying the input and output parameters of the tool and how they should be exposed in TOPPAS. For convenience, we provide preconfigured TTD files on our Web site for a number of common tools. TTD files have a simple structure and the examples given can be easily modified for new tools within a few minutes based on the documentation of the tool. Once the TTD file is in place, the corresponding node can be found in the EXTERNAL section of the tool menu and used in the same way as any other tool node. For the node to run, the external program has to be installed first. An example TTD file and details on its format can be found in the Supporting Information.

Stable workflows are often reused by collaborators, perhaps in a slightly modified form. Thus, sharing workflows should be as easy as possible. In TOPPAS, the whole pipeline and parameter information is stored in a compact file, which can be distributed conveniently. As a good starting point, we provide some standard workflows on our Web site in an online repository, which can be downloaded either through a standard web browser or directly from within TOPPAS (see File → Online Repository in the TOPPAS menu). Users of TOPPAS can also submit their pipelines. We will gladly add them to the online repository. For demonstration purposes, we also provide several example data sets and pipelines (see File → Open example f ile). Batch Execution of Workflows

Once set up and saved, a workflow can also be run without the GUI using the TOPP tool ExecutePipeline. As input files (e.g., D

dx.doi.org/10.1021/pr300187f | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

Figure 6. Workflow showing the use of the external tools msconvert and ProteinProphet.

mzML files. FeatureFinderCentroided finds the peptide features in each of these maps and passes on three featureXML files. Corresponding peptide identifications in idXML format (obtained in advance) are mapped to each of these featureXML files using IDMapper, which then produces one featureXML output file (now including sequence annotations) for every pair of corresponding featureXML and idXML input files. The Collect node waits for all three rounds to finish, then runs FeatureLinkerUnlabeled once, with all three annotated featureXML files as input, which creates a single consensusXML output file. A more complex pipeline is shown in Figure 5. It combines identification using multiple search engines and quantification of iTRAQ reporters. The database used for peptide identification is easy to substitute as it is represented as a dedicated input node in Recycle mode. To demonstrate the ability of TOPPAS to integrate external software tools, we wrote TTD files for ProteoWizard’s msconvert and the widely known ProteinProphet included in the TPP. See Figure 6 for an example. All TTD files described in this publication are included in TOPPAS by default.

in mzML format) change frequently, the user can also provide a resource file to ExecutePipeline that specifies the input to the pipeline. Pipelines can thus be developed and tested on a desktop machine and then easily deployed in high-throughput environments for automatic processing of larger data sets (e.g., in core facilities). Availability

TOPPAS is included in version 1.9 or later of the open source C++ software library OpenMS,20 running on all major platforms (Windows, Linux, Mac OS X). Binary and source packages and installation instructions, as well as a TOPPAS tutorial, are available at http://www.OpenMS.de/TOPPAS.



APPLICATION EXAMPLES In order to review the features of TOPPAS, we will describe several examples of varying complexity. The first example is a basic identification pipeline using the database search engine OMSSA.19 Figure 3 shows the overall layout of the workflow. It accepts one or more mzML files containing the tandem spectra on input node 1. Note that vendor-specific formats, e.g., RAW, can be used after appropriate conversion.25 On the Windows operating system, this conversion can also be performed within TOPPAS. Input node 2 contains the FASTA database. The database also contains decoy versions of all protein sequences in order to allow calculation of false discovery rates (FDRs). After identification, PeptideIndexer annotates for each search result whether it originates from the target or from the decoy part of the sequence database. With this information, the FalseDiscoveryRate tool is able to estimate the FDR for each of the peptide-spectrum matches. Finally, the IDFilter is used to retain only those peptide-spectrum matches with an FDR of at least 5%. A possible extension of this pipeline would be to do the spectrum annotation using multiple search engines and combine the results afterward, using the ConsensusID tool. The results may also be exported using TextExporter for further analysis with external tools, for example Microsoft Excel. Our second example is the basic label-free quantification pipeline illustrated in Figure 4. Input node 1 contains three



CONCLUSION TOPPAS allows non-computer scientists to easily set up new data analysis workflows for mass spectrometric data. It is a valuable tool for designing custom analysis pipelines, while facilitating sharing of existing solutions. Even for bioinformaticians, building a workflow prototype with TOPPAS is much faster and more robust than with custom shell scripts. The entire workflow together with the parameters of all involved tools as well as a workflow description is stored in a single file. It is thus simple to share and document the final pipelines. One of the main advantages over generic workflow management systems is its straightforward setup and usability. The graphical workflow language is simple enough to be readily used by everyone. Yet, in our experience, it is sufficiently expressive to describe a wide range of MS data analysis workflows. Even complex, branched workflows can be easily modeled and the interdependencies of the separate branches E

dx.doi.org/10.1021/pr300187f | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

monitoring transition lists. Mol. Cell. Proteomics 2012, 11, R111.015040. (6) Blankenberg, D.; Von Kuster, G.; Coraor, N.; Ananda, G.; Lazarus, R.; Mangan, M.; Nekrutenko, A.; Taylor, J.; Frederick, M. Ausubel Galaxy: a web-based genome analysis tool for experimentalists. Curr. Protoc. Mol. Biol. 2010, Chapter 19:Unit 19.10, 1−21. (7) Goecks, J.; Nekrutenko, A.; Taylor, J. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010, 11, R86. (8) Giardine, B.; Riemer, C.; Hardison, R. C.; Burhans, R.; Elnitski, L.; Shah, P.; Zhang, Y.; Blankenberg, D.; Albert, I.; Taylor, J.; Miller, W.; Kent, W. J.; Nekrutenko, A. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005, 15, 1451−5. (9) Oinn, T.; et al. Taverna: lessons in creating a workflow environment for the life sciences. J. Concurrency Comput.: Pract. Exper. 2006, 18, 1067−100. (10) Hull, D.; Wolstencroft, K.; Stevens, R.; Goble, C.; Pocock, M. R.; Li, P.; Oinn, T. Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 2006, 34, W729−32. (11) Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kötter, T.; Meinl, T.; Ohl, P.; Thiel, K.; Wiswedel, B. KNIME - the Konstanz information miner. ACM SIGKDD Explor. Newsl. 2009, 11, 26. (12) Linke, B.; Giegerich, R.; Goesmann, A. Conveyor: a workflow engine for bioinformatic analyses. Bioinformatics 2011, 27, 903−11. (13) Néron, B.; Ménager, H.; Maufrais, C.; Joly, N.; Maupetit, J.; Letort, S.; Carrere, S.; Tuffery, P.; Letondal, C. Mobyle: a new full web bioinformatics framework. Bioinformatics 2009, 25, 3005−11. (14) Deelman, E.; Singh, G.; Su, M.-H.; Blythe, J.; Gil, Y.; Kesselman, C.; Mehta, G.; Vahi, K.; Berriman, G. B.; Good, J.; Laity, A.; Jacob, J. C.; Katz, D. S. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program. 2005, 13, 219−37. (15) Altintas, I.; Berkley, C.; Jaeger, E.; Jones, M.; Ludascher, B.; Mock, S. Kepler: An Extensible System for Design and Execution of Scientific Workflows. Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004; pp 423−424. (16) de Bruin, J. S.; Deelder, A. M.; Palmblad, M. Scientific workflow management in proteomics. Mol. Cell. Proteomics 2012, DOI: 10.1074/ mcp.M111.010595. (17) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466−7. (18) Specht, M.; Kuhlgert, S.; Fufezan, C.; Hippler, M. Proteomics to go: Proteomatic enables the user-friendly creation of versatile MS/MS data evaluation workflows. Bioinformatics 2011, 27, 1183−4. (19) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open mass spectrometry search algorithm. J. Proteome Res. 2004, 3, 958−64. (20) Sturm, M.; Bertsch, A.; Gröpl, C.; Hildebrandt, A.; Hussong, R.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Zerck, A.; Reinert, K.; Kohlbacher, O. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinform. 2008, 9, 163. (21) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 2003, 75, 4646−58. (22) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551−67. (23) R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, 2008; Vol. 1. (24) Hoekman, B.; Breitling, R.; Suits, F.; Bischoff, R.; Horvatovich, P. msCompare: a framework for quantitative analysis of label-free LCMS data for comparative biomarker studies. Mol. Cell. Proteomics 2012, DOI: 10.1074/mcp.M111.015974. (25) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 2008, 24, 2534−6. (26) Bauch, A.; Adamczyk, I.; Buczek, P.; Elmer, F.-J.; Enimanev, K.; Glyzewski, P.; Kohler, M.; Pylak, T.; Quandt, A.; Ramakrishnan, C.;

are resolved correctly, while processing tasks independent of each other can be run in parallel on a multicore CPU. By default, TOPPAS is equipped with all TOPP tools. These implement a variety of efficient algorithms for numerous tasks in computational analysis of HPLC−MS data. Arbitrary external command line tools can be easily integrated by writing a simple configuration file describing its interface. Established workflows can be run without the GUI using the ExecutePipeline TOPP tool. This enables the use of TOPPAS pipelines in a high-throughput setting, where a visual interface is no longer needed once the pipeline has been tested. For future versions, we plan to facilitate the transfer of TOPPAS workflows to other workflow management systems as well as an integration with the data management and analysis framework openBIS.26 TOPPAS, TOPP, and OpenMS are open-source software. The rich infrastructure provided by OpenMS facilitates rapid prototyping of new efficient algorithms for HPLC−MS data analysis. Novel functionality is constantly being developed and releases of new OpenMS versions are scheduled on a regular basis. Currently, the OpenMS project is actively developed and maintained by more than 30 developers, which ensures a sustainable development of the package.



ASSOCIATED CONTENT

* Supporting Information S

This material is available free of charge via the Internet at http://pubs.acs.org/.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]; [email protected]. Author Contributions ⊥

Joint first authors.

Notes

The authors declare no competing financial interest. ∥ Deceased March 29, 2010.



ACKNOWLEDGMENTS C.B. gratefully acknowledges funding by the European Commissions’s seventh Framework Program (GA202222). O.K. and J.J. acknowledge funding by BMBF (SARA, MoSGrid) and the European Commission’s FP7 (GA236215, GA262067, and GA283481).



REFERENCES

(1) Keller, A.; Eng, J.; Zhang, N.; Li, X.-j.; Aebersold, R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol. Syst. Biol. 2005, 1, 2005.0017. (2) Kohlbacher, O.; Reinert, K.; Gröpl, C.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Sturm, M. TOPP−the OpenMS proteomics pipeline. Bioinformatics 2007, 23, e191−7. (3) Martens, L.; et al. mzML−a community standard for mass spectrometry data. Mol. Cell. Proteomics 2011, 10, R110.000133. (4) Eisenacher, M. mzIdentML: an open community-builtstandard format for the results of proteomics spectrum identification algorithms. Methods Mol. Biol. 2011, 696, 161−77. (5) Deutsch, E. W.; Chambers, M.; Neumann, S.; Levander, F.; Binz, P.-A.; Shofstahl, J.; Campbell, D. S.; Mendoza, L.; Ovelleiro, D.; Helsens, K.; Martens, L.; Aebersold, R.; Moritz, R. L.; Brusniak, M.-Y. TraML−A standard format for exchange of selected reaction F

dx.doi.org/10.1021/pr300187f | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

Beisel, C.; Malmström, L.; Aebersold, R.; Rinn, B. openBIS: a flexible framework for managing and analyzing complex data in biology research. BMC Bioinformatics 2011, 12, 468.

G

dx.doi.org/10.1021/pr300187f | J. Proteome Res. XXXX, XXX, XXX−XXX