A User-Friendly Pipeline Management System for ... - ACS Publications

Jul 15, 2019 - developed. Cloud CPFP2 is built upon the tools in TPP and can be installed on either Linux or Amazon Web Service. ProteoCloud3 integrat...
0 downloads 0 Views 2MB Size
Subscriber access provided by UNIV OF SOUTHERN INDIANA

Technical Note

WinProphet: a user-friendly pipeline management system for proteomics data analysis based on Trans-Proteomic Pipeline Ching-Tai Chen, Chu-Ling Ko, Wai-Kok Choong, Jen-Hung Wang, Wen-Lian Hsu, and Ting-Yi Sung Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.9b01556 • Publication Date (Web): 15 Jul 2019 Downloaded from pubs.acs.org on July 17, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

WinProphet: a user-friendly pipeline management system for proteomics data analysis based on Trans-Proteomic Pipeline Ching-Tai Chen1,‡,*, Chu-Ling Ko2,‡, Wai-Kok Choong1,‡, Jen-Hung Wang1, Wen-Lian Hsu1, and TingYi Sung1,* 1Institute

of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan of Computer Science, National Chiao Tung University, Hsinchu 300, Taiwan

2Department

ABSTRACT: Protein and peptide identification and quantitation are essential tasks in proteomics research and involve a series of steps in analyzing mass spectrometry data. Trans-Proteomic Pipeline (TPP) provides a wide range of useful tools through its web interfaces for analyses such as sequence database search, statistical validation, and quantitation. To utilize the powerful functionality of TPP without the need of manual intervention to launch each step, we develop a software tool, called WinProphet, to create and automatically execute a pipeline for proteomic analyses. It seamlessly integrates with TPP and other external command-line programs, supporting various functionalities including database search for protein and peptide identification, spectral library construction and search, DIA (Data-Independent Acquisition) data analysis, isobaric labeling and label-free quantitation. WinProphet is a standalone and installation-free tool with graphical interfaces for users to configure, manage, and automatically execute pipelines. The constructed pipelines can be exported as XML files with all of the parameter settings for reusability and portability. The executable files, user manual, and sample data sets of WinProphet are freely available at http://ms.iis.sinica.edu.tw/COmics/Software_WinProphet.html.

Proteomics research usually involves a workflow of analyzing mass spectrometry data for proteomic identification and quantitation. Several software packages have been developed for such purpose, among which Trans-Proteomic Pipeline (TPP)1 is frequently used and supports various functions for different tasks, e.g., sequence database search, statistical validation, and protein quantitation. TPP provides web-based user interfaces to launch its powerful functions and needs manual intervention to execute the workflow. When a large number of mzML or mzXML files need to be processed or the analysis workflow is complicated, automated execution and reuse of the entire workflow are desirable. To satisfy the need, a number of proteomics pipeline tools have been developed. Cloud CPFP2 is built upon the tools in TPP and can be installed on either Linux or Amazon Web Service. ProteoCloud3 integrates five database search tools, one de novo sequencing algorithm, and uses a voting algorithm for the identification results. Both of them are cloud-based software tools, which have the advantage of scalability in the cloud computing environment (or a PC cluster). APP4 supports a number of TPP functions through a server/worker interface and can be executed on either cloud computing systems or local machines. Although it requires complicated setup procedures and is not compatible with newer versions of TPP, it provides essential components for proteomics data analysis. While cloud and distributed computing are suitable for analyzing large-scale data sets, they require additional time and manpower to set up and maintain the cloud computing environment (or a PC cluster). In contrast, it is usually easier and more efficient to analyze medium- or small-sized proteomics data locally given the rapid improvement of multicore CPU, memory size, and solid state drive on a single

machine. Nevertheless, to date there is little attention to the development of a standalone pipeline tool for a single machine with multi-thread capability. Therefore, in this paper we present WinProphet, a light-weight and installation-free Windowsbased proteomics pipeline software tool. Users can incorporate not only all the functions in TPP but other external programs into their pipelines. WinProphet provides graphical user interfaces, which allow users to create, automatically execute, monitor, and reuse the pipeline for proteomics data analysis.

Figure 1. The framework of WinProphet.

DESIGN AND IMPLEMENTATION WinProphet is developed using Visual Studio C# and can be executed without installation on Windows platform. The prerequisites are Microsoft .net framework 4.0 or newer, TPP 5.1.0, and MSGF+ v2017.01.03 (if needed). The framework of WinProphet is illustrated in Figure 1. It provides user-friendly interfaces in the pipeline editor that allows users to configure a pipeline, i.e., selecting functions or tools, setting parameters,

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and adjusting the order of operations. It supports the functions of sequence database search with Comet,5 X!Tandem,6 and MSGF+,7 spectral library construction and search with SpectraST,8 file conversion of search results from t.xml and mzid to the pep.xml format, and statistical validation at different levels with PeptideProphet,9 iProphet,10 ProteinProphet,11 and Mayu.12 Graphical user interfaces are provided for users to conveniently set parameters for aforementioned database search tools and statistical validation tools. It also supports a “User Command” function, through which users can perform quantitation using TPP built-in programs such as Libra13 and StPeter,14 and execute any other external programs by providing console mode commands. Once a pipeline is constructed, WinProphet can automatically execute each operation in the entire pipeline sequentially. The pipeline can be exported as an XML file (winprophet.xml) with all of the settings (functions, parameters, input and output files). Users can import an XML file to edit the existing pipeline and re-execute it. Such functionality is handy when users intend to tweak search parameters for achieving various purposes such as maximizing identification coverage or identifying more peptides with varying post-translational modifications. The XML schema is provided in Text S1 in the Supporting Information. During execution, a progress bar shows the progress of the pipeline. The console for output and error messages of each operation of the pipeline is displayed in two separate status windows. The identification results (pepXML file for PeptideProphet and iProphet, protXML file for ProteinProphet, and csv file for Mayu) can be displayed through the function of Report Generator, which contains a protein table, a peptide table, and a PSM (peptide-spectrum-match) table as shown in Figure S1 in the Supporting Information.

RESULTS WinProphet can be applied to various data analyses, including identification from data-dependent acquisition and data-independent acquisition (DIA) data, spectral library search, and quantitation. To showcase its capability, we have constructed five pipeline examples. The first pipeline performs protein identification analysis for a single mzXML file of Erwinia sample.15 It first uses X!Tandem and Comet for database search. The respective search results are validated by PeptideProphet; before using PeptideProphet, X!Tandem’s search result needs to be converted into a format acceptable by PeptideProphet using TandemConverter. Then the validated results are integrated by iProphet and further validated by Mayu. Finally, Report Generator is used to display identification results. The flowchart of the analysis is illustrated in Figure S2A. The entire pipeline consists of 8 operations in 6 steps: 2 operations of database search, 1 operation of file conversion, 2 operations of PeptideProphet, 1 operation of iProphet, 1 operation of Mayu, and 1 operation of Report Generator. Each operation is represented by a row with input file(s), output file(s), and function type in the pipeline editor of WinProphet. The parameters of a function, say, Comet, can be set using a parameter panel. The screenshot of configured pipeline is illustrated in Figure S2B in the Supporting Information, The second pipeline performs identification and iTRAQ-8 quantitation of 5 mzML files from human A549 cell lysate.16 The flowchart of analysis is shown in Figure 2. The pipeline

Page 2 of 9

consists of 27 operations in 5 steps, among which the first 4 steps are identical to the first pipeline (database search, file conversion, PeptideProphet, and iProphet), followed by an additional step of running ProteinProphet with Libra function for isobaric-labeling quantitation. The Libra function is activated by importing a condition.xml file (default file type in TPP) using the parameter panel of ProteinProphet. The screenshot of configured pipeline is illustrated in Figure 3.

Figure 2. The flowchart of pipeline 2, which performs identification and iTRAQ-8 quantitation on a data set of human A549 cell lysate.

The third pipeline performs identification and label-free quantitation of 4 mzXML files from MCF-7 breast cancer cell lines.17 The analysis uses Comet, XTandem, and MSGF+ for database search; and in addition to TandemConverter, mzIdentConverter is also needed to convert MSGF+’s search results to a format acceptable by PeptideProphet. The respective search result files are processed by PeptideProphet, and then integrated by iProphet followed by validation using ProteinProphet and label-free quantitation by StPeter. The pipeline consists of 35 operations in 6 steps: database search (12 operations), file conversion (8 operations), PeptideProphet, (12 operations), iProphet, ProteinProphet, and StPeter. The flowchart of the analysis and the screenshot of configured pipeline are illustrated in Figure S3 in the Supporting Information. The fourth pipeline performs DIA data analysis of 3 mzXML files of human HeLa cell lysate.18 The files are first processed by DIA-Umpire v218 and the resulting 9 MGF (Mascot Generic Format) files of in silico MS2 spectra are converted to mzXML files by msConvert for database search. User Command is conveniently used to execute the above both programs. The remaining steps are similar to those in the first pipeline. The pipeline consists of 48 operations in 8 steps: User Command to perform DIA-Umpire (3 operations inclusively expressed by one User Command), User Command to perform msConvert (9 operations inclusively expressed by one User Command), database search (18 operations), file conversion (9 operations), PeptideProphet (6 operations), iProphet, Mayu, and Report Generator. The flowchart of the analysis and the screenshot of configured pipeline are illustrated in Figure S4 in the Supporting Information. The fifth pipeline demonstrates application in spectral library searching. It uses the Comet and X!Tandem search results and one mzXML data file from the first pipeline example to perform spectral library construction and search using SpectraST. It consists of four steps: spectral library construction, spectral library search, PeptideProphet, and iProphet. Since

2 ACS Paragon Plus Environment

Page 3 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 3. Screenshot of the second pipeline.

SpectraST is not a built-in function of WinProphet and can be conveniently executed by command lines, in the first two steps we use “User Command” function to execute SpectraST. All of the commands for spectral library construction, such as spectra import, consensus spectra creation, and quality filtering, are specified through the parameter panel of User Command and thus can be integrated as a single operation in the pipeline editor. Spectral library search is designed as an additional operation of User Command to be conceptually distinguished from its construction stage. The numbers of identified proteins, peptides, and PSMs generated from all of the above five pipelines are listed in Table S1 in the Supporting Information, and they have been validated with those obtained by manual operation through TPP web interface (Petunia) to ensure consistency.

The Supporting Information is available free of charge on the ACS Publications website. XML schema of the pipeline; Screenshot of Report Generator showing identification results; flowchart and screenshot of the first pipeline; flowchart and screenshot of the third pipeline; flowchart and screenshot of the fourth pipeline; flowchart and screenshot of the fifth pipeline; table for The number of identified proteins, peptides, and PSMs for the five pipelines. (PDF)

AUTHOR INFORMATION Corresponding Authors * Email: [email protected]; [email protected]. Phone: +886 2 2788 3799 ext. {1711, 2352}. Fax: +886 2 2651 8660

Author Contributions ‡These authors contributed equally. All authors have given approval to the final version of the manuscript.

CONCLUSION

Notes

WinProphet, a lightweight and installation-free tool, fills a niche role not covered by existing TPP with its capability of pipeline creation, automated execution, monitoring, and pipeline reuse. We have constructed five sample pipelines for varying purposes, including protein and peptide identifications, iTRAQ-8 quantitation, label-free quantitation, DIA data analysis, and spectral library construction and search, to showcase its capability. Users can build up customized pipelines from scratch or use these samples as templates with minimum revision to accommodate the input files and parameters for proteomics data analysis. In summary, WinProphet can be of great use and is able to facilitate automated proteomics data analysis.

The authors declare no competing financial interest.

ASSOCIATED CONTENT

(1) Deutsch, E. W.; Mendoza, L.; Shteynberg, D.; Slagel, J.; Sun, Z.; Moritz, R. L. Trans-Proteomic Pipeline, a Standardized Data Processing Pipeline for Large-Scale Reproducible Proteomics

Supporting Information

ACKNOWLEDGMENT The authors would like to thank Luis Mendoza of Institute for Systems Biology for technical assistance, and Taiwan Society for Mass Spectrometry for organizing a hands-on short course of WinProphet. This work was supported by the Next-generation Pathway of Taiwan Cancer Precision Medicine Program (AS-KPQ-107TCPMP) in Academia Sinica, and the Taiwan Protein Project (ASKPQ-105-TPP).

REFERENCES

3 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Informatics. PROTEOMICS – Clin. Appl. 2015, 9 (7–8), 745–754. https://doi.org/10.1002/prca.201400164. (2) Trudgian, D. C.; Mirzaei, H. Cloud CPFP: A Shotgun Proteomics Data Analysis Pipeline Using Cloud and High Performance Computing. J. Proteome Res. 2012, 11 (12), 6282–6290. https://doi.org/10.1021/pr300694b. (3) Muth, T.; Peters, J.; Blackburn, J.; Rapp, E.; Martens, L. ProteoCloud: A Full-Featured Open Source Proteomics Cloud Computing Pipeline. J. Proteomics 2013, 88, 104–108. https://doi.org/10.1016/j.jprot.2012.12.026. (4) Malm, E. K.; Srivastava, V.; Sundqvist, G.; Bulone, V. APP: An Automated Proteomics Pipeline for the Analysis of Mass Spectrometry Data Based on Multiple Open Access Tools. BMC Bioinformatics 2014, 15, 441. https://doi.org/10.1186/s12859-014-0441-8. (5) Eng, J. K.; Jahan, T. A.; Hoopmann, M. R. Comet: An Open‐source MS/MS Sequence Database Search Tool. PROTEOMICS 2013, 13 (1), 22–24. https://doi.org/10.1002/pmic.201200439. (6) Craig, R.; Beavis, R. C. TANDEM: Matching Proteins with Tandem Mass Spectra. Bioinformatics 2004, 20 (9), 1466–1467. https://doi.org/10.1093/bioinformatics/bth092. (7) Kim, S.; Pevzner, P. A. MS-GF+ Makes Progress towards a Universal Database Search Tool for Proteomics. Nat. Commun. 2014, 5, 5277. https://doi.org/10.1038/ncomms6277. (8) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; Stein, S. E.; Aebersold, R. Building Consensus Spectral Libraries for Peptide Identification in Proteomics. Nat. Methods 2008, 5 (10), 873–875. https://doi.org/10.1038/nmeth.1254. (9) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical Statistical Model to Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search. Anal. Chem. 2002, 74 (20), 5383–5392. (10) Shteynberg, D.; Deutsch, E. W.; Lam, H.; Eng, J. K.; Sun, Z.; Tasman, N.; Mendoza, L.; Moritz, R. L.; Aebersold, R.; Nesvizhskii, A. I. IProphet: Multi-Level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates. Mol. Cell. Proteomics MCP 2011, 10 (12), M111.007690. https://doi.org/10.1074/mcp.M111.007690. (11) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry. Anal. Chem. 2003, 75 (17), 4646–4658. (12) Reiter, L.; Claassen, M.; Schrimpf, S. P.; Jovanovic, M.; Schmidt, A.; Buhmann, J. M.; Hengartner, M. O.; Aebersold, R. Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry. Mol. Cell. Proteomics MCP 2009, 8 (11), 2405–2417. https://doi.org/10.1074/mcp.M900317MCP200. (13) Deutsch, E. W.; Mendoza, L.; Shteynberg, D.; Slagel, J.; Sun, Z.; Moritz, R. L. Trans-Proteomic Pipeline, a Standardized Data Processing Pipeline for Large-Scale Reproducible Proteomics Informatics. Proteomics Clin. Appl. 2015, 9 (0), 745–754. https://doi.org/10.1002/prca.201400164. (14) Hoopmann, M. R.; Winget, J. M.; Mendoza, L.; Moritz, R. L. StPeter: Seamless Label-Free Quantification with the Trans-Proteomic Pipeline. J. Proteome Res. 2018, 17 (3), 1314–1320. https://doi.org/10.1021/acs.jproteome.7b00786. (15) Gatto, L.; Christoforou, A. Using R and Bioconductor for Proteomics Data Analysis. Biochim. Biophys. Acta BBA - Proteins Proteomics 2014, 1844 (1, Part A), 42–51. https://doi.org/10.1016/j.bbapap.2013.04.032. (16) Hultin-Rosenberg, L.; Forshed, J.; Branca, R. M. M.; Lehtiö, J.; Johansson, H. J. Defining, Comparing, and Improving ITRAQ Quantification in Mass Spectrometry Proteomics Data. Mol. Cell. Proteomics 2013, 12 (7), 2021–2031. https://doi.org/10.1074/mcp.M112.021592. (17) Lawrence, R. T.; Searle, B. C.; Llovet, A.; Villén, J. Plug-and-Play Analysis of the Human Phosphoproteome by Targeted HighResolution Mass Spectrometry. Nat. Methods 2016, 13 (5), 431–434. https://doi.org/10.1038/nmeth.3811.

Page 4 of 9

(18) Tsou, C.-C.; Tsai, C.-F.; Teo, G. C.; Chen, Y.-J.; Nesvizhskii, A. I. Untargeted, Spectral Library-Free Analysis of Data-Independent Acquisition Proteomics Data Generated Using Orbitrap Mass Spectrometers. Proteomics 2016, 16 (15–16), 2257–2271. https://doi.org/10.1002/pmic.201500526.

4 ACS Paragon Plus Environment

Page 5 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry For TOC only

5 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. The framework of WinProphet. 84x40mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 6 of 9

Page 7 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 2. The flowchart of pipeline 2, which performs identification and iTRAQ-8 quantitation on a data set of human A549 cell lysate. 83x40mm (300 x 300 DPI)

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. Screenshot of the second pipeline. 177x81mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 8 of 9

Page 9 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

82x39mm (300 x 300 DPI)

ACS Paragon Plus Environment