PACOM: A Versatile Tool for Integrating, Filtering, Visualizing, and

Mar 20, 2018 - One of the challenges is to not only interpret the data but also to visualize it in a simple way that facilitates communicating it. Sci...
1 downloads 10 Views 2MB Size
Subscriber access provided by Universitaetsbibliothek | Johann Christian Senckenberg

PACOM: a versatile tool to integrate, filter, visualize and compare multiple large MS proteomics datasets. Salvador Martínez-Bartolomé, J. Alberto Medina-Aunón, Miguel Ángel LópezGarcía, Carmen González-Tejedo, Gorka Prieto, Rosana Navajas, Emilio SalazarDonate, Carolina Fernandez-Costa, John R. Yates, and Juan Pablo Albar J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00858 • Publication Date (Web): 20 Mar 2018 Downloaded from http://pubs.acs.org on March 20, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

1

PACOM: a versatile tool to integrate, filter, visualize and

2

compare multiple large MS proteomics datasets.

3

Salvador Martínez-Bartolomé1,2, J. Alberto Medina-Aunon1, Miguel Ángel López-García1,

4

Carmen González-Tejedo1, Gorka Prieto3, Rosana Navajas1, Emilio Salazar-Donate1,

5

Carolina Fernández-Costa2,4, John R. Yates III2,*., and Juan Pablo Albar (deceased) 1.

6 7

1

Proteomics Laboratory, National Center for Biotechnology, CSIC, Madrid, 28049, Spain.

8

2

Department of Chemical Physiology, The Scripps Research Institute, 10550 North Torrey

9

Pines Road, La Jolla, CA 92037, USA

10

3

11

(UPV/EHU), Bilbao, 48013, Spain.

12

4

13

Investigación de Galicia: Instituto de Investigación Sanitaria Galicia Sur (IIS-GS). University

14

of Vigo, Campus Universitario, s/n, 36310, Vigo, Spain

Department of Communications Engineering, University of the Basque Country

Immunology, Centro de Investigaciones Biomédicas (CINBIO), Centro singular de

15 16

* To whom the correspondence should be addressed to:

17

John R. Yates III

18

Email: [email protected]

19

Phone: 858-784-8863

20

FAX: 858-784-8883

21 22 23

KEYWORDS:

24

Proteomics data comparison, proteomics data integration, data filtering, data visualization,

25

Java API, GUI, chart generation, MIAPE, HUPO-PSI.

26 27 1

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 28

28

ABSTRACT

29

Mass spectrometry-based proteomics has evolved into a high-throughput technology where

30

numerous

31

Furthermore, several scientific journals and funding agencies have emphasized the storage

32

of proteomics data in public repositories to facilitate its evaluation, inspection and reanalysis

33

1

34

tools are needed to integrate multiple proteomics datasets in order to compare different

35

experimental features or to perform quality control analysis. Here, we present a new Java

36

stand-alone tool, PACOM (Proteomics Assay COMparator), which is able to import, combine

37

and simultaneously compare numerous proteomics experiments in order to check the

38

integrity of the proteomic data as well as to verify data quality. With PACOM, the user can

39

detect source of errors that may have been introduced in any step of a proteomics workflow

40

and which influence the final results. Datasets can be easily compared and integrated, and

41

data quality and reproducibility can be visually assessed through a rich set of graphical

42

representations of proteomics data features as well as a wide variety of data filters. Its

43

flexibility and easy-to-use interface make PACOM a unique tool for daily use in a proteomics

44

laboratory. PACOM is available at: https://github.com/smdb21/pacom.

large-scale

datasets

are

generated

from

diverse

analytical

platforms.

. As a consequence, public proteomics data repositories are growing rapidly. However,

45

2

ACS Paragon Plus Environment

Page 3 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

INTRODUCTION

A proteomics experiment is exposed to multiple sources of error during both the experimental process and the subsequent data analysis. Systematic errors can lower the number of high confidence proteins and peptides or reduce the reproducibility between technical replicates. Therefore, it is essential to routinely check the quality of the massive datasets generated in proteomics laboratories in order to diagnose potential instrument or procedure-related problems. It is important to use different quantitative indicators to help identify potential sources of error and ensure that high data quality and reproducibility is achieved and maintained

2-5

. In order to do this, individual metrics of a dataset such as

coefficient of variations, score distributions, systematic mass error shifts, absolute number of peptides identified, protein coverage, number of peptides per protein, number of miscleavages, etc., need to be extracted, sometimes combined, and compared. However, modern mass spectrometers generate large amounts of data that make compilation and visualization a challenging task. A typical bottom-up proteomics data analysis workflow involves several data integration steps between matching a spectrum to peptide sequences in a database and generating a final protein list. When results from multiple analysis or MS runs are combined (i.e. pre-fractionation steps, multiple technical or biological replicates, samples generated under two or more different experimental conditions or peptide and protein lists that result from the use of different search engines or data analysis tools), another step of data integration is needed. Following integration, the presentation, comparison and visualization of proteomics data from multiple experiments can be extremely complicated. One of the challenges is to not only interpret the data, but also to visualize it in a simple way that facilitates communicating it. Scientists unfamiliar with bioinformatics risk missing important details during integration and visualization of large amounts of data since most of the tools available for proteomics data visualization are designed for the inspection of a single dataset. Thus, a versatile software tool is needed to enable scientists to evaluate the quality of their data. 3

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 28

Many quality control and data visualization tools have been developed in the field of proteomics. Some of them have been presented as quality control tools focused on the validation of search engine results to obtain a higher accuracy and minimize false positive identifications

6-7

, but they don’t investigate the sources of error. Others generate a

comprehensive set of QC metrics which can be used to systematically check the quality of the data and source of errors, but they lack an interactive and visual interface to explore these values over the data

8-10

. The scope of other tools is focused on the quality control and

visualization of the proteomics raw data

11-16

. Existing software packages that contain

visualization modules provide a way to explore some features of proteomics datasets, however they either lack the required flexibility to integrate and compare multiple datasets at different levels

17-20

, or they are not suitable for non-expert users who are unfamiliar with

bioinformatics workflows or programming 21-22.

Here we present PACOM, the Proteomics Assay COMparator, which is designed to visualize and compare multiple proteomics datasets simultaneously, allowing interactive exploration of different proteomics data features and a check on data quality and reproducibility across different experiments. This tool is equipped with a rich set of proteomics-specific data filters that help to focus on specific subsets of the data (such as modified peptides or items that pass a score threshold), and it provides an exhaustive set of graphical representations for multiple proteomics data features (Table 1). PACOM inspects the data from different perspectives, so the user can quickly compare multiple datasets before and after combining them in groups by simply selecting different levels of integration in the comparison. PACOM applies the PAnalyzer23 algorithm to automatically cluster similar proteins into groups based on shared peptide evidence, which facilitates the analysis of individual proteoforms. PACOM is useful to any researcher who wants to perform quality control checks or compare several features between different proteomics datasets in a fast and easy manner.

4

ACS Paragon Plus Environment

Page 5 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

PACOM is a free, open-source Java software that supports the most common proteomics data formats, including the HUPO-PSI (Human Proteome Organization – Proteomics 24

Standards Initiative)

standard for protein identifications mzIdentML25-28, the output of

software such as DTASelect29-30, pepXML31, X!Tandem32, the PRIDE33 XML (PRoteomics IDEntifications eXtensible Markup Language) data file, as well as a simple text-based files in flat-file format. In addition, some of the functionalities of the software are provided by a new open-source Java API (Application Programming Interface), the Java MIAPE API (https://github.com/smdb21/java-miape-api). The Java MAIPE API is a new bioinformatics resource for developers to extract the most relevant data from input data files as defined in the MIAPE, the Minimum Information of a Proteomics Experiment guidelines34-35 (for a more detailed description of the API, see Supporting Information). The utility of PACOM is demonstrated for the compilation, integration and comparison of the data generated in the ProteoRed36 Multi-centric Experiment 6, a multi-laboratory assay which

includes

57

datasets

generated

on

17

different

proteomics

platforms

(http://www.legacy.proteored.org/PME6). We also discuss six additional examples in which PACOM was used to assess the quality and reproducibility of proteomics datasets. Here, PACOM was able to detect problems that occurred during chromatographic separation of peptides, enzymatic digestion of proteins in peptides, as well as mass spectrometric data acquisition. MATERIALS AND METHODS Informatics PACOM is encoded in Java 1.8 as a standalone software with a rich GUI (Graphical User Interface). The program is released as open-source software under the permissive Apache 2.0

license

and

can

be

downloaded

from

the

GitHub

code

repository

at

http://github.com/smdb21/pacom. The tool’s package contains the Java program and several 5

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 28

example datasets. PACOM only requires the runtime environment Java 1.8 virtual machine that can be run in any hardware and platform. The disk space required is negligible and the size of internal memory needed depends on the amount of data loaded. However, a modern computer having 8-16GB RAM should be sufficient for most tasks. The following external libraries were used to parse each of the input data types: MSFTBX

for reading pepXML

files, jmzIdentML API

38

for reading mzIdentML files, jmzML API

and XTandem Parser

40

for reading X!Tandem output files. Other external libraries used for

miscellaneous purposes are: compomics-utilities

41

, dbToolkit

42

39

37

for reading mzML files,

,ms-data-core-api

43

and PSI

semantic validator 44. The ProteoRed Multi-centric Experiment 6 A full description of the experiment, the participants, as well as the results submitted by each participant can be found at http://legacy.proteored.org/PME6. The study was designed to analyze a common sample, the ProteoRed Plasma Subset Reference (PPSR), in a multilaboratory environment and then to compare the results obtained by different platforms and approaches. The PPSR sample consisted of a human plasma sample pool obtained from a first immunoaffinity depletion using a SEPPRO IgY4 column (Sigma) to capture the 14 most abundant proteins. Then, nine different fractions coming from these depletions were combined and immuno-depleted again by a SEPPRO IgYHSA (Sigma) for removing the Human Serum Albumin protein (HSA). Finally, 4 proteins were spiked in to a total of 2.967 µg of sample, in different concentrations: 30 µg of YWHAG (P61981|1433_HUMAN: 14-3-3 protein gamma, human recombinant), 3 µg of ALDOA (P00883|ALDOA_RABIT: Fructosebisphosphate aldolase A, Rabbit, Sigma), 0.3 µg of CASB (P02666|CASB_BOVIN: betacasein, Bovin, Sigma) and 0.03 µg of PYGM (P00489|PYGM_RABIT: Glycogen phosphorylase, Rabbit, Sigma). Each participant re-suspended the sample in 50mM Tris HCl pH 7.4 and stored it at -20ºC. Each participant performed their own chromatographic

6

ACS Paragon Plus Environment

Page 7 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

gradient, although a 90-120 minute-gradient was recommended to obtain a more comparable data. All participants reported their experiments using the online MIAPE Generator tool 45, creating 44 MIAPE-compliant reports publicly available at the MIAPE project with the identifier 470. Afterwards, data from 17 out of 20 participants were used to test the PACOM tool. 49 raw data files (3 replicates per participant, excepting one which only performed one replicate) from different MS platforms (AB Sciex t2d and wiff files, Thermo Scientific raw files, Bruker Daltonics and Agilent yep files, Waters raw files) were converted to mzML standard files using the CompassXport converter in case of Bruker Daltonics data, and msConvert converter from ProteoWizard project 46 for all the others, with the exception of 2 datasets that were converted to MGF files due to technical problems during the analysis of the resulting mzML files with Mascot. All resulting files were then directly submitted to a Mascot search engine v2.3 using a custom target-decoy database constructed from a Human UniProtKB/Swissprot database containing a total of 20,295 protein entries and their respective decoy ones in which the four spiked protein sequences were added (https://github.com/smdb21/PACOM/raw/master/PACom/pme6_database/PME6_decoy.fast a.gz) and using the parameters reported by the participants (Additional Table S1). 49 mzIdentML files were then exported from each MASCOT result. Finally, these files were imported by PACOM as individual datasets and were inspected grouping each triplet of replicates in single nodes. Once the data was loaded into the tool a False Discovery Rate (FDR) threshold of 1% at peptide level was applied before performing any comparison. RESULTS Once PACOM has been downloaded, it can be opened without any additional installation. After opening, the user can explore all possibilities of PACOM with preloaded example datasets. PACOM accepts as input different file types such as PRIDE XML47, or the output files from X!Tandem32, DTASelect30 or the Trans-Proteomics Pipeline 31 (pepXML). Datasets 7

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 28

may be imported into the system either individually or in batch (Figure 1A). During import all input data files are parsed to extract the information defined by the MIAPE MSI (Minimum Information About a Proteomics Experiment - Mass Spectrometry Informatics)

48

guidelines.

In addition, PACOM reads data in flat file format, which is especially useful for the incorporation of published datasets typically provided as Excel spreadsheets. PACOM also supports the mzIdentML25-26 format (including the recently released mzIdentML 1.2 specification27), the proteomics community standard for the representation of protein and peptide lists. The use of the mzIdentML format enormously facilitates the integration of datasets that have been generated by different instrument and software platforms. All data is stored locally on the computer that is used to run PACOM, which avoids long wait times due to an upload of the data to a remote server, thus maintaining the privacy of the data. Once datasets are imported, they can be selected for further analysis.

Figure 1: Complete pipeline provided by PACOM: In the first step, search engine results or any table containing protein-peptide pairs can be used as input files for the extraction of the MIAPE information and the creation of the dataset files (A). The second step consists of the definition of a comparison project which defines how the data will be compiled and integrated following a hierarchical structure (B). After that, peptide and protein identification lists can be inspected by generating diverse charts (45) at different levels of integration (C). The tool also allows the application of different filters over the multiple datasets in the project (D).

8

ACS Paragon Plus Environment

Page 9 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The second step defines a comparison between datasets (Figure 1B). The user selects the datasets of interest and organizes these into a hierarchical tree of two levels (Figure 2). The structure of the tree branches determines how the different protein and peptide lists are integrated and visualized. Thus, it is simple to either visualize the complete dataset by adding multiple individual datasets into one single node or to inspect and compare each dataset individually. The lower nodes of the comparison tree show individual dataset features, whereas the parent node of the tree displays all integrated features of the complete dataset. Figure 3 displays a representative example with three different datasets comprising three technical replicates each. Panel A in Figure 3 shows a chart generated by PACOM with the number of proteins and peptides detected overall (black arrow) as well as in each individual replicate. The chart in panel B presents the three datasets following integration of technical replicates along with the grand total (black arrow). PACOM automatically applies the PAnalyzer algorithm to avoid any redundancies or overcounting of proteins or peptides at each level of data integration. This protein group creation and classification is especially useful to distinguish isoforms, protein family members or proteoforms, or to visualize any ambiguities due to the identification of peptides that are shared by multiple proteins. Thus, the number of proteins shown in Figure 3B corresponds to the number of different protein groups formed after compiling all peptides from the individual datasets of the children nodes in the comparison.

9

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 28

Figure 2: Example screenshot of the definition of the comparison project: The tool presents the imported datasets in the left folder tree panel. Then, the user can include and organize them in a hierarchical tree, which will define how the data will be integrated and compared in the next step of the workflow. In the zoomed panel, several experiments containing 3 replicates (excepting one of them) are organized in different nodes for its comparison.

Figure 3: How data is integrated and how redundancies are automatically removed: A project containing 3 datasets, each of them with 3 technical replicates is shown. In A, the tool shows the number of peptides (red), the number of peptides differentiating by charge 10

ACS Paragon Plus Environment

Page 11 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

state (z) (blue) and the number of proteins (green) for the 9 individual replicates and these numbers after the integration of all of them (marked with black arrows). In B, we can see these numbers for the 3 datasets (after the integration of their respective replicates) and for the total (black arrows).

The ease of this step allows the user to rapidly redefine and create different configurations with the same set of data in order to make different comparisons such as integrating them differently, or removing a particular dataset to see how it affects higher levels of data integration. Once the comparison is defined, the tool offers a wide variety of graphical representations for inspection, comparison and quality control assessment of the datasets (Figure 1C). A chart canvas dynamically adapts every time a new chart type is selected or a new filter is applied (Figure 4). A complete list of the 45 chart types together with their descriptions is available in Table 1, and an exhaustive description of the application of most of the charts can be found in the PME6 example (Supporting Information).

11

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 28

Figure 4: Inspecting data in PACOM: The tool provides different options such as a menu (at the top) for changing the chart types, applying or removing filters, and some general options. At the top left, some global statistics of the datasets and some export options are provided. Chart customization controls for changing the comparison level or for the customization of the actual chart are located on the left. On the center, the chart area is automatically generated, and at the bottom the user can see the status and memory logs. At the right a help panel shows the diverse chart types in that analysis category with a description for each of them. In this screenshot, the chart shows the number of peptides and proteins for each one of the 17 experiments performed over the same sample using different analytical platforms.

Additionally, PACOM offers a complete set of diverse filters (Figure 1D) widely used for data curation in proteomics. These filters, which can be applied individually or in combination, comprise: a) Custom score thresholds: specific threshold values applied to any score that is associated to either proteins or peptides. b) False discovery rate (FDR) calculations: a false discovery rate (FDR) threshold calculated over a list of peptides or proteins sorted by a selected score. c) Minimum number of peptides per protein threshold: a minimal number of peptides or PSMs per protein required to keep a protein in a dataset. d) Minimum number of replicate detection: a minimum number of technical or biological replicates in which proteins (or peptides) must be detected to keep them in a dataset. e) Selection of peptides containing certain PTMs: specific peptide modifications and their occurrence on each peptide sequence required to keep peptides in a dataset. f)

User-defined lists of peptides or proteins threshold: pre-specified list of peptides or proteins in which peptides or proteins must be present to keep then in a dataset.

g) Minimal sequence length filter: minimal sequence length required to keep a peptide in a dataset.

12

ACS Paragon Plus Environment

Page 13 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

h) Selection of peptides with specific features that are usually required for target selection in MRM assays: different features such as the absence of any miscleavages, the absence of methionine (Met), tryptophan (Trp), or glutamine (Gln) at the first position of the peptide sequence, or the requirement for peptides to be unique for a single proteoform and not shared by multiple proteins. All filters can be applied at any time, and the charts will be automatically updated. After applying a filter, the individual datasets can be saved as filtered, and then be loaded as new datasets. This option allows the user to save a set of datasets with specific filter settings thus avoiding applying settings again every time the project is loaded. It simplifies the comparison of datasets under different filter settings. PACOM also displays data in a table, with the option to show the data at different levels of integration, and with the option to remove redundancies on peptide level (best PSM per peptide) or on protein level which allows inspection of a single protein across all datasets. The data table can be again filtered and sorted using any value at any column, and can be exported to a tab-separated value (TSV) text file. The datasets can also be exported in PRIDE XML

47

file format. With this option, a different PRIDE XML file will be created for

each direct child node of the parent node of the comparison tree, so that all the datasets pending to that node (grandchildren nodes) will be integrated in a single PRIDE XML file. These PRIDE XML files contain all associated spectra in case a peak list in MGF format was imported along with each peptide or protein file import, those having been generated by directly searching the respective peak list file. The output PRIDE XML file could subsequently be used to visualize the annotated spectra of the PSMs in the dataset with PRIDE Inspector 20 (PRIDE Database Team, EMBL-European Bioinformatics Institute). The fast and agile visualization of specific features of datasets supports a quality check of experimental procedures. PACOM helps to detect features in a dataset that reflect potential errors in either experimental data or data analysis by comparing all datasets with the same 13

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 28

parameter settings. This allows the user to quickly pinpoint outliers in the dataset and, depending on the type of the feature, it may allow the cause of the problem to be identified. For example: a) An altered distribution of miscleavages indicates a problem in the enzymatic digestion of the proteome (example 1 of the Supporting Information). b) Differences in mass errors point to inconsistent calibrations of a mass spectrometer (example 2 of the Supporting Information). c) Differences in the peptide retention time distributions or an outlier number of identifications in a certain fraction localize inconsistencies in chromatographic separation of peptides (examples 3 and 5 of the Supporting Information). d) Very poor reproducibility among replicates hints at variabilities that arose during sample preparation (reproducibility checked on the example 4 of the Supporting Information). e)

Significantly different PSM or peptide scores might reveal a difference in the search engine parameters.

The wide variety of features that can be visualized with PACOM make it a great resource for data quality control across datasets. PACOM also simplifies and expedites data handling by automatically integrating large amounts of proteomics data from the most common proteomics data file formats, including the HUPO-PSI standards. Multiple large proteomics analyses can be easily imported, organized in a hierarchical tree and then visualized in many different ways. In Supporting Information, we discuss a comprehensive example in which PACOM was used to compare data in a multi-centric study. The ProteoRed 36 Multi-centric Experiment 6, PME6 (http://www.legacy.proteored.org/PME6), in which a total of 49 datasets from 17 different proteomics platforms analyzing the same sample was loaded and compared in the tool. Additionally, the Spanish Consortium of the Chromosome-Centric Human Proteome Project 14

ACS Paragon Plus Environment

Page 15 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(Sp-HPP)

49

, responsible for the characterization of proteins encoded by the human

chromosome 16, employed PACOM at the first discovery phase of the project to integrate, filter and report the data from shotgun identification analyses performed on the different platforms available in the consortium. Thus, five different datasets were created and submitted to ProteomeXchange

50-51

. Additionally, several other smaller published datasets

were compiled and inspected using our tool 52-55.

Quality control quick examples:

Example1: In this example, the total proteome of the human Jurkat cell line was analyzed using a 2D LC-MS/MS proteomic approach, consisting of sample fractionation by reverse phase chromatography at basic pH (RPb) followed by a LC-MS/MS analysis carried out on two different mass spectrometers: the 5600 TripleTOF from AB SCIEX (TTOF) and the QExactive from Thermo Scientific (QE). We performed two biological replicates of the experiment. It is remarkable that in both replicates the QE identified more proteins, peptides and PSMs (Figure 5A). This performance difference can be explained by the disparity in the peptide mass accuracy of each mass spectrometer, as revealed by the “peptide mass error” plot obtained with the tool (Figure 5B). We can observe that the peptide mass shift in TTOF (red and green dots) was considerably higher than in the QE system (blue and yellow dots), affecting the subsequent number of identified peptides and proteins.

15

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 28

Figure 5. Differences in performances between a 5600 TripleTOF and a Q-Exactive: In A the number of proteins (green), peptides (blue) and PSMs (red) are shown per each of the replicates of the two experiments. In B the delta m/z (difference between the theoretical peptide mass and the experimental precursor ion mass) is plotted against the experimental m/z values per each PSM (red and green for TTOF and blue and yellow for QE).

Example 2: In this experiment, two different chromatographic conditions were compared in an attempt to improve the separation of the peptides derived from a cell lysate. We ran two aliquots of the same digested lysate at two different column temperatures using the same column attached to a nLC pump coupled to a LTQ XL mass spectrometer from Thermo Scientific. Using PACOM, we were able to compare retention time distributions for the two conditions (Figure 6A) as well as the retention times of the individual shared peptides (Figure 6B), thus clearly demonstrating that the temperature affects the elution pattern of our peptides.

16

ACS Paragon Plus Environment

Page 17 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 6. Example 2: Comparing retention times between two different chromatographic conditions: In A the distribution of retention times (in minutes) is shown for the 2 conditions. In B, the retention time for each of the peptides detected in the 2 experiments is plotted in a scatter plot. The black line is the diagonal and the red line is the regression line with an R2=0.9404.

DISCUSSION In both experimental and data analysis steps, procedures performed in proteomics experiments are exposed to many different error sources that are usually difficult to trace. Although several efforts have led to new approaches to the analysis of quality control metrics of proteomics data

5, 8-9, 21

, none of these methods or tools are easily applicable to

proteomics data without extensive expertise in programming or data analysis. Other tools, such as the PRIDE Inspector Toolsuite 20, allow users to load data from multiple assays, and allow the visualization of annotated spectra, chromatograms (features not available in PACOM) and some other charts (i.e. the distribution of delta m/z, distribution of missed tryptic cleavages, charges and precursor masses for individual experiments). However, PRIDE Inspector only visualizes each assay individually, and therefore it is not possible to perform a direct comparison between datasets. Similarly, IP2, a framework 17

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

containing numerous different analysis tools

29-30, 56-57

Page 18 of 28

, provides a set of charts to visualize

histograms and distributions of similar features, but always refers to individual datasets. PeptideShaker19, is a freely available program that integrates multiple search engine results into a single final protein list and generates new confidence scores and statistics for some confidence measures that may help in quality control. However, the variety of features that can be visualized is limited, and it does not allow the comparison of multiple datasets. hEIDI (handy Exploration and Integration of Data and Identifications)

18

was designed to integrate

multiple datasets by pooling them into different groups for comparison, but it is only able to compare the adjusted spectral count of the constructed protein groups. proteoQC

22

, a R

package for proteomics data quality control, is able to compare groups of experiments, generating a QC report containing intra-experiment metrics for each sample, as well as aggregated information to compare samples at the level of their fractions, technical replicates and biological replicates. However, it is based on R programming language

58

which makes it inaccessible for users not proficient in programming in that language. PTXQC

9

generates a set of QC metrics under four categories: sample preparation, LC, MS

and general performance, and generates a set of plots for each of these categories. However, PTXQC is designed only as a quality control pipeline for MaxQuant 59. PACOM provides a rich set of data charts unmatched by any other data analysis tool (Table 1). It explores a much wider variety of proteomics features. For example, Scaffold (Proteome Software Inc. http://www.proteomesoftware.com), a commercial software package for proteomics data analysis that can be too expensive for academic use, allows the comparison of complex proteomics datasets using different filter criteria, and includes some statistics in its graphical displays of data. However, it doesn’t offer the flexibility provided by PACOM which can compare multiple datasets on the fly and group them into a hierarchical structure. To our knowledge, there is no freely available tool for the visual inspection and comparison of multiple, large-scale proteomics datasets using multiple advanced and proteomics-specific data filters. Thus, our tool constitutes a unique free platform to simultaneously visualize, 18

ACS Paragon Plus Environment

Page 19 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

examine and compare the numerous features associated with protein and peptide lists obtained in multiple large-scale proteomics analyses. Currently, the features that can be explored, compared and visualized through PACOM are not compatible with quantitative approaches. All input files that are supported are essentially files generated either by database search engines or by tools that re-analyze or filter protein and peptide lists from the first ones. Quantitative tools generate a large variety of output data file formats, and the information they contain is even more heterogeneous than the information generated by search engines or identification validation tools. This is reflected in the fact that the existing standard data format for quantitative data, the mzQuantML60, has not been officially adopted by any of the most popular quantitative data analysis tools, and each of them generates its own output data format. Therefore, for now, quantitative data has not been explicitly supported in PACOM. However, the flexibility of PACOM can be used to somehow overcome this deficiency, since the user could include any quantitative feature in a new column in the text file data table used to import the dataset, and PACOM will associate these values to proteins, peptides or PSMs. Then these new features could be explored and compared in the tool as a different “score”, showing and comparing its distribution and values between different datasets. While mzQuantML has still not been widely adopted by the proteomics community, the number of tools implementing the mzTab61 format is increasing. Therefore, we consider that mzTab, a tabular separated text file developed by the HUPO-PSI that represents both identification results and quantitative analysis results, would be the appropriate format to support in the next stable release of PACOM.

Acknowledgment We acknowledge funding from ProteoRed [PRB2 IPT13/0001-ISCIII-SGEFI/FEDER]; the Spanish

National

Research

Council

(CSIC);

the

7th

Framework

Programme

'ProteomeXchange' grant 260558, and US National Institutes of Health (U54GM114833, 19

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

P41GM103533, R01MH067880 and 1U01EY0). C Fernandez-Costa acknowledges a postdoctoral fellowship from Xunta de Galicia (Plan I2C, Spain). U54 GM114833-03.

Dedication This work is dedicated to the memory of Juan Pablo Albar Ramirez. He certainly supported and helped to achieve the goals presented in this work, as nobody else could have done.

Supporting Information. 1.Description of the Java MIAPE API: Figure S1 General scheme of the Java MIAPE API Figure S2: An example of Java code 2. A multi-centric experiment, the ProteoRed Multi-centric Experiment 6: Figure S3. Number of proteins, peptides and PSMs Figure S4. FDR curves Figure S5. Average number and standard deviations of the identified proteins and peptides Figure S6. Score comparison Figure S7. Peptide overlapping between the three replicates of each experiment Figure S8. Peptide and Protein heatmaps Figure S9. Exclusive proteins Figure S10. Peptide length distribution Figure S11. Average protein sequence coverage Figure S12. Repeatability of peptide detection Figure S13. Spiked protein analysis Figure S14. Distribution of the PAnalyzer grouping classification Figure S15. Sensitivity and precision Figure S16. Examples of protein clouds Figure S17. Human chromosome coverage Figure S18. Search engine metadata 3. Supplementary examples (1-4) of the use of PACOM for data comparison and quality control: Figure S19. Supplementary example 1: Detection of an anomaly in the digestion of replicate 1 Figure S20. Supplementary example 2: Detection of anomaly in fraction 2 Figure S21. Supplementary example 3: Reproducibility quality control Figure S22. Supplementary example 4: Comparing a control versus a drug-treated sample 4. Additional Table S1: Search parameters used by each of the selected participants in the multi-laboratory study PME6. 20

ACS Paragon Plus Environment

Page 20 of 28

Page 21 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

21

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 28

References 1. Kinsinger, C. R.; Apffel, J.; Baker, M.; Bian, X.; Borchers, C. H.; Bradshaw, R.; Brusniak, M. Y.; Chan, D. W.; Deutsch, E. W.; Domon, B.; Gorman, J.; Grimm, R.; Hancock, W.; Hermjakob, H.; Horn, D.; Hunter, C.; Kolar, P.; Kraus, H. J.; Langen, H.; Linding, R.; Moritz, R. L.; Omenn, G. S.; Orlando, R.; Pandey, A.; Ping, P.; Rahbar, A.; Rivers, R.; Seymour, S. L.; Simpson, R. J.; Slotta, D.; Smith, R. D.; Stein, S. E.; Tabb, D. L.; Tagle, D.; Yates, J. R.; Rodriguez, H., Recommendations for mass spectrometry data quality metrics for open access data (corollary to the Amsterdam Principles). Journal of proteome research 2012, 11 (2), 1412-9. 2. Campos, A.; Diaz, R.; Martinez-Bartolome, S.; Sierra, J.; Gallardo, O.; Sabido, E.; LopezLucendo, M.; Ignacio Casal, J.; Pasquarello, C.; Scherl, A.; Chiva, C.; Borras, E.; Odena, A.; Elortza, F.; Azkargorta, M.; Ibarrola, N.; Canals, F.; Albar, J. P.; Oliveira, E., Multicenter experiment for quality control of peptide-centric LC-MS/MS analysis - A longitudinal performance assessment with nLC coupled to orbitrap MS analyzers. Journal of proteomics 2015, 127 (Pt B), 264-74. 3. Kocher, T.; Pichler, P.; Swart, R.; Mechtler, K., Quality control in LC-MS/MS. Proteomics 2011, 11 (6), 1026-30. 4. Albar, J. P.; Canals, F., Standardization and quality control in proteomics. Journal of proteomics 2013, 95, 1-2. 5. Foster, J. M.; Degroeve, S.; Gatto, L.; Visser, M.; Wang, R.; Griss, J.; Apweiler, R.; Martens, L., A posteriori quality control for the curation and reuse of public proteomics data. Proteomics 2011, 11 (11), 2182-94. 6. Li, N.; Wu, S.; Zhang, C.; Chang, C.; Zhang, J.; Ma, J.; Li, L.; Qian, X.; Xu, P.; Zhu, Y.; He, F., PepDistiller: A quality control tool to improve the sensitivity and accuracy of peptide identifications in shotgun proteomics. Proteomics 2012, 12 (11), 1720-5. 7. Brosch, M.; Yu, L.; Hubbard, T.; Choudhary, J., Accurate and sensitive peptide identification with Mascot Percolator. Journal of proteome research 2009, 8 (6), 3176-81. 8. Bittremieux, W.; Meysman, P.; Martens, L.; Valkenborg, D.; Laukens, K., Unsupervised Quality Assessment of Mass Spectrometry Proteomics Experiments by Multivariate Quality Control Metrics. Journal of proteome research 2016, 15 (4), 1300-7. 9. Bielow, C.; Mastrobuoni, G.; Kempa, S., Proteomics Quality Control: Quality Control Software for MaxQuant Results. Journal of proteome research 2016, 15 (3), 777-87. 10. Taylor, R. M.; Dance, J.; Taylor, R. J.; Prince, J. T., Metriculator: quality assessment for mass spectrometry-based proteomics. Bioinformatics 2013, 29 (22), 2948-9. 11. Pluskal, T.; Castillo, S.; Villar-Briones, A.; Oresic, M., MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC bioinformatics 2010, 11, 395. 12. Bereman, M. S., Tools for monitoring system suitability in LC MS/MS centric proteomic experiments. Proteomics 2015, 15 (5-6), 891-902. 13. Pichler, P.; Mazanek, M.; Dusberger, F.; Weilnbock, L.; Huber, C. G.; Stingl, C.; Luider, T. M.; Straube, W. L.; Kocher, T.; Mechtler, K., SIMPATIQCO: a server-based software suite which facilitates monitoring the time course of LC-MS performance metrics on Orbitrap instruments. Journal of proteome research 2012, 11 (11), 5540-7. 14. Sturm, M.; Bertsch, A.; Gropl, C.; Hildebrandt, A.; Hussong, R.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Zerck, A.; Reinert, K.; Kohlbacher, O., OpenMS - an open-source software framework for mass spectrometry. BMC bioinformatics 2008, 9, 163. 15. Ma, Z. Q.; Polzin, K. O.; Dasari, S.; Chambers, M. C.; Schilling, B.; Gibson, B. W.; Tran, B. Q.; Vega-Montoto, L.; Liebler, D. C.; Tabb, D. L., QuaMeter: multivendor performance metrics for LC-MS/MS proteomics instrumentation. Anal Chem 2012, 84 (14), 5845-50. 16. Bittremieux, W.; Willems, H.; Kelchtermans, P.; Martens, L.; Laukens, K.; Valkenborg, D., iMonDB: Mass Spectrometry Quality Control through Instrument Monitoring. Journal of proteome research 2015, 14 (5), 2360-6. 17. Halligan, B. D.; Greene, A. S., Visualize: a free and open source multifunction tool for proteomics data analysis. Proteomics 2011, 11 (6), 1058-63. 22

ACS Paragon Plus Environment

Page 23 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

18. Hesse, A. M.; Dupierris, V.; Adam, C.; Court, M.; Barthe, D.; Emadali, A.; Masselon, C.; Ferro, M.; Bruley, C., hEIDI: An Intuitive Application Tool To Organize and Treat Large-Scale Proteomics Data. Journal of proteome research 2016, 15 (10), 3896-3903. 19. Kopczynski, D.; Sickmann, A.; Ahrends, R., Computational proteomics tools for identification and quality control. J Biotechnol 2017, 261, 126-130. 20. Wang, R.; Fabregat, A.; Rios, D.; Ovelleiro, D.; Foster, J. M.; Cote, R. G.; Griss, J.; Csordas, A.; Perez-Riverol, Y.; Reisinger, F.; Hermjakob, H.; Martens, L.; Vizcaino, J. A., PRIDE Inspector: a tool to visualize and validate MS proteomics data. Nature biotechnology 2012, 30 (2), 135-7. 21. Aiche, S.; Sachsenberg, T.; Kenar, E.; Walzer, M.; Wiswedel, B.; Kristl, T.; Boyles, M.; Duschl, A.; Huber, C. G.; Berthold, M. R.; Reinert, K.; Kohlbacher, O., Workflows for automated downstream data analysis and visualization in large-scale computational mass spectrometry. Proteomics 2015, 15 (8), 1443-7. 22. Wen, B.; Gatto, L. proteoQC: An R package for proteomics data quality control. R package version 1.14.0, 2017. 23. Prieto, G.; Aloria, K.; Osinalde, N.; Fullaondo, A.; Arizmendi, J. M.; Matthiesen, R., PAnalyzer: a software tool for protein inference in shotgun proteomics. BMC bioinformatics 2012, 13, 288. 24. Orchard, S., Data standardization and sharing-the work of the HUPO-PSI. Biochim Biophys Acta 2014, 1844 (1 Pt A), 82-7. 25. Eisenacher, M., mzIdentML: an open community-built standard format for the results of proteomics spectrum identification algorithms. Methods in molecular biology 2011, 696, 161-77. 26. Jones, A. R.; Eisenacher, M.; Mayer, G.; Kohlbacher, O.; Siepen, J.; Hubbard, S. J.; Selley, J. N.; Searle, B. C.; Shofstahl, J.; Seymour, S. L.; Julian, R.; Binz, P. A.; Deutsch, E. W.; Hermjakob, H.; Reisinger, F.; Griss, J.; Vizcaino, J. A.; Chambers, M.; Pizarro, A.; Creasy, D., The mzIdentML data standard for mass spectrometry-based proteomics results. Molecular & cellular proteomics : MCP 2012, 11 (7), M111 014381. 27. Vizcaino, J. A.; Mayer, G.; Perkins, S.; Barsnes, H.; Vaudel, M.; Perez-Riverol, Y.; Ternent, T.; Uszkoreit, J.; Eisenacher, M.; Fischer, L.; Rappsilber, J.; Netz, E.; Walzer, M.; Kohlbacher, O.; Leitner, A.; Chalkley, R. J.; Ghali, F.; Martinez-Bartolome, S.; Deutsch, E. W.; Jones, A. R., The mzIdentML Data Standard Version 1.2, Supporting Advances in Proteome Informatics. Molecular & cellular proteomics : MCP 2017, 16 (7), 1275-1285. 28. Seymour, S. L.; Farrah, T.; Binz, P. A.; Chalkley, R. J.; Cottrell, J. S.; Searle, B. C.; Tabb, D. L.; Vizcaino, J. A.; Prieto, G.; Uszkoreit, J.; Eisenacher, M.; Martinez-Bartolome, S.; Ghali, F.; Jones, A. R., A standardized framing for reporting protein identifications in mzIdentML 1.2. Proteomics 2014, 14 (21-22), 2389-99. 29. Cociorva, D.; D, L. T.; Yates, J. R., Validation of tandem mass spectrometry database search results using DTASelect. Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.] 2007, Chapter 13, Unit 13 4. 30. Tabb, D. L.; McDonald, W. H.; Yates, J. R., 3rd, DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. Journal of proteome research 2002, 1 (1), 21-6. 31. Pedrioli, P. G., Trans-proteomic pipeline: a pipeline for proteomic analysis. Methods in molecular biology 2010, 604, 213-38. 32. Craig, R.; Beavis, R. C., TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466-7. 33. Vizcaino, J. A.; Cote, R. G.; Csordas, A.; Dianes, J. A.; Fabregat, A.; Foster, J. M.; Griss, J.; Alpi, E.; Birim, M.; Contell, J.; O'Kelly, G.; Schoenegger, A.; Ovelleiro, D.; Perez-Riverol, Y.; Reisinger, F.; Rios, D.; Wang, R.; Hermjakob, H., The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013. Nucleic acids research 2013, 41 (Database issue), D1063-9. 34. Martinez-Bartolome, S.; Binz, P. A.; Albar, J. P., The Minimal Information about a Proteomics Experiment (MIAPE) from the Proteomics Standards Initiative. Methods in molecular biology 2014, 1072, 765-80. 35. Taylor, C. F.; Paton, N. W.; Lilley, K. S.; Binz, P. A.; Julian, R. K., Jr.; Jones, A. R.; Zhu, W.; Apweiler, R.; Aebersold, R.; Deutsch, E. W.; Dunn, M. J.; Heck, A. J.; Leitner, A.; Macht, M.; Mann, 23

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

M.; Martens, L.; Neubert, T. A.; Patterson, S. D.; Ping, P.; Seymour, S. L.; Souda, P.; Tsugita, A.; Vandekerckhove, J.; Vondriska, T. M.; Whitelegge, J. P.; Wilkins, M. R.; Xenarios, I.; Yates, J. R., 3rd; Hermjakob, H., The minimum information about a proteomics experiment (MIAPE). Nature biotechnology 2007, 25 (8), 887-93. 36. Paradela, A.; Escuredo, P. R.; Albar, J. P., Geographical focus. Proteomics initiatives in Spain: ProteoRed. Proteomics 2006, 6 Suppl 2, 73-6. 37. Avtonomov, D. M.; Raskind, A.; Nesvizhskii, A. I., BatMass: a Java Software Platform for LC-MS Data Visualization in Proteomics and Metabolomics. Journal of proteome research 2016, 15 (8), 2500-9. 38. Reisinger, F.; Krishna, R.; Ghali, F.; Rios, D.; Hermjakob, H.; Vizcaino, J. A.; Jones, A. R., jmzIdentML API: A Java interface to the mzIdentML standard for peptide and protein identification data. Proteomics 2012, 12 (6), 790-4. 39. Cote, R. G.; Reisinger, F.; Martens, L., jmzML, an open-source Java API for mzML, the PSI standard for MS data. Proteomics 2010, 10 (7), 1332-5. 40. Muth, T.; Vaudel, M.; Barsnes, H.; Martens, L.; Sickmann, A., XTandem Parser: an opensource library to parse and analyse X!Tandem MS/MS search results. Proteomics 2010, 10 (7), 15224. 41. Barsnes, H.; Vaudel, M.; Colaert, N.; Helsens, K.; Sickmann, A.; Berven, F. S.; Martens, L., compomics-utilities: an open-source Java library for computational proteomics. BMC bioinformatics 2011, 12, 70. 42. Martens, L.; Vandekerckhove, J.; Gevaert, K., DBToolkit: processing protein databases for peptide-centric proteomics. Bioinformatics 2005, 21 (17), 3584-5. 43. Perez-Riverol, Y.; Uszkoreit, J.; Sanchez, A.; Ternent, T.; Del Toro, N.; Hermjakob, H.; Vizcaino, J. A.; Wang, R., ms-data-core-api: an open-source, metadata-oriented library for computational proteomics. Bioinformatics 2015, 31 (17), 2903-5. 44. Montecchi-Palazzi, L.; Kerrien, S.; Reisinger, F.; Aranda, B.; Jones, A. R.; Martens, L.; Hermjakob, H., The PSI semantic validator: a framework to check MIAPE compliance of proteomics data. Proteomics 2009, 9 (22), 5112-9. 45. Martinez-Bartolome, S.; Medina-Aunon, J. A.; Jones, A. R.; Albar, J. P., Semi-automatic tool to describe, store and compare proteomics experiments based on MIAPE compliant reports. Proteomics 2010, 10 (6), 1256-60. 46. Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P., ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 2008, 24 (21), 2534-6. 47. Martens, L.; Hermjakob, H.; Jones, P.; Adamski, M.; Taylor, C.; States, D.; Gevaert, K.; Vandekerckhove, J.; Apweiler, R., PRIDE: the proteomics identifications database. Proteomics 2005, 5 (13), 3537-45. 48. Binz, P. A.; Barkovich, R.; Beavis, R. C.; Creasy, D.; Horn, D. M.; Julian, R. K., Jr.; Seymour, S. L.; Taylor, C. F.; Vandenbrouck, Y., Guidelines for reporting the use of mass spectrometry informatics in proteomics. Nature biotechnology 2008, 26 (8), 862. 49. Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H. J.; Na, K.; Choi, E. Y.; Yan, F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; Marko-Varga, G.; Deutsch, E. W.; Kim, H.; Kwon, J. Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E. Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S., The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nature biotechnology 2012, 30 (3), 221-3. 50. Segura, V.; Medina-Aunon, J. A.; Guruceaga, E.; Gharbi, S. I.; Gonzalez-Tejedo, C.; Sanchez del Pino, M. M.; Canals, F.; Fuentes, M.; Casal, J. I.; Martinez-Bartolome, S.; Elortza, F.; Mato, J. M.; Arizmendi, J. M.; Abian, J.; Oliveira, E.; Gil, C.; Vivanco, F.; Blanco, F.; Albar, J. P.; Corrales, F. J., Spanish human proteome project: dissection of chromosome 16. Journal of proteome research 2013, 12 (1), 112-22. 51. Segura, V.; Medina-Aunon, J. A.; Mora, M. I.; Martinez-Bartolome, S.; Abian, J.; Aloria, K.; Antunez, O.; Arizmendi, J. M.; Azkargorta, M.; Barcelo-Batllori, S.; Beaskoetxea, J.; Bech-Serra, J. J.; Blanco, F.; Monteiro, M. B.; Caceres, D.; Canals, F.; Carrascal, M.; Casal, J. I.; Clemente, F.; Colome, N.; Dasilva, N.; Diaz, P.; Elortza, F.; Fernandez-Puente, P.; Fuentes, M.; Gallardo, O.; Gharbi, S. I.; Gil, C.; Gonzalez-Tejedo, C.; Hernaez, M. L.; Lombardia, M.; Lopez-Lucendo, M.; 24

ACS Paragon Plus Environment

Page 24 of 28

Page 25 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Marcilla, M.; Mato, J. M.; Mendes, M.; Oliveira, E.; Orera, I.; Pascual-Montano, A.; Prieto, G.; RuizRomero, C.; Sanchez del Pino, M. M.; Tabas-Madrid, D.; Valero, M. L.; Vialas, V.; Villanueva, J.; Albar, J. P.; Corrales, F. J., Surfing transcriptomic landscapes. A step beyond the annotation of chromosome 16 proteome. Journal of proteome research 2014, 13 (1), 158-72. 52. Mohedano Mde, L.; Russo, P.; de Los Rios, V.; Capozzi, V.; Fernandez de Palencia, P.; Spano, G.; Lopez, P., A partial proteome reference map of the wine lactic acid bacterium Oenococcus oeni ATCC BAA-1163. Open Biol 2014, 4, 130154. 53. Arcos, S. C.; Ciordia, S.; Roberston, L.; Zapico, I.; Jimenez-Ruiz, Y.; Gonzalez-Munoz, M.; Moneo, I.; Carballeda-Sangiao, N.; Rodriguez-Mahillo, A.; Albar, J. P.; Navas, A., Proteomic profiling and characterization of differential allergens in the nematodes Anisakis simplex sensu stricto and A. pegreffii. Proteomics 2014, 14 (12), 1547-68. 54. Kubacka, A.; Diez, M. S.; Rojo, D.; Bargiela, R.; Ciordia, S.; Zapico, I.; Albar, J. P.; Barbas, C.; Martins dos Santos, V. A.; Fernandez-Garcia, M.; Ferrer, M., Understanding the antimicrobial mechanism of TiO(2)-based nanocomposite films in a pathogenic bacterium. Sci Rep 2014, 4, 4134. 55. Alcolea, P. J.; Alonso, A.; Garcia-Tabares, F.; Torano, A.; Larraga, V., An Insight into the proteome of Crithidia fasciculata choanomastigotes as a comparative approach to axenic growth, peanut lectin agglutination and differentiation of Leishmania spp. promastigotes. PLoS One 2014, 9 (12), e113837. 56. Park, S. K.; Aslanian, A.; McClatchy, D. B.; Han, X.; Shah, H.; Singh, M.; Rauniyar, N.; Moresco, J. J.; Pinto, A. F.; Diedrich, J. K.; Delahunty, C.; Yates, J. R., 3rd, Census 2: isobaric labeling data analysis. Bioinformatics 2014, 30 (15), 2208-9. 57. Xu, T.; Venable , J. D.; Park, S. K.; Cociorva, D.; Lu, B.; Liao, L.; Wohlschlegel, J.; Hewel, J.; Yates, J. R., 3rd, ProLuCID, a Fast and Sensitive Tandem Mass Spectra-based Protein Identification Program. Molecular Cellular Proteomics 2006, 5, S174. 58. Gatto, L.; Christoforou, A., Using R and Bioconductor for proteomics data analysis. Biochim Biophys Acta 2014, 1844 (1 Pt A), 42-51. 59. Tyanova, S.; Temu, T.; Cox, J., The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nature protocols 2016, 11 (12), 2301-2319. 60. Walzer, M.; Qi, D.; Mayer, G.; Uszkoreit, J.; Eisenacher, M.; Sachsenberg, T.; GonzalezGalarza, F. F.; Fan, J.; Bessant, C.; Deutsch, E. W.; Reisinger, F.; Vizcaino, J. A.; Medina-Aunon, J. A.; Albar, J. P.; Kohlbacher, O.; Jones, A. R., The mzQuantML data standard for mass spectrometrybased quantitative studies in proteomics. Molecular & cellular proteomics : MCP 2013, 12 (8), 233240. 61. Griss, J.; Jones, A. R.; Sachsenberg, T.; Walzer, M.; Gatto, L.; Hartler, J.; Thallinger, G. G.; Salek, R. M.; Steinbeck, C.; Neuhauser, N.; Cox, J.; Neumann, S.; Fan, J.; Reisinger, F.; Xu, Q. W.; Del Toro, N.; Perez-Riverol, Y.; Ghali, F.; Bandeira, N.; Xenarios, I.; Kohlbacher, O.; Vizcaino, J. A.; Hermjakob, H., The mzTab data exchange format: communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience. Molecular & cellular proteomics : MCP 2014, 13 (10), 2765-75.

Table 1. Different chart types available in PACOM: The table shows the 45 different charts available in PACOM to visualize and compare proteomics experiments, together with a brief description and the type of graphical representation. Chart name 1 2 3

PSMs/Peptides/Proteins Number of identification s

Peptide number Number of different peptides per protein

Description

Type of graphical representation

Shows the number of PSMs, Peptides and Proteins in the same chart.

Line chart

Shows the number of Peptides. Shows the number sequences per protein.

of

different

peptide

Bar chart / stacked bar chart / pie chart Bar chart / Stacked bar

25

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 28

chart

4

Protein number

5

Single hit proteins

6

Protein sensitivity and specificity

7

Peptide overlapping Overlapping

Shows the number of Proteins. Shows the number of proteins identified just with one PSM. After entering a list of true positive proteins, it shows the sensitivity, accuracy, specificity, precision, fraction of true negatives and error rate. Shows a Venn diagram representing the overlapping of the peptides. Shows a Venn diagram overlapping of the proteins.

representing

the

Bar chart / Stacked bar chart / Pie chart Bar chart / Stacked bar chart / Pie chart Bar chart

Venn diagram

8

Protein overlapping

9

Peptide score comparison

Each point represents the score of the same peptide in two different datasets.

Scatter plot

10

Peptide score distribution

Shows the distribution of a selected peptide score.

Histogram in a line chart

11

Protein score comparison

Each point represents the score of the same protein in two different datasets.

Scatter plot

12

Protein score distribution

13

Protein words cloud

14

Protein group type distribution

Scores

15

Protein features

16

Number of exclusive proteins Protein repeatability

17 Protein coverage 18

Number of proteins detected once, twice\etc, in each dataset.

Stacked bar chart Stacked bar chart with error bars

Protein coverage distribution

Shows the distribution of protein coverages over the dataset.

20

Peptide mass distribution

21

Peptide length distribution

22

Peptide charge distribution

Peptide features

Histogram in a line chart

Shows the average protein sequence coverage.

Miscleavages distribution

23

Shows the distribution of a selected protein score. Taking all protein descriptions, it represents each word depending on its frequency in the dataset. The more frequent the word is, the bigger is represented. Shows the number of proteins or protein groups that are classified in each protein group category. Number of proteins identified only in each dataset.

Protein coverage

19

Peptide mass error

24

Peptide retention time distribution

25

Peptide retention time comparison

26

Single peptide Retention Time Comparison

27

Number of exclusive peptides

28

Peptide repeatability

Venn diagram

Shows the number of peptides containing no miscleavages, one, two, etc\ Shows the distribution of masses (in Da) of the peptides. Giving a range of lengths, shows the number of peptides for each length in between that length range. Shows the number of peptides for each detected charge state. For each PSM, shows the difference between the experimental and observed m/z values against the experimental m/z values. Distribution of the number of peptides by its retention time.

Word cloud Bar chart / Stacked bar chart Combined line and bar chart

Histogram in a line chart Bar chart / Stacked bar chart Histogram in a line chart Bar chart / Stacked bar chart Bar chart / Stacked bar chart Scatter plot Histogram in a line chart

Each point represents the retention time for the same peptide in two different datasets. For each selected PSM, shows the retention time in each dataset, allowing multiple selections. Number of peptides identified only in each dataset.

Combined line and bar chart

Number of peptides detected once, twice\etc, in each dataset.

Stacked bar chart

Scatter plot Bar chart

26

ACS Paragon Plus Environment

Page 27 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Peptide modificatio n distribution

29

30

Peptide modifications

31

32

Peptide monitoring Number of modified peptides Number of modified sites

Number of peptides detected with one, two, three, etc\ PTM positions of a given type of PTM. Number of times that a given peptide sequence is detected in each dataset. Multiple peptide selections are possible. Number of peptides containing at least one amino acid modified with a given PTM. Multiple PTM selection are allowed. Number of PTM sites of a given PTM in each dataset. Multiple PTM selection are allowed.

Bar chart / Stacked chart

Bar chart Bar chart / Stacked bar chart Bar chart / Stacked bar chart

33

Protein heat-map

Heat-map representing if a given protein is detected or not in all datasets.

Heat-map chart

34

Number of peptides per protein heat-map

Heat-map representing the number of peptides explaining each protein in each dataset.

Heat-map chart

Heat-maps 35

Peptide heat-map

36

Peptide presence heat-map

37 38

False discovery rates

Human genes and chromosome s

Proteins and genes per chromosome Peptides and PSMs per chromosome

41 42 43

FDRs vs Scores / # proteins vs Score Human chromosome coverage

39

40

False discovery rates

Peptide counting

44

Peptide counting ratio histogram Peptide counting ratio vs score Spectrometers

Metadata 45

Input parameters

Heat-map representing the number of times each peptide is detected in all datasets. Given a list of peptides of interest introduced by the user, it represents the number of times each one is detected over the datasets. Shows the number of protein/peptides/PSMs identified for each FDR value. Shows the best scores versus the FDR values and the number of proteins detected at different score thresholds. Shows the proportion of proteins detected versus the total number of encoded proteins in each human chromosome, per each dataset. Shows the number of different proteins and genes identified that are encoded in each chromosome. Shows the number of peptides and PSMs identified mapping to proteins encoded in each human chromosome. Histogram of the distribution of the spectral count ratio between two datasets For each peptide, the spectral count ratio between two datasets vs the score of the peptide in each dataset. Shows a table with the spectrometers information used in each dataset. Shows a table with the values of the input parameters of the search engine (required in MIAPE)

Heat-map chart Heat-map chart Line chart Line chart Bar chart / Spider plot Bar chart / Pie chart Bar chart / Pie chart Histogram in a bar chart Scatter plot Table Table

27

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 28

for TOC only

28

ACS Paragon Plus Environment