ABRF Proteome Informatics Research Group (iPRG) 2015 Study

Dec 19, 2016 - ... of Biomolecular Resource Facilities (ABRF) aimed to evaluate the effects of the statistical analysis on the accuracy of the results...
3 downloads 4 Views 16MB Size
Subscriber access provided by EPFL | Scientific Information and Libraries

Article

ABRF Proteome Informatics Research Group (iPRG) 2015 Study: Detection of differentially abundant proteins in label-free quantitative LC-MS/MS experiments Meena Choi, Zeynep Filiz Eren-Dogu, Christopher M Colangelo, John S. Cottrell, Michael R. Hoopmann, Eugene A. Kapp, Sangtae Kim, Henry Lam, Thomas A. Neubert, Magnus Palmblad, Brett S Phinney, Susan T. Weintraub, Brendan MacLean, and Olga Vitek J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.6b00881 • Publication Date (Web): 19 Dec 2016 Downloaded from http://pubs.acs.org on December 20, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

ABRF Proteome Informatics Research Group (iPRG) 2015 Study: Detection of differentially abundant proteins in label-free quantitative LC-MS/MS experiments

Meena Choi#1, Zeynep F. Eren-Dogu#2, Christopher Colangelo3, John Cottrell4, Michael R. Hoopmann5, Eugene A. Kapp6, Sangtae Kim7, Henry Lam8, Thomas A. Neubert9, Magnus Palmblad10, Brett S. Phinney11, Susan T. Weintraub12, Brendan MacLean13, Olga Vitek*1 #

Equal contribution

1

Northeastern University, Boston MA, USA

2

Mugla Sitki Kocman University, Mugla, Turkey

3

Primary Ion, LLC, Old Lyme CT, USA

4

Matrix Science Ltd, London, UK

5

Institute for Systems Biology, Seattle WA, USA

6

Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia

7

Pacific Northwest National Laboratory, Richland WA, USA

8

Department of Chemical and Biomolecular Engineering and Division of Biomedical

Engineering, The Hong Kong University of Science and Technology, Hong Kong 9

Skirball Institute and Department of Biochemistry and Molecular Pharmacology, New York

University School of Medicine, New York NY, USA 10

Center for Proteomics and Metabolomics, Leiden University Medical Center, Leiden, The

Netherlands 11

University of California at Davis, Davis CA, USA

ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

12

University of Texas Health Science Center at San Antonio, San Antonio TX, USA

13

University of Washington, Seattle WA, USA

Page 2 of 43

*Corresponding author:

Olga Vitek 360 Hungtington Ave. Boston, Massachusetts 02115 Northeastern University Email : [email protected] Tel : 617-373-2194

Keywords: Mass spectrometry, LC-MS/MS, Quantitative proteomics, Bioinformatics, Statistics, Differential Abundance

ABSTRACT

Detection of differentially abundant proteins in label-free quantitative shotgun liquid chromatography-tandem mass spectrometry (LC-MS/MS) experiments requires a series of computational steps that identify and quantify LC-MS features. It also requires statistical analyses that distinguish systematic changes in abundance between conditions from artifacts of biological and technical variation. The 2015 study of the Proteome Informatics Research Group (iPRG) of the Association of Biomolecular Resource Facilities (ABRF) aimed to evaluate the effects of the statistical analysis on the accuracy of the results. The study used LC-tandem mass spectra acquired from a controlled mixture, and made the data available to anonymous volunteer

ACS Paragon Plus Environment

2

Page 3 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

participants. The participants used methods of their choice to detect differentially abundant proteins, estimate the associated fold changes, and characterize the uncertainty of the results. The study found that multiple strategies (including the use of spectral counts versus peak intensities, and various software tools) could lead to accurate results, and that the performance was primarily determined by the analysts’ expertise. This manuscript summarizes the outcome of the study, and provides representative examples of good computational and statistical practice. The dataset generated as part of this study is publicly available.

ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 43

1 Introduction Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is an important tool for quantitative investigations of complex proteomes.1 It is widely used in biological2 and clinical research3. Shotgun label-free LC-MS/MS workflows are particularly attractive as they require minimal sample manipulation and yield unbiased characterization of protein relative abundance in complex biological mixtures. A common goal of quantitative proteomic experiments is to detect differentially abundant proteins among conditions. To achieve this with shotgun label-free LC-MS/MS workflows, proteins are digested into peptides with an enzyme such as trypsin. The peptides are then separated by liquid chromatography, ionized, and subjected to mass analysis to produce mass spectra (MS). The resulting two-dimensional LC-MS features corresponding to the peptide ions and their intensities are related to the abundance of the underlying protein. Features are selected by the mass spectrometer for fragmentation to produce tandem mass spectra (MS/MS; MS2). The tandem mass spectra are informative of the amino acid sequences of the peptides, and the frequency of identified MS/MS spectra can be viewed as a proxy for protein abundance. 4 Multiple computational steps are required to extract the identities and relative abundances of proteins from the LC-MS features.5 First, the sequences of amino acids are assigned by a database search, which compares the observed LC-MS/MS spectra to the predicted fragments generated from the database entries. Second, LC-MS features are extracted from the twodimensional LC-MS space, and their intensities are determined by either extracted ion current peak area or apex height of the feature. The LC-MS features are then mapped to the identified LC-MS/MS spectra and aligned across the runs. Finally, statistical analysis of the identified and quantified LC-MS features summarizes all the peptide ions in the experiment that are assigned to

ACS Paragon Plus Environment

4

Page 5 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

each protein, quantifies their biological and technical variation, and detects proteins that systematically change between conditions. Many computational and statistical tools now exist to identify and quantify LC-MS features, and to detect differentially abundant proteins.6 Various combinations of the tools are incorporated into automated and semi-automated data analysis pipelines. However, each pipeline makes specific and sometimes subjective choices of algorithms and implementations, data processing parameters, and methods of statistical analysis, and these choices have important consequences for the ability to detect differentially abundant proteins. To investigate the ability of various computational and statistical pipelines to detect differentially abundant proteins in LC-MS/MS experiments, the 2015 Proteome Informatics Research Group (iPRG) of the Association of Biomolecular Resource Facilities (ABRF) launched a study that solicited community participation. The ABRF is an international society dedicated to advancing core and biotechnology investigator laboratories through research, communication, and education. The iPRG supports and participates in the development, evaluation, and advancement of new algorithms, software tools and strategies for proteome informatics. The 2015 iPRG study provided a well-designed, controlled dataset to be analyzed by the anonymous volunteer participants using their methods of choice. It should be noted that the study did not aim to compare peptide spectrum matching or peptide peak area extraction tools. Therefore, identified LC-MS/MS spectra and their associated integrated peak intensities were also provided as optional starting points of data analysis. The participants made their own decision of whether to start from the raw data or from the provided intermediate results. The organizers of the study evaluated the submissions in terms of their ability to correctly detect the differentially abundant proteins and to accurately estimate the fold changes in protein abundance

ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 43

among conditions. This manuscript summarizes the outcomes of the study, highlights the importance of the correct use of computational and statistical methods, and provides representative examples of good statistical practice.

2 Study Design and Data Acquisition The study was based on four artificially made samples of known composition, each containing a constant background 200 ng of tryptic digests of S. cerevisiae (ATCC strain 204508/S288c). Each sample was separately spiked with different quantities of six individual protein digests. All of the proteins were reduced and alkylated with iodoacetamide prior to digestion with trypsin. The concentrations of the spiked-in proteins are summarized in Table 1.

Table 1 here

The four samples were analyzed in three LC-MS/MS acquisitions (total of 12 runs) in random order. The digests were loaded directly on to a 15 cm x 75 µm PicoFrit column (New Objective) self-packed with 3 µm Reprosil-Pur C18-AQ beads (Dr. Maisch HPLC GmbH). Samples were separated using a Thermo Scientific Easy-nLC 1000 system with a 110-min linear gradient of 0-40% acetonitrile in 0.1% formic acid at 250 nL/min directly connected to a Thermo Scientific Q-Exactive mass spectrometer. Data were acquired in data-dependent (DDA) mode, with each MS survey scan followed by 10 MS/MS HCD scans (AGC target 10E6, max fill time 60 msec), with 30-sec dynamic exclusion. Both MS and MS/MS data were acquired in profile mode in the Orbitrap, with resolution 70,000 for MS and 17,500 for MS/MS. The MS1 scan

ACS Paragon Plus Environment

6

Page 7 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

range was 300-1650 m/z, the normalized collision energy was set to 27%, and singly charged ions were excluded. Based on the feedback of the ABRF community, the organizers made available both the original files, and the results of a partial analysis. Specifically, the database of the S. cerevisiae proteins was augmented with the sequences of the spiked proteins (added at random locations). The identifiers of the spiked proteins were made similar to the background proteins. The MS data files were searched against a provided target-decoy protein database using three sequence search engines, OMSSA7, MS-GF+8 and Comet9. The search parameters for each engine are detailed in Supplementary Section 2. The search results were first validated at the peptide-spectrum match (PSM) level by PeptideProphet10, employing decoy-assisted semi-parametric modeling11. Accurate mass modeling was turned on to adjust probabilities based on the precursor mass deviations. Then, iProphet12 was used to combine the results from the three search engines. The resulting iProphet cutoff is a PSM-specific probability equivalent to (1 – posterior error probability). Note that although decoys were used to help fit the mixture distribution at the PeptideProphet stage, decoys were not used in subsequent probability adjustments. Consequently, the peptide- and protein-level error estimates may not correspond to the error estimates obtained by simple decoy counting. To provide an additional error estimate, a PSM-level q-value was also computed by decoy counting and is listed in the ID tables provided to the participants. Specifically, the PSMs were first sorted by descending iProphet probability, and the q-value of a particular PSM was taken to be the fraction of decoys among all IDs at or above its iProphet probability. Decoy PSMs themselves were omitted in the ID tables, but otherwise all PSMs, regardless of q-value, were included. Participants who used the provided IDs were expected to filter the PSMs based on

ACS Paragon Plus Environment

7

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 43

criteria of their choice. The provided spectral identifications included 5,766 proteins with on average 26,242 PSMs per run. 48% of the proteins (2,801 out of 5,776) were represented with two or fewer PSMs in at least one condition, and 29% of proteins (1,663 out of 5,776) had 15 or more PSMs in every condition. 17% of the unfiltered PSMs had a PSM-level q-value greater than 0.01. In absence of additional filtering by the participants, these identifications would lead to a very high protein-level false discovery rate (FDR). The LC-MS features were quantified with Skyline13 v2.6, using the identified MS/MS spectra to locate and extract the MS1 peaks. The iProphet probability cut-off of 0.15 was used to filter peptide targets at 1% PSM-level FDR estimated by decoy counting. Supplementary Table 3 shows that this corresponds to peptide-level FDR estimated at 2.4%, and protein-level FDR estimated at 17.4%. The full parameters of the workflow are detailed in Supplementary Section 2. After the data import, peptide ions and corresponding MS1 peak areas were exported using the Transition Results report in Skyline. Finally, the iProphet probability and q-value were appended as extra columns to the report. Peptide IDs that could be mapped to multiple proteins in the database were removed. Therefore, no peptides were shared between proteins. Since the prevalence of homologous sequences among yeast proteins and the spiked-in standard proteins is low, this was not expected to significantly impact the conclusions regarding the spiked proteins. The provided data contained quantitative measurements for 3,766 proteins. The intensity of each peak was quantified as the area under its curve in the chromatographic dimension. The unit of a peak intensity was an ion count. For a given peptide ion, Skyline reported the sum of the areas under the curve over all the targeted isotopes (in this case M, M+1 and M+2). The LC-MS features of a protein were defined as distinct combinations of peptide sequence and charge. The dataset had 29,575 LC-MS features, and the median number of LC-MS features per protein was

ACS Paragon Plus Environment

8

Page 9 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

five. Overall, 25% of proteins had one or two distinct LC-MS features; 25% of proteins had 11 or more distinct LC-MS features. The resulting data table contained 29,575 LC-MS features quantified across 12 runs, i.e. 29,575x12=354,900 quantitative values. Of these values, 42% were obtained by retention time alignment alone and, therefore, lacked associated q-values. Unlike in the ID table (where 17% of PSMs had a PSM-level q-value greater than 0.01), only 1% of the assigned q-values were greater than 0.01. Therefore, the processed data table had a more stringent PSM filtering than the ID tables. The average proportion of missing values per protein was 0.05%. Only 17 peptide queries were not assigned a peak. Moreover, 1,154 peak intensities were quantified with as 0. The participants in the study received the raw data in Thermo RAW and mzML formats, the results of the database search in pepXML and tab-delimited table format, the original Skylinebased quantification in a tab-delimited table format, and an Excel template for submitting the analysis

results.

The

raw

data

can

be

downloaded

http://massive.ucsd.edu/ProteoSAFe/status.jsp?task=eccf4bd3e86a4f79af468b0010eb80b0.

from All

the information provided to the participants and the subsequent re-quantification with Skyline, can be downloaded from ftp://[email protected]/ (password ABRF329). The participants were asked to use these data to i) estimate (log2) fold changes for all the proteins in the dataset, ii) detect the proteins that change in relative abundance in all pairwise comparisons of the four samples, and iii) characterize the uncertainty associated with the differential abundance. The participants could perform relative quantification based on counts of tandem mass spectra, areas of the chromatogram peaks extracted from MS1 spectra over time, or both. If a participant used multiple approaches, they were asked to separately report each result. In order to assess the impact of identification and/or peak detection and integration on the results,

ACS Paragon Plus Environment

9

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 43

the participants who started from the raw data were asked to submit their intermediate results containing peptide identifications, as well as extracted peak intensities for submissions that used quantification with LC-MS peaks. Finally, the participants were asked to complete a survey that described their data processing and statistical methodology. In order to provide a comprehensive and fully documented workflow that exemplifies accurate detection of differentially abundant proteins, the raw data were reprocessed by the iPRG group after the conclusion of the study, with the most recent version of Skyline to the date (v3.5). The re-analysis also used a narrower m/z extraction range. Other differences between Skyline v2.6 and v3.5 were minor (e.g., while Skyline v2.6 rounded peak intensities to the nearest integer, v3.5 allowed fractional values). At the same time, the re-analysis investigated more specifically the use of isotopic peaks, and of PSM FDR controls. While the original processing used The iProphet probability cut-off of 0.15, the re-analysis varied the iProphet FDR cut-off from 0.05 to 0.95, considered including or excluding single-hit proteins, and using monoisotopic peaks or multiple isotopic peaks. The workflow characterized by an iProphet probability cut-off 0.15 and filtering for proteins with at least two peptides was considered representative of best practice. These results were not available to the participants, however we summarized them in Supplementary Section 2 and Supplementary Table 3 for the purpose of establishing a benchmark. The repeated data processing resulted in fewer proteins, but more LC-MS features per protein, as compared to the original version. It produced 3,027 proteins, and after between-run alignment, the proteins were represented by 34,783 distinct LC-MS features. The median number of LC-MS features per protein was seven. Overall, 25% of proteins had up to three LC-MS features. 25% of proteins had 14 or more LC-MS features. The reprocessing increased from 17 to

ACS Paragon Plus Environment

10

Page 11 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

233 the number of peptide queries not assigned a peak, and 2,585 peak intensities were below 0.5. Since the statistical analysis treated zeros as missing values, the re-analysis rounded the values below 0.5 to 0 for consistency between the two versions. The average proportion of such missing values per protein was 0.67%. Overall, the new processing reduced the number of considered proteins, while increasing the quality of the protein information. Unique peptides and protein FDR were estimated as 0.5% and 2.3% respectively (see Supplementary Table 3) by decoy counting, before applying statistical tests for differential abundance.

3 Results 3.1 Overview of the Participants There were sixty anonymous voluntary participants in the study, who produced 51 valid submissions. Almost all of the participants were ABRF members. The majority of the participants were from European countries, followed by Asia, North America and Australia. Most of the participants were employed in academic labs, and were conducting either core services, or investigator laboratory research. The most common job description was bioinformatician (43%), followed by mass spectrometrist (18%), statistician (11%), and software developer (10%). Only 6.6% of the participants were new to label-free proteomics, and half (50%) had at least five years of experience. More than half of the participants indicated that they routinely perform analysis of MS-based proteomics data, including label-free quantitation. Figure 1 presents a short summary of the participants.

Figure 1 here

ACS Paragon Plus Environment

11

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 43

3.2 Summary of the Submissions Below we provide a summary of the results, while anonymizing the individual submissions. Additional information is available in Supplementary Section 3. 3.2.1 Use of Input Data The participants made their own decisions about whether to start from the raw data and/or from the provided intermediate files. Figure 2A shows that most of the participants started their analyses with the raw data, and used peak intensity quantification. As of August 7, 2016, there were 6,604 ORFs reported for S. cerevisiae in the Saccharomyces Genome Database, of which 5,154 have been verified (http://www.yeastgenome.org/genomesnapshot). Prior proteomic experiments, relying on a combination of fractionation techniques and high-performance instrumentation, reported around 4,000 – 4,500 S. cerevisiae proteins.14,15,16,17 In the laboratory that provided the LC-MS/MS data for this study, a typical 2-hr Q Exactive LC-MS/MS experiment identifies 3,000 – 3,200 proteins with 1% protein-level FDR. The participants of the study made their own choices of methods to control the FDR at the level of spectral, peptide or protein identification. Since the FDR cutoff strongly impacts the number of peptide and protein identifications, the submissions differed in the reported number of proteins, and the numbers are not directly comparable. While most submissions fell well within the expected range, some reported substantially more or substantially fewer proteins. (Submissions 13, 48 and 40 only reported the differentially abundant proteins, and therefore, their number of identified/quantified proteins is artificially small.)

Figure 2 here

3.2.2 Sensitivity and Specificity of Detecting Differentially Abundant Proteins

ACS Paragon Plus Environment

12

Page 13 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2B summarizes the number of true positive (TP) out of 36 possible (six spiked-in proteins in six comparisons), and the number of false positive (FP) differentially abundant proteins in each submission across all possible pairwise comparisons of samples. As can be seen from the figure, neither the type of input data nor the quantification strategy alone affected the true positives and false positives results. There was also no direct association between the number of reported proteins and the number of false positives. In other words, the choice of input was not a major determinant of the sensitivity and specificity of the results. The details of the results separated for each pairwise comparison of samples are provided in Supplementary Section 3. 3.2.3 Accuracy of Estimation of Fold Changes Figure 2C summarizes the reported log2-fold changes among background proteins for all pairwise comparisons of samples. Although the log2-fold changes of the background proteins are expected to be close to zero, nearly every submission reported large log2-fold changes for at least some background proteins. At the same time, there was no association between these values and the reported number of false positives in Figure 2B. In other words, the successful submissions could overcome the nuisance variation among the background proteins. Supplementary Figures 3 and 4 compare the accuracy of estimation of the log2-fold changes of the spiked proteins to the true values. As above, there was no association between the observed fold changes and the false positives. It is interesting to note, however, that the variation of the estimates of log2-fold changes between submissions increased with large true fold changes. At the same time, most submissions using counting of tandem mass spectra consistently reported smaller estimates of log2-fold changes than the submissions using intensities of LC-MS peaks.

ACS Paragon Plus Environment

13

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 43

3.2.4 Computational and Statistical Analysis Strategy The previous results illustrate that the accuracy of detection of differentially abundant proteins is not a simple consequence of the number of the reported proteins or of the observed fold changes. Therefore, we sought to investigate the details of the analytical choices made by the participants, as summarized in Figure 3. The most commonly used data processing tools were Skyline, MaxQuant18 and Progenesis QI (Nonlinear Dynamics/Waters). The most common statistical analysis tools were Perseus19, Progenesis QI, and scripts written in R, Python, Excel, or Matlab. However, there was no single well-performing combination of computational tools. Good performance could be achieved with a fairly diverse set of workflows. At the same time, use of the same analytical workflow did not guarantee uniformly good results. For example, while submissions 3 and 37 (utilizing raw data, MaxQuant and Perseus) had perfect positive predictive values (PPV=1), submissions 21 and 43 used the same workflows but performed poorly. Similarly, submission 6 (utilizing provided peak intensities with Skyline and linear modeling with R) performed well, but submission 23 used the same workflow and performed worse. On the statistical analysis side, the majority of well-performing submissions used some form of comprehensive modeling of spectral counts or feature intensities between runs, and adjusted the results for multiple comparisons to control FDR. All the submissions relying solely on fold change cutoffs performed poorly. Several laboratories submitted multiple entries. Some laboratories were successful with all the analytical strategies. For example, lab 1 used spectral counting with DESeq2 (submission 44) and peak intensities with MSstats (submission 6), and performed well in both. Lab 2 made wellperforming submissions 17, 25 and 49. Submission 17 used the provided Skyline file and

ACS Paragon Plus Environment

14

Page 15 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

peptide-level inference. Submissions 25 and 49 used MaxQuant. Submission 49 used peptidelevel summarization and inference, while submission 25 used protein-level summarization and inference. All had accurate results. Lab 3 used MaxQuant and pFind20 with different adjustments for multiple testing, which were all successful. Other laboratories also used multiple strategies, but performed worse. For example, Lab 6 compared spectral counting to peak intensities, but used the same statistical analysis (t-test) for both. It performed worse than lab 1, especially for the peak intensities part. Lab 7 used the provided intensity data, and investigated a simple linear model in submission 23, an Empirical Bayesian approach in submission 31, and the R package BayesFactor with different parameters in submissions 19 and 28. All the submissions resulted in a similarly medium performance. Lab 8 made multiple submissions with MaxQuant, Skyline and Mascot+LFQuant21, and different methods for protein-level summarization. Submission 8 and 30 both used MaxQuant, however submission 30 used summed intensities of all tryptic peptides for protein summarization, and submission 8 used summarization by a least squares fit to peak intensities. Submission 34 and 35 used the provided intensity data from Skyline. Submission 34 summed intensities of all tryptic peptides, and submission 35 averaged intensities of all tryptic peptides for protein summarization. All these submissions resulted in poor performance. Lab 9 investigated the impact of protein-level summarization using MaxQuant and Skyline. Submissions 24 and 16 used the original protein-level summarization by MaxQuant. Submission 16 differed by imputing missing value by random forest. Submission 33 used Skyline. All the approaches performed relatively poorly. Overall, we observed that the data analysis expertise of a participant, and not the specific choice of the software tools, was the biggest driver of the accuracy of the results.

ACS Paragon Plus Environment

15

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 43

Figure 3 here

4 Representative Examples of Good Computational and Statistical Practice As seen from the submissions, multiple analytical workflows could produce reasonably accurate results. It is impossible to list all the combinations of the appropriate strategies. Therefore, in this section we provide just one representative example of good statistical practice when working with peak intensities, and with spectral counts, as a benchmark. Again, we would like to emphasize that these are not the only possible solutions. 4.1 Statistical Analysis of LC-MS Peak Intensities As an example of statistical analysis of LC-MS peak intensity data provided to the participants, we used the statistical package MSstats22 v3.4 implemented in R. The data are available in Supplementary Table 2. A sample workflow and the R code are also given in Supplementary R scripts. Step 1: Transformation, normalization and summarization. The first step in data analysis was quality control, data cleaning and initial preprocessing, as described in Supplementary Section 2 and Supplementary Section 4. Using MSstats, the peak intensities were first log2transformed, and normalized by adding or subtracting constants from the log2-intensities of all the features in a run, to equalize the median log2 intensities across all the runs. The intensities of the features of a protein in a run were summarized to obtain a single value per protein per run, using Tukey’s median polish (i.e., a robust version alternative to a two-way Analysis of Variance). The summarization accounted for missing values that occurred when the abundance of a peptide ion was below the limit of detection. More information about summarization used in this manuscript is available in the “User Manual” for MSstats v3.3.11 (www.msstats.org).

ACS Paragon Plus Environment

16

Page 17 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Step 2: Modeling and model-based comparisons. MSstats automatically detected the structure of the experimental design (a group comparison design). The summarized protein abundances were represented separately for each protein, with a linear mixed-effects model that reflected the design. The model assumed that on the log2 scale the technological variation in the summarized abundances was approximately normally distributed. The parameters of the model were estimated as described in the documentation of MSstats. For each pairwise comparison of samples, model-based summaries and their p-values were obtained to test each protein for differential abundance. Step 3: Adjustment for multiple testing. The p-values of the tests were adjusted for multiple comparisons using the approach by Benjamini and Hochberg23 to control the FDR of the tests for differential abundance. Step 4: Model-based conclusions. The results of tests for differential abundance between each pair of samples are displayed in Figure 4. The y-axis is the minus log10 p-value of a pairwise comparison adjusted to control FDR. High values correspond to statistically significant changes. The x-axis is the log2-fold change between two samples, which in statistical language is sometimes called the “practical significance” of a change. Among the pairwise comparisons of samples, most were true negatives and true positives. There were several false positives, especially when comparing sample 2 against sample 4, and several false negatives. Importantly, many true negatives comparisons had fairly large log2-fold changes. This highlights the importance of a statistical approach for determining differential abundance, as opposed to relying on fold change cutoffs. Supplementary Figure 5 illustrates in more detail that decisions based on fold change cutoffs or on p-values that are not adjusted to control FDR inflate the number of false positives results.

ACS Paragon Plus Environment

17

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 43

Figure 4 here

4.3 Statistical Analysis of Counts of MS/MS Spectra As an example of statistical analysis of counts of tandem mass spectra provided to the participants, we used DESeq2 package24 v1.12.3 implemented in R and Bioconductor. The data are available in Supplementary Table 2. A sample workflow and the R code are also given in Supplementary R Scripts. An alternative example of statistical analysis with edgeR (not used in this manuscript) can be found in Branson at al.25 Step 1: Transformation and normalization. Since DESeq2 incorporates normalization as part of modeling the data, the statistical analysis of spectral counts did not require prior ad hoc normalization prior to DESeq2. The normalization was done as part of the model building with DESeq2. Step 2: Modeling and model-based comparisons. The spectral counts were modeled separately for each protein, using a generalized linear model. Since the counts for some of the proteins were low, the data could not be expected to follow the Normal distribution, and a different probability distribution must be specified to accurately represent the count nature of the data. DESeq2 assumed that the spectral counts follow a Negative Binomial distribution. The model had a built-in normalization of the spectral counts in each run in form of a parameter called “size factor”.

The parameters of the model were estimated as described in the

documentation of DESeq2. For each pairwise comparison of samples, model-based summaries and their p-values were obtained to test each protein for differential abundance.

ACS Paragon Plus Environment

18

Page 19 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Step 3: Adjustment for multiple testing. Similarly to the case of peak intensities, the p-values of the tests were adjusted for multiple comparisons using the approach of Benjamini and Hochberg, to control FDR. Step 4: Model-based conclusions. At this stage the statistical analysis for count data is identical to that of the peak intensities. The results of tests for differential abundance are displayed in a volcano plot in Figure 5. In all the pairwise comparison of samples, most of the proteins were true negatives and true positives. There were also a small number of false negatives and false positives. Supplementary Figure 6 illustrates in more detail the importance of the correct model choice when working with the spectral counts data, and of adjusting pvalues to control FDR.

Figure 5 here

4.4 Accuracy of MS Peak Intensities Versus Counts of Tandem Mass Spectra We also used this opportunity to compare the performance of detecting differentially abundant proteins with LC-MS peak intensities to that of counts of tandem mass spectra. Figure 6A shows that among the differentially abundant proteins, quantification by spectral counting compressed the estimates of log2-fold changes. This is consistent with the participants’ findings (Supplementary Figures 3 and 4). In contrast, intensities of LC-MS features resulted in more accurate, but also more variable estimates. Figure 6B shows that for the background proteins, quantification by spectral counting resulted in overall smaller absolute values of estimates of log2-fold changes than quantification by peak intensities. At the same time, spectral counting resulted in fewer false positive conclusions of differential abundance, even when the participants

ACS Paragon Plus Environment

19

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 43

chose to use the complete provided peptide IDs data set without performing any filtering by FDR. These results may seem to suggest that quantification by spectral counting is superior to that by LC-MS peak intensities. However, this conclusion is likely an artifact of the large true fold changes, and of the absence of biological variation in the study. It is very likely that the fold change compression would have a stronger negative impact on the detection of differentially abundant proteins in real-life biological investigations, which usually have smaller true fold changes and larger biological variation.

Figure 6 here

4.5 Impact of Peak Area Extraction To illustrate the impact of peak area extraction (here defined as a combination of peptide spectrum matching, filtering of these matches to control the FDR, and peak area quantification) we conducted a re-analysis using the same database search results, but filtered differently and using a newer version Skyline (3.5) with additional parameter tuning. It is important to emphasize that our goal is not to evaluate or to promote Skyline, but to evaluate the importance of the decisions made upstream of the statistical analysis, and to provide a comprehensive and documented workflow that can be used as a benchmark. Qualitatively similar results can be obtained with other peak area extraction tools, and other decisions about filtering and parameters may have a similar impact. During the re-analysis, chromatograms were extracted from centroided spectra with narrower extraction tolerance (Supplementary Figure 7), more tightly controlled PSM-level FDR, and

ACS Paragon Plus Environment

20

Page 21 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

protein targets limited to those with at least two peptides detected as described in Section 2. We believe that this represents the best practice of using Skyline for this type of data at the time of writing. The newly processed peak intensity data were analyzed with the same statistical test strategy as described in Section 4.1. Figures 7 and 8 summarize the results, and show that an optimized peak detection, filtering and quantification positively affected the detection of differential abundance. The estimates of log2 fold changes among the spiked proteins were closer to the true values, the tests for differential abundance had fewer false positives and false negatives, and the estimates of log2 fold changes among the background proteins were smaller in absolute value.

Figure 7 here Figure 8 here

We also sought to investigate the impact of the data processing and of PSM FDR control, specifically the impact of 1) including or excluding proteins with only a single distinct LC-MS feature, 2) removing distinct LC-MS features based on identifications with iProphet probability less than a particular cutoff, and 3) using the monoisotopic peak or taking the sum (on the original scale) of the intensity of three isotopic peaks for each distinct LC-MS feature. The statistical analysis of each dataset was the same as described in Section 4.1. Supplementary Table 3 shows that including protein quantifications based on a single PSM, increases the number of proteins considered, but also the opportunity to produce false positive differential results. FDR analysis using varied iProphet probabilities indicates that single-hit proteins are particularly vulnerable to false discovery at the identification level, which may result

ACS Paragon Plus Environment

21

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 43

in quantifying a signal that does not belong to the targeted protein. This highlights the importance that protein-level control of the identification FDR when working with protein-level quantification. When single PSMs were excluded, the choice of the iProphet cutoff had little impact on the accuracy of the results, except in the case of the somewhat unreasonable cutoff of 0.05 (that kept PSMs with posterior probability as low as 0.05), which undermined the performance. Quantifying an LC-MS feature in a run using the sum of three isotopic peaks reduced the proportion of missing values, and reduced the number of false positive results.

5 Discussion In this manuscript we summarized the outcomes and the follow-up investigations of the 2015 study of the ABRF Proteome Informatics Research Group (iPRG) on detection of differentially abundant proteins in label-free LC-MS/MS experiments. The study had a fairly simple design in that only a few proteins were changing among the samples, and most fold changes were relatively large. Overall, there was a substantial diversity of analytical strategies selected by the participants, and many submissions performed quite well in detecting the changes. Multiple strategies (including the use of spectral counts versus peak intensities, and the use of various software) led to high sensitivity, specificity, and positive predictive values close to 1. This demonstrates that, despite the diversity of the available methods and tools, multiple groups are capable of accurately detecting changes in protein abundance. At the same time, the same analytical workflows could lead to different results, when used by different labs. This conclusion was confirmed by others. For example, researchers outside of the iPRG group re-analyzed the iPRG dataset with four different spectral processing tools (OpenMS,

ACS Paragon Plus Environment

22

Page 23 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

MaxQuant, Skyline, DeMix-Q) and their results were also unique, and not identical to the iPRG submissions that used the same workflows.26 We concluded that, in order to document the results in scientific publications, it is not enough to simply name the bioinformatics tools. Instead, it is important to provide sufficient details, such as the sequence of the analysis steps, the choice of the parameters, and the intermediate results. Many modern computational frameworks enable such documentation. For example, when performing statistical analysis in R and in MSstats, it is possible to document all these detailed choices in a single Markdown document. This is also one of the preferred documentation formats in the recently launched 2016 iPRG study (http://iprg2016.org), along with the IPython/Jupyter notebook format. Skyline documents for this manuscript can be found at http://panoramaweb.org/labkey/iPRG-2015.url We encourage core facilities to use the fully documented tools, and this reporting software. At the same time, the study showed that different statistical models are needed for different quantification types, particularly for spectral counting compared to peak intensity data. In other words, the statistical approach must be adapted to the experiment at hand. Adjustment for multiple testing is also an important part of the statistical analysis workflow, while the use of a fixed fold change cutoffs did not produce satisfactory results. In this study detection of differentially abundant proteins by spectral counting did well. However, we believe that this was likely influenced by the particular experimental design. The compression of the estimated fold changes, and the semi-quantitative nature of the quantification will likely not perform as well in real-life biological investigations, which have a larger number of differentially abundant proteins across a wider concentration range, smaller fold changes, and larger biological variation. Beyond this study, other researchers have compared the performance of spectral counting and MS signal analysis (peak intensity) using other spike-in datasets. They

ACS Paragon Plus Environment

23

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 43

found that spectral counting had higher accuracy with strict cutoffs for false positives, and peak intensity had more sensitivity when more false positives are allowed.27 Some differences between these published results and those in this paper may be due to differences in statistical methods between papers. For example, Ramus et al.27 used a Beta-Binomial model for spectral counting and a Welch t-test by Perseus for peak intensity, while the current analysis used DESeq and MSstats. Although evaluation of data processing was not the main objective of the study, it became clear that details of data processing (broadly defined as a combination of peptide spectrum matching, filtering of these matches with respect to protein-level FDR, and peak area estimation) have important implications for protein-level relative quantification. In particular, the use of proteins quantified with a single peptide resulted in a worse performance of detecting differential abundance. In the past, arguments existed in favor28 and against29 reporting single-peptide proteins from the perspective of protein identification. This study emphasized the importance of limiting the dataset to proteins with at least two peptides from the perspective of relative protein quantification, and also the importance of the FDR control. Regardless of the choice, the analysis plan should be established before seeing the data, and documented in a form of a standard operating procedure, to avoid an excessive manipulation and overfitting, The composition of the controlled mixtures also can influence the results. While Zhang et al. analyzed the iPRG dataset and found that Skyline led to fewer false negatives than MaxQuant.26 Ramus et al. analyzed a different spike-in dataset (three different fold changes with 48 variant UPS1 proteins) with data processing tools similar to the ones used in this study. They found that both MaxQuant intensity and LFQ had fewer false negatives than Skyline.27 Therefore, to calibrate the statistical methods, we suggest using as many spike-in datasets as possible. We

ACS Paragon Plus Environment

24

Page 25 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

hope that the iPRG 2015 dataset will contribute to the collection of publicly available benchmarks that can be used to evaluate the performance of workflows. Finally, it became clear from the study that the expertise of the participant is a key component to the success of detecting differentially abundant proteins. This highlights the importance of interdisciplinary training (e.g., short courses offered by the ABRF, ASMS, or other organization or research groups), which introduces proteomic scientists to both the appropriate use of the bioinformatics tools, and to the underlying data processing and statistical methodology.

6 Supporting information The following files are available free of charge at ACS website http://pubs.acs.org : Supplementary Table ST1.csv : testing result for peak intensity data provided to participants Supplementary Table ST2.csv : testing result for spectral count data Supplementary Table ST3.xlsx : Summary of results with different parameters for processing in Skyline Supplementary Table ST4.csv : testing result for re-processed peak intensity data

Supplementary Figure S1 : Number of differentially abundant proteins of three comparisons (sample 1 vs sample 2, sample 1 vs sample3, sample 1 vs sample 4). Supplementary Figure S2 : Number of differentially abundant proteins of three comparisons (sample 2 vs sample 3, sample 2 vs sample 3, sample 3 vs sample 4). Supplementary Figure S3 : Estimated log2-fold changes of 4 spike-in proteins (Protein A, B, C, D) for all pairwise comparisons

ACS Paragon Plus Environment

25

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 43

Supplementary Figure S4 : Estimated log2-fold changes of 2 spike-in proteins (Protein E, F) for all pairwise comparisons. Supplementary Figure S5 : Venn diagram, comparing multiple statistical strategies of detecting differentially abundant proteins, with LC-MS peak intensity data provided to the participants. Supplementary Figure S6 : Venn diagram, comparing multiple statistical strategies of detecting differentially abundant proteins, with spectral counts data provided to the participants. Supplementary Figure S7 : The Skyline Full-Scan graph showing the difference in chromatogram extraction windows.

Supplementary Rscript SR1.R : R script to analyze peak intensity data provided to participant Supplementary Rscript SR2.R : R script to analyze spectral count data provided to participant Supplementary Rscript SR3.R : R script to analyze re-processed peak intensity data

List of provided datasets Details of study design and data acquisition Additional information on the submissions by study participants Additional information on representative examples of good statistical practice

7 Acknowledgments We thank the participants of the iPRG 2015 study for their work in preparing the submissions. We thank Steven Blais and Jingjing Deng from the Neubert Lab (Mass Spectrometry Core for Neuroscience), Skirball Institute, NYU School of Medicine, for the LC-MS/MS analysis to produce the data for this study. We acknowledge support from the ABRF and NIH Shared

ACS Paragon Plus Environment

26

Page 27 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Instrumentation Grant RR027990, NIH NINDS grant P30 NS050276, the National Institute of General Medical Sciences under grant R01GM087221, and 2P50 GM076547/Center for Systems Biology.

REFERENCES 1. Bantscheff, M.; Lemeer, S.; Savitski, M. M.; Kuster, B., Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present. Anal Bioanal Chem 2012, 404 (4), 939-65. 2. Larance, M.; Lamond, A. I., Multidimensional proteomics for cell biology. Nat Rev Mol Cell Biol 2015, 16 (5), 269-80. 3. Gallien, S.; Domon, B., Advances in high-resolution quantitative proteomics: implications for clinical applications. Expert Rev Proteomics 2015, 12 (5), 489-98. 4. Lundgren, D. H.; Hwang, S. I.; Wu, L.; Han, D. K., Role of spectral counting in quantitative proteomics. Expert Rev Proteomics 2010, 7 (1), 39-53. 5. Kall, L.; Vitek, O., Computational mass spectrometry-based proteomics. PLoS Comput Biol 2011, 7 (12), e1002277. 6. Nahnsen, S.; Bielow, C.; Reinert, K.; Kohlbacher, O., Tools for label-free peptide quantification. Mol Cell Proteomics 2013, 12 (3), 549-56. 7. Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H., Open mass spectrometry search algorithm. J Proteome Res 2004, 3 (5), 958-64. 8. Kim, S.; Pevzner, P. A., MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun 2014, 5, 5277. 9. Eng, J. K.; Jahan, T. A.; Hoopmann, M. R., Comet: an open-source MS/MS sequence database search tool. Proteomics 2013, 13 (1), 22-4. 10. Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R., Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem 2002, 74 (20), 5383-92. 11. Choi, H.; Ghosh, D.; Nesvizhskii, A. I., Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling. J Proteome Res 2008, 7 (1), 286-92. 12. Shteynberg, D.; Deutsch, E. W.; Lam, H.; Eng, J. K.; Sun, Z.; Tasman, N.; Mendoza, L.; Moritz, R. L.; Aebersold, R.; Nesvizhskii, A. I., iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol Cell Proteomics 2011, 10 (12), M111 007690. 13. MacLean, B.; Tomazela, D. M.; Shulman, N.; Chambers, M.; Finney, G. L.; Frewen, B.; Kern, R.; Tabb, D. L.; Liebler, D. C.; MacCoss, M. J., Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 2010, 26 (7), 966-8.

ACS Paragon Plus Environment

27

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 43

14. Ghaemmaghami, S.; Huh, W. K.; Bower, K.; Howson, R. W.; Belle, A.; Dephoure, N.; O'Shea, E. K.; Weissman, J. S., Global analysis of protein expression in yeast. Nature 2003, 425 (6959), 737-41. 15. de Godoy, L. M.; Olsen, J. V.; Cox, J.; Nielsen, M. L.; Hubner, N. C.; Frohlich, F.; Walther, T. C.; Mann, M., Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast. Nature 2008, 455 (7217), 1251-4. 16. Webb, K. J.; Xu, T.; Park, S. K.; Yates, J. R., 3rd, Modified MuDPIT separation identified 4488 proteins in a system-wide analysis of quiescence in yeast. J Proteome Res 2013, 12 (5), 2177-84. 17. Hebert, A. S.; Richards, A. L.; Bailey, D. J.; Ulbrich, A.; Coughlin, E. E.; Westphall, M. S.; Coon, J. J., The one hour yeast proteome. Mol Cell Proteomics 2014, 13 (1), 339-47. 18. Cox, J.; Mann, M., MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol 2008, 26 (12), 1367-72. 19. Tyanova, S.; Temu, T.; Sinitcyn, P.; Carlson, A.; Hein, M. Y.; Geiger, T.; Mann, M.; Cox, J., The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat Methods 2016, 13 (9), 731-40. 20. Wang, L. H.; Li, D. Q.; Fu, Y.; Wang, H. P.; Zhang, J. F.; Yuan, Z. F.; Sun, R. X.; Zeng, R.; He, S. M.; Gao, W., pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry. Rapid Commun Mass Spectrom 2007, 21 (18), 2985-91. 21. Zhang, W.; Zhang, J.; Xu, C.; Li, N.; Liu, H.; Ma, J.; Zhu, Y.; Xie, H., LFQuant: a labelfree fast quantitative analysis tool for high-resolution LC-MS/MS proteomics data. Proteomics 2012, 12 (23-24), 3475-84. 22. Choi, M.; Chang, C. Y.; Clough, T.; Broudy, D.; Killeen, T.; MacLean, B.; Vitek, O., MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics 2014, 30 (17), 2524-6. 23. Benjamini, Y. H., Y., Controlling the false discovery rate: a practical and powerful approach to multiple testing. J.R. Statist. Soc. B 1995, 57 (1), 289-300. 24. Love, M. I.; Huber, W.; Anders, S., Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014, 15 (12), 550. 25. Branson, O. E.; Freitas, M. A., Tag-Count Analysis of Large-Scale Proteomic Data. Journal of Proteome Research 2016. 26. Zhang, B.; Kall, L.; Zubarev, R. A., DeMix-Q: Quantification-Centered Data Processing Workflow. Mol Cell Proteomics 2016, 15 (4), 1467-78. 27. Ramus, C.; Hovasse, A.; Marcellin, M.; Hesse, A. M.; Mouton-Barbosa, E.; Bouyssie, D.; Vaca, S.; Carapito, C.; Chaoui, K.; Bruley, C.; Garin, J.; Cianferani, S.; Ferro, M.; Van Dorssaeler, A.; Burlet-Schiltz, O.; Schaeffer, C.; Coute, Y.; Gonzalez de Peredo, A., Benchmarking quantitative label-free LC-MS data processing workflows using a complex spiked proteomic standard dataset. J Proteomics 2016, 132, 51-62. 28. Gupta, N.; Pevzner, P. A., False discovery rates of protein identifications: a strike against the two-peptide rule. J Proteome Res 2009, 8 (9), 4173-81. 29. Reiter, L.; Claassen, M.; Schrimpf, S. P.; Jovanovic, M.; Schmidt, A.; Buhmann, J. M.; Hengartner, M. O.; Aebersold, R., Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Mol Cell Proteomics 2009, 8 (11), 2405-17.

ACS Paragon Plus Environment

28

Page 29 of 43

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

29

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 43

TABLE / FIGURE Legends Table 1. Summary of the study design. Four samples were analyzed in triplicate in random order by data-dependent LC-MS/MS (ThermoFisher Q Exactive). Each sample contained 200 ng yeast tryptic digest spiked with the indicated amounts (in fmols) of tryptic digest of six individual proteins.

Figure 1. Summary of participants. Continent, type of facility, job description, and years of experience in proteomics.

Figure 2. Summary of the submissions. (A) Number of reported proteins in each submission. The participants made their own choices regarding the False Discovery Rate control at the level of spectral, peptide or protein identification, and therefore the confidence in the reported proteins is not necessarily comparable. The panels indicate the chosen quantification strategy. The colors indicate the type of input data used for the analysis, such as raw data, raw data along with the provided intermediate files to check their results (indicated as “raw+check”), the provided peptide IDs, and the provided extracted peak intensities. The horizontal lines at n= 5766 and n=3766 indicate the number of proteins with spectral counts and the number of proteins with peak intensities in the provided data. (B) Number of differentially abundant proteins across all six pairwise comparisons of samples. The bars indicate the number of true positives (below zero) and false positives (above zero). The panels indicate different quantification methods and the color of a bar indicates the type of input data used for the analysis. Five submissions had more than 300 false positives, and those are indicated with arrows and actual values. Red dashed line shows the maximal number of differentially abundant proteins (36) that could be reported,

ACS Paragon Plus Environment

30

Page 31 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

despite 6 cases where the true change was less than 20%. Supplementary Figure 1 and 2 show the same plots for each sample-to-sample comparison. (C) Estimated log2-fold changes for background proteins for all pairwise comparisons. The panels indicate different quantification methods and the color of a box indicates the type of input data used for the analysis. Data points with estimated log2-fold change < 10 or > 10 are indicated as a grey asterix, and a number of outliers.

Figure 3. Summary of the analytical choices made in 48 out of the 51 submissions. The submissions were characterized according to their input data, peptide identification (‘others’: pFind,

PEAK

DB,

X!Tandem/MassChroQ,

MS-GF+,

SIEVE,

PeptideShaker,

X!Tandem/IdentiQuantXLTM,

MSAmanda/SIEVE,

Mascot/Genedata

Expressionist

software), feature quantification, summarization, choice of statistical analysis software (‘others’: IdentiQuantaXL, PANDA-view, SIEVE, PEAK Q), statistical inference (‘others’: Bayesian linear random effects model, quasi-Poisson generalized linear model, Wilcoxon Rank Sum test), and control of false discovery (FDR) rate in the list of differentially abundant proteins (‘others’: local FDR, positive FDR). The submissions were first separated into 3 sub-panels by positive predictive value (PPV), and then ordered by the number of true positives within each sub-panel. Three submissions did not provide sufficient details of their analyses, and were excluded from the figure.

Figure 4. Volcano plot display of the results of the statistical analysis with MSstats of LC-MS peak intensity data provided to the participants. Y axis: minus log10 p-value of a pairwise comparison between two samples, adjusted to control the False Discovery Rate in the list of

ACS Paragon Plus Environment

31

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 43

differentially abundant proteins in this comparison. X axis: log2-fold change between two samples. The dashed horizontal line corresponds to FDR = 0.05.

Figure 5. As in Figure 4, displaying the results of the statistical analysis with DESeq of counts of MS/MS spectra provided to the participants.

Figure 6. Comparison of the results obtained with spectral counts and LC-MS peak intensities provided to the participants. The differentially abundant proteins were based on the adjusted pvalue cutoff of 0.05. (A) Estimated versus true log2 fold changes among the spiked proteins in the 36 possible pairwise comparisons of samples, and the corresponding decisions of differential abundance. (B) Estimated versus true log2 fold changes among the background proteins, in all the possible pairwise comparisons of samples, and the corresponding decisions of differential abundance. The true log2 fold change for all the background proteins is zero. The line ‘cannot test’ states the number of proteins and pairwise comparisons between samples where there the protein was absent in one of the samples, or had no replicates in both samples.

Figure 7. Volcano plot for intensity data, which was re-processed with recent version of Skyline. A volcano plot demonstrating magnitude and significance of the protein comparisons between the four samples. The dashed gray line shows where adjusted p-value = 0.05, with points above the line having adjusted p-value < 0.05 and points below the line having adjusted p-value > 0.05. This plot is colored when identifying the differentially abundant proteins.

Figure 8. Comparison for processing in Skyline. Peak intensity data provided to participants using 0.15 iProphet cut-off and Orbitrap MS1 extraction defaults for Skyline v 2.6.0.6851 vs. re-

ACS Paragon Plus Environment

32

Page 33 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

processed using 0.15 iProphet cut-off, no single-hit proteins and ±10 ppm extraction with Skyline v 3.5.0.9319. As in Figure 6, plots (A) and (B) display the result from re-processed peak intensity data. Tables (A) and (B) compare the results with peak intensity data provided to participant.

ACS Paragon Plus Environment

33

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 43

‘for TOC only ‘ : Table of Contents graphic

ACS Paragon Plus Environment

34

Page 35 of 43

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 1 Samples A B C D E F

Name

Origin

Ovalbumin Myoglobin Phosphorylase b Beta-Galactosidase

Chicken Egg White Equine Heart Rabbit Muscle Escherichia Coli

Molecular Weight 45KD 17KD 97KD 116KD

Bovine Serum Albumin Carbonic Anhydrase

Bovine Serum Bovine Erythrocytes

66KD 29KD

1 ACS Paragon Plus Environment

1

2

3

4

65 55 15 2 11 10

55 15 2 65 0.6 500

15 2 65 55 10 11

2 65 55 15 500 0.6

Figure 1

Journal of Proteome Research (A) Continent

40

Page 36 of 43 (B) Facility type

40

1 32 53.33 % 2 30 26 330 43.33 % 4 20 520 15 25.00 % 12 6 20.00 % 9 8 15.00 % 710 10 13.33 % 5 4 3 8.33 % 8 6.67 % 1 5.00 % 1.67 % 90 0 10 Europe Asia North Australia South NA Core+ Core Research Software America NZ America Research developer 11 12 (C) Job description (D) Experience in label-free proteomics 13 40 40 14 15 30 30 26 16 43.33 % 17 19 31.67 % 20 20 18 13 11 11 21.67 % 19 18.33 % 18.33 % 8 7 10 10 6 13.33 % 20 5 11.67 % 4 10.00 % 3 8.33 % 2 6.67 % 21 5.00 % 3.33 Paragon % ACS Plus Environment 220 0 Mass Statistician Software Lab Director NA < 1 year 1−2 years 3−4 years 5−10 years >10 years 23Bioinformatician Spectrometrist Developer Scientist Manager 24

5 8.33 %

NA

5 8.33 %

NA

Figure 2

Page 37 of 43

A

Spectral counting

Intensity based

6000

Hybrid

Input data

NA

Peaks

Peptide IDs

Raw+check 4000

Raw NA

Provided # of proteins with peak intensities = 3766

40 46 12 44 47

9 51 45

18 32 50

Spectral counting

Hybrid

NA

Study ID, ordered by number of proteins Intensity based 447 514 734 2002

330

False positive

300

200

100

0

True positive

45 9 51

Spectral counting

Hybrid

50 32 18

12 44 46 40 47

3 25 37 26 13 42 6 49 17 36 4 11 20 48 31 1 23 19 7 8 14 15 21 43 28 35 34 30 2 38 16 22 27 24 39 33 29 41 10 5

36

Study ID, ordered by number of false positives

12

Intensity based 4

2

6

24

68 317

273

46

2094

NA 10

9

6

3

Study ID, ordered by number of false positives

50 32 18

ACS Paragon Plus Environment

45 9 51

0

12 44 46 40 47

Absolute value of log2 fold-change in background proteins

0

13 48 6 5 26 27 14 25 3 15 37 36 1 20 38 29 21 43 8 30 4 22 10 34 35 24 16 33 2 7 17 23 19 31 28 41 49 39 11 42

2000

3 25 37 26 13 42 6 49 17 36 4 11 20 48 31 1 23 19 7 8 14 15 21 43 28 35 34 30 2 38 16 22 27 24 39 33 29 41 10 5

Reported proteins

Provided # of proteins with spectral counts = 5766

Differentially abundant proteins ‘False positives’ and ‘True positives’

1 2 3 4 5 6 7 8 9 10 11 12 13 B14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 C31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

Journal of Proteome Research

Figure 3

PPV < 0.2

0.2 < PPV < 0.7

PPV > 0.7

3 3 2 1 2 2 4 1 5 6 8 5 7 7 7 7 8 9 9 8 8 8 4 9

6

11 42 20 48 25 13 12 44 49 36 26 17 4 40 6 3 45 46 37

0.857 0.972 0.821 0.744 1.000 1.000 1.000 1.000 0.963 0.897 1.000 0.920 0.821 0.767 0.955 1.000 0.708 0.867 1.000

36 35 32 32 30 30 27 26 26 26 24 23 23 23 21 19 17 13 10

6 1 7 11 0 0 0 0 1 3 0 2 5 7 1 0 7 2 0

8 1 14 28 21 43 19 31 23 15

0.559 0.580 0.359 0.295 0.294 0.294 0.511 0.575 0.500 0.267

33 29 28 28 25 25 24 23 23 20

26 21 50 67 60 60 23 17 23 55

30 16 27 24 38 47 29 35 22 34 2 39 33 9 51 10 5 41 7

ACS

Page 38 of 43

Tr ue po si Fa tiv ls e e po si tiv e

PP V

ID

In pu t Id en tif Q ica ua tio nt n Su ifica m t m ion St ari z at is atio t In ical n fe re sof n t FD ce wa re R

ID Su bm is si on

La b

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

Journal of Proteome Research

0.174 0.145 0.134 0.120 0.141 0.181 0.057 0.182 0.126 0.145 0.132 0.091 0.074 0.085 0.052 0.023 Paragon Plus 0.008 0.027 0.172

Input

Peaks Peptide ids Raw+check Raw

Identification Skyline MaxQuant Progenesis Others

Quantification

Feature intensity Spectral counting Hybrid

Summarization

Protein summarization / Protein−level inference Peptide summarization / Protein−level inference Peptide summarization / Peptide−level inference

Statistical software

Persus Progenesis QI Others R, Excel, MatLab, Python In−house scripts

Inference

t−test / SAM's t test

152 32 ANOVA Linear (mixed−effects) model 30 177 Others 194 30 30 219 FDR 28 170 Benjamini Hochberg 27 122 Permutation FDR Others 27 447 Manual validation 26 117 FC cutoff 26 180 No adjustment 25 148 24 158 No information 24 240 21 264 18 194 18 330 17 734 Environment 17 2002 14 514 5 24

Figure 4 Page 39 of 43

Journal of Proteome Research

Sample2 - Sample1

-Log10 (adjusted p-value)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Sample3 - Sample1

Sample3 - Sample2

8 D



D

6



B ●

4 B



C

2



F

E

F ●



B ●

0



E





A

● C

E

A ● ●

F

Sample4 - Sample1

C

A ●

● ●



D

Sample4 - Sample2

Sample4 - Sample3 ● E

8



E ●

6 ●

● D

F

● ●

D

4

C ●

● C

2

F ●

-10

-5

B

F



0

5

10

-10

-5

0

5

10

-10

Log2-fold change, peak intensities ACS Paragon Plus Environment

Differentially abundant spike−in protein Not differentially abundant spike−in protein

B ●

D

B

A ●

A ●

0



E

-5

A ●● ●

C

0

Differentially abundant background protein Not differentially abundant background protein

5

10

Figure 5

Journal of Proteome Research

Sample2 - Sample1

-Log10 (adjusted p-value)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Sample3 - Sample1

F ●● D E ●● C

8

Page 40 of 43







B

D

6

Sample3 - Sample2 ●

C

F

● B

C ●

4

B ●

2 A

0

F ●●●



Sample4 - Sample1 E

8

D E ●● ● A

A

Sample4 - Sample2





E



F

C ●

● A

6

E

Sample4 - Sample3 E ● ● B



D B

● ●

A D





C

4

F ● ●



2

F

D ●

A

B

0

C ●



-10

-5

0

5

10

-10

-5

0

5

10

-10

-5

0

Log2-fold change, spectral counts ACS Paragon Plus Environment

Differentially abundant spike−in protein Not differentially abundant spike−in protein

Differentially abundant background protein Not differentially abundant background protein

5

10

Figure 6 Page 41 of 43

Journal of Proteome Research

A

Spiked proteins Peak intensities

Estimated log2−fold change

10.0

7.5

5.0

2.5

0.0 0

2.5

5

7.5

10 0

True log2−fold change

36 comparisons among 6 spike-in proteins Differentially abundant Not differentially abundant

2.5

5

Spectral counts

Peak Intensities

27 9

25 11

7.5

10

Background proteins 75% Quantile = 0.1092 Median = 0.0477 25% Quantile = 0.0175

75% Quantile = 0.2379 Median = 0.1022 25% Quantile = 0.0429

7.5

Estimated log2−fold change

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 B 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

Spiked proteins Spectral counts

5.0

Background proteins

Spectral counts

Peak Intensities

Number of background proteins Total number of comparisons Differentially abundant Not differentially abundant Cannot test*

5770 34620 0 34620 0

3760 22560 19 22532 9

2.5

0.0

ACS Paragon Plus Environment Spectral counts

Peak intensites

Figure 7

Journal of Proteome Research

Sample2 - Sample1

-Log10 (adjusted p-value)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

8



C

F

Page 42 of 43

Sample3 - Sample1

● D

Sample3 - Sample2



C







D

C

● F

6 B



4

E ●

2

● B

A

B



● ● E ●



Sample4 - Sample1 8

Sample4 - Sample2 F

C ● ●

6



D



Sample4 - Sample3 E

C



D

A

E

● B

● F

A ●

C F ● ●



0



● B

● A

B

-5

D ●

2

-10



E



0

● D

F



4



A A

0

E



5

10

-10

-5

0

5

10

-10

Log2-fold change, peak intensities ACS Paragon Plus Environment

Differentially abundant spike−in protein Not differentially abundant spike−in protein

-5

0

Differentially abundant background protein Not differentially abundant background protein

5

10

Figure 8

Page 43 of 43

A

B

Spiked proteins Peak intensities, v 3.5.0.9319

Background proteins Peak intensities, v 3.5.0.9319

Estimated log2−fold change

10.0

Estimated log2−fold change

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Journal of Proteome Research

7.5

5.0

2.5

0.0

7.5

75% Quantile = 0.1626 Median = 0.0799 25% Quantile = 0.0351

5.0

2.5

0.0 0.0

2.5

5.0

7.5

10.0

True log2−fold change

36 comparisons among 6 spike-in proteins Differentially abundant Not differentially abundant

Peak intensities Peak intensities, v 2.6.0.6851 v 3.5.0.9319 25 11

28 8

Background proteins

Peak intensities v 2.6.0.6851

Peak intensities, v 3.5.0.9319

Number of background proteins Total number of comparisons Differentially abundant Not differentially abundant Cannot test*

3760 22560 19 22532 9

3021 18126 1 18125 0

ACS Paragon Plus Environment