Quantitative Performance Evaluator for Proteomics (QPEP): Web

Centre for Mathematics and its Applications, Mathematical Sciences Institute, ...... are observed Biostat 2016, 17 (1) 16– 28 DOI: 10.1093/biostatis...
1 downloads 0 Views 4MB Size
Article pubs.acs.org/jpr

Quantitative Performance Evaluator for Proteomics (QPEP): Web-based Application for Reproducible Evaluation of Proteomics Preprocessing Methods Dario Strbenac,† Ling Zhong,‡ Mark J. Raftery,‡ Penghao Wang,†,# Susan R. Wilson,§,∥ Nicola J. Armstrong,†,⊥ and Jean Y. H. Yang*,† †

School of Mathematics and Statistics, University of Sydney, Sydney, New South Wales 2006, Australia Bioanalytical Mass Spectrometry Facility, University of New South Wales, Sydney, New South Wales 2052, Australia § School of Mathematics & Statistics, University of New South Wales, Sydney, New South Wales 2052, Australia ∥ Centre for Mathematics and its Applications, Mathematical Sciences Institute, Australian National University, Canberra, Australian Capital Territory 0200, Australia ‡

S Supporting Information *

ABSTRACT: Tandem mass spectrometry is one of the most popular techniques for quantitation of proteomes. There exists a large variety of options in each stage of data preprocessing that impact the bias and variance of the summarized protein-level values. Using a newly released data set satisfying a replicated Latin squares design, a diverse set of performance metrics has been developed and implemented in a web-based application, Quantitative Performance Evaluator for Proteomics (QPEP). QPEP has the flexibility to allow users to apply their own method to preprocess this data set and share the results, allowing direct and straightforward comparison of new methodologies. Application of these new metrics to three case studies highlights that (i) the summarization of peptides to proteins is robust to the choice of peptide summary used, (ii) the differences between iTRAQ labels are stronger than the differences between experimental runs, and (iii) the commercial software ProteinPilot performs equivalently well at between-sample normalization to more complicated methods developed by academics. Importantly, finding (ii) underscores the benefits of using the principles of randomization and blocking to avoid the experimental measurements being confounded by technical factors. Data are available via ProteomeXchange with identifier PXD003608. KEYWORDS: Latin square design, iTRAQ, benchmarking, reproducible research, web server



INTRODUCTION

caused by batch effects rather than the biological condition of interest. Frameworks that facilitate open comparison between different preprocessing methods by a community-wide joint effort highlight the appropriate normalization procedures for different kinds of analyses. They also enable a deeper

Evaluation of competing methods for preprocessing using designed experiments is necessary to identify which summarization and normalization methods are likely to perform best for biological data sets. It has been established that choices made for low-level analysis may have significant impact on biological analyses and thus affect eventual scientific conclusions, such as reporting protein abundance differences © XXXX American Chemical Society

Received: October 8, 2016

A

DOI: 10.1021/acs.jproteome.6b00882 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

12 normalization methods to be applied to the data set, disregarding whether the methods are applicable to the study’s experimental design. Also, it provides no flexibility for the user to provide a data set that has been normalized using their own novel method. Second, Normalyzer is intended for generic biological experiments rather than designed evaluation studies, so the evaluations (e.g., coefficient of variation, quantile− quantile plots) are restricted to examining unwanted variance. Finally, unlike a web-based evaluation application, it is standalone software, and comparisons between methods of different researchers cannot be centrally disseminated. This work provides an interactive and comprehensive evaluation utilizing the Latin square data set that we have recently designed and generated. The complex design of this data set provides an opportunity to thoroughly examine various preprocessing algorithms in proteomics and directly compare them. Here we describe a novel set of performance metrics implemented in QPEP for comparison of quantitation preprocessing. This is the first comprehensive evaluation of a range of summarization and normalization methods for proteomics data.

understanding of the strengths and weaknesses of various algorithms. AffyComp1 was the first such competition that evaluated microarray technology and was novel because it allowed users to upload their own preprocessing of the raw data and compare it with other methods via a web-based scoreboard. This allowed thorough evaluation that concluded that Affymetrix’s default MAS normalization method was one of the worst performing,2 being beaten by most other methods developed by academic researchers. Other studies for copy number microarrays and RNA-seq later followed,3,4 with similarly complex experimental designs to assess bias and variance in great detail. Comprehensive evaluation of protein quantitation, rather than protein identification, has rarely been done for proteomics data. The most appropriate methods to use and how various methods perform in relationship to each other are still unknown. The earliest evaluation study consisted of six proteins spiked into a complex mixture at six different concentrations to form six samples, each measured by a separate unlabeled experiment.5 Considered together, the six designed samples form a Latin square (Section S-1). Each sample was analyzed three times. Performance evaluation was minimal; variance was indirectly assessed by clustering the samples using k-means clustering and also by the standard deviation of the three technical replicates for each protein. Bias was indirectly addressed by plotting the protein quantitation and expected quantitation of every sample. Other limitations of the study are that no comparisons of any algorithms or software parameters were done and the size of the spike-in proteins was not systematically varied as part of the experimental design. A small number of other data sets that make use of spike-ins at specific concentrations have been published,6−11 but they have used simple dilutions of particular samples or compared two conditions that have a presence or absence of spike-in proteins, which limits the aspects of bias and variance that are able to be evaluated. Also, only one of the aforementioned studies used a labeling technique,6 namely, using iTRAQ. There are many potential ways to summarize numerous peptide values into a single summary value for a protein. This has only been considered in Matzke and colleagues’ study.8 However, that study only considered factors like the user interface and quality of the end-user software documentation in the evaluation of the summarization possibilities. Popular approaches include the median peptide,12 sum of the three peptides with the largest quantities,13 and sum of all peptides divided by the number of theoretically observable peptides.14 Summaries may also statistically take into account the peptides that are shared between two or more proteins, as is done by SCAMPI.15 The SCAMPI algorithm is not evaluated here due to the R package protiq containing it being no longer being maintained, according to CRAN. Summarization methods can have significant impact on inferred protein quantities, but their evaluation in terms of bias and variance has been limited in scope to date. Because of the rarity of comprehensively designed data sets in the proteomics field, there is limited software available to evaluate and compare preprocessing algorithm performance. Normalyzer16 is the first software that attempts to enable such comparisons. The software allows users to load their proteomics data set into R and applies 12 normalization methods to it, such as LOESS normalization, before producing simple data set summaries, such as mean-variance plots. Normalyzer has a number of limitations. First, it forces all



METHODS The data set used for evaluation consists of an experimental design and preprocessing of the raw mass spectrometry data into protein quantities.



EXPERIMENTAL DESIGN The experiment consists of 21 known proteins, each purchased from Sigma-Aldrich (Table S-1), and the yeast proteome. Here the non-yeast proteins are referred to as spike-in proteins and yeast proteins as background proteins. The spike-in proteins selected for purchasing were based on a set of characteristics intended to increase their identifiability and provide broad generalizability. The characteristics and their justification are shown in Table S-2. The yeast proteome extracted from wildtype Saccharomyces cerevisiae Y0000 (BY4741; MATa; his3Δ1; leu2Δ0; met15Δ0; ura3Δ0) was grown in liquid-rich medium (2% glucose, 1% peptone, 1% yeast extract) and harvested from logarithmically grown cultures (OD600-1). It was added to better simulate the background complexity of a typical protein mixture. All proteins were labeled in each run with eight iTRAQ reagents. For each experimental run each spike-in protein was added at a different concentration in each iTRAQ channel, except in channel 121, referred to as the internal reference channel. The other seven channels are referred to as dynamic channels. A sample is defined as a particular iTRAQ channel of a particular experimental run. The proteins are grouped into seven groups, such that each group contains one small, one medium, and one large protein. The amount of each protein group used in each channel is shown for Run 1 (Table 1). The full set of design matrices is available as Table S-3. Preprocessing

To summarize the raw data to normalized protein-level quantities, five out of six key stages of preprocessing (peptide−spectrum matching, peptide−protein assignment, quantitation, summarization, missing value treatment, and normalization) were performed (Figure 1). Missing value treatment is not evaluated because this data set has a low percentage of missing peptides (Table S-4) due to the high concentration of the spike-in proteins used. Missing peptides B

DOI: 10.1021/acs.jproteome.6b00882 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research Table 1. Experimental Design for Run 1a protein group iTRAQ channel

a

113 114 115 116 117 118 119 121

A

B

C

D

E

F

G

yeast proteome

1 2 4 8 16 32 64 8

2 4 8 16 32 64 1 8

4 8 16 32 64 1 2 8

8 16 32 64 1 2 4 8

16 32 64 1 2 4 8 8

32 64 1 2 4 8 16 8

64 1 2 4 8 16 32 8

1 1 1 1 1 1 1 1

Values are volumes of diluted protein used, in units of microliters.

Figure 1. Flowchart of typical proteomics data preprocessing. Steps of preprocessing are represented by boxes and particular methods to evaluate are listed on the right side. Gray steps or gray option lists are not evaluated in this study.

were ignored in the run(s) in which they were missing. Mass spectra were acquired from an ABI QStar Elite instrument. In brief, iTRAQ samples were concentrated and desalted onto a micro precolumn (C18, Pepmap, 300 μm × 5 mm, Dionex, Netherlands) with H2O/CH3CN (98:2, 0.1% TFA) at 15 μL/min. After a 4 min wash, the trap was switched (Switchos) into line with a fritless analytical column (75 μm × ∼15 cm) containing C18 RP packing material (Magic, 3 μ, 200 Å, Michrom Bioresources). Peptides were eluted using a linear gradient of H2O:CH3CN (98:2, 0.1% formic acid-buffer A) to H2O/CH3CN (36:64, 0.1% formic acid-buffer B) at 200 nL/min over 30 min. High voltage (2300 V) was applied through a low union at the column inlet and the outlet positioned ∼1 cm from the orifice (T = 150 °C) of the mass spectrometer. Positive ions were generated by electrospray and the QStar operated in information-dependent acquisition mode (IDA). A time-of-flight MS survey scan was acquired (m/z 350−1750, 0.5 s), and the three largest multiply charged ions (counts >50, charge state ≥2 and ≤4) were sequentially selected by Q1 for MS/MS analysis. Nitrogen was used as the collision gas, enhance all for iTRAQ was enabled, and an optimum collision energy was automatically chosen (based on charge state and mass). The fragment intensity multiplier setting was 20 and tandem mass spectra were accumulated for a maximum of 2 s (m/z 65−2000) before dynamic exclusion for

30 s. Peak lists were generated and searched using ProteinPilot’s Paragon method17 against a custom protein database, which contains all proteins for all source organisms in the March 2015 release of the SwissProt18 database. Version 5.0 of ProteinPilot was used. The cysteine alkylation variable was set to iodoacetamide, the protein digestion variable was set to trypsin, the special factors list had carbamidomethylation of cysteine as a fixed modification and oxidation of methionine as a variable modification, and the search effort variable was set to Thorough ID. The instrument model was set to QSTAR Elite ESI, which sets the mass tolerances to 0.2 m/z for both MS and MS/MS and allows up to two missed cleavages. To report only proteins with 95% or higher detection confidence, a targetdecoy database search was performed and the Detected Protein Threshold variable was set to 1.3. Approximately the same number of proteins were identified from the background set and the spike-in set (Table S-5). For peptide quantitation, the reporter ion areas were corrected for isotopic impurity using the default impurity table provided by ProteinPilot. Default quantitation, which uses bias and background correction, was evaluated in the first case study. In an alternate preprocessing scenario, bias factor scaling and background correction were both turned off. Quantitated spectra were exported from ProteinPilot to text files using the Peptide Summary feature of the Export panel. Spectra with a C

DOI: 10.1021/acs.jproteome.6b00882 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

changes are determined based on the design matrices of each of the seven runs. Quantile regression is used, as implemented by the rq function in R. Default parameters of the function are used. An explanation of the metrics based on a regression line is presented in Table 2. Two kinds of regression lines may be constructed for this evaluation. All four of the above metrics may be calculated for both lines. In the absolute quantitation scenario, each protein summary in a sample is divided by the median summary of the proteins present at the median design amount of 8 μL in that sample and regressed on the expected fold change. In this unconventional way, the measured amounts of different proteins are directly compared with each other. The main experimental factors to affect this type of regression are precursor ion coelution and different measurement efficiencies of different peptides. The second kind of linear regression is for relative quantitation. The protein summary of each protein in a dynamic channel is divided by the same protein’s summary in the internal reference channel. The key factors to affect these metrics are systematic differences between the samples and also precursor ion coelution. These two regression lines provide a total of eight performance metrics. Summation Metrics. An additional set of metrics examines unwanted variance by considering the total amount of protein per channel, protein design group, or protein size group based on summation within the particular grouping, followed by calculation of the coefficient of variation (CV). The ideal value for all summation metrics is zero. By design, all spike-in protein summaries for each dynamic channel have the same total amount of protein used. Also, every protein is present at the same total amount across all channels within a run. The metrics that make use of these important properties of the experimental design are described in Table 3. These three summation metrics are calculated within each run and also across all runs, resulting in six separate metrics. Lastly, the CV Median Run Proteins metric is neither a regression nor a summation metric. It finds the median protein

Used value of 0 were discarded. Such an annotation indicates a peptide that is shared between multiple proteins, has the iTRAQ reagent balance group detected in locations other than the N-terminal, or is missing the iTRAQ reagent at the N-terminal. Spectra were summarized to peptides by summing the areas for each channel for all spectra that had the same peptide sequence within each run and had confidence of >95%. These peptide summaries form the basis of evaluations made in the last two case studies. Proteins with at least one matched peptide were quantified using the Top 1 and Median methods. Only proteins with at least three matched peptides could be quantified using the Top 3 method. The raw mass spectra and peptide-spectrum matches are available from the ProteomeXchange Consortium as data set PXD003608 via the PRIDE database.19



RESULTS AND DISCUSSION A broad set of performance metrics to objectively assess the quality of protein quantitation has been developed. To make the metrics interactively accessible to a wide audience, QPEP enables custom comparisons of them to be made and new preprocessing approaches to be evaluated simply by uploading text files of protein-level summaries for automated metric calculation. Performance Metrics

Eight distinct performance metrics are proposed for absolute and relative quantitation scenarios, providing a total of 15 performance metrics. Absolute quantitation involves calculating ratios using quantitation of different proteins within a particular sample, whereas relative quantitation takes the ratio of measurements of a particular protein between two different samples. The metrics may further be categorized as regression metrics and summation metrics. Regression Metrics. Regression metrics examine the relationship between expected log2-fold change versus experimental log2-fold change based on a linear regression fit between these two kinds of fold changes. The expected fold Table 2. Details of the Regression Metrics ideal value

interpretation

Intercept Slope

metric

0 1

Line Difference Sum

0

Residual Sum

0

The further the intercept is from zero, the more systematically inaccurate the fold changes are. An ideal value of one means that the measured fold changes increase at the same rate as they do in the experimental design. Values between zero and one suggest that the fold changes are systematically underestimated. The sum of the absolute differences between the fold change values predicted by the regression line and the expected values from the experimental design. It combines biases that affect either the slope or the intercept into a single value and provides an overall goodness of fit between the observed and expected fold changes. Sum of absolute residuals. It could be close to ideal, even if the slope and intercept are far from their ideal values.

Table 3. Details of the Summation Metrics metric

ideal value

interpretation

CV Channel Sums

0

Protein summaries are summed within each iTRAQ channel and the CV is calculated. It provides an indication how similar the total protein quantity is between channels.

CV Protein Group Sums

0

Protein summaries are summed within each protein group across all dynamic channels and the CV is calculated. Factors that affect this metric are proteins that have systematically high or low signal, undetected proteins, and binding of variable efficiency between peptides and particular iTRAQ labels. It provides an indication of how similar the seven protein groups are.

CV Size Group Sums

0

Proteins are alternatively grouped into three size groups: small, medium, and large. Seven proteins belong to each group. Protein summaries are summed within each group, and the CV is calculated. This metric is particularly affected by the decreasing number of peptides available to calculate protein summaries, as protein size decreases (Figure S-1).

CV Median Run Proteins

0

The median spike-in protein summary is found for each run, and the CV of those summaries is calculated. The factors that have the most effect on this metric are any extrinsic differences between runs, such as if the instrument has been recalibrated, or differences in laboratory environmental conditions, such as room temperature. It provides an indication of the repeatability of the experiments. D

DOI: 10.1021/acs.jproteome.6b00882 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research Table 4. Summary of Performance Metrics and Their Characteristicsa quantitation Metric (Column ID)

Intercept (1, 5) Slope (2, 6) Line Difference Sum (3, 7) Residual Sum (4, 8) CV Channel Sums (9, 13) CV Protein Group Sums (10, 14) CV Size Group Sums (11, 15) CV Median Run Proteins (12)

preprocessing stage

characteristic

absolute

relative

summarization

normalization

bias

× × × × × × × ×

× × × ×

× × × × × × × ×

× (R)

× × ×

× (R) × × × ×

variance

× × × × ×

a The applicability of a metric to the quantitation type, preprocessing stage, and error characteristic is shown. (R): only affects relative quantitation. The number next to the metric name shows its column name in the web-based application.

the particular sample, creating seven groups each containing seven metrics. Numerical comparison uses the selected metrics (columns) of the selected method sets (rows) to calculate the distance of each selected method set to the ideal metric value. The best method set has either the smallest sum of absolute distances to the ideal values (weighted voting) or the largest number of metrics closest to the ideal values (unweighted voting). The result text also provides the distance value or the metrics for which the method set was best. Ties are handled by displaying all method sets that have the best set of metrics. Two kinds of graphical summaries are available. First, the “Plot PCA” button creates a PCA plot of the method sets selected based on the metrics selected. The method sets are projected into two dimensions. This enables the compact summarization of method set performance similarity. Second, the “Plot Comparison” button causes either boxplots or barcharts to be generated of the selected subset of the evaluation matrix, depending on whether there are one or more summarized values for each value of the x-axis variable. The plots can be organized into multiple rows and columns by selecting one of the “Methodology” column names from the “Rows By” and “Columns By” drop-down lists. The “Colour By” list adds a third way to group metrics in the plot. Therefore, the application allows up to three methodology factors to be visualized simultaneously.

signal for each run and calculates the CV of those seven values. Its ideal value is zero. None of the metrics use any protein summaries from the internal reference channel or the background proteins to ensure that metrics are independent of the protein summaries used for normalization. All metrics, the scenarios they are relevant to, the preprocessing stages they are affected by, and the undesirable characteristic they evaluate are summarized in Table 4. QPEP Application

To enable the unbiased and reproducible evaluation of three novel case studies in the following section, and, importantly, to encourage researchers to develop their own data preprocessing methods and compare their method’s performance to existing approaches, a web-based application has been developed. Its URL is http://shiny.maths.usyd.edu.au/QPEP. QPEP has three sections: a help page, an evaluation matrix page, and a data set upload page. To enable researchers to compare their novel methods with existing approaches, the “New Processing Upload” page (Figure 2A) enables users to upload their own preprocessing of the data set for automated performance metric determination by calculating the 15 metrics defined in the previous section. Absolute quantitation and summation metrics are not calculated if fold changes between channels are provided rather than protein quantities of each sample. When the details of preprocessing and the quantitation table have been uploaded to the server, all possible performance metrics are calculated. The “Evaluation Matrix” page (Figure 2B) is centered on an interactive performance metrics matrix. Each unique set of preprocessing parameters is referred to as a method set and forms one row of the matrix. The first row of the matrix shows the ideal value of each metric. The metrics are grouped by whether they are calculated on within-channel, within-run, or between-runs protein measurements. They are further grouped by whether they measure bias or variance. Each metric occupies one column of the matrix. To enable users to interactively make comparisons of interest, the rows and columns of the matrix are selectable by left clicking the method set number or the metric number in the footer of the table. Once a subset of the matrix has been selected, the method sets may be evaluated either numerically or graphically. The metrics are typically grouped by either experimental run or iTRAQ channel. For example, when “Channel” is selected for the “Group Metrics By” option, each of the user-chosen metrics calculated for each of the seven runs is aggregated into groups based on the iTRAQ channel used for

Case Studies

The utility of the proposed evaluation metrics and QPEP is illustrated by their application to three analysis scenarios, each examining the performance of preprocessing methods for either relative or absolute quantitation. The first scenario involves evaluating the bias and variance of relative quantitation produced by the ProteinPilot commercial software to represent a typical analysis that a proteomics beginner may perform with a data set generated by a SCIEX instrument. Second, the performance characteristics of custom relative quantitation are evaluated using peptide summarization methods and betweensample normalization methods popular in academia. Finally, a novel scenario of performing absolute quantitation without use of internal standards is examined. Between-sample and between-run normalization are of interest in the first two scenarios because clustering of samples by iTRAQ channel and, to a lesser extent, run can be seen in PCA plots of the background protein summaries (Figure S-2). Samples measured by the internal reference channel are expected to be located away from the other seven channels, where the E

DOI: 10.1021/acs.jproteome.6b00882 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 2. Overview of the web-based application for evaluation. (A) New Processing Upload page allows users to describe various aspects of their data preprocessing and to upload the protein quantities or fold changes as a text file to the server. The web-based application calculates all of the performance metrics and then automatically switches to the Evaluation Matrix page. (B) Evaluation Matrix page displays the matrix of performance metrics and various tools to interact with it. The numbers in the table are summaries, either the mean or standard deviation of the metric, of either all experimental runs or all iTRAQ channels, where applicable. Filtering, sorting, row, and column selection are used to subset the matrix to the comparison of interest. Full metric names can be seen by hovering the mouse over the column number. Numerical and graphical evaluation tools are located below the metric matrix.

proportion of yeast proteins (1 μL/57 μL) is larger than for the other channels (1 μL/128 μL), so the PCA plot is consistent with theoretical expectations.

make samples more comparable to each other by calculating the ratio to the internal reference channel for all proteins within a sample and representing a sample by the median ratio. By a process not described in the software documentation, scaling factors are calculated for each sample to make the median ratio of proteins in a dynamic channel to the reference channel be 1:1. From the design matrix (Table S-3), because each dynamic channel contains the same protein volumes in a different permutation, it is evident that the theoretical median fold change of each dynamic sample to the internal reference sample is the

Case 1: Preprocessing with ProteinPilot

Quantitation benchmarking of ProteinPilot has not previously been made publicly available, and its performance in comparison with methods proposed in academia is presently unknown. Two key steps of ProteinPilot processing are bias and background correction. The bias correction step aims to F

DOI: 10.1021/acs.jproteome.6b00882 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 3. Performance metrics of regression for relative quantitation using the default processing parameters of ProteinPilot 5. Horizontal red line represents the ideal metric value. Metrics are grouped either by iTRAQ channel (first row) or run (second row).

corrections for batch effects are unknown. Only one study has proposed a solution to remove such effects.20 The performance of this method has not been evaluated independently. Other possible approaches include not performing any correction (e.g., ProteinPilot), or using modern statistical techniques, such as RUV,21 which has not previously been applied in the proteomics field. Simple Scaling corrects the sample-to-sample variability within each run by calculating multiplicative factors using only the yeast protein summaries. Every channel has the yeast proteome added at the same quantity, so the channels should be adjusted so that some aspect of the yeast proteome is made to be consistent between samples. First, a representative control quantity value for each sample is calculated. The Top 1, Top 3, and Median summaries are made for yeast proteins, as well as the summation of all yeast proteins within a sample. Next, a scaling factor is calculated by dividing the representative value of the internal reference channel to the representative value of each dynamic channel. Every protein in a dynamic channel is multiplied by the calculated factor for that channel. This procedure is done independently for each run. Two-Stage Scaling is an extension of Simple Scaling. The first step is identical to Simple Scaling. In the second stage, another set of scaling factors to multiply entire runs by is calculated. The four ways of calculating a representative value are used, this time considering all yeast proteins within a run. A scaling factor is calculated by arbitrarily picking one run to be the reference run and then dividing its representative value by each other run’s representative value. Lastly, all proteins within a run are multiplied by the calculated factor for that run. Third, a more complicated approach involves fitting a linear model to the subset of the yeast proteins, which were measured in every run of the experiment. Here a modified version of the published model6 is used that does not include a peptide effect

same, so this normalization approach is applicable to the data set used for evaluation. The following step, background correction, is intended to make fold changes closer to their true values. It affects each protein within a sample differently. In the software documentation, it is stated to be problematic with data sets where the proteins with the highest expression amount are mostly differentially expressed. This is true for the experimental design used by this case study, where the yeast proteins are present at much lower abundances than any of the spike-in proteins. Only the regression line for relative quantitation is evaluated because the software does not output the numerator and denominator used, only the calculated fold change. Most distributions of the of the regression line slope have a median close to the ideal value of 1 (Figure 3). The slopes for channels 113 and 118 are particularly good, having little variance and being the very close to 1. This demonstrates that background correction may perform well for data sets composed of mostly differentially expressed proteins, contrary to the warning in the software documentation. Channel-specific biases are evident, with the Intercept metric of channels 114 and 115 having intercepts around −2. The biases also adversely affect the Line Difference Sum metric of these channels. There is noticeably less variability between runs than between channels. The lower bias comes at a cost of increased variance. The medians of the Residual Sum metric are higher than for the unnormalized data set, presented in the following section. The increased variance would adversely affect differential protein expression studies in practice, where increased variance would undesirably result in more false-negative findings. Case 2: Custom Relative Quantitation

The summarization approaches of peptides to a particular protein evaluated in this case study are Top 1, Top 3, and Median. The protein summary is the maximally quantitated peptide belonging to the protein, the sum of three such peptides, and the value of the median of such peptides, respectively. Normalization methods evaluated that normalize between samples, and in some cases between batches, are Simple Scaling, Two-Stage Scaling, Linear Model, and Removing Unwanted Variation (RUV). Labeled experiments inherently have batches of samples and the effectiveness of statistical

yijk = μ + runi + channel j + runChannelij + proteink + ϵ

yijk is the log2-transformed protein summary value of a protein in run i, measured in channel j of protein k. The term μ is the intercept term, runi is the effect of the ith run (i ∈ {1..7}), channelj is the effect of the jth channel (j ∈ {113, 114, 115, 116, 117, 118, 119, 121}), the interaction term runChannelij is the sample-specific effect, proteink is a factor for each of the yeast G

DOI: 10.1021/acs.jproteome.6b00882 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research proteins detected in all runs, and ϵ represents other sources of technical variability. The corrected protein summary value, denoted z, is simply corrected by subtracting the unwanted fitted effects from the observed summary value of spike-in protein k

summation metrics that use summaries from multiple samples are also ideally reduced because the variability between samples is reduced. Apart from Simple Scaling, all of these procedures also aim to reduce variability between runs. Each of the normalization methods with parameter choices was evaluated independently to reduce the complexity of comparison between them (Table S-6). For the two scaling methods, Median Yeast summarization of a sample performed best in the most number of categories (Table S-6A) and will be used as the summary for each sample with which to calculate a scaling factor. For RUV, the performance metrics were rather sensitive to the choice of parameters (Table S-6B). The RUV software uses a default of k equal to the number of samples and ν equal to 0.001, and these values are used in subsequent comparisons. The Simple Scaling, Linear Model, and RUV methods all produce similar biases to fold changes (Figure 4). For example, channels 114 and 115 have a median intercept of about −2, and the variability of this across runs is small. Also, channel 117 has an intercept of ∼1, with little variation across runs. In contrast, when the intercepts are grouped by run, the distributions are centered close to 0, although the spread is much larger. The unnormalized intercepts are all around −2 due to the different total volume in the internal reference channel to the

zijk = yijk − runi − channel j − runChannelij

Lastly, the most complex method for making corrections is a recent extension to the RUV technique,21 which can correct a data set when the unwanted experimental factors are unknown in advance and neither is the factor of interest (e.g., disease status). Like the two-stage scaling and linear model approaches, this method makes use of controls present in each sample. For this reason, the yeast proteins are key to RUV normalization. Any systematic bias will be present in all proteins, including those known to be unrelated to biological differences between samples. The user has to provide an estimate of the number of independent factors thought to be biasing the measurements, termed the rank parameter, and a second parameter to fine-tune how much of the estimated unwanted variation is removed from the data set, termed the regularization parameter. All of these procedures mainly aim to improve relative quantitation bias because systematic differences between samples are accounted for. Variance of relative quantitation and some of the

Figure 4. Performance metrics for relative quantitation. Horizontal red line represents the ideal metric value. The intercept is the only metric affected by the normalizations, so the distribution of intercept metrics of no normalization and three of the correction methods is shown. Two-stage Scaling has identical values to Simple Scaling in this scenario and is not shown. The Slope and Residual Sum metrics are unaffected by any of the betweensample normalization methods, and only their values for the unnormalized data set are shown. H

DOI: 10.1021/acs.jproteome.6b00882 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

differences in protein amount are the result of the biological conditions studied or simply an artifact of the iTRAQ channels inadvertently used for each condition. The Residual Sum’s median is typically ∼15, with little variation between groupings. The peptide to protein summary type has no substantial effect on any of the metrics.

dynamic channels and the compositional nature of the data. The spread of the intercepts between channels for the unnormalized summary is much less than that for any of the normalized summaries, demonstrating that normalizing using the yeast protein summaries introduces additional unwanted variation when removing bias. The slope coefficient is particularly far from ideal for channels 114 and 115, with a median value of about 0.25. When considering the slopes being grouped by runs, all runs are consistent with medians close to 0.5. This observation has important implications for experimental design. Most published articles do not use the statistical principles of randomization and blocking when designing their iTRAQ-labeled experiment.22−24 Unless the findings of these projects are validated by a different experimental technique, it is unclear whether the

Case 3: Absolute Quantitation

The linear regression of expected to measured fold changes in this novel scenario has an intercept and slope similar to what is theoretically expected (Figure 5A). No normalization methods are applied because they normalize between samples, whereas these metrics use only within-sample protein summaries. The intercepts of the linear fit tend to be slightly positively biased, which suggests that the median protein summary used

Figure 5. Performance metrics for absolute quantitation. Horizontal red line represents the ideal metric value. (A) Distribution of metric values of a linear regression of protein summaries within a sample divided by the median protein group’s median summary of the sample, grouped by either iTRAQ channel or experimental run. (B) Summaries for three variance metrics calculated using data from either one run or all runs. I

DOI: 10.1021/acs.jproteome.6b00882 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

between-run normalization like RUV perform almost identically to simple scaling approaches. There appears to be no good reason to use RUV for the relative quantitation scenario when a much simpler method works equally as well.

as the denominator of the fold change is typically lower than expected. As for custom relative quantitation, the slopes are all well below 1, regardless of the run, channel, or summary method used. The median summarization has the worst median slope estimates for all seven runs and five out of seven channels. The variability of the metrics is not greatly different between the summarization varieties, as shown by similar IQRs within each Figure panel. The current approach to absolute quantitation is to add stable-isotope labeled standards to every sample and create a linear model for the known concentration of peptide and experimentally measured amount.25 This is timeconsuming and expensive because internal standards must be created for every peptide of interest.26 It is infeasible to take this approach for shotgun proteomics experiments because of the number of distinct standards that would need to be developed, but these results suggest that such an effort may be unnecessary. Regarding the summation metrics which sum across different samples, no normalization method performs consistently better than the others (Figure 5B). RUV results in the greatest improvement of the within-channel CVs but only for runs four, five, and six. For the other four runs, all methods result in similar CVs. Simple Scaling is the best method, but by only a small margin, for the CV Median Proteins metric. The CV of all of the measurements in a channel across all runs is approximately the same as the CV within each run, showing that there is little additional unwanted variability present between runs. Notably, the CV increases with run number for Top 1 and Top 3 summarizations, suggesting that a relationship between the length of time the protein mixture was stored before analysis with the mass spectrometer and the magnitude of measurement variability.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jproteome.6b00882. Section S-1. Latin Square. Provides introductory information about the Latin square design and its properties. Table S-1. The identities of purchased proteins and their associated information. Table S-2. Characteristics of purchased proteins and the reasons for using those rules. Table S-3. Experimental design of all seven runs for the purchased proteins. Table S-4. Percentage of peptides in a particular run detected in another run. Table S-5. Number of proteins identified in each run. Figure S-1. Relationship between length of spike-in protein and number of peptides detected in experiment. Figure S-2. Principal components analysis of background proteins. Table S-6 Performance metrics after normalization of Top 3 peptide summaries of proteins. (PDF) Archive of all 94 preprocessed protein quantity tables used as input to the web application, which the three case studies are based on. (ZIP) Contains the R source code of the benchmarking server. The software may be run like any other Shiny application. (ZIP) Information about the proteins identified in the seven experimental runs and their fold changes to the internal reference channel, as determined by ProteinPilot. The Error Factor columns are automatically determined by ProteinPilot. They provide a distance from the estimated fold change to determine the boundaries of the 95% confidence intervals if both the fold change and the error factor are converted to a logarithmic scale. (TXT)



CONCLUSIONS A range of performance metrics and an associated web-based application, QPEP, have been described, which allow for the comprehensive evaluation of bias and variance in a designed iTRAQ spike-in experiment. A number of interesting observations were found in the three case studies, which may have consequences for existing and future proteomics experiments. ProteinPilot’s background correction algorithm performed well at reducing the bias of fold changes toward unity, which no other algorithm achieved. Also, absolute quantitation had similar performance characteristics to relative quantitation, although no internal standards for the spike-in proteins were used. In the future, it would be interesting to compare an LC−MS/MS experiment that has been background-corrected by ProteinPilot to an LC−MS/MS/MS experiment of the same samples, which is a laboratory approach to addressing the underestimation of fold changes in iTRAQ experiments.27 Which of these two approaches results in measurements with smaller variance is also unknown and of importance for differential expression studies. Another outstanding issue is that ProteinPilot does not have any user options regarding peak quantitation. Methods such as peak area, peak summit height, and different types of peak area integration could be implemented and their evaluation carried out. Additionally, the poorer performance of iTRAQ channels 114 and 115 relative to the other six channels warrants further investigation. It may have only occurred for the reagent batch used by this study or may be systematically affecting all experiments that utilize iTRAQ labels, the majority of which do not randomize experimental conditions to channels. Finally, complex methods for



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Tel: +61 2 9351 3012. ORCID

Dario Strbenac: 0000-0001-9234-3243 Present Addresses ⊥

N.J.A.: Department of Mathematics and Statistics, School of Engineering and Information Technology, Murdoch University, Murdoch, Western Australia 6150, Australia. # P.W.: School of Veterinary and Life Sciences, Murdoch University, Murdoch, Western Australia 6150, Australia. Author Contributions

S.R.W., P.W., and J.Y.H.Y. designed the replicated Latin squares experiment. M.J.R. and L.Z. carried out the purchasing of proteins, mixing of proteins to create the samples according to the design, iTRAQ labeling, and mass spectrometry. D.S. processed the data, created the metrics, developed the web application, and wrote the article. J.Y.H.Y., N.J.A., and S.R.W. supervised the development of the computational methods and their application to the data set. All authors also edited the article. J

DOI: 10.1021/acs.jproteome.6b00882 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research Notes

(13) Ning, K.; Fermin, D.; Nesvizhskii, A. I. Comparative Analysis of Different Label-Free Mass Spectrometry Based Protein Abundance Estimates and Their Correlation with RNA-Seq Gene Expression Data. J. Proteome Res. 2012, 11 (4), 2261−2271. (14) Schwanhäusser, B.; Busse, D.; Li, N.; Dittmar, G.; Schuchhardt, J.; Wolf, J.; Chen, W.; Selbach, M. Global quantification of mammalian gene expression control. Nature 2011, 473 (7347), 337−342. (15) Gerster, S.; Kwon, T.; Ludwig, C.; Matondo, M.; Vogel, C.; Marcotte, E. M.; Aebersold, R.; Bühlmann, P. Statistical Approach to Protein Quantification. Mol. Cell. Proteomics 2014, 13 (2), 666−677. (16) Chawade, A.; Alexandersson, E.; Levander, F. Normalyzer: A Tool for Rapid Evaluation of Normalization Methods for Omics Data Sets. J. Proteome Res. 2014, 13 (6), 3114−3120. (17) Shilov, I. V.; Seymour, S. L.; Patel, A. A.; Loboda, A.; Tang, W. H.; Keating, S. P.; Hunter, C. L.; Nuwaysir, L. M.; Schaeffer, D. A. The Paragon Algorithm, a Next Generation Search Engine That Uses Sequence Temperature Values and Feature Probabilities to Identify Peptides from Tandem Mass Spectra. Mol. Cell. Proteomics 2007, 6 (9), 1638−1655. (18) The UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 2015, 43 (D1), D204−D212. (19) Vizcaíno, J. A.; Csordas, A.; del-Toro, N.; Dianes, J. A.; Griss, J.; Lavidas, I.; Mayer, G.; Perez-Riverol, Y.; Reisinger, F.; Ternent, T.; et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 2016, 44 (D1), D447−D456. (20) Hill, E. G.; Schwacke, J. H.; Comte-Walters, S.; Slate, E. H.; Oberg, A. L.; Eckel-Passow, J. E.; Therneau, T. M.; Schey, K. L. A Statistical Model for iTRAQ Data Analysis. J. Proteome Res. 2008, 7 (8), 3091−3101. (21) Jacob, L.; Gagnon-Bartsch, J. A.; Speed, T. P. Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostat 2016, 17 (1), 16−28. (22) Leong, S.; Nunez, A. C.; Lin, M. Z.; Crossett, B.; Christopherson, R. I.; Baxter, R. C. iTRAQ-Based Proteomic Profiling of Breast Cancer Cell Response to Doxorubicin and TRAIL. J. Proteome Res. 2012, 11 (7), 3561−3572. (23) Xu, D.; Li, Y.; Li, X.; Wei, L.-L.; Pan, Z.; Jiang, T.-T.; Chen, Z.L.; Wang, C.; Cao, W.-M.; Zhang, X.; et al. Serum protein S100A9, SOD3, and MMP9 as new diagnostic biomarkers for pulmonary tuberculosis by iTRAQ-coupled two-dimensional LC−MS/MS. Proteomics 2015, 15 (1), 58−67. (24) Kim, S.; Lee, H.-J.; Hahm, J.-H.; Jeong, S.-K.; Park, D.-H.; Hancock, W. S.; Paik, Y.-K. Quantitative Profiling Identifies Potential Regulatory Proteins Involved in Development from Dauer Stage to L4 Stage in Caenorhabditis elegans. J. Proteome Res. 2016, 15 (2), 531− 539. (25) Scott, K. B.; Turko, I. V.; Phinney, K. W. Quantitative Performance of Internal Standard Platforms for Absolute Protein Quantification Using Multiple Reaction Monitoring-Mass Spectrometry. Anal. Chem. 2015, 87 (8), 4429−4435. (26) Villanueva, J.; Carrascal, M.; Abian, J. Isotope dilution mass spectrometry for absolute quantification in proteomics: Concepts and strategies. J. Proteomics 2014, 96, 184−199. (27) Ting, L.; Rad, R.; Gygi, S. P.; Haas, W. MS3 eliminates ratio distortion in isobaric multiplexed quantitative proteomics. Nat. Methods 2011, 8 (11), 937−940.

The authors declare no competing financial interest. The raw mass spectra and peptide-spectrum matches are available from the ProteomeXchange Consortium as data set PXD003608 via the PRIDE database.19



ACKNOWLEDGMENTS We acknowledge the National Health and Medical Research Council grant 525453 awarded to S.R.W., which also provided support to P.W., and the Australian Government Department of Education and Training Australian Postgraduate Award to D.S.



ABBREVIATIONS CV, coefficient of variation; IQR, interquartile range; iTRAQ, isobaric tags for relative and absolute quantification; PCA, principal components analysis; RUV, removing unwanted variation



REFERENCES

(1) Cope, L. M.; Irizarry, R. A.; Jaffee, H. A.; Wu, Z.; Speed, T. P. A benchmark for Affymetrix GeneChip expression measures. Bioinformatics 2004, 20 (3), 323−331. (2) Irizarry, R. A.; Wu, Z.; Jaffee, H. A. Comparison of Affymetrix GeneChip expression measures. Bioinformatics 2006, 22 (7), 789−794. (3) SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat. Biotechnol. 2014, 32 (9), 903−914. (4) Halper-Stromberg, E.; Frelin, L.; Ruczinski, I.; Scharpf, R.; Jie, C.; Carvalho, B.; Hao, H.; Hetrick, K.; Jedlicka, A.; Dziedzic, A.; et al. Performance assessment of copy number microarray platforms using a spike-in experiment. Bioinformatics 2011, 27 (8), 1052−1060. (5) Mueller, L. N.; Rinner, O.; Schmidt, A.; Letarte, S.; Bodenmiller, B.; Brusniak, M.-Y.; Vitek, O.; Aebersold, R.; Müller, M. SuperHirn − a novel tool for high resolution LC−MS-based peptide/protein profiling. Proteomics 2007, 7 (19), 3470−3480. (6) Mahoney, D. W.; Therneau, T. M.; Heppelmann, C. J.; Higgins, L.; Benson, L. M.; Zenka, R. M.; Jagtap, P.; Nelsestuen, G. L.; Bergen, H. R.; Oberg, A. L. Relative Quantification: Characterization of Bias, Variability and Fold Changes in Mass Spectrometry Data from iTRAQ-Labeled Peptides. J. Proteome Res. 2011, 10 (9), 4325−4333. (7) Tuli, L.; Tsai, T.-H.; Varghese, R. S.; Xiao, J. F.; Cheema, A.; Ressom, H. W. Using a spike-in experiment to evaluate analysis of LC−MS data. Proteome Sci. 2012, 10 (1), 13. (8) Matzke, M. M.; Brown, J. N.; Gritsenko, M. A.; Metz, T. O.; Pounds, J. G.; Rodland, K. D.; Shukla, A. K.; Smith, R. D.; Waters, K. M.; McDermott, J. E.; et al. A comparative analysis of computational approaches to relative protein quantification using peptide peak intensities in label-free LC−MS proteomics experiments. Proteomics 2013, 13 (3−4), 493−503. (9) Pursiheimo, A.; Vehmas, A. P.; Afzal, S.; Suomi, T.; Chand, T.; Strauss, L.; Poutanen, M.; Rokka, A.; Corthals, G. L.; Elo, L. L. Optimization of Statistical Methods Impact on Quantitative Proteomics Data. J. Proteome Res. 2015, 14 (10), 4118−4126. (10) Shalit, T.; Elinger, D.; Savidor, A.; Gabashvili, A.; Levin, Y. MS1Based Label-Free Proteomics Using a Quadrupole Orbitrap Mass Spectrometer. J. Proteome Res. 2015, 14 (4), 1979−1986. (11) Välikangas, T.; Suomi, T.; Elo, L. L. A systematic evaluation of normalization methods in quantitative label-free proteomics. Briefings Bioinf. 2016, bbw095. (12) Sturm, M.; Bertsch, A.; Gröpl, C.; Hildebrandt, A.; Hussong, R.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Zerck, A.; Reinert, K.; et al. OpenMS − An open-source software framework for mass spectrometry. BMC Bioinf. 2008, 9, 163. K

DOI: 10.1021/acs.jproteome.6b00882 J. Proteome Res. XXXX, XXX, XXX−XXX