Statistical Models for the Analysis of Isobaric Tags Multiplexed

Jul 26, 2017 - However, GLM does not address all the complexities of proteomics data such as repeated measures and variance heterogeneity. Linear mode...
0 downloads 5 Views 1MB Size
Subscriber access provided by Hong Kong University of Science and Technology Library

Article

Statistical models for the analysis of isobaric tags multiplexed quantitative proteomics Gina D'Angelo, Raghothama Chaerkady, Wen Yu, Deniz Baycin Hizal, Sonja Hess, Wei Zhao, Kristen Lekstrom, Xiang Guo, Wendy I White, Lorin Roskos, Michael A. Bowen, and Harry Yang J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.6b01050 • Publication Date (Web): 26 Jul 2017 Downloaded from http://pubs.acs.org on July 27, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Statistical models for the analysis of isobaric tags multiplexed quantitative proteomics Gina D’Angelo*,1, Raghothama Chaerkady2, Wen Yu3, Deniz BaycinHizal2, Sonja Hess2, Wei Zhao1, Kristen Lekstrom2, Xiang Guo4, Wendy I. White4, Lorin Roskos5, Michael A. Bowen2,6, Harry Yang1 1

Statistical Sciences, MedImmune, Gaithersburg, Maryland, United States

2

Antibody Discovery and Protein Engineering, Protein Sciences, MedImmune, Gaithersburg,

Maryland, United States 3

Research Bioinformatics, MedImmune, Gaithersburg, Maryland, United States

4

Clinical Biomarkers and Computational Biology, MedImmune, Gaithersburg, Maryland, United

States 5

Clinical Pharmacology, Pharmacometrics, and DMPK, MedImmune, Gaithersburg, Maryland,

United States 6

Currently at Juno Therapeutics, Seattle, WA, United States

KEYWORDS: Proteomics, Mixed models, Statistical Models, Biomarkers, TMT

ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 40

ABSTRACT: Mass spectrometry is being used to identify protein biomarkers that can facilitate development of drug treatment. Mass spectrometry based labeling proteomic experiments result in complex proteomic data that is hierarchical in nature often with small sample size studies. The generalized linear model (GLM) is the most popular approach in proteomics to compare protein abundances between groups. However, GLM does not address all the complexities of proteomics data such as repeated measures and variance heterogeneity. Linear Models for Microarray Data (LIMMA) and mixed models are two approaches that can address some of these data complexities to provide better statistical estimates. We compared these three statistical models (GLM, LIMMA, and mixed models) under two different normalization approaches (quantile normalization and median sweeping) to demonstrate when each approach is the best for tagged proteins. We evaluated these methods using a spiked-in dataset of known protein abundances, a Systemic Lupus Erythematosus (SLE) dataset, and simulated data, from multiplexed labeling experiments that use tandem mass tags (TMT). Data are available via ProteomeXchange with identifier PXD005486. We found median sweeping to be a preferred approach of data normalization, and with this normalization approach there was overlap with findings across all methods with GLM being a subset of mixed models. The conclusion is that the mixed model had the best type I error with median sweeping, while LIMMA had the better overall statistical properties regardless of normalization approaches.

1 Introduction Mass spectrometry based proteomics is a promising technology to identify significant protein differential abundances between groups1-3. Proteomics can help advance biomarker discovery and facilitate clinical development of target therapies3-4. Among many applications, proteomic profiling can help to identify subpopulations that may respond better to drug treatments thus

ACS Paragon Plus Environment

2

Page 3 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

enabling personalized healthcare approaches5. Mass spectrometry analysis yields quantitative information based on the abundance of the ions which may be the function of peptide concentrations and other factors. Mass spectrometry based quantitative proteomics can be accomplished using label-free or multiplexed labelling approaches. Our study employed multiplexed labeling using tandem mass tags (TMT), where multiple samples are quantitated simultaneously using isobaric tags. In shotgun proteomics, the estimates of the protein intensities are derived from the peptide level measurements. In TMT based relative quantitation, a maximum of ten samples, labelled with chemically equivalent isotopic distinct tags, can be multiplexed in a plate and multiple plates are used to accommodate more than 10 samples. A multiplex experiment is referred to as plate throughout this paper. The TMT tags used to label samples from each plate can be assigned to as channels. Abundant peptides produce large number of peptide spectral matches (PSMs), however low abundance peptides with better ionization properties will also produce high intensity signals. The spectra from a peptide can also vary in their abundance, ranging from barely being detected to having large abundance. Some of the peptides have such a low abundance that they are below the limit of detection, leading to missing values. This can be due to poor LC-MS signal and experimental variations amongst other plate and subject driven reasons. This necessitates use of appropriate statistical approaches to model intensity measured by mass spectrometry to determine protein differential abundance. Proteomics data has a complex hierarchical structure. Each protein has a varied number of peptide spectrum matches with quantitative data often yielding multiple measurements of a peptide. For our purposes we consider the spectra as multiple measurements of a peptide. Each protein can have multiple peptides, each of which can have multiple spectra (measurements).

ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 40

These protein and peptide “repeated measurements” for each channel adds another layer of nonindependence and correlation. Analysis of such data requires proper statistical modeling so as to provide a corrected variance for statistical inference. The hierarchical structure of the proteomics data is considered to be clustered. There are multiple approaches that can be used to handle clustered data. The general linear model (GLM) and t-test are two standard approaches for analyzing group-based proteomics data. Linear Models for Microarray Data (LIMMA)6 popularized in microarray data analysis, has been gaining traction in proteomics2, 7-9. However, LIMMA still is not as widely accepted in proteomics as it has been in microarray data2. Common to these three techniques is the use of a two-step procedure, consisting of 1) reducing the data to independent summary measures of the proteins, and 2) performing the analysis using the independent measures. LIMMA goes a step further to correct the variance and shrinks the variance towards a pooled variance. This was intended to address the concern that there are a small number of samples (i.e. subjects) and as a result the variance measure is a poor estimate. With small sample sizes, proteins with extremely low fold changes can have smaller variances resulting in more false significant findings2. By contrast, proteins with large fold changes tend to have larger variances resulting in false null (nonsignificant) findings2. By drawing a population protein variance and shrinking towards it, LIMMA corrects the extreme variance6, 10, mitigating the risks of both false discoveries and nonsignificant findings. Mixed models11-12 can be used in the presence of repeated measures, correlated data, and account for sources of variation in the data. Rather than putting in a summary statistic of every protein into the model, we suggest putting in all the peptide data into a single protein-specific model to improve our variance estimates by using all of the spectra data. Specifically, we want to account

ACS Paragon Plus Environment

4

Page 5 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

for variation due to plate, channel, and peptide. Each protein-specific model will include all the peptide and spectra data per protein while incorporating the various sources of noise due to the plate, channel, and interdependence of the measurements. In doing so, more accurate variance estimates may be obtained. Mixed models has been proposed for both label-free13-16 and labelled3, 17-18 proteomics data. Mixed models were suggested by Herbrich et al.18 to model all protein data simultaneously with one-sample data using mean sweeping. Oberg et al.3 demonstrated mixed models for groupwise comparisons by putting in all the proteomics data implementing a stepwise normalized regression approach. However, Oberg et al.3 discussed the computational barriers and complexities with this approach and problems with unbalanced proteomics data. We also propose using mixed models; however, our approach differs from Herbrich et al.18 and Oberg et al.3 with: 1) implementing two different preprocessing streams that consists of median sweeping or quantile normalization, and, 2) our objective to test for group differences using protein-specific models rather than putting all the protein data in a single model. Due to computational limitations and complicated structure of data, we are unable to put all the proteins into a single model to assess group differential protein abundances. The computational limitation is caused by the large number of parameters needed to be estimated and the complexity of the likelihood optimization. The complicated structure of proteomics data is that: 1) the number of proteins varies between each plate, 2) the number of peptides and PSMs within each peptide can vary from 1 to hundreds which is not consistent across plates, and 3) there is missing data across some channels. Such a complicated data structure can cause additional numerical optimization problems when attempting to put all data into a single mixed model. If the missing data are missing not at random (MNAR) then the mixed model results are not reliable.

ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 40

There have been earlier efforts to compare the GLM to the mixed model18 and also the GLM/ttest to LIMMA2 using mean/median sweeping in proteomics. Although the three modeling approaches, GLM, LIMMA, and mixed models can be used to analyze proteomic data, there has been no effort made to compare the utility and performance of these three methods simultaneously. We evaluate these three methods (GLM, LIMMA, and mixed models) under two different normalization approaches (quantile normalization and median sweeping). In this paper, we evaluate the statistical properties characterized by risk of false findings (type I error) and risk of non-significant findings (type II error) so as to provide some practical guidance on how to use these methods for proteomic data explorations. The assessment is carried through using a dataset derived from known protein abundances (a spiked-in dataset), simulated data, and a commercial serum Systemic Lupus Erythematosus (SLE) dataset with TMT labelling data.

2 Experimental Section 2.1 Sample preparation and mass spectrometry analysis We used two datasets from quantitative mass spectrometry experiments, one from 13 proteins spiked-in E. coli lysates and a second one from differential proteomics of SLE using depleted serum samples from control and SLE groups. A detailed method section describing sample preparation, multiplexed isobaric labeling, MASCOT search parameters, and LC-MS2 mass spectrometry analysis is given in the supporting material section. Depleted SLE and normal serum samples were processed as described by Cole et al.19. E. coli samples were subjected to filter-aided samples preparation (FASP) method of digestion. Peptides from both sets of samples

ACS Paragon Plus Environment

6

Page 7 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

were labeled using 10-plex TMT (tandem mass tag) (Thermo). Both datasets were obtained from high-resolution mass spectrometry analysis of isobaric multiplexed labeled peptides. For the spiked-in experiment, background proteome was prepared from E. coli DH5 alpha strain. Twelve proteins encoded by ENO1, ARG1, FABP4, PTGES3, KPNA2, LASP1, HDAC3, KCNIP3, OTUB1, GABARAPL1, GAS7 and EZR were obtained from Genecopoea (Rockville, MD, USA) and bovine serum albumin was obtained from Sigma-Aldrich (St Louis, MO, USA) (Table S1). These 13 proteins were spiked at different concentrations ranging from 0 to 80 picomoles in ten E. coli lysates (70µg) (Table S2). LC-MS analyses were carried out using a nanoflow LC system from Dionex Ultimate 3000 RSLCnano coupled to LTQ Orbitrap Fusion Tribrid mass spectrometer for 90 minutes. MS analysis was carried out using data dependent MS/MS acquisition. The mass resolution for the MS/MS analysis was set to 60,000 in order to resolve the TMT reporter ions, which will be referred to as “channel” in this paper. The commercial SLE samples were bought from BioreclamationIVT (Baltimore, MD, USA). Six abundant proteins (albumin, IgG, IgA, transferrin, haptoglobin, and antitrypsin) from serum samples were depleted using Human 6 Multiple Affinity Removal System, Agilent Technologies, according to the manufacturer’s protocol. Unbound samples were concentrated and 100 µg of protein digested using trypsin as described previously19. Peptides from 10 samples were labeled with 10-plex tandem mass tag (TMT reagents) according to the manufacturer's instruction. Data dependent acquisition (DDA) mass spectrometry of the TMT labeled peptides was carried out on the hybrid quadrupole-orbitrap (Q-Exactive) (Thermo Fisher Scientific) MS interfaced with Thermo Easy nLC system (Proxeon). A detailed method section describing LCMS/MS analysis is given in the supporting material section.

ACS Paragon Plus Environment

7

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 40

The raw data for both datasets was processed using the Proteome Discoverer 1.4 (Thermo Fisher Scientific) data analysis pipeline configured with MASCOT (v2.5) search engine and data was filtered using a 1% FDR cutoff as estimated by Percolator20. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE21 partner repository with the dataset identifier PXD005486.

2.2 Data preprocessing Once data is produced from Proteome Discoverer it is necessary to do additional preprocessing. We first removed PSMs that have >=30%2, 22 isolation interference to minimize the influence of co-isolated peptides on peptide quantification. In a PSM some of the channels can have missing values typically due to being below the LOD (limit of detection). Missing data in TMT labelling tend to be left censored and are referred to as missing not at random (MNAR)23-24. We notice in our data the abundance values are related to the percentage of missing data, which indicates the data are left censored and MNAR23. Missing values are imputed (filled in) with its minimum PSM value. We use a single-value imputation approach25 that replaces missing values by its minimum PSM value also known as the PSM LOD. We are interested in analyzing lowabundance proteins and it has been shown that imputation can improve analysis in this scenario26, however not as effectively with medium to high-abundance proteins. We exclude the PSMs where all channels have missing values. We note that missing data is fairly common in MS labelling experiments and label-free experiments. There are many schools of thought on how to handle missing data and imputation3, 18, 23-30

. There is no general consensus in the proteomics community in how to handle missing

data; and we recommend the user review their data and the literature that reflects their data. For example, TMT labelling differs from label-free data in that TMT-labelled peptides are viewed as

ACS Paragon Plus Environment

8

Page 9 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

10 (or your # plex) repeat measurements of the same peptides. Based on our experience, we have noticed a higher percentage of missing data with label-free experiments compared to labelled experiments, and this may be caused by label-free data having “independent” runs where ion suppression may occur in one analysis but not in the other. We also caution the analyst to understand how software is handling missing data as the approach can provide different results dependent on the approach and data (e.g MaxQuant28, MSStats27, and imputeLCMD31). Proteomics data tend to have variance heterogeneity due to their variance being a function of the mean which is a violation of linear models32. Transformations can address the mean-variance function and stabilize the variance, while also making the data be approximately normally distributed 33. A  transformation of the spectra addresses the nonconstant variance common in proteomics data2, 32. We choose to use the raw value and not take ratios since there is no reliable reference value to use for each channel. Both quantile normalization and median sweeping are evaluated. Normalization is performed to remove systematic bias due to the instrument, sample preparation, and experimental error34-35. Quantile normalization normalizes all channels so that they each have the same distribution. We do this by assigning the same mean value to the jth ranked channels. Median sweeping2, 18 will normalize all channels so that they each have a median zero. Median sweeping consists of multiple steps in the following order: 1) the  intensity values are median polished by subtracting the median of the PSM  intensity values from each  intensity value in that PSM, 2) the protein relative abundance for each channel is calculated by estimating the median of all PSMs belonging to that channel and protein, and 3) the loading material and sample processing is corrected by subtracting the median of each channel from the protein relative abundance.

ACS Paragon Plus Environment

9

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 40

It has been recognized that more reliable proteins contain multiple peptides, each peptide having multiple spectra36-40. A protein with only one peptide and a peptide with only one spectrum are less reliable and not a good representation of that protein36-40. We explore this idea to evaluate the impact of excluding low-frequency proteins and low-frequency peptides on the estimate and inference.

2.3 Statistical Methods 2.3.1 Data notation Let  denote the intensity value of the kth spectra from the pth protein and qth peptide of the ith plate and jth channel, where i=1,…,I and j=1,…,J. We stack all the spectra of the ith plate and jth channel and have a  × 1 vector  . Group, channel, plate, and peptide are denoted as  = ( ,  ,  ,  ), where  = 1, . . ,  for group,  = 1, . . ,  for plate,  = 1, … , " for channel, and  is the qth peptide belonging to pth protein, ith plate and jth channel. 2.3.2 Mixed models A mixed effects model11-12, 41 is  = # $ + & ' + ( where $ is a ) × 1 vector of fixed effects, ' ~+(,, -) is a l×1 vector of random effects, . ~+/,, 0 1 is a vector of residual errors, 0 = 23 45678 , '99 , … . , ':;, .99 , … . , .:;, are independent, and 1

#PEP/Prot>2 & PSM/PEP>1

Table 3 Spiked-in results with quantile normalization Mix1

Mix 2

GLM

LIMMA

Type I (n=23710)

0.008

0.019

0.047

0.052

Power (n=48)

75.0%

77.1%

77.1%

85.4%

Type I (n=15820)

0.001

0.002

0.044

0.050

Power (n=48)

75.0%

77.1%

77.1%

85.4%

0.001

0.002

0.043

0.048

#PEP/Prot>1 & PSM/PEP>1

#PEP/Prot>2 & PSM/PEP>1 Type I (n=15220)

ACS Paragon Plus Environment

19

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Power (n=48)

75.0%

77.1%

77.1%

Page 20 of 40

85.4%

Table 4 Summary statistics of fold changes spiked-in experiment with median sweeping

n Mix1/Mix2 Bias √LMN GLM/LIMMA Bias √LMN # PEP/Prot>1 & PSM/PEP >1 n Mix1/Mix2 Bias √LMN GLM/LIMMA Bias √LMN # PEP/Prot>2 & PSM/PEP >1 n Mix1/Mix2 Bias √LMN GLM/LIMMA Bias √LMN

2 12

4 12

12-spiked 20 12

-0.6 1.0

-1.7 1.9

-0.7 1.0

Non spiked 40 12

All 48

23710

-11.7 11.8

-24.0 24.3

-9.5 13.6

-0.9 1.4

-1.8 1.9

-11.0 11.2

-21.9 22.3

-8.9 12.5

-1.0 1.5

12

12

12

12

48

15820

-0.6 1.0

-1.7 1.9

-11.7 11.9

-24.1 24.4

-9.5 13.6

-1.0 1.5

-0.7 1.0

-1.8 1.9

-11.0 11.2

-22.2 22.5

-8.9 12.6

-1.0 1.5

12

12

12

12

48

15220

-0.6 1.0

-1.7 1.9

-11.7 11.9

-24.1 24.4

-9.5 13.6

-0.9 1.5

-0.7 1.0

-1.8 1.9

-11.0 11.2

-22.2 22.5

-8.9 12.6

-1.0 1.5

Table 5 Summary statistics of fold changes spiked-in experiment with quantile normalization

n Bias √LMN # PEP/Prot>1 & PSM/PEP >1 n Bias √LMN # PEP/Prot>2 & PSM/PEP >1 n Bias √LMN

12-spiked 20 40 12 12 -11.7 -24.1 11.9 24.4

2 12 -0.6 1.0

4 12 -1.7 1.9

12 -0.6 1.0

12 -1.8 1.9

12 -11.8 11.9

12 -0.6 1.0

12 -1.8 1.9

12 -11.8 11.9

Non spiked All 48 -9.5 13.6

23710 -1.0 1.5

12 -24.1 24.5

48 -9.6 13.7

15820 -1.0 1.5

12 -24.1 24.5

48 -9.6 13.7

15220 -1.0 1.5

ACS Paragon Plus Environment

20

Page 21 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

3.2 Simulation In order to evaluate statistical properties of each method proposed we use the simulated spectra level data with multiple scenarios. All the scenarios will let us evaluate the type I error, power, bias, √LMN, and 95% coverage. Refer to Section 2.3.5 for explanation of the statistical terms. The simulated data is normally distributed and not skewed, therefore the mean and median should lead to similar results. Hence, for simulation purposes we use the mean value. Each simulation study compares 8 models: 1) Mixed model with group fixed effect and plate, channel nested within plate, and peptide nested with plate and channel random effects with all spectra data points (Mix1); 2) Mixed model with group fixed effect and plate, channel nested within plate, and peptide random effects with all spectra data points (Mix2); 3) Mixed model with group fixed effect and plate and channel nested within plate random effects with all spectra data points (Mix3); 4) Mixed model with group fixed effect and plate and channel nested within plate random effects where peptide mean value is outcome (Mix4); 5) Mixed model with group fixed effect and plate random effect where subject mean value is outcome (Mix5); 6) General linear model with group and plate as fixed effect where subject mean value is outcome (GLM1);

ACS Paragon Plus Environment

21

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 40

7) General linear model with group as fixed effect where subject mean value is outcome (GLM2); 8) LIMMA with group and plate as fixed effect where subject mean value is outcome (LIM). The bias is the same for all methods. The differences are in the MSE and coverages, which are indicative of the variability in the coefficients. We focus on the MSE, coverages, type I error, and power. In the figures, the x-axis is a categorical variable that combines the number of spectra and amount variance: smpep=2 spectra, mdpep=6 spectra, novar= (2[ , 2\ , 2] , 2. ) = (0,0,0,2), sovar= (2[ , 2\ , 2] , 2. ) =

(1,1,3,2), and movar= (2[ , 2\ , 2] , 2. ) =

(1,3,3,2). Smpep

indicates small number of peptides, mdpep indicates medium number of peptides, novar indicates no variance, sovar indicates some variance, and movar indicates more variance. Refer to Figure 1 for the plots of √^_` vs. the amount of variance and number of spectra by varying coefficient values of group. All of the GLM and mixed models have the same √^_`. Hence, the lines overlap in the figure for GLM and the mixed models. LIMMA consistently has a larger √^_`. For each variance scenario when more spectra are added the √^_` decreases. As the amount of variance in the subject and peptide component increases the √^_` increases. The increase in the group coefficient does not seem to impact the √^_` values. This indicates LIMMA has more variability than the other methods. To evaluate type I error and power, we let the group coefficient value range from 0 to 1 with values in between. However, 0 and 1 will tell us the most about type I error and power, respectively. Figure 2 displays the % significant vs. the amount of variance and number of

ACS Paragon Plus Environment

22

Page 23 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

spectra by varying coefficient values of group. The proportion of significance increases when there are more spectra per peptide across all methods. This makes sense as more information is provided. LIMMA tends to be more liberal when there is no variability in the components, meaning that the type I error and power are larger than all other methods. When there is more variability in the components the type I error tends to be slightly larger for all the mixed model approaches and the GLM with plate affect. The GLM not adjusting for plate effect tends to have smaller type I error and power. This indicates that plate effect should be adjusted for in the GLM mean protein model. When there is no variance and B=0 the mixed models (except for the peptide mean model) tend to have a smaller type I error (~3%) than the other approaches. When B=1, the power is much larger and closer to 1 when there was no variance accounted for in the plate, subject and peptides. As more variance is added to plate, subject and peptides, the power decreases. LIMMA tends to have more power and the GLM without plate effect has less power than all other methods. The power is about the same across the mixed model approaches and GLM with the plate effect. The coverages tend to be close to the 95% coverage (Figure 3). The coverage decreases a little when there are more spectra per peptide. LIMMA and GLM without the plate effect hover around 95% across all scenarios. In the case with no variability in all components, the mixed model approaches (all except the mean peptide model) are a bit above 96% and the GLM with plate effect and mixed model with mean peptide are around 93-94%. When there is more variability in the components, all the mixed model approaches and the GLM with plate effect drops to 93-94% indicating being slightly more liberal as demonstrated in the type I error plots. The 8 models are similar, however LIMMA performs the best across all scenarios with best power, decent type I error, good coverages, despite having the largest MSE.

ACS Paragon Plus Environment

23

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 40

Figure 1 RMSE (square-root MSE) vs. amount of variance and #spectra by group coefficient (B1). All mixed models and GLMs have the same RMSE values results in overlapping lines. This plot compares the RMSE to varying amounts of variance and number of spectra. The x-axis is a categorical variable that combines the number of spectra and amount variance to six categories: smpep indicates small number (2) of peptides, mdpep indicates medium number (6) of peptides, novar indicates no variance, where (2[ , 2\ , 2] , 2. ) = (0,0,0,2); sovar indicates some variance, where (2[ , 2\ , 2] , 2. ) = (1,1,3,2); and movar indicates more variance, where (2[ , 2\ , 2] , 2. ) = (1,3,3,2).

ACS Paragon Plus Environment

24

Page 25 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2 Proportion significant vs. amount of variance and #spectra by group coefficient (B1) (top to bottom): a) B1 =(0,.2), b) B1 =(0.5,1). This plot demonstrates the type I error and power. The x-axis is a categorical variable that combines the number of spectra and amount of variance as described in the caption for Figure 1.

ACS Paragon Plus Environment

25

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 40

Figure 3 Coverage vs. amount of variance and #spectra by group coefficient (B1). The x-axis is a categorical variable that combines the number of spectra and amount of variance as described in the caption for Figure 1.

3.3 SLE data The SLE dataset was used to demonstrate the various methods to identify proteins that differ between the SLE and healthy normal group. A total of 726 proteins were identified and 708 that have 1 peptide/protein & >1 PSM/peptide (#proteins=358), and 2) >2 peptide/protein & >1 PSM/peptide (#proteins = 294). The number of proteins was reduced by almost a half for both restrictions. The trends for findings with restriction are similar to the findings with no restriction. Although there are less findings with the restrictions compared to when no restrictions, the percentage of findings increased by ~3% for both sets of restrictions compared to no restriction. The FDR results reduced the results by ~1/3 with restrictions. There are less FDR findings with the restriction compared to when no

ACS Paragon Plus Environment

27

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 40

restrictions, however the percentage of findings increased slightly for both sets of restrictions compared to no restriction. Table 7 presents all the results across the 4 methods using quantile normalization with/without peptide/PSM restrictions. We first summarize results with no restrictions. We found that MIX1 is the most conservative and the findings are a subset of all methods. Mix2 is the next most conservative approach where a majority of the findings are a subset of GLM and LIMMA (Figure 5). LIMMA is the most liberal method where GLM and LIMMA have a large amount of overlap. The FDR results reduced the results by ~¼. We found Mix1 to be a subset of Mix2, GLM, and LIMMA. MIX2 and GLM have the same findings and are both a subset of LIMMA. The standard errors tend to be the smallest for LIMMA, next smallest for GLM, larger for MIX2, and largest for MIX1. We also restricted the analysis by number of unique peptides per protein and number of PSMs per peptides to be: 1) >1 peptide/protein & >1 PSM/peptide (#proteins=358), and 2) >2 peptide/protein & >1 PSM/peptide (#proteins = 294). The number of proteins evaluated was reduced by almost half of the no restriction. The findings were reduced for all methods but the percentage of findings increased by ~2% for both sets of restrictions compared to no restriction. The overlap across methods remained the same across all restriction scenarios. The FDR results reduced the results by ~1/3 with the restrictions which is less of a reduction than the no peptide/PSM restriction. With the FDR of 5% the results are similar to when not restricting the # of peptides and PSMs. MIX1 was a subset of all 3 methods, and the other 3 methods have large overlap. When data are correlated the SE can be larger, and if the data are very noisy across all measurements this will lead to a larger variance. LIMMA pools towards a common variance; therefore, if that common variance is smaller this would then have the larger variances shrink

ACS Paragon Plus Environment

28

Page 29 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

towards it and become smaller. The mixed model does have a corrected variance that measures the within and between variability. In quantile normalization, we did notice the experiment and channel variance components tend to be much smaller than the peptide variance component. Median sweeping did not shift the variance from the experiment and subject to the peptides. We suspect this shift of variance occurs with quantile normalization since quantile normalization forces the channel distributions to be the same when this may not hold; whereas there is only a shift with median sweeping. Quantile normalization is possibly overcorrecting and reducing the biological variability too much but this needs further investigation. The number of subjects (number of clusters) is small and number of repeated measurements (cluster size) can vary from 1 to the hundreds which has been shown can affect the behavior of the mixed model 47-49. With median sweeping the number of findings were similar across all methods. There was large overlap across all 4 methods with GLM being a subset of the mixed model when using median sweeping. Regardless of restrictions or using raw p-values/FDR with quantile normalization, we found the order of the methods of least to largest number of findings to be: MIX1, MIX2, GLM, then LIMMA. In addition, these methods tended to be a subset of each other or to have large overlap with quantile normalization. For both normalization approaches, restricting the frequency of peptides and PSMs did lead to a larger percentage of findings. It has been noted previously that these may be more reliable proteins. Even when we did FDR there were less reductions from findings with the peptide/PSM frequency restriction, indicating these may be the more reliable proteins. We also noticed that the results were more consistent across all methods when using median sweeping than when using normalization. Median sweeping led to more similar standard errors across all 4 methods than did quantile normalization. The trend of LIMMA having smaller standard errors did hold across both normalization approaches, and the

ACS Paragon Plus Environment

29

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 40

difference between standard errors was larger across methods with quantile normalization. Median sweeping is the preferred normalization method of choice. Table 6 SLE results with median sweeping

No restriction Significant at α=.05 (n=708) Significant at q=.05 (n=708) #PEP/Prot>1 & PSM/PEP>1 Significant at α=.05 (n=358) Significant at q=.05 (n=358) #PEP/Prot>2 & PSM/ PEP>1 Significant at α=.05 (n=294) Significant at q=.05 (n=294)

Mix1

Mix2

GLM

LIMMA

54 (7.6%) 13 (1.8%)

54 (7.6%) 10 (1.4%)

48 (6.8%) 11 (1.6%)

55 (7.7%) 12 (1.7%)

40 (11.2%) 8 (2.2%)

40 (11.2%) 8 (2.2%)

34 (9.5%) 9 (2.5%)

37 (10.3%) 10 (2.8%)

33 (11.2%) 5 (1.7%)

33 (11.2%) 5 (1.7%)

28 (9.5%) 6 (2.0%)

31 (10.5%) 6 (2.0%)

Table 7 SLE results with quantile normalization

No restriction Significant at α=.05 (n=708) Significant at q=.05 (n=708) #PEP/Prot>1 & PSM/PEP>1 Significant at α=.05 (n=358) Significant at q=.05 (n=358) #PEP/Prot>2 & PSM/ PEP>1 Significant at α=.05 (n=294) Significant at q=.05 (n=294)

Mix1

Mix2

GLM

LIMMA

30 (4.2%) 9 (1.3%)

50 (7.1%) 14 (2.0%)

63 (8.9%) 14 (2.3%)

66 (9.3%) 16 (2.3%)

23 (6.4%) 9 (2.5%)

28 (7.8%) 9 (2.5%)

40 (11.2%) 10 (2.8%)

43 (12.0%) 13 (3.6%)

18 (6.1%) 5 (1.7%)

22 (7.5%) 6 (2.0%)

33 (11.2%) 6 (2.0%)

36 (12.2%) 10 (3.4%)

ACS Paragon Plus Environment

30

Page 31 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

4 Conclusion Proteomic data acquired on a mass spectrometry platform can be noisy, large, and hierarchical in nature with lots of imbalance due to data acquisition and preprocessing. GLM is the most widely used statistical method in proteomics. LIMMA is popular in the microarray area and gaining traction in proteomics. Due to the proteomics data characteristics it is a natural choice to implement mixed models to address the complex factors and their impacts on the experimental outcome. We compared the well-known GLM method to LIMMA and the mixed model by

ACS Paragon Plus Environment

31

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 40

evaluating their statistical properties with TMT labelling proteomics data. In addition, we compared quantile normalization to median sweeping across all methods. Three datasets were evaluated to determine when each method would be preferred. To assess type I error and power we needed a simulation study and/or a spiked-in experiment. The benefit to a simulation study is that you can completely control the type of data you have and you know the answer; however it can be difficult to simulate proteomics data. The nice feature of the spiked-in experiment is the answer is known, but these tend to be technical replicates limiting our evaluation with biological replicate data. It has been shown by Bell et al. 47, Dieleman et al. 48, and Galbrieth et al.49 that a small number of clusters and small cluster sizes can impact the performance of the mixed model. This was demonstrated with our simulation study. As proteomic studies tend to have small sample sizes and the number of PSMs and peptides can vary drastically this could cause the mixed model to not behave as expected. This could be the reason for some of the proteins working better with the mixed model than other proteins. We noticed the methods can be sensitive to the normalization method where the mixed model was the most impacted across both datasets. LIMMA had the best power and decent type I error, where the mixed model had the best type I error with median sweeping. LIMMA did have the best power and type I error with quantile normalization. The bias and MSE was smaller for the mixed model when the fold changes were smaller, but as the fold change increased LIMMA and GLM had smaller bias and MSE. LIMMA is the preferred method. All models provide similar coefficient values except for the intercept which would lead to different prediction but not different fold changes. Each model handles their variance estimates differently which directly affects inference. A benefit to the mixed model is it

ACS Paragon Plus Environment

32

Page 33 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

directly addresses the noise in the model, attributes different sources of noise, is very flexible and can provide a corrected variance while using all measurements to estimate the parameters. However, the sample sizes can be small and LIMMA is intended to address the variance by shrinkage in the presence of small sample sizes. The disadvantage to LIMMA is its limitation of not being able to address data collected over multiple timepoints and correlated data through the variance structure and random effects whereas the mixed model can. GLM seems acceptable and it does not make any adjustments to the SE. Median sweeping led to similar results across methods more than quantile normalization did. Median sweeping also retained the channel variability whereas quantile normalization shifted the variability from the channel to peptides. Such a finding suggests median sweeping may be preferred since it does a distribution shift rather than force the channel distributions to be the same when this may not hold true. There have been other efforts to compare modeling approaches. Herbrich et al.18 evaluated the comparison of the GLM with median sweeping to the mixed model with mean sweeping and found the GLM median sweeping to be preferred. Kammers et al.2 evaluated the comparison of a t-test/GLM to LIMMA both with median sweeping and found LIMMA to be preferred. We have shown the normalization methods may make a difference as others have9, 18, 34. We will evaluate additional normalization approaches under various scenarios in future work. We also plan to evaluate different missing data approaches to determine if there is a preferred approach. A proposed method would be to combine the empirical Bayes of LIMMA and the mixed model, as suggested with an approach implementing ridge regression through the mixed model combined with empirical Bayes7. Such a hybrid approach can address both the correlated data and the variance shrinkage.

ACS Paragon Plus Environment

33

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 40

ASSOCIATED CONTENT Supporting Information. Experimental details and results. This material is available free of charge via the Internet at http://pubs.acs.org. Detailed methods describing sample preparation, multiplexed isobaric labeling, MASCOT search parameters, and LC-MS2 mass spectrometry analysis; Table S1: List of purified recombinant human proteins spiked in E. coli lysates; Table S2: Amounts of proteins spiked in each TMT channel; Table S3: Summary statistics of fold changes spike-in experiment with median sweeping; Table S4: Summary statistics of fold changes spike-in experiment with quantile normalization; code for models AUTHOR INFORMATION Corresponding Author *Phone: 301-398-0975. Email: [email protected]. ORCID Gina D’Angelo: 0000-0002-5327-086X Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. ACKNOWLEDGEMENTS The work was solely supported by MedImmune. We have no external funding sources to report.

ACS Paragon Plus Environment

34

Page 35 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

ABBREVIATIONS DDA: Data dependent acquisition FASP: filter-aided samples preparation FDR: false discovery rate GLM: General linear model LIMMA: Linear Models for Microarray Data LOD: limit of detection MNAR: missing not at random MSE: mean-squared error PSM: peptide spectral match SLE: Systemic Lupus Erythematosus SE: Standard error TMT: tandem mass tags

REFERENCES 1. Browne, W. J.; Dryden, I. L.; Handley, K.; Mian, S.; Schadendorf, D., Mixed effect modelling of proteomic mass spectrometry data using Gaussian mixtures. Journal of the Royal Statistical Society Series C 2010, 59, 617-633. 2. Kammers, K.; Cole, R. N.; Tiengwe, C.; Ruczinski, I., Detecting significant changes in protein abundance. EuPA Open Proteom 2015, 7, 11-19.

ACS Paragon Plus Environment

35

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 40

3. Oberg, A. L.; Mahoney, D. W.; Eckel-Passow, J. E.; Malone, C. J.; Wolfinger, R. D.; Hill, E. G.; Cooper, L. T.; Onuma, O. K.; Spiro, C.; Therneau, T. M.; Bergen, H. R., 3rd., Statistical analysis of relative labeled mass spectrometry data from complex samples using ANOVA. J Proteome Res. 2008, 7 (1), 225-233. 4. Hu, J.; Coombes, K. R.; Morris, J. S.; Baggerly, K. A., The importance of experimental design in proteomic mass spectrometry experiments: some cautionary tales. Brief Funct Genomic Proteomic 2005, 3, 322-331. 5. Morris, J. S.; Baggerly, K. A.; Gutstein, H. B.; Coombes, K. R., Statistical contributions to proteomic research. Methods Mol Biol 2010, 641, 143–166. 6. Smyth, G. K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004, 3 (1), 1-25. 7. Goeminne, L. J. E.; Gevaert, K.; Clement, L., Peptide-level robust ridge regression improves estimation, sensitivity, and specificity in data-dependent quantitative label-free shotgun proteomics. Molecular & Cellular Proteomics 2016, 15 (2), 657-668. 8. Schwämmle, V.; León, I. R.; Jensen, O. R., Assessment and Improvement of Statistical Tools for Comparative Proteomics Analysis of Sparse Data Sets with Few Experimental Replicates. J. Proteome Res 2013, 12 (9), 3874-3883. 9. Ting, L.; Cowley, M. J.; Hoon, S. L.; Guilhaus, M.; Raftery, M. J.; Cavicchioli, R., Normalization and Statistical Analysis of Quantitative Proteomics Data Generated by Metabolic Labeling. Molecular & Cellular Proteomics 2009, 8 (10), 2227-2242. 10. Lonnstedt, I.; Speed, T., Replicated microarray data. Statistica Sinica 2002, 12, 31-46. 11. Diggle, P.; Heagerty, P.; Liang, K. Y.; Zeger, S., Analysis of longitudinal data. 2nd ed.; Oxford University Press: Oxford, 2002. 12. Liang, K. Y.; Zeger, S. L., Longitudinal data analysis using generalized linear models. Biometrika 1986, 73, 13–22. 13. Clough, T.; Braun, S.; Fokin, V.; Ott, I.; Ragg, S.; Schadow, G.; Vitek, O., Statistical design and analysis of label-free LC-MS proteomic experiments: a case study of coronary artery disease. Methods Mol Biol. 2011, (728), 293-319. 14. Clough, T.; Key, M.; Ott, I.; Ragg, S.; Schadow, G.; Vitek, O., Protein Quantification in Label-Free LC-MS Experiments. Journal of Proteome Research 2009, 8, 5275–5284. 15. Clough, T.; Thaminy, S.; Ragg, S.; R., A.; Vitek, O., Statistical protein quantification and significance analysis in label-free LC-MS experiments with complex designs. BMC bioinformatics 2012, 13 (Suppl 16), S6. 16. Daly, D. S.; Anderson, K. K.; Panisko, E. A.; Purvine, S. O.; Fang, R.; Monroe, M. E.; Baker, S. E., Mixed-effects statistical model for comparative LC-MS proteomics studies. The Journal of Proteome Research 2008, 7, 1209–1217. 17. Chang, C. Y.; Picotti, P.; Hüttenhain, R.; Heinzelmann-Schwarz, V.; Jovanovic, M.; Aebersold, R.; Vitek, O., Protein significance analysis in selected reaction monitoring (SRM) measurements. Molecular & cellular proteomics : MCP 2012, 11 (4), M111.014662. 18. Herbrich, S. M.; Cole, R. N.; West, K. P.; Schulze, K.; Yager, J. D.; Groopman, J. D.; Christian, P.; Wu, L.; O'Meally, R. N.; May, D. H.; McIntosh, M. W.; Ruczinski, I., Statistical inference from multiple iTRAQ experiments without using common reference standards. J. Proteome Res 2013, 12, 594-604. 19. Cole, R. N.; Ruczinski, I.; Schulze, K.; Christian, P.; Herbrich, S.; Wu, L.; Devine, L. R.; O'Meally, R. N.; Shrestha, S.; Boronina, T. N.; Yager, J. D.; Groopman, J.; West, K. P., Jr, The

ACS Paragon Plus Environment

36

Page 37 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

plasma proteome identifies expected and novel proteins correlated with micronutrient status in undernourished Nepalese children. J Nutr 2013, 143 (10), 1540-8. 20. Käll, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J., Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods 2007, 4 (11), 923-5. 21. Vizcaíno, J. A.; Csordas, A.; del-Toro, N.; Dianes, J. A.; Griss, J.; Lavidas, I.; Mayer, G.; Perez-Riverol, Y.; Reisinger, F.; Ternent, T.; Xu, Q. W.; Wang, R.; Hermjakob, H., 2016 update of the PRIDE database and related tools. Nucleic Acids Res 2016, 44 (D1), D447-D456. 22. Sandberg, A.; Branca, R. M. M.; Lehtiö, J.; Forshed, J., Quantitative accuracy in mass spectrometry based proteomics of complex samples: The impact of labeling and precursor interference. Journal of proteomics 2014, 96, 133-144. 23. Karpievitch, Y.; Stanley, J.; Taverner, T.; Huang, J.; Adkins, J. N.; Ansong, C.; Heffron, F.; Metz, T. O.; Qian, W. J.; Yoon, H.; Smith, R. D.; Dabney, A. R., A statistical framework for protein quantitation in bottom-up MS-based proteomics. Bioinformatics 2009, 25 (16), 2028-34. 24. Luo, R.; Zhao, H., Protein quantitation using iTRAQ: Review on the sources of variations and analysis of nonrandom missingness. Stat Interface. 2012, 5 (1), 99–107. 25. Webb-Robertson, B. J. M.; Wiberg, H. K.; Matzke, M. M.; Brown, J. N.; Wang, J.; McDermott, J. E.; Smith, R. D.; Rodland, K. D.; Metz, T. O.; Pounds, J. G.; Waters, K. M., Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. J. Proteome Res 2015, 14 (5), 1993–2001. 26. Goeminne, L. J.; Argentini, A.; Martens, L.; Clement, L., Summarization vs peptidebased models in label-free quantitative proteomics: performance, pitfalls, and data analysis guidelines. J Proteome Res. 2015, 14 (6), 2457-65. 27. Choi, M.; Chang, C. Y.; Clough, T.; Broudy, D.; Killeen, T.; MacLean, B.; Vitek, O., MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics 2014, 30 (17), 2524–26. 28. Cox, J.; Mann, M., MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature Biotechnology 2008, 26 (12), 1367-72. 29. Lazar, C.; Gatto, L.; Ferro, M.; Bruley, C.; Burger, T., Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies. J Proteome Res. 2016, 15 (4), 1116-25. 30. Schwämmle, V.; Verano-Braga, T.; Roepstorff, P., Computational and statistical methods for high-throughput analysis of post-translational modifications of proteins. Journal of proteomics 2015 129, 3-15. 31. Lazar, C. imputeLCMD: A Collection of Methods for Left-Censored Missing Data Imputation, R package, version 2.0. 32. Karp, N. A.; Huber, W.; Sadowski, P. G.; Charles, P. D.; Hester, S. V.; Lilley, K. S., Addressing accuracy and precision issues in iTRAQ quantitation. Mol Cell Proteomics 2010, 9 (9), 1885-97. 33. Weisberg, S., Applied Linear Regression. 2nd ed.; John Wiley & Sons, Inc.: New York, 1985. 34. Callister, S. J.; Barry, R. C.; Adkins, J. N.; Johnson, E. T.; Qian, W. J.; Webb-Robertson, B. J.; Smith, R. D.; Lipton, M. S., Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. J Proteome Res. 2006, 5 (2), 27786.

ACS Paragon Plus Environment

37

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 40

35. Maes, E.; Hadiwikarta, W. W.; Mertens, I.; Baggerman, G.; Hooyberghs, J.; Valkenborg, D., CONSTANd : a normalization method for Isobaric labeled spectra by constrained optimization. Molecular & cellular proteomics : MCP 2016, 15 (8), 2779-90. 36. Bradshaw, R. A.; Burlingame, A. L.; Carr, S.; Aebersold, R., Reporting protein identification data: the next generation of guidelines. Molecular & cellular proteomics : MCP 2006, 5 (5), 787-8. 37. Carr, S.; Aebersold, R.; Baldwin, M.; Burlingame, A.; Clauser, K.; Nesvizhskii, A.; Data, W. G. o. P. G. f. P. a. P. I., The need for guidelines in publication of peptide and protein identification data: Working Group on Publication Guidelines for Peptide and Protein Identification Data. Molecular & cellular proteomics : MCP 2004, 3 (6), 531-3. 38. Gupta, N.; Pevzner, P. A., False discovery rates of protein identifications: a strike against the two-peptide Rule J Proteome Res. 2009, 8 (9), 4173–4181. 39. Omenn, G. S.; States, D. J.; Adamski, M.; Blackwell, T. W.; Menon, R.; Hermjakob, H.; Apweiler, R.; Haab, B. B.; Simpson, R. J.; Eddes, J. S.; Kapp, E. A.; Moritz, R. L.; Chan, D. W.; Rai, A. J.; Admon, A.; Aebersold, R.; Eng, J.; Hancock, W. S.; Hefta, S. A.; Meyer, H.; Paik, Y. K.; Yoo, J. S.; Ping, P.; Pounds, J.; Adkins, J.; Qian, X.; Wang, R.; Wasinger, V.; Wu, C. Y.; Zhao, X.; Zeng, R.; Archakov, A.; Tsugita, A.; Beer, I.; Pandey, A.; Pisano, M.; Andrews, P.; Tammen, H.; Speicher, D. W.; Hanash, S. M., Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly-available database. Proteomics 2005, 5 (13), 3226-45. 40. Zhang, Y.; Xu, T.; Shan, B.; Hart, J.; Aslanian, A.; Han, X.; Zong, N.; Li, H.; Choi, H.; Wang, D.; Acharya, L.; Du, L.; Vogt, P. K.; Ping, P.; Yates, J. R., 3rd, ProteinInferencer: Confident protein identification and multiple experiment comparison for large scale proteomics projects. Journal of proteomics 2015, 129, 25-32. 41. Roy, A., Estimating correlation coefficient between two variables with repeated observations using mixed effects model. Biometrical journal. Biometrische Zeitschrift 2006, 48 (2), 286-301. 42. Mahoney, D. W.; Therneau, T. M.; Heppelmann, C. J.; Higgins, L.; Benson, L. M.; Zenka, R. M.; Jagtap, P.; Nelsestuen, G. L.; Bergen, H. R.; Oberg, A. L., Relative quantification: characterization of bias, variability and fold changes in mass spectrometry data from iTRAQlabeled peptides. J Proteome Res. 2011, 10 (9), 4325-33. 43. Ow, S. Y.; Salim, M.; Noirel, J.; Evans, C.; Rehman, I.; Wright, P. C., iTRAQ underestimation in simple and complex mixtures: “The good, the bad and the ugly”. Journal of Proteome Research 2009, 8, 5347–5355. 44. Ting, L.; Rad, R.; Gygi, S. P.; Haas, W., MS3 eliminates ratio distortion in isobaric labeling-based multiplexed quantitative proteomics. Nat Methods 2012, 8 (11), 937–940. 45. Oberg, A. L.; Mahoney, D. W., Statistical methods for quantitative mass spectrometry proteomic experiments with labeling. BMC bioinformatics 2012, 13 Suppl 16, S7. 46. Mar, J. C.; Kimura, Y.; Schroder, K.; Irvine, K. M.; Hayashizaki, Y.; Suzuki, H.; Hume, D.; Quackenbush, J., Data-driven normalization strategies for high-throughput quantitative RTPCR. BMC bioinformatics 2009 10 (110), 1-10. 47. Bell, B. A.; Morgan, G. B.; Kromrey, J. D.; Ferron, J. M., The Impact of Small Cluster Size on Multilevel Models: A Monte Carlo Examination of Two-Level Models with Binary and Continuous Predictors. In JSM Proceedings, Survey Research Methods Section, 2010; pp 40574067.

ACS Paragon Plus Environment

38

Page 39 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

48. Dieleman, J. L.; Templin, T., Random-effects, fixed-effects and the within-between specification for clustered data in observational health studies: a simulation study. PloS one 2014, 9 (10), e110257. 49. Galbraith, S.; Daniel, J. A.; Vissel, B., A study of clustered data and approaches to its analysis. J Neurosci 2010, 30, 10601-8.

ACS Paragon Plus Environment

39

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

457x451mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 40 of 40