Can “Normal” Protein Expression Ranges Be Estimated with High

Apr 16, 2015 - The python script then calls the custom R scripts to analyze the data and generate summaries of peptide IDs, protein IDs and spectral c...
1 downloads 8 Views 1MB Size
Subscriber access provided by NEW YORK UNIV

Article

Can “normal” protein expression ranges be estimated with high-throughput proteomics? Roger Higdon, and Eugene Kolker J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00176 • Publication Date (Web): 16 Apr 2015 Downloaded from http://pubs.acs.org on April 19, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Can “normal” protein expression ranges be estimated with high-throughput proteomics? Roger Higdon1,2 and Eugene Kolker1-4* 1

Bioinformatics and High-Throughput Analysis Laboratory, Center for Developmental Therapeutics, Seattle Children’s Research Institute, Seattle, WA, USA 2 CDO Analytics, Seattle Children’s Hospital, Seattle, WA, USA 3 Departments of Biomedical Informatics and Medical Education and Pediatrics, University of Washington, Seattle, WA, USA 4 Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA, USA

Abstract Although biological science discovery often involves comparing conditions to a normal state, in proteomics little is actually known about normal. Two Human Proteome studies featured in Nature offer new insights into protein expression and an opportunity to assess how high-throughput proteomics measures normal protein ranges. We use data from these studies to estimate technical and biological variability in protein expression and compare them to other expression datasets from normal tissue. Results show that measured protein expression across same-tissue replicates vary by only +/- 4 to 10 fold for most proteins. Coefficients of variation (CV) for protein expression measurements range from 62% to 117% across different tissue experiments; however, adjusting for technical variation reduced this variability by as much as 50%. In addition, the CV could also be reduced by limiting comparisons to proteins with at least 3 or more unique peptide identifications as the CV was on average 33% lower than for proteins with 2 or fewer peptide identifications. We also selected 13 housekeeping proteins and genes that were expressed across all tissues with low variability to determine their utility as a reference set for normalization and comparative purposes. These results present the first step towards estimating normal protein ranges by determining the variability in expression measurements through combining publicly available data. They support an approach that combines standard protocols with replicates of normal 1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 30

tissues to estimate normal protein ranges for large numbers of proteins and tissues. This would be a tremendous resource for normal cellular physiology and comparisons of proteomics studies.

Keywords: proteomics, protein expression, gene expression, variability, normal tissue, normal range, spectral counts

Background and Significance Two Human Proteome studies featured in the May 29, 2014 issue of Nature (Kim et. al. and Wilhelm et. al.) offer valuable new insights into protein expression across a broad range of 1,2

. High-throughput shotgun mass spectrometry (MS) proteomic studies have

human tissues

become a standard approach to identify and quantify thousands of proteins in a single experiment

3-5

. These studies, as with most broad efforts, primarily examined protein

abundance patterns but did not attempt to examine individual protein concentration ranges 1,2,6. Thanks to the authors’ efforts to make the data easily accessible, we were able to perform additional analyses on an individual protein level. These analyses allow us to explore how high-throughput proteomics can be used to estimate the range of “normal” protein expression for thousands of proteins. A number of efforts exist to provide estimates of protein abundance. Our own MOPED database provides estimates of protein concentration for thousands of proteins across a range of experiments

7-9

. Other resources, including PaxDB, PeptideAtlas,

and MaxQB, also provide expression estimates for many proteins across tissues and cell types 10-12

. Still, these resources only provide single estimates of expression for a protein rather than

a range. Resources such as MOPED try to integrate all available proteomics experiments, yet because of the cost and complexity of proteomics experiments very few contain the sufficient 2 ACS Paragon Plus Environment

Page 3 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

number of biological replicates required to estimate a “normal” range of expression. Two exceptions to this dilemma were published from the Snyder lab. These studies examine the expression of different omics, including proteomics, and contain multiple biological replicates 13-15

. Two approaches were used: repeated sampling within a single individual (blood), and

sampling across many individuals (lymphoblastoid cells). Additionally, normal protein concentration ranges have been reported for a small number of protein biomarkers in blood 16. Other studies have also estimated variability of protein expression in 2-DIGE experiments for roughly 100 proteins in blood or cerebral spinal fluid 17-21. A much larger number of gene expression datasets are available through resources such as GEO. There have been many attempts to investigate normal gene expression using these data

22

. These studies include a large effort by Su et. al. (2004) to assay 79 normal

tissues with limited replicates (mainly 2 per tissue) that have been re-processed through resources such as BioGPS

23,24

. Additionally, resources have been focused on identifying

widely expressed housekeeping genes

25-27

. However, it has been noted that even the

expression of these housekeeping genes can vary a great deal in normal samples 28. The goal of this paper is to take advantage of the public availability of the Kim and Wilhelm studies to help characterize normal tissue protein expression variability as measured by modern shotgun proteomics experiments. We attempt to differentiate between components of variation that comprise measurement of protein expression (Figure 1): the biological variation that exists between different samples and the technical variation due to the measurement process. We also contrast the measurement of protein expression in normal tissues to existing measurements of gene expression, and also identify housekeeping proteins. This study, combined with others, can be a first step towards estimating the normal range of protein expression across a substantial portion of the Human proteome. 3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 30

Materials and Methods Datasets The Kim study conducted MS proteomics analysis on 17 normal human adult and 7 fetal tissues. Data from the 12 adult tissues that were analyzed in common with the Wilhelm study were used for our analysis (Table 1). The tissues came from rapid autopsy and were confirmed as histologically normal. Each sample consisted of a pool of tissues from 3 individuals and sample prep and protocols are detailed in Kim et. al.1 Part of the Wilhelm study used a MS proteomics analysis on 36 tissues and body fluids obtained from the University of Munich Biobank. Details of the protocols are given in Wilhelm et al.2 The two studies used similar protocols for MS analysis including separating samples into 24 fractions using LDS- or SDS-PAGE and an in-gel digestion with trypsin. Wilhelm also used LysC and chymotrypsin digestion for some samples, but these data were excluded to make the two studies more closely comparable. Analysis of fractions was done using LC-MS/MS on Thermo-Fisher Orbitrap Elite and Velos instruments by both studies. Additional proteomic data were downloaded for three tissues (liver, lung, and heart) from PaxDB. The data were a combination of several proteomics studies independent of the Kim and Wilhelm studies. PaxDB used their own methodology to combine results from different experiments and generate PPM estimates for each protein 8. Ensembl identifiers from PaxDB were mapped to UniProt identifiers for comparison in this study Roche

tissue

data

(heart,

lung,

and

ovary)

are

available

in

MOPED

(moped.proteinsprire.org, Roche_tissue_proteomes) and were originally downloaded from PeptideAtlas (PAe0001771, PAe0001774, PAe0001778). The study is an unpublished analysis by Roche of nine tissues and body fluids using SDS-PAGE fractionation, trypsin digestion, and 4 ACS Paragon Plus Environment

Page 5 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

an LTQ Orbitrap instrument. While the number of protein expression studies using normal tissues is limited, there are a number of microarrays studies, available through GEO, containing replicate samples of normal tissue. The normal tissues are typically used as controls for disease based studies. We downloaded 6 studies from GEO (GDS3837, GDS4389, GDS2947, GDS651, GDS505, GDS3592) corresponding to 6 of the 12 tissues used for our proteomics studies

29-33

. When

more than one study was available we chose the study with the greatest number of available biological replicates. These studies had from 7 to 32 replicate samples from different subjects. The data had been previously normalized and log transformed so no further normalization was done. The data were matched to proteomics data using a gene symbol to Uniprot identifier mapping from Uniprot 34. RNA-seq transcriptomics data were obtained from a study of 27 normal adult tissues [35. Data matching the 6 tissues in common with the Kim, Wilhelm, and GEO data were used in the analysis. The study used obtained tissue from the Uppsala Biobank and Uppsala University Hospital and analyzed a single sample of each tissue with Illumina instruments using standard protocol. Raw reads were converted to FPKM values using Cufflinks

36

. FPKM values were

normalized by total FPKM expression and transformed to a log scale. Data are available at ArrayExpress (E-MTAB-1733) 37

Proteomic Data Analysis We downloaded the raw data from the Kim, Wilhelm, and Roche studies, converted it to mzXML

format

using

msconvert

(part

of

Proteowizard

version

3.03

from

http://proteowizard.sourceforge.net/) 38 and re-analyzed it using our SPIRE (Systematic Protein Investigative Research Environment) proteomics pipeline 39-42. SPIRE combines the X!tandem 5 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

43

Page 6 of 30

search engine (version 13-09-01-1 from theGPM.org) with experimental design

specifications supplied by the user to fine tune the analysis. SPIRE’s in-house post processing tools written in R (version 3.0.2 from the R-project.org) 44 determine peptide spectrum matches (PSM) and protein identifications. SPIRE uses a decoy database approach to estimate local and cumulative false discovery rates (FDR) for both PSMs and protein identifications. The searches were conducted assuming fully tryptic cleavage of peptides with no more than 2 missed cleavages, a fixed modification for carbamidomethylation of cysteine, a variable modification for oxidation of methionine and a precursor mass tolerance of 20 ppm. The remaining X!tandem search parameters were set to default values. A local FDR threshold of 10% was used to accept PSM identifications. FDR values for each protein were estimated for the combination of experiment and tissue. SPIRE runs on the lab’s Linux cluster that currently has 408 processer cores (consisting of 24 4-core Xeon 5148 @ 2.33Ghz nodes, 24 16-core Xeon X5550 nodes @ 2.67GHz and 1 24-core Xeon X56550 @ 2.67 GHz) and 500 gigabytes of RAM available to it. For each proteomics experiment the SPIRE pipeline distributes the workflow to as many as 48 nodes using Grid Engine (version 6.2u5). The workflow creates a dynamically generated python script (version 2.6.6) that takes user supplied search parameters and experiment design information, and invokes X!tandem to search individual mzXML files. After searches are completed the python scripts convert xml output to tab delimited files using XSLT style sheets. The python script then calls the custom R scripts to analyze the data and generate summaries of peptide IDs, protein IDs and spectral counts according the approach described above

39-42, 44

. Data from these studies are available in MOPED (moped.proteinspire.org,

experiment_ids: Kim_Nature_2014, Wilhelm_Nature_2014).

Calculating concentration 6 ACS Paragon Plus Environment

Page 7 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Use of spectral counting to estimate protein expression is well established

45-47

.

Extracted ion chromatograms (XIC) can provide definite benefits for relative comparisons within an experiment on a common set of peptides. However for comparisons across experiments with differing methods and varied sets of identified peptides as is the case here, the benefit is less clear particularly given the additional complexity of the analysis with this type of data 48. We estimated parts per million (PPM) concentration using spectral counts normalized by protein size and total expression in the sample (see Higdon et. al. for details) 8. PPM values were transformed to a log scale for analysis. We excluded proteins with zero PSM identifications rather than assigning a zero expression level. This decision was made because lack of identification is most likely due to low expression or a result of detection bias inherent to MS studies. An analysis was conducted that included zero PPM estimates and used an offset prior to log transformation--it provided slightly noisier but fundamentally similar results (see Supplemental Figure S1). PPM values were transformed to a log scale before comparing.

Calculating variability and components of variability We wish to calculate the average variation in expression measurements across different tissues and to separate and measure technical and biological variability. We use a number of formulas and assumptions. To simplify the calculations we assume the different expression estimates (on the log scale) for each protein vary independently from a common mean with a common variance. We can then estimate the variance for each protein and pool it across  ) as illustrated by Figure 1. proteins to get estimates of the average measurement variance (

In the circumstance when there are only two independent measurements (as is the case here with a single measurement from each of the 2 studies, Kim and Wilhelm):

7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

  =  −  = 2 ,

Page 8 of 30

(1)

where D is the difference between the measurements and from each experiment. So, the measurement variance can be estimated as:

 =  /2, where  is the sample variance of the differences. 

(2)

In the case where there are identical technical replicates of a sample the total measurement variation (M) can be decomposed into combinations of the between sample biological variation (B) and the within sample technical variation (T) as illustrated in Figure 1:

  =  +  /, where n is the number of technical replicates..

(3)

When there are only two samples we can apply (1), (2) and (3) to each sample to adjust the total measurement variance estimate for technical variability and obtain an approximation of the between sample biological variance:







 =   −   +  , where  is the pooled estimate of variance based upon the 



technical replicates from each experiment and  and  are the number of technical replicates in each experiment. On the natural log-scale standard deviation estimates (square-roots of the above variance estimates) can be used as an approximation of the coefficient of variation (CV) on the 8 ACS Paragon Plus Environment

Page 9 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

original scale. This approximation is based upon the delta method 49 When the number of replicates is small, estimates of variance are unreliable. Therefore in high throughput studies, empirical Bayes (EB) estimates are often used to provide more reliable estimates of the variability of individual protein and gene expression. EB estimates take a weighted average of individual variance estimates and a pooled estimate. Smyth (2004) provides a straightforward approach for calculating EB variance estimates 50.

Results Comparing Kim and Wilhelm Studies We compared protein expression in 12 normal tissues analyzed as part the Wilhelm and Kim studies. The two studies obtained tissue samples from different tissue banks and performed shotgun MS analysis using similar instrumentation and protocols. To simplify the analysis we only compared proteins that were identified with an FDR of less than 1% in at least one of the studies and by at least one PSM in both studies (a summary of IDs is shown in Table 1). Figure 2 shows that the correlation in expression measured by PPM on the log scale between the two experiments across the 12 tissues ranges from R = 0.58 to 0.84. The standard deviation (SD, square root of the pooled variance estimate) is shown in Table 2 for each tissue. As noted in the methods the SD of log transformed data can be viewed as an approximation of the CV on the untransformed scale. In this analysis the CV estimates varied between 60% and 110%. Assuming that “normal” expression should be within 2 standard deviations of the mean, we can back transform this value (  ) to get an estimate of the “normal” fold change range in expression across different proteomics experiments. These data show that expression measurements from similar experiments vary up or down between 4 to 10 (or 0.25 to 0.1) fold 9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 30

(Figure 2). These fold change are averages across all of the identified proteins—see below for further discussion of individual protein ranges. Protein quantitation has the potential for bias when based on a very few peptide sequences. We explored the impact of the number of uniquely identified peptide sequences on the results in Table 2. Of those proteins included in the analysis an average of 61% across tissues had at least 3 unique peptide identifications in each experiment (ranging from 52% to 66%). The CV for proteins with at least 3 unique peptide IDs in each experiment was on average 33% less than those with 2 or fewer indicating that estimates based upon more unique peptides sequence are more consistent and more reliable. Similarly, correlation in expression between experiments was also nearly twice as large on average across tissues for proteins with at least 3 unique peptide IDs versus those with 2 or fewer (0.72 vs. 0.38).

Components of variation Only a portion of the variation between the two experiments comes from actual normal fluctuation in protein expression between samples, with the remainder coming from technical variation due to measurement error in MS proteomics, and differences in protocols, instrumentation and labs. The Kim study does include replicates, although the exact instrument varies between replicates. Although they are not exact technical replicates they are reasonable surrogates to approximate the technical variability. An adjusted estimate of variability was calculated (values taken from Tables 1 and 3) using two steps. First we applied a variance components approach as detailed in Methods. Second, the Wilhelm study lacked technical replicates and because the two studies vary in details such as number of separations, gradients, and instrument settings it is not acceptable to treat the single technical replicate from the Wilhelm study as equivalent to a technical replicate from the Kim study. In proteomics studies 10 ACS Paragon Plus Environment

Page 11 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

using spectral counting the power of replication is proportional to the total number of spectral counts, therefore equivalent technical replicates should provide a similar number of total spectral counts. So, we assumed that the amount of technical replication in the Wilhelm study is approximately equal to the ratio of spectral counts times the number of replicates in the Kim study. The estimates in Table 3 show this technical variability may comprise as much as half of the total variation in the data and thus variation in “normal” protein expression is likely far lower than variation between two experiments as illustrated by Figure 1.

Comparisons with other public datasets Only a few other datasets providing expression estimates for normal tissue are publically available. We compared our results to data from PaxDB and from the Roche tissue profiling experiment. Each study had 3 tissues in common with the 12 analyzed above. Expression estimates from each of these sources was compared to the mean of the Kim and Wilhelm studies and the result are shown in Figure 3. Several studies show reasonably strong correlations while others do not. However, the numbers of proteins vary greatly between data sets (~600 to 12,000 for PaxDB and 300 to 3,000 for Roche) indicating that the scale and similarity of methods can greatly influence the variability and correlation between proteomics studies.

Comparisons with gene expression An analysis of variation was done in the same manner as was done with the proteomics data. Genes from each tissue corresponding to the proteins used in the proteomics analysis were included. The results in Table 4 show that the microarray studies had considerably lower estimates of variability (CV and fold change) than the proteomics studies even after the 11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 30

adjustment for technical variation. It is not clear whether this is due to the relatively low technical variability in microarrays

51

, the fact that comparisons are made within experiments

rather than across them, or because of a lack of correspondence between gene and protein expression. The correlation between microarray and proteomics studies was positive but lower than the correlation between different proteomics studies (Table 4). To see if this low correlation is due to the microarray platform we also compared the correlation with proteomics studies and available RNA-Seq data (not replicated so not useful for assessing variability). The correlation with proteomics studies and the RNA-Seq data appears to be no higher than for the proteomics studies and microarray data, while the microarray and RNA-Seq data show much higher correlation (Table 4). Range in variation across proteins and genes A sample size of 2 is clearly too small to adequately estimate the variation expression ranges for individual proteins. To compensate we used an empirical Bayes (EB) estimate of individual protein variances to get an idea of the range of variation across proteins (10th and 90th percentiles). Although often EB estimates do not deviate dramatically from the pooled estimate, non-EB estimates have a very wide range (Table 3). The microarray studies had reasonable numbers of replication so likely provide a better comparison across genes. The results in Table 3 show the individual CV and fold change estimates ranging from 50% lower or higher than the pooled estimates between the 10th and 90th percentiles respectively.

Housekeeping proteins A total of 1297 proteins were identified in all 12 tissues in both the Wilhelm and Kim studies. Only 107 of the proteins had within tissue measurement CVs that averaged less than 30% across the 12 tissues. After averaging the measurements from the Wilhelm and Kim studies 12 ACS Paragon Plus Environment

Page 13 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

only 34 of those proteins had a CV less than 30% across the 12 tissues. Next, only 14 of those had corresponding genes having a within tissue measurement CV less than 20% across the 6 microarray studies. All the remaining candidates had above average protein expression in each of the 12 tissues and had well above average RNA-Seq expression across all tissues (Table 4). Only 1 had a CV for RNA-Seq expression that was not well below average (EEF1A1). Removing EEF1A1 leaves the remaining candidates shown in Table 5. All of these were previously identified as a housekeeping genes in Eeisenberg and Levanon (2013) 25 or Chang et. al. (2011) 26 and 10 were identified in both.

Discussion These results demonstrate that there is some agreement in the measurements of protein expression between the Kim and Wilhelm studies as well as other high throughput proteomics studies. However, the variability in protein expression estimates between these two studies, even after a rough correction for technical variability, is higher than has been reported in previous studies based on 2D-DIGE. Those studies reported average CV estimates of approximately 25% 17-21. A high-throughput study of lymphoblastoid cells across 79 individuals has protein CV estimates averaging 15%

15

. However, these CV estimates may be low given

that: (1) the study used isobaric labeling which reduces technical variability but also limits the range of measured expression samples

52

, and (2) cultured cells are more homogeneous than tissue

53

. The variability in protein expression was also substantially higher than in gene

expression measured by microarrays on a similar set of tissues. However, the data show that a considerable portion of the variability in protein expression estimates is likely from technical variability, and still more is likely due to differences between labs, their measurement processes, and protocols. There are also other well-known issues with proteomics studies including 13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

variation tissue sample collection methods,

Page 14 of 30

and bias in estimating concentration due to

protein fraction. All of these issues add to the variability in protein expression measurements beyond the actual biological variability in protein expression. These data also show that the correlation of protein expression between different experiments (Figure 2) and between gene expression (Table 4) varies across different tissues. Similar variation has arisen in recent transcriptomic studies supporting the notion that certain tissues have higher variability in normal protein expression

54

. Therefore, it might be easier or

more appropriate to estimate normal protein expression is some tissues. However, analysis of a much larger number of experiments is necessary to validate this hypothesis. Given that proteins are the instruments of biological function and data in this study and from previous studies show that protein and gene expression does not always correlate well, it is always preferable to have differential expression definitively confirmed by proteomics methods 48, 55. Housekeeping genes or proteins have been shown in the past to be valuable tool for normalizing expression and providing comparisons across experiments

25, 26

. This study

selected a set of housekeeping genes/protein that showed low variability in expression across tissues for both proteomic and transcriptomic data. As studies become increasingly multi-omic such multipurpose sets will become even more essential to comparisons both between experiments and different omics.

Concluding Remarks Although the results shown here and in other studies do not provide a conclusive answer to whether current high-throughput proteomics can reliably estimate the normal range of protein expression across a substantial portion of the Human proteome, they hint that it is plausible. 14 ACS Paragon Plus Environment

Page 15 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Approaches that might achieve this goal in the future need to estimate average protein expression to a reasonable level of accuracy and estimate biological variability. They also should include several technical replicates in order to both reduce measurement error and to be able to separate technical variability from biological variability. If variation is similar to that shown in this study (~80% CV) then 16 equivalent studies would be needed to achieve a 20% CV for estimates of average expression One approach that might achieve this objective with fewer replicates would be to have a single lab combine multiple replicates of “normal” tissues from many subjects and use advanced instrumentation and a standard protocol. This would minimize measurement variability but create the potential for bias by tying the measurements too closely to one lab’s operating procedures. An alternative would be having identical tissue samples analyzed by multiple labs thereby allowing for estimation of inter-lab error. However, this would require an effort many times that of a single lab. Neither approach is likely to be undertaken because of the prohibitive cost and scale. Therefore, a resource of normal protein expression should be generated by combining data across many experiments. It is therefore critical to have complete and accurate meta-data, so studies with advanced instrumentation, proper replication and similar protocols are integrated 56-60. Data will need to be normalized across experiments through the use of common data analysis methods. Reference sets of proteins such as housekeeping proteins like those selected in this paper can serve as benchmark for normalization. This focused effort could both anticipate and minimize technical variability, thus allowing estimation of normal biological variation. Such a resource would not just be valuable for normal cellular physiology but also for comparing proteomics studies. Although this normal protein expression resource is no substitute for a properly replicated and controlled study, it would provide a means to validate 15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 30

results, test and generate new hypotheses, and make better use of the wealth of existing proteomics data. If this resource can become multi-omic, by combining with parallel efforts in genomics, transcriptomics, and metabolomics, the scientific community can deepen its understanding of the molecular function of normal human cells.

Supporting Information Figure S1. Comparison of log PPM values between Kim and Wilhelm studies across 12 normal tissues. Includes zero PPM values by using an offset with log transformation. Correlation coefficients (R) are reported on each of the figures.

This Material is available free of charge via the internet http://pubs.acs.org. Acknowledgement We thank Elizabeth Montague, Elizabeth Stewart, Larissa Stanberry, and John Choinere for their editing of the manuscript and excellent comments and suggestions. Author Information Corresponding Author *E-mail: [email protected], Phone: 425-884-7170 Funding Research reported in this study was supported by the National Science Foundation under the Division of Biological Infrastructure [0969929]; Seattle Children’s Research Institute; and The Robert B. McMillen Foundation [to E.K.]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Science Foundation, Seattle Children’s Research Institute, and The Robert B. McMillen Foundation.

References 1. Kim, M.-S.; Pinto, S. M.; Getnet, D.; Nirujogi, R. S.; Manda, S. S.; Chaerkady, R.; Madugundu, A. K.; Kelkar, D. S.; Isserlin, R.; Jain, S.; Thomas, J. K.; Muthusamy, B.; 16 ACS Paragon Plus Environment

Page 17 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Leal-Rojas, P.; Kumar, P.; Sahasrabuddhe, N. A.; Balakrishnan, L.; Advani, J.; George, B.; Renuse, S.; Selvan, L. D. N.; Patil, A. H.; Nanjappa, V.; Radhakrishnan, A.; Prasad, S.; Subbannayya, T.; Raju, R.; Kumar, M.; Sreenivasamurthy, S. K.; Marimuthu, A.; Sathe, G. J.; Chavan, S.; Datta, K. K.; Subbannayya, Y.; Sahu, A.; Yelamanchi, S. D.; Jayaram, S.; Rajagopalan, P.; Sharma, J.; Murthy, K. R.; Syed, N.; Goel, R.; Khan, A. A.; Ahmad, S.; Dey, G.; Mudgal, K.; Chatterjee, A.; Huang, T.-C.; Zhong, J.; Wu, X.; Shaw, P. G.; Freed, D.; Zahari, M. S.; Mukherjee, K. K.; Shankar, S.; Mahadevan, A.; Lam, H.; Mitchell, C. J.; Shankar, S. K.; Satishchandra, P.; Schroeder, J. T.; Sirdeshmukh, R.; Maitra, A.; Leach, S. D.; Drake, C. G.; Halushka, M. K.; Prasad, T. S. K.; Hruban, R. H.; Kerr, C. L.; Bader, G. D.; Iacobuzio-Donahue, C. A.; Gowda, H.; Pandey, A. A draft map of the human proteome. Nature 2014, 509, 575–581. 2. Wilhelm, M.; Schlegl, J.; Hahne, H.; Moghaddas Gholami, A.; Lieberenz, M.; Savitski, M. M.; Ziegler, E.; Butzmann, L.; Gessulat, S.; Marx, H.; Mathieson, T.; Lemeer, S.; Schnatbaum, K.; Reimer, U.; Wenschuh, H.; Mollenhauer, M.; Slotta-Huspenina, J.; Boese, J.-H.; Bantscheff, M.; Gerstmair, A.; Faerber, F.; Kuster, B. Mass-spectrometry-based draft of the human proteome. Nature 2014, 509, 582–587. 3. Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422, 198– 207. 4. Bensimon, A.; Heck, A. J. R.; Aebersold, R. Mass spectrometry-based proteomics and network biology. Annu. Rev. Biochem. 2012, 81, 379–405. 5. Cravatt, B. F.; Simon, G. M.; Yates, J. R. The biological impact of mass-spectrometry-based proteomics. Nature 2007, 450, 991–1000. 6. Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; Lundberg, E.; Omenn, G. S. Metrics for the Human Proteome Project 2013-2014 and strategies for finding missing proteins. J. Proteome Res. 2014, 13, 15–20. 7. Kolker, E.; Higdon, R.; Haynes, W.; Welch, D.; Broomall, W.; Lancet, D.; Stanberry, L.; Kolker, N. MOPED: Model Organism Protein Expression Database. Nucleic Acids Res. 2012, 40, D1093–1099. 8. Higdon, R.; Stewart, E.; Stanberry, L.; Haynes, W.; Choiniere, J.; Montague, E.; Anderson, N.; Yandl, G.; Janko, I.; Broomall, W.; Fishilevich, S.; Lancet, D.; Kolker, N.; Kolker, E. MOPED enables discoveries through consistently processed proteomics data. J. Proteome Res. 2014, 13, 107–113 9. Montague, E.; Stanberry, L.; Higdon, R.; Janko, I.; Lee, E.; Anderson, N.; Choiniere, J.; Stewart, E.; Yandl, G.; Broomall, W.; Kolker, N.; Kolker, E. MOPED 2.5--an integrated multi-omics resource: multi-omics profiling expression database now includes transcriptomics data. OMICS 2014, 18, 335–343. 10. Farrah, T.; Deutsch, E. W.; Omenn, G. S.; Sun, Z.; Watts, J. D.; Yamamoto, T.; Shteynberg, D.; Harris, M. M.; Moritz, R. L. State of the human proteome in 2013 as viewed through PeptideAtlas: comparing the kidney, urine, and plasma proteomes for the biology- and disease-driven Human Proteome Project. J. Proteome Res. 2014, 13, 60–75. 11. Wang, M.; Weiss, M.; Simonovic, M.; Haertinger, G.; Schrimpf, S. P.; Hengartner, M. O.; von Mering, C. PaxDb, a database of protein abundance averages across all three domains of life. Mol. Cell Proteomics 2012, 11, 492–500. 12. Schaab, C.; Geiger, T.; Stoehr, G.; Cox, J.; Mann, M. Analysis of high accuracy, quantitative proteomics data in the MaxQB database. Mol. Cell Proteomics 2012, 11, M111.014068. 13. Chen, R.; Mias, G. I.; Li-Pook-Than, J.; Jiang, L.; Lam, H. Y. K.; Chen, R.; Miriami, E.; Karczewski, K. J.; Hariharan, M.; Dewey, F. E.; Cheng, Y.; Clark, M. J.; Im, H.; Habegger, L.; Balasubramanian, S.; O’Huallachain, M.; Dudley, J. T.; Hillenmeyer, S.; Haraksingh, R.; 17 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 30

Sharon, D.; Euskirchen, G.; Lacroute, P.; Bettinger, K.; Boyle, A. P.; Kasowski, M.; Grubert, F.; Seki, S.; Garcia, M.; Whirl-Carrillo, M.; Gallardo, M.; Blasco, M. A.; Greenberg, P. L.; Snyder, P.; Klein, T. E.; Altman, R. B.; Butte, A. J.; Ashley, E. A.; Gerstein, M.; Nadeau, K. C.; Tang, H.; Snyder, M. Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell 2012, 148, 1293–1307. 14. Stanberry, L.; Mias, G. I.; Haynes, W.; Higdon, R.; Snyder, M.; Kolker, E. Integrative analysis of longitudinal metabolomics data from a personal multi-omics profile. Metabolites 2013, 3, 741–760. 15. Wu, L.; Candille, S. I.; Choi, Y.; Xie, D.; Jiang, L.; Li-Pook-Than, J.; Tang, H.; Snyder, M. Variation and genetic control of protein abundance in humans. Nature 2013, 499, 79–82. 16. Normal Laboratory Values: Blood, Plasma, and Serum 2013. 17. Maes, E.; Landuyt, B.; Mertens, I.; Schoofs, L. Interindividual variation in the proteome of human peripheral blood mononuclear cells. PLoS ONE 2013, 8, e61933. 18. Corzett, T. H.; Fodor, I. K.; Choi, M. W.; Walsworth, V. L.; Turteltaub, K. W.; McCutchen-Maloney, S. L.; Chromy, B. A. Statistical analysis of variation in the human plasma proteome. J. Biomed. Biotechnol. 2010, 2010, 258494 19. Winkler, W.; Zellner, M.; Diestinger, M.; Babeluk, R.; Marchetti, M.; Goll, A.; Zehetmayer, S.; Bauer, P.; Rappold, E.; Miller, I.; Roth, E.; Allmaier, G.; Oehler, R. Biological variation of the platelet proteome in the elderly population and its implication for biomarker research. Mol. Cell Proteomics 2008, 7, 193–203 20. Hu, Y.; Malone, J. P.; Fagan, A. M.; Townsend, R. R.; Holtzman, D. M. Comparative proteomic analysis of intra- and interindividual variation in human cerebrospinal fluid. Mol. Cell Proteomics 2005, 4, 2000–2009. 21. Stoop, M. P.; Coulier, L.; Rosenling, T.; Shi, S.; Smolinska, A. M.; Buydens, L.; Ampt, K.; Stingl, C.; Dane, A.; Muilwijk, B.; Luitwieler, R. L.; Sillevis Smitt, P. A. E.; Hintzen, R. Q.; Bischoff, R.; Wijmenga, S. S.; Hankemeier, T.; van Gool, A. J.; Luider, T. M. Quantitative proteomics and metabolomics analysis of normal human cerebrospinal fluid samples. Mol. Cell Proteomics 2010, 9, 2063–2075. 22. Barrett, T.; Wilhite, S. E.; Ledoux, P.; Evangelista, C.; Kim, I. F.; Tomashevsky, M.; Marshall, K. A.; Phillippy, K. H.; Sherman, P. M.; Holko, M.; Yefanov, A.; Lee, H.; Zhang, N.; Robertson, C. L.; Serova, N.; Davis, S.; Soboleva, A. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013, 41, D991–995. 23. Su, A. I.; Wiltshire, T.; Batalov, S.; Lapp, H.; Ching, K. A.; Block, D.; Zhang, J.; Soden, R.; Hayakawa, M.; Kreiman, G.; Cooke, M. P.; Walker, J. R.; Hogenesch, J. B. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci. U.S.A. 2004, 101, 6062–6067. 24. Wu, C.; Macleod, I.; Su, A. I. BioGPS and MyGene.info: organizing online, gene-centric information. Nucleic Acids Res. 2013, 41, D561–565. 25. Eisenberg, E.; Levanon, E. Y. Human housekeeping genes, revisited. Trends Genet. 2013, 29, 569–574. 26. Chang, C.-W.; Cheng, W.-C.; Chen, C.-R.; Shu, W.-Y.; Tsai, M.-L.; Huang, C.-L.; Hsu, I. C. Identification of human housekeeping genes and tissue-selective genes by microarray meta-analysis. PLoS ONE 2011, 6, e22859.. 27. Zhu, J.; He, F.; Song, S.; Wang, J.; Yu, J. How many human genes can be defined as housekeeping with current expression data? BMC Genomics 2008, 9, 172. 28. Lee, P. D.; Sladek, R.; Greenwood, C. M. T.; Hudson, T. J. Control genes and variability: absence of ubiquitous reference transcripts in diverse mammalian expression studies. Genome Res. 2002, 12, 292–297. 18 ACS Paragon Plus Environment

Page 19 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

29. Sabates-Bellver, J.; Van der Flier, L. G.; de Palo, M.; Cattaneo, E.; Maake, C.; Rehrauer, H.; Laczko, E.; Kurowski, M. A.; Bujnicki, J. M.; Menigatti, M.; Luz, J.; Ranalli, T. V.; Gomes, V.; Pastorelli, A.; Faggiani, R.; Anti, M.; Jiricny, J.; Clevers, H.; Marra, G. Transcriptome profile of human colorectal adenomas. Mol. Cancer Res. 2007, 5, 1263–1275. 30. Lu, T.-P.; Tsai, M.-H.; Lee, J.-M.; Hsu, C.-P.; Chen, P.-C.; Lin, C.-W.; Shih, J.-Y.; Yang, P.-C.; Hsiao, C. K.; Lai, L.-C.; Chuang, E. Y. Identification of a novel biomarker, SEMA5A, for non-small cell lung carcinoma in nonsmoking women. Cancer Epidemiol. Biomarkers Prev. 2010, 19, 2590–2597. 31. Lenburg, M. E.; Liou, L. S.; Gerry, N. P.; Frampton, G. M.; Cohen, H. T.; Christman, M. F. Previously unidentified changes in renal cell carcinoma gene expression identified by parametric analysis of microarray data. BMC Cancer 2003, 3, 31. 32. Affò, S.; Dominguez, M.; Lozano, J. J.; Sancho-Bru, P.; Rodrigo-Torres, D.; Morales-Ibanez, O.; Moreno, M.; Millán, C.; Loaeza-del-Castillo, A.; Altamirano, J.; García-Pagán, J. C.; Arroyo, V.; Ginès, P.; Caballería, J.; Schwabe, R. F.; Bataller, R. Transcriptome analysis identifies TNF superfamily receptors as potential therapeutic targets in alcoholic hepatitis. Gut 2013, 62, 452–460. 33. Bowen, N. J.; Walker, L. D.; Matyunina, L. V.; Logani, S.; Totten, K. A.; Benigno, B. B.; McDonald, J. F. Gene expression profiling supports the hypothesis that human ovarian surface epithelia are multipotent and capable of serving as ovarian cancer initiating cells. BMC Med Genomics 2009, 2, 71. 34. The UniProt Consortium Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Research 2012, 41, D43–D47. 35. Kampf, C.; Mardinoglu, A.; Fagerberg, L.; Hallström, B. M.; Edlund, K.; Lundberg, E.; Pontén, F.; Nielsen, J.; Uhlen, M. The human liver-specific proteome defined by transcriptomics and antibody-based profiling. FASEB J. 2014, 28, 2901–2914. 36. Roberts, A.; Pimentel, H.; Trapnell, C.; Pachter, L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics 2011, btr355. 37. Rustici, G.; Kolesnikov, N.; Brandizi, M.; Burdett, T.; Dylag, M.; Emam, I.; Farne, A.; Hastings, E.; Ison, J.; Keays, M.; Kurbatova, N.; Malone, J.; Mani, R.; Mupo, A.; Pedro Pereira, R.; Pilicheva, E.; Rung, J.; Sharma, A.; Tang, Y. A.; Ternent, T.; Tikhonov, A.; Welter, D.; Williams, E.; Brazma, A.; Parkinson, H.; Sarkans, U. ArrayExpress update--trends in database growth and links to data analysis tools. Nucleic Acids Res. 2013, 41, D987–990. 38. Chambers, M.C., MacLean, B., Burke, R., Amode, D., Ruderman, D.L., Neumann, S., Gatto, L., Fischer, B., Pratt, B., Egertson, J., Hoff, K., Kessner, D., Tasman, N., Shulman, N., Frewen, B., Baker, T.A., Brusniak, M.-Y., Paulse, C., Creasy, D., Flashner, L., Kani, K., Moulding, C., Seymour, S.L., Nuwaysir, L.M., Lefebvre, B., Kuhlmann, F., Roark, J., Rainer, P., Detlev, S., Hemenway, T., Huhmer, A., Langridge, J., Connolly, B., Chadick, T., Holly, K., Eckels, J., Deutsch, E.W., Moritz, R.L., Katz, J.E., Agus, D.B., MacCoss, M., Tabb, D.L. & Mallick, P. A cross-platform toolkit for mass spectrometry and proteomics. Nature Biotechnology 2012, 30, 918-920 39. Kolker, E.; Higdon, R.; Morgan, P.; Sedensky, M.; Welch, D.; Bauman, A.; Stewart, E.; Haynes, W.; Broomall, W.; Kolker, N. SPIRE: Systematic protein investigative research environment. J Proteomics 2011, 75, 122–126. 40. Higdon, R.; Reiter, L.; Hather, G.; Haynes, W.; Kolker, N.; Stewart, E.; Bauman, A. T.; Picotti, P.; Schmidt, A.; van Belle, G.; Aebersold, R.; Kolker, E. IPM: An integrated protein model for false discovery rate estimation and identification in high-throughput proteomics. J Proteomics 2011, 75, 116–121. 41. Higdon, R.; Kolker, E. A predictive model for identifying proteins by a single peptide match. 19 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 30

Bioinformatics 2007, 23, 277–280. 42. Hather, G.; Higdon, R.; Bauman, A.; von Haller, P. D.; Kolker, E. Estimating false discovery rates for peptide and protein identification using randomized databases. Proteomics 2010, 10, 2369–2376. 43. Craig, R.; Cortens, J. P.; Beavis, R. C. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 2004, 3, 1234–1242. 44. R Development Core Team R: A language and environment for statistical computing; R Foundation for Statistical Computing: Vienna, Austria. 45. Vogel, C.; Marcotte, E. M. Calculating absolute and relative protein abundance from mass spectrometry-based protein expression data. Nat Protoc 2008, 3, 1444–1451. 46. Ishihama, Y.; Oda, Y.; Tabata, T.; Sato, T.; Nagasu, T.; Rappsilber, J.; Mann, M. Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Mol. Cell Proteomics 2005, 4, 1265–1272. 47. Lundgren, D. H.; Hwang, S.-I.; Wu, L.; Han, D. K. Role of spectral counting in quantitative proteomics. Expert Rev Proteomics 2010, 7, 39–53. 48. Ning, K.; Fermin, D.; Nesvizhskii, A. I. Comparative analysis of different label-free mass spectrometry based protein abundance estimates and their correlation with RNA-Seq gene expression data. J. Proteome Res. 2012, 11, 2261–2271. 49. Oehlert, G. W. A Note on the Delta Method. The American Statistician 1992, 46, 27. 50. Smyth, G. Linear Models and Empirical Bayes Methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 2004. 51. Karp, N. A.; Huber, W.; Sadowski, P. G.; Charles, P. D.; Hester, S. V.; Lilley, K. S. Addressing accuracy and precision issues in iTRAQ quantitation. Mol. Cell Proteomics 2010, 9, 1885– 1897. 52. Ertel, A.; Verghese, A.; Byers, S. W.; Ochs, M.; Tozeren, A. Pathway-specific differences between tumor cell lines and normal and tumor tissue cells. Mol. Cancer 2006, 5, 55. 53. Shi, L.; Reid, L. H.; Jones, W. D.; Shippy, R.; Warrington, J. A.; Baker, S. C.; Collins, P. J.; Longueville, F. de; Kawasaki, E. S.; Lee, K. Y.; Luo, Y.; Sun, Y. A.; Willey, J. C.; Setterquist, R. A.; Fischer, G. M.; Tong, W.; Dragan, Y. P.; Dix, D. J.; Frueh, F. W.; Goodsaid, F. M.; Herman, D.; Jensen, R. V.; Johnson, C. D.; Lobenhofer, E. K.; Puri, R. K.; Scherf, U.; Thierry-Mieg, J.; Wang, C.; Wilson, M.; Wolber, P. K.; Zhang, L.; Amur, S.; Bao, W.; Barbacioru, C. C.; Lucas, A. B.; Bertholet, V.; Boysen, C.; Bromley, B.; Brown, D.; Brunner, A.; Canales, R.; Cao, X. M.; Cebula, T. A.; Chen, J. J.; Cheng, J.; Chu, T.-M.; Chudin, E.; Corson, J.; Corton, J. C.; Croner, L. J.; Davies, C.; Davison, T. S.; Delenstarr, G.; Deng, X.; Dorris, D.; Eklund, A. C.; Fan, X.; Fang, H.; Fulmer-Smentek, S.; Fuscoe, J. C.; Gallagher, K.; Ge, W.; Guo, L.; Guo, X.; Hager, J.; Haje, P. K.; Han, J.; Han, T.; Harbottle, H. C.; Harris, S. C.; Hatchwell, E.; Hauser, C. A.; Hester, S.; Hong, H.; Hurban, P.; Jackson, S. A.; Ji, H.; Knight, C. R.; Kuo, W. P.; LeClerc, J. E.; Levy, S.; Li, Q.-Z.; Liu, C.; Liu, Y.; Lombardi, M. J.; Ma, Y.; Magnuson, S. R.; Maqsodi, B.; McDaniel, T.; Mei, N.; Myklebost, O.; Ning, B.; Novoradovskaya, N.; Orr, M. S.; Osborn, T. W.; Papallo, A.; Patterson, T. A.; Perkins, R. G.; Peters, E. H.; Peterson, R.; Philips, K. L.; Pine, P. S.; Pusztai, L.; Qian, F.; Ren, H.; Rosen, M.; Rosenzweig, B. A.; Samaha, R. R.; Schena, M.; Schroth, G. P.; Shchegrova, S.; Smith, D. D.; Staedtler, F.; Su, Z.; Sun, H.; Szallasi, Z.; Tezak, Z.; Thierry-Mieg, D.; Thompson, K. L.; Tikhonova, I.; Turpaz, Y.; Vallanat, B.; Van, C.; Walker, S. J.; Wang, S. J.; Wang, Y.; Wolfinger, R.; Wong, A.; Wu, J.; Xiao, C.; Xie, Q.; Xu, J.; Yang, W.; Zhang, L.; Zhong, S.; Zong, Y.; Slikker, W. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotech 2006, 24, 1151–1161. 20 ACS Paragon Plus Environment

Page 21 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

54. Kechavarzi B, Janga SC. Dissecting the expression landscape of RNA-binding proteins in human cancers. Genome Biol. 2014 Jan 10;15(1):R14. 55. Schwanhäusser, B.; Busse, D.; Li, N.; Dittmar, G.; Schuchhardt, J.; Wolf, J.; Chen, W.; Selbach, M. Global quantification of mammalian gene expression control. Nature 2011, 473, 337–342. 56. Kolker, E.; Özdemir, V.; Martens, L.; Hancock, W.; Anderson, G.; Anderson, N.; Aynacioglu, S.; Baranova, A.; Campagna, S. R.; Chen, R.; Choiniere, J.; Dearth, S. P.; Feng, W.-C.; Ferguson, L.; Fox, G.; Frishman, D.; Grossman, R.; Heath, A.; Higdon, R.; Hutz, M. H.; Janko, I.; Jiang, L.; Joshi, S.; Kel, A.; Kemnitz, J. W.; Kohane, I. S.; Kolker, N.; Lancet, D.; Lee, E.; Li, W.; Lisitsa, A.; Llerena, A.; Macnealy-Koch, C.; Marshall, J.-C.; Masuzzo, P.; May, A.; Mias, G.; Monroe, M.; Montague, E.; Mooney, S.; Nesvizhskii, A.; Noronha, S.; Omenn, G.; Rajasimha, H.; Ramamoorthy, P.; Sheehan, J.; Smarr, L.; Smith, C. V.; Smith, T.; Snyder, M.; Rapole, S.; Srivastava, S.; Stanberry, L.; Stewart, E.; Toppo, S.; Uetz, P.; Verheggen, K.; Voy, B. H.; Warnich, L.; Wilhelm, S. W.; Yandl, G. Toward more transparent and reproducible omics studies through a common metadata checklist and data publications. OMICS 2014, 18, 10–14. 57. Kolker, E.; Altintas, I.; Bourne, P.; Faris, J.; Fox, G.; Frishman, D.; Geraci, C.; Hancock, W.; Lin, B.; Lancet, D.; Lisitsa, A.; Knight, R.; Martens, L.; Mesirov, J.; Özdemir, V.; Schultes, E.; Smith, T.; Snyder, M.; Srivastava, S.; Toppo, S.; Wilmes, P. Reproducibility: In praise of open research measures. Nature 2013, 498, 170. 58. Ioannidis, J.P. a et al. Repeatability of published microarray gene expression analyses. Nature genetics. 2009, 41, 149-55 59. Schofield, P.N. et al. Post-publication sharing of data and tools. Nature. 2009, 461, 171-173 60. J. P. Mesirov, “Accessible reproducible research,” Science, 2010, vol. 327, no. 5964, pp. 415–416,

21 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 30

Table 1. Summary of protein IDs (1% FDR), spectral counts (10% LFDR) and technical replicates across 12 tissues from the Kim and Wilhelm studies. The Comparison gives the number of proteins used for the comparison of Kim and Wilhelm (1% FDR in at least one of the two studies and least 1 spectra ID in the other). The equivalent (EQ) technical replicates for Wilhelm are the number of technical replicates for Kim times the ratio of spectral counts. Tissue

Adrenal gland Colon Esophagus Gall bladder Heart Kidney Liver Lung Ovary Prostate gland Rectum Testis

Protein IDs

Spectra

Kim

Wilhelm

Comparison Kim

5183 5153 2922 4634 3417 3771 5084 3908 6659 5951 4880 7692

4435 4299 4743 2748 2943 3352 3968 3083 3917 3940 3402 4527

4171 4204 2948 2560 2484 3100 3724 2703 3951 3944 3449 4593

202338 197989 142668 191726 307989 199917 271329 178649 267052 249695 205403 259327

22 ACS Paragon Plus Environment

Technical Replicates Wilhelm Kim EQ. Wilhelm 124737 3 1.85 114584 2 1.16 128412 3 2.70 96741 2 1.01 147445 5 2.39 98901 3 1.48 100702 4 1.48 104386 3 1.75 97425 2 0.73 140501 2 1.13 83493 2 0.81 143934 2 1.11

Page 23 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 2. Comparing the protein expression for proteins with 3 or more unique peptide identification in each experiment to those with 2 or fewer. The table shows the proportion with 3 or more unique peptide identifications, the coefficient of variation (CV) across tissues, and the correlation between experiments across tissues. CV Correlation Tissue % Proteins 3+ 23+ 2with 3+ Peptides Peptides Peptides Peptides Peptide IDs Adrenal gland 65.7% 0.6 0.96 0.8 0.43 Colon 62.7% 0.51 0.77 0.86 0.59 Esophagus 52.3% 0.82 1.2 0.71 0.4 Gall bladder 57.0% 1.0 1.2 0.51 0.24 Heart 55.1% 1.0 1.3 0.58 0.21 Kidney 63.1% 0.66 0.98 0.79 0.44 Liver 61.5% 0.82 1.2 0.68 0.35 Lung 59.5% 0.91 1.3 0.55 0.2 Ovary 65.3% 0.59 1.0 0.77 0.37 Prostate gland 63.9% 0.57 0.94 0.81 0.49 Rectum 59.2% 0.52 0.86 0.85 0.56 Testis 65.3% 0.64 1.1 0.74 0.31

23 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 30

Table 3. Comparison of variability in protein expression measurements across different tissues based upon Kim and Wilhelm studies. Measures include a pooled (across proteins) standard deviation (SD) estimate from the log PPM expression values, which also approximates the coefficient of variation (CV). Fold change range (FC) is based on   . 10th and 90th percentile values are given for individual protein SD and FC estimates using single protein and empirical Bayes estimates (EB). An estimate of the SD for technical replicates is used to adjust the pooled SD and FC estimates. ***Note, SD for estimates for technical replicates was large for Adrenal gland resulting in a negative adjusted SD. Tissue

Adrenal gland Colon

PooledAcross Proteins

Percentiles

SD (CV) 0.74

4.43

SD 10% 0.09

SD 90% 1.24

FC 10% 1.20

0.62

3.47

0.09

1.03

1.03

7.78

0.09

1.69

FC

Percentiles for Empirical Bayes Estimates

Technical Reps

Adjusted for Technical Reps

FC 90% 11.92

SD 10% 0.44

SD 90% 0.82

FC 10% 2.39

FC 90% 5.20

SD (CV)

SD

FC

1.21

***

***

1.19

7.78

0.55

0.59

3.02

3.28

0.60

0.38

2.14

1.21

29.14

0.98

1.05

7.13

8.10

0.69

0.94

6.57

Esophagus Gall bladder Heart

1.09

8.91

0.16

1.77

1.36

34.76

1.07

1.13

8.45

9.62

0.74

0.89

5.88

1.17

10.44

0.14

1.85

1.31

40.82

0.74

1.23

4.41

11.7

0.67

1.11

9.23

Kidney

0.79

4.89

0.07

1.32

1.16

14.07

0.69

0.75

3.99

4.45

0.73

0.60

3.35

Liver

0.97

6.90

0.10

1.60

1.23

24.35

0.88

0.95

5.86

6.64

0.78

0.81

5.02

Lung

1.06

8.32

0.12

1.75

1.28

33.07

0.90

1.02

5.99

7.68

0.80

0.91

6.19

Ovary

0.77

4.70

0.10

1.26

1.21

12.36

0.66

0.74

3.72

4.42

0.71

0.35

2.03

Prostate gland Rectum

0.72

4.24

0.08

1.19

1.17

10.82

0.58

0.70

3.20

4.05

0.65

0.48

2.60

0.68

3.88

0.06

1.04

1.12

7.98

0.57

0.61

3.12

3.39

0.59

0.39

2.20

Testis

0.82

5.20

0.08

1.35

1.18

15.00

0.66

0.80

3.71

4.99

0.76

0.53

2.87

24 ACS Paragon Plus Environment

Page 25 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 4. Comparison of variability in gene expression measurements from tissues matched with Kim and Wilhelm study samples. Measures include the number of biological replicates (N) a pooled (across proteins) standard deviation (SD) estimate from the log PPM expression values, which also approximates the coefficient of variation (CV). Fold change range (FC) is based on   . 10th and 90th percentile values are given for individual protein SD and FC estimates using empirical Bayes estimates. Correlation of microarray (MA), MS proteomics and RNA-Seq (RS) are also shown. *Note, single protein and EB estimates are nearly the same so only EB shown Tissue

Percentiles of Empirical Bayes Estimates

Correlation

Colon

Pooled Across Proteins N SD Fold (CV) Change 32 0.33 1.94

SD* 10% 0.14

SD* 90% 0.46

FC* 10% 1.32

FC* 90% 2.53

MA vs MS 0.30

RS vs MS 0.38

MA vs RS 0.79

Heart

8

0.32

1.91

0.13

0.46

1.31

2.52

0.32

0.41

0.75

Kidney

11

0.34

1.99

0.15

0.44

1.35

2.39

0.39

0.42

0.71

Liver

60

0.23

1.58

0.11

0.39

1.24

2.17

0.50

0.57

0.80

Lung

7

0.30

1.80

0.09

0.29

1.20

1.78

0.34

0.34

0.74

Ovary

12

0.56

3.09

0.29

0.76

1.79

4.59

0.23

0.40

0.54

25 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 30

Table 5. List of housekeeping proteins/genes. These proteins had low variation (CV) both within and between tissues and high expression (log PPM) between tissues. Genes corresponding to proteins also had low variation (CV) within tissues for microarray (MA) experiments and between tissues for RNA-Seq. Genes also had high RNA-Seq expression (log FPKM) between tissues Gene

Uniprot

Name

CV protein expression

Mean log PPM

CV for MA

CV for RNA-Seq

Mean log FPKM

within tissues

between tissues

between tissues

within tissues

between tissues

between tissues

GDI2

P50395

Rab GDP dissociation inhibitor beta

15%

17%

6.57

20%

37%

5.18

PGAM1

P18669

Phosphoglycerate mutase 1

19%

27%

7.44

15%

32%

4.82

PGK1

P00558

Phosphoglycerate kinase 1

20%

29%

7.63

17%

42%

5.99

PGRMC2

O15173

Membrane-associated progesterone receptor component 2

28%

21%

5.57

20%

37%

4.81

PRDX1

Q06830

Peroxiredoxin-1

14%

22%

7.86

19%

21%

6.27

PSMA4

P25789

Proteasome subunit alpha type-4

29%

27%

5.71

18%

31%

5.05

PSMD9

O00233

26S proteasome non-ATPase regulatory subunit 9

20%

29%

5.23

20%

35%

4.05

RAB14

P61106

Ras-related protein Rab-14

17%

20%

6.38

19%

32%

4.30

RAB2A

P61019

Ras-related protein Rab-2A

13%

26%

6.43

17%

18%

5.08

RAC1

P63000

Ras-related C3 botulinum toxin substrate 1

16%

18%

6.15

18%

36%

5.08

TALDO1

P37837

Transaldolase

28%

22%

6.34

18%

28%

5.06

TPI1

P60174

Triosephosphate isomerase

21%

22%

7.91

20%

42%

5.53

UBA52

P62987

Ubiquitin-60S ribosomal protein L40

26%

27%

8.53

17%

44%

6.46

65%

56%

4.23

30%

76%

2.83

Mean across all genes or proteins

26 ACS Paragon Plus Environment

Page 27 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1. A representation of protein expression measurement. In a typical proteomics study thousands of protein can be measured across several orders of magnitude of concentration. The “normal” range in concentration or expression (on the log scale) for any protein is roughly twice the biological variation between samples (as measured by standard deviation,  ) above or below the mean expression level (µ). However, the measurement range adds the technical variation ( )due to the measurement error of proteomics assays to the biological variation. Technical variation decreases proportionally to the square-root of the number of replicates (n).

27 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 30

Figure 2. Comparison of log PPM values between Kim and Wilhelm studies across 12 normal tissues. Correlation coefficients (R) and fold change range estimates   are reported on each of the figures.

28 ACS Paragon Plus Environment

Page 29 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 3. Comparison of log PPM values between the mean of Kim and Wilhelm studies and datasets from PaxDB and Roche tissue studies. Correlation coefficients (R) are reported on each of the figures.

29 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

TOC Graphic 188x143mm (72 x 72 DPI)

ACS Paragon Plus Environment

Page 30 of 30