Modified Spectral Count Index (mSCI) for Estimation of Protein

Sep 18, 2009 - Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100730, P. R. China, S...
0 downloads 2 Views 758KB Size
Modified Spectral Count Index (mSCI) for Estimation of Protein Abundance by Protein Relative Identification Possibility (RIPpro): A New Proteomic Technological Parameter Aihua Sun,†,‡ Jiyang Zhang,‡ Chunping Wang,† Dong Yang,‡ Handong Wei,‡ Yunping Zhu,‡ Ying Jiang,*,‡ and Fuchu He*,†,‡,§ Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100730, P. R. China, State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing 102206, P. R. China, and Institutes of Biomedical Sciences and Department of Chemistry, Fudan University, Shanghai 200032, P. R. China Received March 16, 2009

Peptides Count (SC) was widely used for protein abundance estimation in proteomics. On the basis of that, Mann and co-workers corrected the SC by dividing spectrum counts by the number of observable peptides per protein and named it PAI. Here we present modified spectral count index (mSCI) for protein abundance estimation, which was defined as the number of observed peptides divided by protein relative identification possibility (RIPpro). RIPpro was derived from 6788 mRNA and protein expression data (collected from human liver samples) and related to proteins’ three physical and chemical properties (MW/pI/Hp). For 46 proteins in mouse neuro2a cells, mSCI shows a linear relationship with the actual protein concentration, similar or better than PAI abundance. Also, multiple linear regressions were performed to quantitative assess several factors’ impact on the mRNA/protein abundance correlation. Our results shown that the primary factor affecting protein levels was mRNA abundance (32-37%), followed by variability in protein measurement, MW and protein turnover (7-12%,7-9% and 2-3%, respectively). Interestingly, we found that the concordance between mRNA transcripts and protein expression was not consistent among all protein functional categories. This correlation was lower for signaling proteins as compared to metabolism genes. It was determined that RIPpro was the primary factor affecting signaling protein abundance (23% on average), followed by mRNA abundance (17%). In contrast, only 5% (on average) of the variability of metabolic protein abundance was explained by RIPpro, much lower than mRNA abundance (40%). These results provide the impetus for further investigation of the biological significance of mechanisms regulating the mRNA/protein abundance correlation and provide additional insight into the relative importance of the technological parameter (RIPpro) in mRNA/protein correlation research. Keywords: liver • transcriptome • proteome • mass spectra • quantification • correlation

Introduction Some parameters, such as the hit rank, the score and the number of peptides per protein,1 can be considered as indicators for protein abundance in the analyzed sample.2 Among them, the number of peptides identified was widely used.3,4 On the basis of that, Mann and co-workers2,5 corrected the number of observed peptides by dividing spectrum counts by the number of observable peptides per protein and named it PAI. PAI correlated well with protein concentration in the * To whom correspondence should be addressed. Dr. Fuchu He, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing 102206, P. R. China 100850. E-mail, [email protected]; tel,8610-80705001; fax, 8610-80705155. Dr. Ying Jiang, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing 102206, P. R. China 100850. E-mail, [email protected]; tel, 8610-80705299; fax, 8610-80705002. † Chinese Academy of Medical Sciences & Peking Union Medical College. ‡ Beijing Institute of Radiation Medicine. § Fudan University.

4934 Journal of Proteome Research 2009, 8, 4934–4942 Published on Web 09/18/2009

analyzed protein mixture.5 Because observable tryptic peptides were taken to be those in the mass range, the PAI is describing not only the abundance of the protein in the sample, but also its response to the measurement procedure. And the latter is derived from a complex process which includes digestion efficiency, peptide solubility, extraction, ionization and fragmentation of each protein. PAI could not take all these factors into account.5 As a more conventional measure, PAI and its similarity method (effective number of peptide) were widely used in the literature,2,5-7 and the detailed algorithm was mainly coming from experience, similar but different in every investigation.2,5-7 To investigate the influence of operator experience on quantitative accuracy deviation, the correlation of protein concentration and several PAI were also calculated in this study. Integrating mRNA expression studies and proteomic analysis in high-throughput studies may enable the results of gene 10.1021/pr900252n CCC: $40.75

 2009 American Chemical Society

mSCI for Estimation of Protein Abundance by RIPpro

research articles

expression experiments to be interpreted in the proper context. Since mRNA is eventually translated into protein, it could be assumed that a correlation should exits between mRNA and protein expression levels. However, cells have adopted a number of elaborate regulatory mechanisms in order to regulate protein expression at the transcriptional, posttranscriptional, translational, or post-translational levels. These mechanisms include, but are not limited to, regulation of transcription factor activity and chromatin structure modification to control transcription; splicing of mRNA or differential ribosomal loading to regulate post-transcriptional activity or translation, respectively; and protein degradation or export which regulate post-translational protein activity. Therefore, investigation of the discordance of mRNA and protein expression may identify novel post-transcriptional mechanisms that may be candidates for the design of therapeutics.8 Previous attempts to correlate protein abundance with mRNA expression levels have failed to clearly elucidate the cellular mechanisms involved in the regulation of this process.9-20 The results of previously published studies have concluded that a stronger correlation existed when the protein/mRNA correlation was investigated in the context of different tissues or during dynamic processes. However, only a minimal correlation has been reported between mRNA and protein levels when analyzed in the context of a single sample. A number of factors may contribute to the poor correlation which exists between mRNA and protein expression. These may include certain fundamental biological processes that have been thoroughly discussed in the literature including the regulation of transcriptional and translational processes, mechanisms that regulate mRNA/protein stability, or a number of other cellular processes. In addition to biological factors, the different sources of mRNA/protein data, the limited data size and the inaccurate protein abundance estimation may also contribute to the poor mRNA-protein correlation reported in the literature.15,21-23 A number of studies have reported some influence factors during proteome and transcriptome comparison research.6,21,24 In 2006, Nie et al.21 performed the first quantitative study which investigated how biochemical factors affect the mRNA/protein correlation in prokaryotic systems. It was determined that more accurate estimations of sample protein concentration and larger size of data sets could improve the accuracy of these analyses. Compared with data sets used for prior investigations, data from CNHLPP (Chinese Human Liver Proteome Project) had some advantages: (1) transcriptome and proteome data were from the same initial sample collected by the Subcommittee of Sample Collection and Banking of CNHLPP; (2) larger size of data set; (3) more accurate estimation of protein abundance; (4) less measurement variation for protein abundance from 7 repeated experiments. So, we concluded that the high quality and large quantity data from CNHLPP was a more suitable data set for the mRNA/protein correlation analysis. It can be assumed that, for the same number of molecules, larger proteins and proteins with many peptides in the preferred scan range for mass spectrometry will generate more observed peptide. Thus, the abundance of proteins with extreme physico-chemical properties (too hydrophobic/acidic/ alkaline and low molecular weight) could not be measured accurately. Certainly, these may lead to the low coefficient of mRNA/protein abundance. Despite numerous published studies, the contribution of proteomic technology bias in the investigation of the mRNA/protein correlation is poorly understood.25,26 This may be due to the lack of proper

parameters to define the technological limitations of this methodology. In this study, from 6788 proteins identified in the human liver proteome (HLP)27 and based on proteins’ three physical and chemical properties (MW/pI/Hp), which are related to the sample preparation, separation and mass analysis of protein, we derived a new technological parameter, RIPpro, and based on that, we defined a new protein abundance index, mSCI (Modified Spectral Count Indexes), that is the number of observed peptides divided by RIPpro (see eq 3). Learning from large-scale mass spectrum data, this quantitative index has taken many influencing factors into account comprehensively. The advantage of mSCI as protein abundance was further discussed in this study. Beside this, several factors including RIPpro on the mRNA/protein abundance correlation were analyzed quantitatively.

Experimental Procedures Transcriptome and Proteome Data. All transcriptome and proteome data used for this analysis were from CNHLPP27 and were extracted from the same sample. Transcriptome data: The pooled RNA sample from ten human-liver RNAs in same amount were labeled and hybridized to HG-U133plus 2.0 high-density oligonucleotide arrays (Affymetrix), which contains 54 675 probe sets representing 19 164 human genes and 15 136 ESTs on a single array. One ug of poly (A)+ RNA was annealed to oligo (dT) and transcribed using SurperScript II reverse transcriptase (Invitrogen, Carlsbad, CA), Labeling, hybridization, washing and signal scan on the microarrays were performed according to the manufacturer’s instructions. Primary image analysis of the arrays was performed by using GENECHIP 3.2 (Affymetrix, Santa Clara, CA), and normalization was performed using Mas 5.0 software (Affymetrix). Only those transcripts that were declared “present” and more than 100 intensity of fluorescence were taken into account. To assign a signal for a gene in human liver, we selected the maximum normalized expression signal of all probe sets matched to the gene if there are multiple probe sets for a gene. Proteome data: In order for protein data to be included in our analysis study, it was required to meet the following criteria: (1) All peptide data should satisfy a unified 95% confidence cutoff. (2) Reversed IPI database search is used to estimate false positive rates. (3) The qualified peptide should have an MS/ MS sequence with more than six amino acids in order to be analyzed by the “shotgun” strategy. (4) Each of the identified proteins should have two or more matching peptides. (5) Protein Semiquantitation: Semiquantitation from 7 large-scale batches of nongel technology lines were assigned normalized spectral count indexes (SCIN).28 Expectation maximization (EM) algorithm was performed to distribute spectral count (SC) values among identified proteins and form spectral count indexes (SCI) by dividing EM distributed SC values.29 SCI values in each batch were normalized, and the final semiquantitation indexes were constructed by arithmetic means of each identified protein. Also, semiquantitation was computed by modified spectral count index (mSCI), which was defined in eq 3. There are 3880 proteins for which both the mRNA and corresponding protein abundance have been determined. Among these proteins, the protein abundance and half-life have been determined for 1885. After discarding several outliers (see: Pearson’s Correlation Analysis of mRNA and Protein Abundance), 2775 or 1881 proteins were used for downstream analysis. Journal of Proteome Research • Vol. 8, No. 11, 2009 4935

research articles

Sun et al.

Index of RIP. To measure the potential of a specific technique’s ability to identify protein expression in a largescale study, we have developed a new index known as protein relative identification possibility (RIPpro). RIPpro was derived from 6788 proteins identified in the human liver proteome (HLP) and is based on three important proteins’ physical and chemical properties, specifically molecular weight (MW), isoelectric point (pI) and hydrophobicity (Hp), which are related to the sample preparation, separation and mass analysis of protein. First, the entire range of each property (MW, pI or Hp) in the IPI Human v3.07 database (ftp://ftp.ebi.ac.uk/pub/ databases/IPI/old/HUMAN/ipi.HUMAN.v3.07.fasta.gz) was divided equally into 100 bins, and the numbers of proteins that were categorized into each bin were counted and this value was denoted as CpAll(i) where i is the index of the bin (Supporting Information Figure S1, A2, B2, C2). The number of human liver proteins (from HLP) in each bin was also calculated and this value was denoted asCpExp(i) (Supporting Information Figure S1, A1, B1, C1). The Rp(i) is defined as p p Rp(i) ) CExp (i)/CAll (i)

(1)

And Rp(i) of each bin was calculated. The maximal value of the ratios is set equal to 1 and all values were normalized to this ratio p RIPp(i) ) Rp(i)/Rmax

(2)

p is the maximal value of Rp(i). For each protein in Where Rmax the IPI Human v3.07 database, the RIP of MW, pI and Hp is calculated by the above methods (Supporting Information Figure S1, A3, B3, C3). And the minimal RIP value between the 3 generated (Mw, pI and Hp) for each protein is used to measure the identified possibility of the protein; this is known as RIPpro. To validate the accuracy, reliability and rational of utilizing RIPpro as parameter in these analyses, mouse protein abundancedatawerecollectedfrompreviouslypublishedstudies.12,30,31 It can be assumed that regardless of the species of a protein’s origin (mouse or human) proteins with the same pI, Mw or Hp range will have the same RIPpro, since this serves as the index to reflect protein identification property in MS. On the basis of this theory, the RIPpro of each protein in IPI mouse v3.07 database was calculated (Supporting Information Table S1). The calculated difficulty grade for each protein in the 43 300 IPI Mouse v3.07 database are shown in Supplemental Figure S2. I, II, III and IV is defined as gradually rise of RIPpro. Proteins in I, II, and III grades were identified with greater difficulty as compared with those proteins with a IV grade (control). The Mw/pI/Hp range of each RIPpro grade is shown in Supplemental Table S2 and Supplemental Figure S2. Protein Abundance Determination and Verification. Two methods of protein abundance determination were used, that is, (1) spectral count (SC), (2) spectral count modified with RIPpro (Modified Spectral Count Indexes, mSCI). The mSCI is defined as

mSCI ) SC/RIPpro

(3)

where SC are the number of observed peptides per protein and RIPpro are protein relative identification possibility derived from 6788 proteins, respectively. 4936

Journal of Proteome Research • Vol. 8, No. 11, 2009

Forty-six protein concentrations were measured by “reversed” isotope dilution using SILAC-labeled proteins and unlabeled synthetic peptides in mouse neuro2a cells.2 To prove the reliability and rationality of mSCI as protein abundance estimation, the relationship between protein concentration and mSCI was calculated. Protein Half-Life. Protein half-life was determined by utilizing the ProtParam program (http://us.expasy.org/tools/ protparam.html),32,33 which relates the half-life of a protein to the identity of its N-terminal residue. The N-terminal for each protein sequence was extracted from the Swiss-Prot database (http://expasy.org/). Protein Abundance Measurement Variation. The coefficient of variation (Cv) is defined as the ratio between the standard deviation and mean of protein abundance. This measurement reflects measurement stability, or uncertainty, and can indicate the relative dispersion of data to the mean. The proteome data used in this research was from 7 large-scale batches of nongel technology lines, that is, seven replicas for one sample. For each protein, Cv was used to measure the protein measurement variations. Cellular Functional Category. Gene ontology34 analysis was performed on our data set using a reduced set of GO categories (GO-slim) with GOfact (http://61.50.138.118/gofact/).33 On the basis of the annotation of gene ontology,34 most proteins were classified into one of 24 categories of cellular function. Gene Cluster and TreeView. Hierarchical clustering of human liver data was performed using Gene Cluster (rana.lbl.gov/Eisen-Software.htm) using average linkage clustering.35 The resultant dendrogram was visualized using TreeView software (rana.lbl.gov/EisenSoftware.htm). Statistical Methods. Pearson’s correlation reflects the degree of linear relationship between two variables. Spearman correlation coefficient can be applied to the data whose distribution is nonnormal and unknown. Simple linear regression and multiple linear regression analyses were performed to measure correlation pattern between mRNA and protein abundance.6,24,36,37 All statistical analysis was performed using SAS software 8.0.

Results Stability of RIPpro. The probability of one protein being observed differs significantly from platform to platform, or from sample to sample. But for a group of proteins, from different platform or sample, the trend of RIPpro should be stable. This is why the index can be used extensively. To prove this character, two independent experimental data from different platforms (from one sample) were extracted for analysis. Data 1 (BIRM-2DLC-LTQ, Digestion_SCXLC_RPLC), 5405 proteins from BIRM (Beijing Institute of Radiation Medicine); Data 2 (SIBS-3DLC-LTQ, SCX/ SAX/SEC _Digestion_SCXLC_RPLC), 5422 proteins from SIBS (Shanghai Institutes for Biological Sciences). According to the calculation formula of RIPpro, 6788, 5405 and 5422 proteins were used, respectively, to calculate the RIPpro of each protein in IPI Human v3.07 (Supporting Information Table S2). Strong positive correlations existed among different data for rBIRM-2DLC-LTQ&SIBS-3DLC-LTQ ) 0.98, r6788&SIBS3DLC-LTQ ) 0.95, and r6788&BIRM2DLC--LTQ ) 0.95. The results suggest the RIPpro values in Supporting Information Table S2 could be generalized to other shotgun proteome data. Validation of Protein Semiquantitation. Mann and coworkers have measured 46 protein concentrations in mouse neuro2a cells by “reversed” isotope dilution using SILAClabeled proteins and unlabeled synthetic peptides.2 Here the

mSCI for Estimation of Protein Abundance by RIPpro

research articles j ). Then, five (0.13%) of them were determined distance mean (d to be outliers since the difference between distance (dn) to the j ) was more than 5-fold greater than the distance mean (d distance standard deviation (SD), as determined by the equaj )/SD g 5. These protein points were therefore tion: (d1 - d removed from the study. Among the five outliers, four had halflife information, resulting in 1881 proteins for downstream analysis.

Figure 1. Relationship between mSCI and protein concentration. The protein concentration is measured for neuro2a cells in published article.2

validation of mSCI semiquantitation method was checked by the absolute protein amount reported previously. As shown in Figure 1, the mSCI-based protein abundance were highly consistent with the actual values (R2 ) 0.76, r ) 0.87) in mouse neuro2a cells. mSCI and PAI. PAI was defined as the number of sequenced piptides divided by the number of its calculated observable peptides (Nobsbl), and the observable peptides would be those in the mass range.5 Researchers have different understanding for “observable” peptides.2,5-7 Rappsilber et al.5 thought observable tryptic peptides (Nobsbl) would be those in the mass range 800-2400 Da. On the basis of that, Ishihama et al.2 eliminated too hydrophilic or hydrophobic peptides. Heller et al.7 took the number of theoretically observable peptides as peptides following the trypsin cleavage rules with zero or one missed cleavage and having a molecular mass between 720 and 3000 Da. Nie et al.6 hold the “effective number of peptides” as peptides of length 7-25 amino acids. This difference could be associated with the different mass range of identified peptide achieved on their mass spectrometry, also with their own experience. According to algorithm reported previously,2,5-7 each PAI was calculated. As shown in Table 1, the observable peptides from different algorithms have strong correlation (r ) 0.98-1), the relationship between PAI and 46 protein concentration have big variation (R2 ) 0.66-0.79) (Table 2), most of them were lower than the mSCI/protein concentration correlation (R2 ) 0.75, r)0.87), and the relationship between protein concentration and PAI (eliminate hydrophobic peptides) was lower than PAI, which was calculated by Nobsbl (not eliminate hydrophobic peptides). Pearson’s Correlation Analysis of mRNA and Protein Abundance. The measurement of protein abundance by MS has a number of technical limitations. The efficiency of peptide ionization and the ability of each molecule to enter the mass spectrometer depend upon protein composition and the local chemical environment. Therefore, alteration of these factors may result in the occasional emergence of statistical outliers. Previous studies have determined that Pearson’s correlation analysis is the most suitable method for determining the normal distribution of data; however, this method can be affected by outliers which may strongly increase or decrease the strength of relationships.38 The outlier can be either a sample that does not fit the model or an error in measurement and it has been determined that it is acceptable to discard outliers prior to computing the line of best fit (http://mathworld.wolfram.com/Outlier.html). In our analysis, the distance (d1,2...,n,...,3880) from protein point (x1,2,....n....3880, y1,2,...,n,...3880) to the trendline (ax + by + c ) 0) was calculated by dn ) [(axn + byn + c)/(a2 + b2)1/2], the mean value of which was defined as

To obtain a reliable correlation between mRNA and protein abundance, the correlation coefficient was determined by distance hierarchy. We first used a simple regression analysis for protein and mRNA abundance based on the equation: ax + by + c ) 0 (x and y stand for log converted data of absolute mRNA and protein abundance, respectively). Three estimations of protein abundance were used to determine protein abundance (SC, mSCI, SCIN); mRNA abundance was calculated by 0.83x - y - 1.46 ) 0, 0.97x - y - 1.24 ) 0, 0.83x - y - 3.79 ) 0, respectively. The distance from a point (x0,y0) to the trendline (ax + by + c ) 0) was calculated by d ) [(ax0 + by0 + c)/(a2 + b2)1/2]. On the basis of each of the 1881 proteins’ distance from the point to the trendline, samples were divided into three groups. The first group, referred to as “5%”, comprises the top 5% of the 1881 total proteins that has the longest distance to the trendline. The second class, referred to as “75%” is composed of the 75% of the 1881 proteins with the shortest distance to the trendline. The third class of proteins, referred to as “20%”, are the 20% of the total proteins that had a shorter distance to the trendline than the 5% group but a longer distance than the 75% group. To determined distance hierarchy between each group, a Pearson’s correlation analysis was performed (Table 3). It was determined that the correlation between mRNA and protein abundance, for all data (0.56-0.60), was consistent with, or greater than, correlation levels previously reported for yeast or cell lines.16,34,39 Furthermore, a greater correlation was observed between mRNA and protein abundance in the 95% (0.61-0.66) and 75% (0.75-0.79) groups as compared to the correlation calculated when all data were analyzed together (0.56-0.60). To further substantiate the relationship between mRNA/ protein correlation and the protein physical/chemical properties, 94 “outline” proteins (5% of 1881 proteins) from which the distance to the trendline was greater than average were used for subsequent analysis. The 94 proteins were divided into two groups according to dot plot of mRNA and protein abundance. Group 1 comprised those proteins with abundance greater than that estimated by a single regression model of mRNA abundance, and consisted of 48 proteins. Proteins allocated to Group 2 were those with abundance lower than that estimated by the single regression model of mRNA abundance and included 46 proteins. Physical and chemical properties of the 94 “outline” proteins are shown in Supporting Information Table S4. Hypergeometry distribution analysis of functional categories and RIPpro grade on 94 “outline” proteins were performed. As a consequence of these analyses, it was determined that proteins with metabolic functions were found to be enriched in Group 1 and absent in Group 2. In contrast, signaling proteins were found to be concentrated in Group 2 and absent in Group 1 (Table 4A). Furthermore, it was determined that Group 1 consisted of higher abundance proteins and showed a significant increase (p < 0.01) in proteins with low RIPpro (I, II, III). Conversely, Group 2 was predomiJournal of Proteome Research • Vol. 8, No. 11, 2009 4937

research articles

Sun et al.

Table 1. The Correlation of Nobsbl from Different Algorithms

N N N N N N N

(emPAI) (7-25) (7-25) (hdy < 1.5) (720-3000 Da) (720-3000 Da) (hdy < 1.5) (800-2400 Da) (800-2400 Da) (hdy < 1.5)

N (emPAI)

N (7-25)

N (7-25) (hdy < 1.5)

N (720-3000D)

N (720-3000D) (hdy < 1.5)

N (800-2400D)

N (800-2400D) (hdy < 1.5)

1 0.98 0.98 0.99 0.99 0.98 0.98

0.98 1 1 1 0.99 0.99 0.99

0.98 1 1 0.99 0.99 0.99 0.99

0.99 1 0.99 1 1 0.99 0.99

0.99 0.99 0.99 1 1 1 1

0.98 0.99 0.99 0.99 1 1 1

0.98 0.99 0.99 0.99 1 1 1

Table 2. The Correlation of PAI (from Different Algorithms) and Actual Protein Concentration R2 r a

PAIa

PAI (7-25)

PAI (7-25) (hdy < 1.5)

PAI (720-3000D)

PAI (720-3000D) (hdy < 1.5)

PAI (800-2400D)

PAI (800-2400D) (hdy < 1.5)

0.79 0.89

0.70 0.84

0.66 0.81

0.71 0.84

0.66 0.81

0.71 0.84

0.67 0.82

Results from Ishihama, Y. et al.2

Table 3. Pearson’s Correlation analysis of mRNA/Protein Abundance in Hierarchya correlationb

correlationc

Table 4. Hypergeometry Distribution Analysis of Functional Categories and RIPpro Grade on 94 “Outline” Proteinsa

correlationd

protein percentage

A

B

A

B

A

B

100% 95% 75%

0.56 0.61 0.76

0.56 0.61 0.75

0.60 0.66 0.79

0.60 0.66 0.78

0.58 0.63 0.75

0.59 0.63 0.75

a (A) 3775 proteins have mRNA and corresponding protein abundance; (B) 1881 proteins have protein half-life, mRNA and corresponding protein abundance. b SC. c mSCI. d SCIN.

nantly composed of lower abundance proteins, and in which low RIPpro (I, II, III) proteins were deleted (significantly or not) (Table 4B). Multiple Regression Analyses of Protein Abundance and Some Covariates. To study the effect of various biochemical and physical factors on the correlation between mRNA and protein abundance, a multiple regression analysis was performed. This multiple regression model provided approximately 54% of the total protein abundance variability (Table 5). These results show that variation of protein abundance was mostly affected by mRNA abundance (32-37%); however, protein measurement variation and protein half-life period (7-12%, 2-3%, respectively) each had a significant effect on protein abundance. For different protein abundance estimations, the influence power of proteins’ biochemical and physical properties was different. For SC quantitation, MW, Hp and pI did not have influence on protein abundance. For mSCI and SCIN quantitation, Hp and pI still did not have influence, but MW variation could explain 7-9% of the protein abundance variability. Covariates Affecting mRNA/Protein Correlation in Different GO Functional Categories. Protein categories were clustered based on similar levels of correlation between mRNA and protein abundance, total protein abundance, the physical and chemical properties of the protein (RIPpro), and the halflife of the protein (Table 6: cluster image). The clustering identified three predominant clusters where branch of the dendrogram represents proteins that contain similarities in each functional category. The first cluster contains metabolic proteins; the second cluster comprises signaling proteins; and the third cluster is composed of proteins that have similarities in protein biosynthesis, protein metabolism and transport function. The third cluster also contains a number of signaling proteins. 4938

Journal of Proteome Research • Vol. 8, No. 11, 2009

(A) Hypergeometry Distribution Analysis of Functional Categories name

Amino acid metabolism Carbohydrate metabolism Generation of energy Lipid metabolism Macromolecule metabolism Protein metabolism Cell adhesion Cell communication Cell cycle Cell death Cell differentiation Cell proliferation Cell signaling Immune response Response to stimulus Signal transduction Transcription Transport

pval*

dir*

pval#

dir#

0.01 0.03 0.03 0.19 0.18 0.06 0.13 0.04 0.08 0.63 0.61 0.52 0.28 0.24 0.37 0.06 0.18 0.11

++ ++ ++ + -+ + -

0.11 0.10 0.11 0.17 0.02 0.16 0.30 0.02 0.41 0.05 0.41 0.46 0.27 0.01 0.03 0.07 0.23 0.40

+ -+ ++ + + + + ++ ++ + + -

(B) Hypergeometry Distribution Analysis of RIPpro Grade RIPpro grade

M

N

m

n

expect

pval

dir

I* II* III* I# II# III#

1881 1881 1881 1881 1881 1881

48 48 48 46 46 46

80 184 380 80 184 380

8 12 19 1 2 4

2.041 4.695 9.697 1.956 4.500 9.293

0.001 0.002 0.001 0.409 0.156 0.029

++ ++ ++ --

a M, all protein number in analysis; m, identified protein number of different RIPpro grade; N, protein number of * or # among 94 proteins; n, identified protein number of different RIPpro grade among 94 proteins. “dir” note for the enrichment depletion of the function classes. “+” for enriched, “++” for significantly enriched, “-” for depleted, “- -”for significantly depleted. *Protein abundance was higher than that estimated by single regression model from mRNA abundance; #Protein abundance was lower than that estimated by single regression model from mRNA abundance.

As expected, each cluster has distinctive features. It was determined that proteins in the first cluster have a high mRNA/ protein correlation coefficient, high protein abundance, low RIPpro and low protein half-life. Proteins in the second cluster include those with a low mRNA/protein correlation coefficient, low protein abundance, high RIPpro and high protein half-life; these proteins have the inverse properties of those proteins in

mSCI for Estimation of Protein Abundance by RIPpro

research articles

Table 5. Contribution of Different Factors to the Variations of Protein Abundance

to reflect protein identified possibility. Nobsbl modified spectra count often used for protein abundance estimation in literature. Defined as the observable peptides, Nobsbl would be the peptides in the mass range.5 Researchers have different understanding of “observable” peptides.2,5-7 This difference may be derived from the different typical mass range of peptide identifications achieved on their mass spectrometry, and also from their own experience. According to algorithm reported previously,2,5-7 each Nobsbl modified spectra count (PAI) was calculated. The results shown that (Table 2) the correlation between PAI and 46 protein concentration have big variation (R2 ) 0.66-0.79) and most of them were lower than that between protein concentration and RIPpro modified spectral count index (mSCI) (R2 ) 0.75, r) 0.87). Also, from the definition of Nobsbl, the elimination of peptides which are too hydrophobic would be more reasonable, but the relationship between protein concentration and PAI (eliminate hydrophobic peptides) was lower than PAI calculated by Nobsbl (no elimination of hydrophobic peptides). All the analyses above suggested that mSCI might be a more stable and accurate index for protein abundance estimation than PAI. Finally, Pearson’s correlation analysis demonstrated that there was a significant correlation between RIPpro and the divergence of the correlation between mRNA and protein abundance when analyzing data from previously published studies12,30,31 (r ) 0.92-1.0, p < 0.05) (Supporting Information Table S5, Figure S3). A low RIPpro value was indicative of increased difficulty in protein identification and correlated with a greater difference between mRNA and protein abundance (Supporting Information Figure S3).These data demonstrate that RIPpro is a reliable technical parameter by which to represent the influence of the protein’s physical and chemical properties during MS based protein identification, and also it really affects the mRNA/protein correlation.

partial R-squarea partial R-squareb partial R-squarec

factors

mRNA abundance MW HP pI Protein measurement variation Protein half-life a

SC.

b

0.32 (p < 0.01) 0.001 (p ) 0.06) ∼ ∼ 0.12 (p < 0.01)

0.37 (p < 0.01) 0.07 (p < 0.01) 0.002 (p < 0.01) 0.0009 (p > 0.05) 0.08 (p < 0.01)

0.34 (p < 0.01) 0.09 (p < 0.01) 0.007 (p < 0.01) ∼ 0.07 (p < 0.01)

0.03 (p < 0.01)

0.02 (p < 0.01)

0.02 (p < 0.01)

c

mSCI. SCIN.

Cluster 1. The third cluster is composed of proteins with an intermediate phenotype that are characterized by a medium or high mRNA/protein correlation coefficient, and high protein abundance, RIPpro and protein half-life. Collectively, these results illustrate the reliability and rationality to perform multiple regression analysis on mRNA/protein correlation and its physical and chemical properties of different proteins functional categories. We next performed a multiple regression analysis for each of the 24 functional categories. The variables included in our analysis were protein abundance, and covariates including mRNA abundance, RIPpro and protein half-life (Table 6). We determined that the concordance between mRNA transcripts and metabolic protein was greater than the concordance between mRNA transcripts and signaling proteins. Furthermore, our analysis demonstrated that mRNA abundance was the greatest factor contributing to metabolic protein abundance (40% on average). RIPpro (5%) also played a minor, but significant, role in regulating the correlation between mRNA transcript and metabolic protein levels. In this study, a protein was classified as metabolic if it was involved in one of 11 cellular processes including carbohydrate metabolism, electron transport, generation of precursor metabolites and energy, nucleotide metabolism, amino acid and derivative metabolism, amine metabolism, lipid metabolism, organic acid metabolism, protein biosynthesis, protein metabolism or transport function. In contrast to metabolic proteins, only 17% of the influence for signaling protein was due to mRNA abundance, less than the influence of RIPpro (23%). These data suggest that signaling proteins may adapt more elaborate regulatory mechanisms at the post-transcriptional and post-translational levels as compared to metabolic proteins, while metabolic proteins are more effectively regulated at the transcriptional level.

Discussion The Reliability and Rationality of RIPpro as Technique Index. The RIPpro parameter was deduced from statistical and mathematical modeling and reflects the influence of physical and chemical properties on accumulated observed component peptides in individual proteins. For this factor to be used in a multiple regression analysis, the reliability of RIPpro must first be validated. In this study, we derived RIPpro from a largescale data set of high quality proteome data, resulting in increased reliability. Second, Nobsbl was defined as a theoretical observable number of peptide per protein,2 since previous studies have shown that the larger Nobsbl, the greater identified potential for a given protein by MS.2,5 We further determined that the derived RIPpro had a strong correlation with Nobsbl (rs ) 0.82, p < 0.01) among 43 300 proteins (derive from IPI Mouse v3.07 database) and can therefore be utilized as a parameter

Pearson’s Correlation Analysis of mRNA and Protein Abundance. By utilizing hierarchical measurement of a Pearson’s correlation, we were able to develop a comprehensive understanding of the relationship between mRNA and protein abundance. As a consequence of these studies, we were able to determine that, although there was not a strong correlation between mRNA and protein abundance for total proteins (r ) 0.56-0.60), most (75%) showed a very strong correlation (r ) 0.75-0.79). This study provided three techniques (SC, mSCI and SCIN) which represent the development of a quantitative approach in proteomics to estimate protein abundance. On the basis of the results of our study, it was determined that the correlation between mRNA and protein abundance was greater when analyzed by mSCI and SCIN as compared to analysis performed utilizing the SC protein abundance estimation (Table 3). Previous studies that investigated the mRNA/protein correlation predominantly relied on LC-MS-MS protein quantification using spectral counting directly (SC). Since the efficiency that peptides are ionized and enter the mass spectrometer (MS) depends upon both protein composition and the local chemical environment,36 slight alterations in these factors can often produce large variation in the MS signal intensity. To decrease this variance, the protein abundance index (PAI) was developed,2,5 which was defined as the number of sequenced peptides divided by the number of its calculated, observable peptides. On the basis of this, HLP protein abundance estimation (SCIN) also normalized protein abundance from different experiment batch of the same protein. The more precise quantitative Journal of Proteome Research • Vol. 8, No. 11, 2009 4939

research articles

Sun et al. a

Table 6. Contribution of Different Factors to Protein Abundance Variation among 24 Different GO Categories

a Cluster image: 24 GO categories were clustered based on mRNA/protein correlation coefficient (C), protein abundance (P), RIPpro (R) and protein half-life (H). The percentages of medium/high protein abundance, RIPpro I and long protein half-life (g20 h) were designed to represent the value of protein abundance, RIPpro and protein half-life for each “GO category”. All these values were normalized by cluster software before hierarchical clustering. Clusters with similar profiles were grouped together and graphically visualized by “TreeView” software. “Positive” is shown in shades of red; “negative” is green; and “zero” is black.

approach makes the high correlation coefficient under SCIN more authentic than that in previous report. With these methods, we performed further analysis of the correlation between mRNA and protein abundance taking into account the physical and chemical properties of the protein. These studies demonstrated stark differences between proteins categorized in Group 1 and Group 2. Proteins in Group 1, which predominantly comprised liver metabolism and energygenerating proteins, were characterized by a low RIPpro and contained fewer theoretic tryptic-peptides. On the basis of SCIN principles, the spectral count (SC) values of identified proteins in this group were distributed by their theoretic trypticpeptides, and it was determined that proteins with similar SC levels, when distributed by low theoretic tryptic-peptide number, could be expressed at high levels. These results may explain the unexpected appearance of some proteins with low RIPpro values and high protein abundance in Group 1 and the presence of numerous high RIPpro and low-abundance proteins in Group 2. The observation that differences in the correlation between mRNA and protein abundance existed between proteins in different functional categories provided the impetus to perform a multiple regression analysis on the mRNA/protein correlation and its physical and chemical properties of different proteins functional categories. The results of this analysis suggest that the function of a protein as well as its physical and chemical properties contribute to the observed differences in the cor4940

Journal of Proteome Research • Vol. 8, No. 11, 2009

relation between mRNA and protein abundance for each protein. Collectively, these results further elucidated mechanisms that regulate the relationship between mRNA and protein abundance. Multiple Regression Analyses of Protein Abundance and Some Covariates. Among the covariates analyzed, we determined that mRNA abundance provided the greatest contribution to the variability of protein abundance (32%-37%); this contribution was greater than previously reported in the literature (20-28%).21 Despite the obvious fact that protein synthesis is dependent upon mRNA, earlier studies that investigated the relationships between mRNA and protein abundance have consistently reported that little or no correlation exists between mRNA and protein levels.14,40 Therefore, the results of our study, which utilized a large number of high quality samples, provide overwhelming evidence to the contrary and demonstrate that a significant correlation exists between mRNA and protein levels in the liver. In this study, we have introduced protein’s biochemical and physical properties (MW, Hp and pI) in multiple regression analyses as covariates that affect the variation of protein abundance and determined that MW could explain 7-9% of the total variability for mSCI and SCIN, but none for SC. These results clearly demonstrate the importance of considering protein property when quantifying abundance and provide the first quantitative analysis of the protein length (MW) influence on protein abundance.41-44 For estimation of protein abun-

mSCI for Estimation of Protein Abundance by RIPpro

research articles

dance based spectral count, which had ignored the influence of a protein’s physical and chemical properties,2 the results show that protein properties had no contribution on the total variation of protein abundance. However, based on our results, it is clear that this type of measurement is neither accurate nor reliable. A second predominant source of variation in protein abundance is the variability in protein abundance measurement (7-12%) which, in our study, was found to be lower than previously reported (34-44%).21 In CNHLPP, quantitative data from different experiment batches was normalized, in order to significantly reduce the contribution of measurement variation to protein abundance as compared to the other (12%). These results suggested that multiple batches of experiment data could provide a more robust analysis of variation. Furthermore, the adoption of the Cv as the measure of variation index may explain the decreased impact that protein measurement variation had on variation in the correlation between mRNA and protein abundance. Finally, the range of protein abundance in this study was broad enough (SC: 1-26 549) that, although standard deviation (SD) was used as the protein measurement variation parameter,21 Cv was more suitable. Protein half-life is a well-known factor that effects the correlation between mRNA and protein abundance. In this study, protein half-life was estimated by the N-terminal rule. This analysis determined that the protein half-life contributes 2-3% of the total variation of mRNA/protein correlation which is lower than previous studies (5%).21

30972909), Chinese State Key Project Specialized for Infectious Diseases (2008ZX10002-016, 2009ZX10004-103, 2009ZX09301-002) and International Scientific Collaboration Program (2009DFB33070).

Covariates Affecting mRNA/Protein Correlation in Different GO Functional Categories. Whether the protein/ mRNA correlation is relatied to protein abundance has to be fully elucidated and is therefore controversial.14,25 A number of studies have concluded that protein abundance is the most important contributing factor for regulating the variability in the correlation between mRNA and protein abundance.25 Other studies, however, report that the correlation between mRNA and protein levels is not dependent upon protein abundance.14 The results presented in the current study provide significant evidence in support of the theory that the correlation between mRNA and protein levels is dependent upon protein abundance. Protein half-life, which is frequently discussed as a contributing factor for regulating the correlation between mRNA and protein levels, was not predicted by our multiple regression analysis to be a contributing factor. Multiple regression analysis enables the investigation of relationships between several independent or predictor variables and a dependent or criterion variable. Therefore, the observation that protein half-life does not contribute to this correlation may be due to the fact that protein half-life has been shown to be strongly related to protein abundance.45 Elucidating the contributions that several covariates have on variation in the correlation between mRNA and protein abundance would further increase our understanding of cellular transcriptional and post-transcriptional regulation mechanisms required to control and maintain protein levels.

Acknowledgment. This work was partially supported by Chinese State Key Projects for Basic Research (Nos. 2006CB910801, 2006CB910401, 2006CB910602 and 2010CB912700), Chinese State High-tech Program (863) (2006AA02A308), National Natural Science Foundation of China for Creative Research Groups (30621063) and National Natural Science Foundation of China (30700356, 30700988,

Supporting Information Available: Tables of RIPpro and Nobsbl of 43 300 proteins, the MW/pI/Hp range of four different RIPpro grades in 43 300 proteins from mouse IPI 3.07 data set, RIPpro of 50 207 proteins (derived from IPI human V 3.07 database), which were calculated by three data, physical and chemical properties of 94 “outline” proteins derived from 1881 proteins, correlation of RIPpro grade and the mRNA/ protein abundance divergence. Schematic diagram of the relative identification rates for proteins with different physicochemical property range, identified difficulty grade of 43 300 mouse IPI3.07 proteins, distribution of different RIPpro grade proteins along with the divergence of mRNA/protein abundance. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Corbin, R. W.; Paliy, O.; Yang, F.; Shabanowitz, J.; Platt, M.; Lyons, C. E., Jr.; Root, K.; McAuliffe, J.; Jordan, M. I.; Kustu, S.; Soupene, E.; Hunt, D. F. Toward a protein profile of Escherichia coli: comparison to its transcription profile. Proc. Natl. Acad. Sci. U.S.A. 2003, 100 (16), 9232–7. (2) Ishihama, Y.; Oda, Y.; Tabata, T.; Sato, T.; Nagasu, T.; Rappsilber, J.; Mann, M. Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Mol. Cell. Proteomics 2005, 4 (9), 1265–72. (3) Old, W. M.; Meyer-Arendt, K.; Aveline-Wolf, L.; Pierce, K. G.; Mendoza, A.; Sevinsky, J. R.; Resing, K. A.; Ahn, N. G. Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol. Cell. Proteomics 2005, 4 (10), 1487–502. (4) Liu, H.; Sadygov, R. G.; Yates, J. R., III. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal. Chem. 2004, 76 (14), 4193–201. (5) Rappsilber, J.; Ryder, U.; Lamond, A. I.; Mann, M. Large-scale proteomic analysis of the human spliceosome. Genome Res. 2002, 12 (8), 1231–45. (6) Nie, L.; Wu, G.; Zhang, W. Correlation of mRNA expression and protein abundance affected by multiple sequence features related to translational efficiency in Desulfovibrio vulgaris: a quantitative analysis. Genetics 2006, 174 (4), 2229–43. (7) Heller, M.; Schlappritzi, E.; Stalder, D.; Nuoffer, J. M.; Haeberli, A. Compositional protein analysis of high density lipoproteins in hypercholesterolemia by shotgun LC-MS/MS and probabilistic peptide scoring. Mol. Cell. Proteomics 2007, 6 (6), 1059–72. (8) Chan, E. Integrating Transcriptomics and Proteomics. Genomics Proteomics 2006, 6 (3.), 20–26. (9) Griffin, T. J.; Gygi, S. P.; Ideker, T.; Rist, B.; Eng, J.; Hood, L.; Aebersold, R. Complementary profiling of gene expression at the transcriptome and proteome levels in Saccharomyces cerevisiae. Mol. Cell. Proteomics 2002, 1 (4), 323–33. (10) Schmidt, M. W.; Houseman, A.; Ivanov, A. R.; Wolf, D. A. Comparative proteomic and transcriptomic profiling of the fission yeast Schizosaccharomyces pombe. Mol. Syst. Biol. 2007, 3, 79. (11) Anderson, L.; Seilhamer, J. A comparison of selected mRNA and protein abundances in human liver. Electrophoresis 1997, 18 (34), 533–7. (12) Kislinger, T.; Cox, B.; Kannan, A.; Chung, C.; Hu, P.; Ignatchenko, A.; Scott, M. S.; Gramolini, A. O.; Morris, Q.; Hallett, M. T.; Rossant, J.; Hughes, T. R.; Frey, B.; Emili, A. Global survey of organ and organelle protein expression in mouse: combined proteomic and transcriptomic profiling. Cell 2006, 125 (1), 173–86. (13) Forner, F.; Foster, L. J.; Campanaro, S.; Valle, G.; Mann, M. Quantitative proteomic comparison of rat mitochondria from muscle, heart, and liver. Mol. Cell. Proteomics 2006, 5 (4), 608–19. (14) Chen, G.; Gharib, T. G.; Huang, C. C.; Taylor, J. M.; Misek, D. E.; Kardia, S. L.; Giordano, T. J.; Iannettoni, M. D.; Orringer, M. B.; Hanash, S. M.; Beer, D. G. Discordant protein and mRNA expression in lung adenocarcinomas. Mol. Cell. Proteomics 2002, 1 (4), 304–13. (15) Mootha, V. K.; Bunkenborg, J.; Olsen, J. V.; Hjerrild, M.; Wisniewski, J. R.; Stahl, E.; Bolouri, M. S.; Ray, H. N.; Sihag, S.; Kamal, M.;

Journal of Proteome Research • Vol. 8, No. 11, 2009 4941

research articles

(16)

(17)

(18)

(19)

(20) (21)

(22) (23) (24)

(25) (26) (27) (28)

(29)

4942

Patterson, N.; Lander, E. S.; Mann, M. Integrated analysis of protein composition, tissue diversity, and gene regulation in mouse mitochondria. Cell 2003, 115 (5), 629–40. Washburn, M. P.; Koller, A.; Oshiro, G.; Ulaszek, R. R.; Plouffe, D.; Deciu, C.; Winzeler, E.; Yates, J. R., III. Protein pathway and complex clustering of correlated mRNA and protein expression analyses in Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. U.S.A. 2003, 100 (6), 3107–12. Williamson, A. J.; Smith, D. L.; Blinco, D.; Unwin, R. D.; Pearson, S.; Wilson, C.; Miller, C.; Lancashire, L.; Lacaud, G.; Kouskoff, V.; Whetton, A. D. Quantitative proteomics analysis demonstrates post-transcriptional regulation of embryonic stem cell differentiation to hematopoiesis. Mol. Cell. Proteomics 2008, 7 (3), 459–72. Minagawa, H.; Honda, M.; Miyazaki, K.; Tabuse, Y.; Teramoto, R.; Yamashita, T.; Nishino, R.; Takatori, H.; Ueda, T.; Kamijo, K.; Kaneko, S. Comparative proteomic and transcriptomic profiling of the human hepatocellular carcinoma. Biochem. Biophys. Res. Commun. 2008, 366 (1), 186–92. Irmler, M.; Hartl, D.; Schmidt, T.; Schuchhardt, J.; Lach, C.; Meyer, H. E.; Hrabe de Angelis, M.; Klose, J.; Beckers, J. An approach to handling and interpretation of ambiguous data in transcriptome and proteome comparisons. Proteomics 2008, 8 (6), 1165–9. Ghaemmaghami, S.; Huh, W. K.; Bower, K.; Howson, R. W.; Belle, A.; Dephoure, N.; O’Shea, E. K.; Weissman, J. S. Global analysis of protein expression in yeast. Nature 2003, 425 (6959), 737–41. Nie, L.; Wu, G.; Zhang, W. Correlation between mRNA and protein abundance in Desulfovibrio vulgaris: a multiple regression to identify sources of variations. Biochem. Biophys. Res. Commun. 2006, 339 (2), 603–10. Cox, B.; Kislinger, T.; Emili, A. Integrating gene and protein expression data: pattern analysis and profile mining. Methods 2005, 35 (3), 303–14. Hegde, P. S.; White, I. R.; Debouck, C. Interplay of transcriptomics and proteomics. Curr. Opin. Biotechnol. 2003, 14 (6), 647–51. Nie, L.; Wu, G.; Brockman, F. J.; Zhang, W. Integrated analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: zeroinflated Poisson regression models to predict abundance of undetected proteins. Bioinformatics 2006, 22 (13), 1641–7. Gygi, S. P.; Rochon, Y.; Franza, B. R.; Aebersold, R. Correlation between protein and mRNA abundance in yeast. Mol. Cell. Biol. 1999, 19 (3), 1720–30. Nie, L.; Wu, G.; Culley, D. E.; Scholten, J. C.; Zhang, W. Integrative analysis of transcriptomic and proteomic data: challenges, solutions and applications. Crit. Rev. Biotechnol. 2007, 27 (2), 63–75. Zheng, J.; Gao, X.; Mato, J.; Beretta, L.; He, F. Report of the 9th HLPP Workshop October 2007, Seoul, Korea. Proteomics 2008, 8 (17), 3420–3. Jiang, Y.; Ying, W.; Wu, S.; Chen, M.; Guan, W.; Yang, D.; Song, Y.; Liu, X.; Li, J.; Hao, Y.; Sun, A.; Geng, C.; Li, H.; Mi, W.; Zhang, Y.; Zhang, J.; Chen, X.; Li, L.; Gong, Y.; Li, T.; Ma, J.; Li, D.; Yuan, X.; Zhang, X.; Xue, X.; Zhu, Y.; Qian, X.; He, F.; Zhong, F.; Shen, H.; Lin, C.; Lu, H.; Wei, L.; Cao, J.; Yun, D.; Gao, M.; Fan, H.; Cheng, G.; Yu, Y.; Xie, L.; Wang, H.; Yang, P. Y.; Shi, L.; Tong, W.; Li, X.; Wang, Y.; Liu, S.; Sheng, Q.; Zeng, R.; Sun, Y.; Xu, Y.; Cai, J.; He, P.; Gao, H.; Zhao, X. H.; Tan, Y.; Yan, H.; Yang, Y.; Huang, J.; Han, Z. G.; He, Q.; Chen, P.; Liang, S.; Zhao, M.; Mao, X.; Yu, H.; Cao, Z.; Li, Y.; Dai, W.; Jiang, H.; Wang, D.; Zheng, J.; Xue, G.; Tang, Y.; Cheng, J.; Liu, Y.; Wang, X.; Jia, J.; An, D.; Wang, Z.; Li, Q.; Cui, T. First insight into human liver proteome from PROTEOMESKYLIVERHu 1.0, a publicly-available database. J. Proteome Res. 2009, DOI: 10.1021/pr900532r. Xue XF, W. S.; Zu, Y. P.; He, F. C. Improving label-free protein quantification methods using expectation maximization-like algorithm in shotgun proteomics. Chin. J. Anal. Chem. 2007, 1, 19– 24.

Journal of Proteome Research • Vol. 8, No. 11, 2009

Sun et al. (30) Zhang, W.; Morris, Q. D.; Chang, R.; Shai, O.; Bakowski, M. A.; Mitsakakis, N.; Mohammad, N.; Robinson, M. D.; Zirngibl, R.; Somogyi, E.; Laurin, N.; Eftekharpour, E.; Sat, E.; Grigull, J.; Pan, Q.; Peng, W. T.; Krogan, N.; Greenblatt, J.; Fehlings, M.; van der Kooy, D.; Aubin, J.; Bruneau, B. G.; Rossant, J.; Blencowe, B. J.; Frey, B. J.; Hughes, T. R. The functional landscape of mouse gene expression. J. Biol. 2004, 3 (5), 21. (31) Su, A. I.; Wiltshire, T.; Batalov, S.; Lapp, H.; Ching, K. A.; Block, D.; Zhang, J.; Soden, R.; Hayakawa, M.; Kreiman, G.; Cooke, M. P.; Walker, J. R.; Hogenesch, J. B. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci. U.S.A. 2004, 101 (16), 6062–7. (32) Wilkins, M. R.; Gasteiger, E.; Bairoch, A.; Sanchez, J. C.; Williams, K. L.; Appel, R. D.; Hochstrasser, D. F. Protein identification and analysis tools in the ExPASy server. Methods Mol. Biol. 1999, 112, 531–52. (33) Li, D.; Li, J. Q.; Ouyang, S. G.; S.F., W.; Wang, J.; Xu, X. J.; Zhu, Y. P.; He, F. C. An Integrated Strategy for Functional Analysis in Large-scale Proteomic Research by Gene Ontology. Prog. Biochem. Biophys. 2005, 32 (11), 1026–9. (34) Ideker, T.; Thorsson, V.; Ranish, J. A.; Christmas, R.; Buhler, J.; Eng, J. K.; Bumgarner, R.; Goodlett, D. R.; Aebersold, R.; Hood, L. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 2001, 292 (5518), 929–34. (35) Eisen, M. B.; Spellman, P. T.; Brown, P. O.; Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 1998, 95 (25), 14863–8. (36) Shen, Y.; Zhao, R.; Berger, S. J.; Anderson, G. A.; Rodriguez, N.; Smith, R. D. High-efficiency nanoscale liquid chromatography coupled on-line with mass spectrometry using nanoelectrospray ionization for proteomics. Anal. Chem. 2002, 74 (16), 4235–49. (37) Wu, G.; Nie, L.; Zhang, W. Relation between mRNA expression and sequence information in Desulfovibrio vulgaris: combinatorial contributions of upstream regulatory motifs and coding sequence features to variations in mRNA abundance. Biochem. Biophys. Res. Commun. 2006, 344 (1), 114–21. (38) Moore, D. S.; McCabe, G. P. Introduction to the Practice of Statistics; W.H. Freeman: New York, 1999. (39) Lian, Z.; Kluger, Y.; Greenbaum, D. S.; Tuck, D.; Gerstein, M.; Berliner, N.; Weissman, S. M.; Newburger, P. E. Genomic and proteomic analysis of the myeloid differentiation program: global analysis of gene expression during induced differentiation in the MPRO cell line. Blood 2002, 100 (9), 3209–20. (40) Chen, G.; Gharib, T. G.; Wang, H.; Huang, C. C.; Kuick, R.; Thomas, D. G.; Shedden, K. A.; Misek, D. E.; Taylor, J. M.; Giordano, T. J.; Kardia, S. L.; Iannettoni, M. D.; Yee, J.; Hogg, P. J.; Orringer, M. B.; Hanash, S. M.; Beer, D. G. Protein profiles associated with survival in lung adenocarcinoma. Proc. Natl. Acad. Sci. U.S.A. 2003, 100 (23), 13537–42. (41) Coghlan, A.; Wolfe, K. H. Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae. Yeast 2000, 16 (12), 1131–45. (42) Urrutia, A. O.; Hurst, L. D. The signature of selection mediated by expression on human genes. Genome Res. 2003, 13 (10), 2260–4. (43) Lemos, B.; Bettencourt, B. R.; Meiklejohn, C. D.; Hartl, D. L. Evolution of proteins and gene expression levels are coupled in Drosophila and are independently associated with mRNA abundance, protein length, and number of protein-protein interactions. Mol. Biol. Evol. 2005, 22 (5), 1345–54. (44) Warringer, J.; Blomberg, A. Evolutionary constraints on yeast protein size. BMC Evol. Biol. 2006, 6, 61. (45) Belle, A.; Tanay, A.; Bitincka, L.; Shamir, R.; O’Shea, E. K. Quantification of protein half-lives in the budding yeast proteome. Proc. Natl. Acad. Sci. U.S.A. 2006, 103 (35), 13004–9.

PR900252N