Article pubs.acs.org/jpr
Biological Network Module-Based Model for the Analysis of Differential Expression in Shotgun Proteomics Jia Xu,† Lily Wang,‡ and Jing Li*,†,§ †
Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, People’s Republic of China ‡ Department of Biostatistics, Vanderbilt University, Nashville, Tennessee 37203, United States § Shanghai Center for Bioinformation Technology, Shanghai 201203, People’s Republic of China S Supporting Information *
ABSTRACT: Protein differential expression analysis plays an important role in the understanding of molecular mechanisms as well as the pathogenesis of complex diseases. With the rapid development of mass spectrometry, shotgun proteomics using spectral counts has become a prevailing method for the quantitative analysis of complex protein mixtures. Existing methods in differential proteomics expression typically carry out analysis at the singleprotein level. However, it is well-known that proteins interact with each other when they function in biological processes. In this study, focusing on biological network modules, we proposed a negative binomial generalized linear model for differential expression analysis of spectral count data in shotgun proteomics. In order to show the efficacy of the model in protein expression analysis at the level of protein modules, we conducted two simulation studies using synthetic data sets generated from theoretical distribution of count data and a real data set with shuffled counts. Then, we applied our method to a colorectal cancer data set and a nonsmall cell lung cancer data set. When compared with single-protein analysis methods, the results showed that module-based statistical model which takes account of the interactions among proteins led to more effective identification of subtle but coordinated changes at the systems level. KEYWORDS: differential expression analysis, biological network module, negative binomial model, spectral count, shotgun proteomics
■
INTRODUCTION High-throughput detection technology for protein expression levels is having an increasing impact on biomedical research. In particular, the application of mass spectrometry in proteomics has enabled us to test thousands of proteins simultaneously.1 Because of the massive amount of data generated by the highthroughput technology, the focus of research has been redirected to how to analyze the data, of which a key issue is to find the differentially expressed proteins or protein list under different conditions. These studies help people to identify important disease-related proteins, which would provide new information for early diagnosis, and may also shed light on the pathogenesis of complex diseases. Peptides are separated and fragmented using multidimensional liquid chromatography (LC) and tandem mass spectrometry in shotgun proteomics. Mass spectrometric methods based on stable isotope labeling such as ICAT,2 SILAC,3 and iTRAQ4 came up first according to the similar physicochemical property but different mass between the labeled substance and the corresponding natural isotope. However, the procedure of these methods is complicated, and it assumes that the protein is not changed in the labeling process.5 Label-free methods were then developed in the © 2014 American Chemical Society
quantitative analysis for peptides and proteins, using either intensity or spectral counts as the measurement.6 Among the sampling statistics, spectral counts are considered as the most reproducible measurement in protein expression analysis.7 Spectral counts generated from shotgun proteomics have become the prevailing method for protein abundance quantification. Some methods have already been applied to the statistical analysis of protein differential expression, including t-test,8 Fisher’s exact test,9 G-test,10 AC test,11 and local-pooled-error (LPE) test.12 Zhang et al. compared these methods and reported that Fisher’s exact test, G-test, and AC test could be used when replicates were limited (1 or 2), while t-test was effective when there were 3 or more replicates.7 In recent years, other statistical models for the spectral counts have also been proposed. A generalized G-test proposed by Zhang et al. applies to spectral counts analysis, which improves the sensitivity of the identification.7 Once used in GeneChip data, the power law global error model (PLGEM) has also been applied to shotgun proteomics data, using normalized spectral abundance factor Received: July 14, 2014 Published: October 20, 2014 5743
dx.doi.org/10.1021/pr5007203 | J. Proteome Res. 2014, 13, 5743−5750
Journal of Proteome Research
Article
(NSAF) to normalize the size bias.13 Choi et al. proposed a statistical framework QSpec based on the hierarchical Bayes estimation of generalized linear mixed effects model.14 The Poisson model assumes that the variance is equal to the mean, which real spectral count data can hardly meet. The quasilikelihood generalized linear model proposed by Li et al. allows overdispersion in the data.15 A negative binomial model was also proposed to solve the overdispersion in spectral count data, and was shown to be more effective in the zero variance situations.16 Despite the many models proposed to analyze spectral count data, to our knowledge, they all perform analysis at the singleprotein level. Nevertheless, most biological functions are carried out by sets of interacting proteins. In practice, often there are few or no individual proteins reaching statistical significance based on single-protein statistical models, since the biological differences are modest relative to the noises in shotgun proteomics. Moreover, single-protein analysis ignoring the interactions between proteins may miss important effects on pathways or PPI subnetworks, in which the member proteins have some modest change in concert. In the analysis of gene expression microarray data, gene-set or pathway based methods have been shown to have higher statistical power and better interpretability than single gene-based approaches. For instance, Gene Set Enrichment Analysis (GSEA)17 and PAGE18 are applied to the analysis of microarray data, which integrate gene annotation databases as Gene Ontology19 or KEGG PATHWAY20 to identify differentially expressed gene sets. The pathway-based mixed model for the analysis of microarray data proposed by Wang et al. was shown to have higher power than GSEA and PAGE.21 Label-free shotgun proteomics has already identified changes in protein expression levels successfully in some disease conditions as a burgeoning method. However, the effort for statistical analysis of shotgun proteomics data has not been as much as that for the analysis of microarrays, and we cannot simply copy the methods applied to microarray for the analysis of spectral count data because they follow different distributions. Therefore, we proposed a biological network module-based model for the differential expression analysis of spectral count data. In this study, we proposed a negative binomial generalized linear model based on biological network modules for the differential expression analysis of shotgun proteomics using spectral counts. We performed two simulation studies using synthetic data sets to assess the sensitivity and specificity of the model. Then we applied the model to two real data sets and identified some significant differentially expressed protein modules.
■
Another data set used in this study is a colorectal cancer data set collected from tissue secretome of four colorectal cancer patients provided in de Wit’s work.22 Specimens from both the tumor and adjacent normal colon mucosa were tested in the data set. The data set contained 2703 proteins. Since most biological network information is annotated at the gene level, we mapped these proteins to the gene symbols in BioMart.24 When two or more proteins were mapped to the same gene symbol, we randomly picked one protein entry. After the preprocessing, the data set contained 2653 protein entries with unique gene symbols. The third data set is a nonsmall cell lung cancer data set from Kikuchi’s work.23 According to the paper, 3621 protein groups were identified in the analysis comparing three conditions (adenocarcinoma, squamous cell carcinoma and normal specimens). In our study, we performed differential analysis between the adenocarcinoma and the normal conditions, with 4 and 8 samples, respectively. We required at least 2 spectral counts in one of the samples and preprocessed the data in the same way as the colorectal cancer data set. After preprocessing, the data set contained 3100 protein entries with unique gene symbols. Biological Network and Module Decomposition
A group of functionally related genes or proteins in a biological network is usually called a module. The integrated biological networks used in this study were obtained from KEGG pathway20 or protein−protein interaction (PPI) information. The KEGG pathway information was downloaded on April 11th, 2011. Each KEGG pathway was treated as a module. For the PPI network, we chose the STRING database (version 9.1).25 We used all the protein links with annotations of actions in the STRING database. Both experimental and predicted data were included. Here we define a PPI module as a group of closely connected proteins, namely a protein along with all its direct interactors. In addition, a PPI module will be removed if fewer than 3 members were detected in the shotgun proteomics data set. Generalized Linear Model
The generalized linear model (GLM) is an extension of the general linear model and it has a link function that transforms the expectation of the response variable. Possible distributions of the response variable often include the normal distribution, the Poisson distribution, the negative binomial distribution and the gamma distribution. In our study, we proposed the following biological network module-based model for the differential analysis of spectral count data in shotgun proteomics: log[E(y)] = μ + group + module + group × module
MATERIALS AND METHODS
Here, y represents the spectral counts. The parameter μ is the overall mean. group and module are both indicator variables, group = 1 for case group and group = 0 for control group. Similarly, module = 1 if the protein is in the module and module = 0 otherwise. group × module represents the interaction effects between group and module. Since the interactions among proteins are taken into account, this statistical model is expected to improve the statistical power and accuracy of the analysis. Moreover, statistically significant modules can be identified when applying this model. Further functional analysis of these significantly different modules or the annotation of pathways will be beneficial to our understanding of the key biological processes or functional
Data Sets Collecting and Preprocessing
We used three published shotgun proteomics data sets.13,22,23 One is the yeast data set from Saccharomyces cerevisiae strain BY4741 provided in Pavelka’s work.13 The strain was grown in the medium labeled with 14N and 15N, 4 independent cultures for each condition. No difference in protein expression level is expected between the two conditions as the 14N- and 15Nlabeled proteins were mixed at a 1:1 ratio. The data set contained 1314 proteins, 7 of which were contaminant proteins. In our study, these contaminant proteins were excluded from the data set. We used this yeast data set to generate synthetic data sets in the simulation study. 5744
dx.doi.org/10.1021/pr5007203 | J. Proteome Res. 2014, 13, 5743−5750
Journal of Proteome Research
Article
As shown in Table 1, 6 scenarios were included in the first simulation study. For each scenario, both the case group and
modules in the pathogenesis of complex diseases. All the analyses in this study were carried out in the R software.26 The corresponding R codes were provided as Supporting Information (SI).
■
Table 1. AUC Values for the Model Using the Data Set Generated from the Theoretical Distribution
RESULTS AND DISCUSSION
Distribution Fitting
The actual distribution of spectral count data in shotgun proteomics is yet to be determined. For protein differential expression based on count data, the analysis usually starts with a Poisson model or a negative binomial model.27 To find out which model fits the spectral count data better, we used the Poisson distribution as well as the negative binomial distribution to fit the protein expression profile in the normal group in the colorectal cancer data set using maximumlikelihood method, as shown in Figure 1. Compared with the
scenario
pa
fcb
AUC
1 2 3 4 5 6
0.3 0.5 0.8 0.3 0.5 0.8
1.5 1.5 1.5 2 2 2
0.7698 0.8638 0.9991 0.9301 0.9988 1.0000
p = proportion of proteins with the treatment effect added to the first module in the case group. bfc = fold change. a
the control group were simulated, each with 20 samples. For each sample, 1500 values were randomly generated from the negative binomial distribution with the parameter mu set at 7 and the parameter size set at 0.5 as an approximation to the spectral counts. These parameter settings were based on the negative binomial distribution fitting of the average spectral counts in the normal group of the colorectal cancer data set. The 1500 values were randomly assigned to 50 modules, each with 30 proteins. We then added treatment effect to the first module according to the parameter p and fc. Here, p indicates the proportion of proteins with the treatment effect added to the first module in the case group; fc indicates the fold change. Therefore, in the first module in the case group, the spectral counts of 30 × p proteins were multiplied by fc. For instance, 9 (= 30 × 0.3) proteins in the module were added with the treatment effect when p = 0.3. We generated 20 data sets for each scenario. For each data set, we obtained the p-value and used AUC (Area Under the receiver operating characteristic Curve) to assess the efficacy of the model. We plotted the receiver operating characteristic (ROC) curves for all the scenarios, as show in Figure 2. The horizontal axis of the ROC curve indicates the false positive rate (FPR), and the vertical axis indicates the true positive rate (TPR). We calculated the AUC value for each scenario, as shown in Table 1. In terms of AUC, the results showed that the AUC values were greater than 0.86 when fc = 1.5 and p reached 0.5. When fc = 2, the AUC values across all the three scenarios were all greater than 0.93, which demonstrated that our model can identify differentially expressed modules, effectively. In addition to the data sets generated from the theoretical distribution, we constructed synthetic data sets using the yeast data set for further assessment. No difference in protein expression level was expected between the two conditions since the 14N- and 15N-labeled proteins were mixed at a 1:1 ratio. To ensure there was no difference between the two groups, we shuffled the spectral counts across the rows for five times, following the method introduced in the work of Choi et al.14 Then we randomly assigned the 1307 proteins to 45 modules, the first 44 modules with 29 proteins each and the last module with 31 proteins. We added the fold change (fc) to the first module in the case group according to the parameter p. When adding the fold changes for treatment effects, if the original spectral count was 0, we randomly generated a value for the protein from the negative binomial distribution with the mean as the fold change and the size as 0.35 obtained from distribution fitting of the data, so as to prevent the situation of
Figure 1. Distribution fitting of the spectral count data in the colorectal cancer data set. The distribution with red filling shows the actual distribution of the spectral counts in the normal group in the colorectal cancer data set. The distributions with green filling and blue filling are the fitted negative binomial distribution and Poisson distribution, respectively. One-hundred eight counts that are greater than 40 are not shown in the figure in order to get a better display of the distribution.
Poisson distribution, the negative binomial distribution fits the real data better, probably because the actual spectral count data do not satisfy the assumption of the Poisson distribution that the variance is equal to the mean. In contrast, the negative binomial distribution has two parameters which model the dispersion as well as the mean in the data, leading to a better fit for overdispersed data. Therefore, we chose the negative binomial model in our study. Simulation Study
Using synthetic data sets, we performed two simulation studies to assess the sensitivity and specificity of the module-based negative binomial generalized linear model. In the first simulation study, we generated count data from the negative binomial distribution. In the second simulation study, we generated synthetic data sets from the yeast data set. 5745
dx.doi.org/10.1021/pr5007203 | J. Proteome Res. 2014, 13, 5743−5750
Journal of Proteome Research
Article
Figure 2. ROC curves for the module-based model using the data set generated from the theoretical distribution. fc = fold change; p = proportion of proteins with the treatment effect added to the testing module.
Real Data Application
multiplying the fold change with zero. Nine scenarios were included in the simulation study, as shown in Table 2. We generated 20 data sets for each scenario.
To identify colorectal cancer protein biomarkers, de Wit et al. analyzed human colorectal tissue and patient-matched normal colon tissue samples by shotgun proteomics. We applied our model to reanalyze this data set.22 We carried out two data analyses, one based on the KEGG pathway and the other one based on the PPI modules.
Table 2. AUC Values for the Model Using the Synthetic Data Set Generated from the Real Data scenario
pa
fcb
AUC
1 2 3 4 5 6 7 8 9
0.3 0.5 0.8 0.3 0.5 0.8 0.3 0.5 0.8
2 2 2 3 3 3 4 4 4
0.5639 0.6360 0.8127 0.7734 0.8254 0.9841 0.8095 0.9129 0.9997
Pathway-Based Model Analysis
We assigned the proteins expressed in the colorectal cancer data set to KEGG pathways by mapping the proteins to their coding genes. We required at least 3 mapped proteins in each pathway, and 218 pathways were obtained. 1316 of the 2653 proteins that were not in these pathways were collected into a new module named “the other pathway”. Therefore, 219 pathways were considered for further analysis. We obtained the p-values for each of the pathways and calculated false discovery rate (FDR) adjusted p-values using Benjamini-Hochberg adjustment for multiple testing correction.28 Using our model, we identified 10 significant differentially expressed pathways with 221 proteins, which had adjusted p-values less than 0.05 (See Table 3). Some of the 10 significant pathways have been shown to be associated with colorectal cancer. For example, the presence of DNA replication errors (RERs) shows DNA microsatellite instability (MIN), which characterizes 15% of colorectal cancers.29 Errors in pre-mRNA splicing also causes cancers including colorectal
a p = proportion of proteins with the treatment effect added to the first module in the case group. bfc = fold change.
We also plotted the ROC curves for all the scenarios and calculated the AUC values for each scenario, as shown in Figure 3 and Table 2. These results showed that the AUC value reached 0.81 when fc = 2 and p = 0.8. When fc = 3, the AUC value was 0.98. The results above suggested that the model could identify modules of proteins with small but coordinated changes.
Figure 3. ROC curves for the module-based model using the synthetic data set generated from real data. fc = fold change; p = proportion of proteins with the treatment effect added to the testing module. 5746
dx.doi.org/10.1021/pr5007203 | J. Proteome Res. 2014, 13, 5743−5750
Journal of Proteome Research
Article
significant pathways which were not identified in the singleprotein model, such as DNA replication pathway, mRNA surveillance pathway, and autoimmune thyroid disease pathway. Recent studies showed that autoimmunity is associated with the development of malignancy.32 Our results indicated that the pathway-based analysis can help to capture some missing information from the single-protein analysis, and will contribute to our understanding on the pathogenesis of complex diseases.
Table 3. Statistically Significant Pathways in the Colorectal Cancer Data Set pathway ko03030 DNA replication ko03040 spliceosome ko04111 cell cycle yeast ko05320 autoimmune thyroid disease ko05330 allograft rejection ko05332 graft-versushost disease ko03015 mRNA surveillance pathway ko00511 other glycan degradation ko04970 salivary secretion ko03013 RNA transport
original sizea
# of CRC proteins
nominal p-value
FDR adjusted p-value
36
15
9.49 × 10−06
0.00208
128 68
89 24
3.76 × 10−04 0.00106
0.0333 0.0333
54
4
0.00107
0.0333
39
4
0.00107
0.0333
36
4
0.00107
0.0333
83
33
0.00115
0.0333
17
8
0.00122
0.0333
86
10
0.00142
0.0344
151
68
0.00225
0.0493
PPI Module-Based Model Analysis
Using STRING network, we mapped the proteins in the colorectal data set and decomposed the network into 2002 PPI modules. To control the redundancy between modules, we merged those modules when the number of their common members was over 80% of the total number of all the proteins in both modules, resulting in 1837 PPI modules. The proteins that were not in these modules were named “the other module”. Similar to the KEGG pathway-based analysis, we obtained 49 significant differentially expressed PPI modules from the colorectal cancer data set, with FDR adjusted p-values less than 0.05. In Table 4, we list the top 10 significant PPI modules. We carried out GO enrichment analysis using WebGestalt and obtained the main functions of these modules. The information on all the 49 modules can be found in SI Table S1. We found that the top 10 significant differentially expressed modules were mainly related to mRNA metabolic process and transcription. For instance, mRNA splicing has been shown to be a cancer-causing process.30 We chose Module 240, which ranked fourth among the 49 significant modules, as an example and drew a graph of its network structure using Cytoscape,33 as shown in Figure 4. For the total of 56 proteins in this module, the average spectral counts for the case group were more than that for the normal group for 55 proteins. Eight proteins were detected only in the cancer group. Thirty-nine proteins had fold changes of more than two. The main function of this module is chromatin organization, which has been demonstrated as a key factor in the progression of cancer because it modulates gene transcription.34 Of all the members in the module, some have already been reported to be related to tumorigenesis. For instance, the minichromosome maintenance (MCM) gene family has a vital role in DNA replication and is involved in many types of human cancers including prostate cancer.35
a
Original size refers to the original number of proteins in the pathway. # of CRC proteins refers to the number of proteins in the colorectal cancer data set which were also in the pathway.
cancer.30 Thus, spliceosome pathway might also be related to the pathogenesis of colorectal cancer. To compare the performances between the module-based model and the single-protein model, we also conducted an analysis based on the single-protein level using both negative binomial model and Fisher’s exact test.9 We identified 149 and 460 significant differentially expressed proteins using negative binomial model and Fisher’s exact test at single protein level, respectively. Subsequently the enrichment analyses were performed in these two protein lists using WebGestalt,31 and one significant differentially expressed pathway was found with FDR less than 0.05, the splicesome pathway, which was also identified in the pathway-based model. From the differentially expressed protein list detected by Fisher’s exact test, we identified another significant pathway by enrichment analysis, which was the fatty acid metabolism pathway. From the results, we can see that there is some overlap between the pathway-based analysis and the single-protein analysis, which demonstrates the efficacy of our model. Meanwhile, using our model, we have also found more
Table 4. Top 10 Significant PPI Modules in the Colorectal Cancer Data Set
a
PPI module
sizea
nominal p-value
Mod1735 Mod1013 Mod1817 Mod240 Mod1795 Mod284 Mod795
178 151 124 56 111 90 62
1.08 6.29 1.92 4.52 6.07 7.37 8.89
Mod503 Mod1747 Mod875
104 106 107
8.94 × 10−06 9.38 × 10−06 1.12 × 10−05
× × × × × × ×
10−07 10−07 10−06 10−06 10−06 10−06 10−06
FDR adjusted p-value 1.99 × 10−04 5.78 × 10−04 0.00118 0.00192 0.00192 0.00192 0.00192 0.00192 0.00192 0.00197
function mRNA splicing, via spliceosome mRNA splicing, via spliceosome mRNA splicing, via spliceosome Transcription elongation from RNA polymerase II promoter; chromatin organization mRNA splicing, via spliceosome Nuclear-transcribed mRNA catabolic process, nonsense-mediated decay Transcription elongation from RNA polymerase II promoter; chromatin organization; positive regulation of viral transcription mRNA splicing, via spliceosome mRNA splicing, via spliceosome mRNA splicing, via spliceosome
Size refers to the number of proteins in the PPI module. 5747
dx.doi.org/10.1021/pr5007203 | J. Proteome Res. 2014, 13, 5743−5750
Journal of Proteome Research
Article
Figure 4. Subnetwork of PPI module 240. Each node represents a protein. The color and size of the node represent the changes of the protein expression. The darker the color is, the larger the fold change. The larger the node is, the larger the fold change.
Further Validation
diagnosis, disease classification, prognostic factor, or potential targets for medical therapy.
■
In order to confirm the effectiveness of the model, we also performed the pathway-based model analysis on a nonsmall cell lung cancer data set. Two-hundred twenty-four pathways were included after we required at least 3 proteins mapped to each pathway. Twenty-one significant differentially expressed pathways were identified after multiple testing corrections using Benjamini-Hochberg adjustment (adjusted p-value < 0.01, see the SI Table S2), of which only four pathways were detected by the single-protein negative binomial model and one pathway was identified by Fisher’s exact test. Most of the 21 significant differentially expressed pathways have been reported to be involved in cancer development, such as RNA transport, complement, and coagulation cascades, aminoacyl-tRNA biosysnthesis, mismatch repair, and gluconeogenesis.36−40 The most significant differentially expressed pathway malaria may stimulate host immune responses, which are considered crucial in fighting lung cancer.41 The pathways pyrimidine metabolism, endoplasmic reticulum, and ECM-receptor interaction play important roles in the prognosis, classification, and therapy of nonsmall cell lung cancer.42−47 In summary, we have proposed a module-based negative binomial generalized linear model for the differential expression analysis of shotgun proteomics data. Different from the singleprotein model, the module-based model takes into account the interactions between genes or proteins, which enables us to identify those differentially expressed modules with slight but coordinated changes. The results from the real data applications demonstrated that more significant differentially expressed modules, including disease-specific ones, can be identified using the module-based model. This result is in accordance with the reports in microarray data.21 However, the risk of false positive should also be considered when a module-based statistical model is used. Methods for multiple testing corrections can be helpful to control the overall false positive rate. In addition to providing insights on the development of diseases, the identified coordinately differentially expressed modules or pathways can also be considered as biomarker candidates for
ASSOCIATED CONTENT
* Supporting Information S
Table S1, Statistically significant PPI modules in colorectal cancer data set. This material is available free of charge via the Internet at http://pubs.acs.org. Table S2, Statistically significant pathways in nonsmall cell lung cancer data set. File S3, The R codes of the biological network module-based statistical model. This material is available free of charge via the Internet at http://pubs.acs.org.
■
AUTHOR INFORMATION
Corresponding Author
*Tel.: +86 21 34204348; e-mail:
[email protected]. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS The work was supported by grants from the National Natural Science Foundation of China (31271416, 31000582), the National Key Basic Research Program (2011CB910204), and the National High-Tech R&D Program (863) (2012AA020201, 2012AA101601). Additional support from Pujiang Talent Program (12PJ1406600) and Program for “Chen Xing” Young Scholars, Shanghai Jiao Tong University.
■
REFERENCES
(1) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422 (6928), 198−207. (2) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 1999, 17 (10), 994−999. (3) Ong, S.-E.; Blagoev, B.; Kratchmarova, I.; Kristensen, D. B.; Steen, H.; Pandey, A.; Mann, M. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 2002, 1 (5), 376−386. 5748
dx.doi.org/10.1021/pr5007203 | J. Proteome Res. 2014, 13, 5743−5750
Journal of Proteome Research
Article
(4) Ross, P. L.; Huang, Y. N.; Marchese, J. N.; Williamson, B.; Parker, K.; Hattan, S.; Khainovski, N.; Pillai, S.; Dey, S.; Daniels, S. Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol. Cell. Proteomics 2004, 3 (12), 1154−1169. (5) Bantscheff, M.; Schirle, M.; Sweetman, G.; Rick, J.; Kuster, B. Quantitative mass spectrometry in proteomics: A critical review. Anal. Bioanal. Chem. 2007, 389 (4), 1017−1031. (6) Old, W. M.; Meyer-Arendt, K.; Aveline-Wolf, L.; Pierce, K. G.; Mendoza, A.; Sevinsky, J. R.; Resing, K. A.; Ahn, N. G. Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol. Cell. Proteomics 2005, 4 (10), 1487−1502. (7) Zhang, B.; VerBerkmoes, N. C.; Langston, M. A.; Uberbacher, E.; Hettich, R. L.; Samatova, N. F. Detecting differential and correlated protein expression in label-free shotgun proteomics. J. Proteome Res. 2006, 5 (11), 2909−2918. (8) Student, On the error of counting with a haemacytometer. Biometrika 1907, 351−360. (9) Fisher, R. A. On the interpretation of χ2 from contingency tables, and the calculation of P. J. R. Stat. Soc. 1922, 85 (1), 87−94. (10) Rohlf, F. J. Biometry: the Principles and Practice of Statistics in Biological Research; Freeman: New York, 1981. (11) Audic, S.; Claverie, J.-M. The significance of digital gene expression profiles. Genome Res. 1997, 7 (10), 986−995. (12) Jain, N.; Thatte, J.; Braciale, T.; Ley, K.; O’Connell, M.; Lee, J. K. Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics 2003, 19 (15), 1945−1951. (13) Pavelka, N.; Fournier, M. L.; Swanson, S. K.; Pelizzola, M.; Ricciardi-Castagnoli, P.; Florens, L.; Washburn, M. P. Statistical similarities between transcriptomics and quantitative shotgun proteomics data. Mol. Cell. Proteomics 2008, 7 (4), 631−644. (14) Choi, H.; Fermin, D.; Nesvizhskii, A. I. Significance analysis of spectral count data in label-free shotgun proteomics. Mol. Cell. Proteomics 2008, 7 (12), 2373−2385. (15) Li, M.; Gray, W.; Zhang, H.; Chung, C. H.; Billheimer, D.; Yarbrough, W. G.; Liebler, D. C.; Shyr, Y.; Slebos, R. J. Comparative shotgun proteomics using spectral count data and quasi-likelihood modeling. J. Proteome Res. 2010, 9 (8), 4295−4305. (16) Leitch, M. C.; Mitra, I.; Sadygov, R. G. Generalized linear and mixed models for label-free shotgun proteomics. Stat. Interface 2012, 5 (1), 89. (17) Subramanian, A.; Tamayo, P.; Mootha, V. K.; Mukherjee, S.; Ebert, B. L.; Gillette, M. A.; Paulovich, A.; Pomeroy, S. L.; Golub, T. R.; Lander, E. S. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 2005, 102 (43), 15545−15550. (18) Kim, S.-Y.; Volsky, D. J. PAGE: parametric analysis of gene set enrichment. BMC Bioinform. 2005, 6 (1), 144. (19) Ashburner, M.; Ball, C. A.; Blake, J. A.; Botstein, D.; Butler, H.; Cherry, J. M.; Davis, A. P.; Dolinski, K.; Dwight, S. S.; Eppig, J. T. Gene Ontology: Tool for the unification of biology. Nat. Genet. 2000, 25 (1), 25−29. (20) Kanehisa, M.; Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28 (1), 27−30. (21) Wang, L.; Zhang, B.; Wolfinger, R. D.; Chen, X. An integrated approach for the analysis of biological pathways using mixed models. PLoS Genet. 2008, 4 (7), e1000115. (22) de Wit, M.; Kant, H.; Piersma, S. R.; Pham, T. V.; Mongera, S.; van Berkel, M. P. A.; Boven, E.; Pontén, F.; Meijer, G. A.; Jimenez, C. R.; Fijneman, R. J. A. Colorectal cancer candidate biomarkers identified by tissue secretome proteome profiling. J. Proteomics 2014, 99 (0), 26−39. (23) Kikuchi, T.; Hassanein, M.; Amann, J. M.; Liu, Q.; Slebos, R. J.; Rahman, S. J.; Kaufman, J. M.; Zhang, X.; Hoeksema, M. D.; Harris, B. K. In-depth proteomic analysis of nonsmall cell lung cancer to discover molecular targets and candidate biomarkers. Mol. Cell. Proteomics 2012, 11 (10), 916−932.
(24) Kasprzyk, A. BioMart: Driving a paradigm change in biological data management. Database 2011, 2011, bar049. (25) Franceschini, A.; Szklarczyk, D.; Frankild, S.; Kuhn, M.; Simonovic, M.; Roth, A.; Lin, J.; Minguez, P.; Bork, P.; von Mering, C. STRING v9. 1: Protein−protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013, 41 (D1), D808− D815. (26) Team, R. C. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria; In ISBN 3-900051-07-0: 2013. (27) Cameron, A. C.; Trivedi, P. K. Regression Analysis of Count Data; Cambridge University Press: Cambridge, U.K., 2013. (28) Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 1995, 289−300. (29) Jass, J.; Do, K.; Simms, L.; Iino, H.; Wynter, C.; Pillay, S.; Searle, J.; Radford-Smith, G.; Young, J.; Leggett, B. Morphology of sporadic colorectal cancer with DNA replication errors. Gut 1998, 42 (5), 673− 679. (30) Venables, J. P. Aberrant and alternative splicing in cancer. Cancer Res. 2004, 64 (21), 7647−7654. (31) Zhang, B.; Kirov, S.; Snoddy, J. WebGestalt: An integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 2005, 33 (suppl 2), W741−W748. (32) Franks, A. L.; Slansky, J. E. Multiple associations between a broad spectrum of autoimmune diseases, chronic inflammatory diseases and cancer. Anticancer Res. 2012, 32 (4), 1119−1136. (33) Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N. S.; Wang, J. T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13 (11), 2498−2504. (34) Jones, P. A.; Baylin, S. B. The fundamental role of epigenetic events in cancer. Nat. Rev. Genet. 2002, 3 (6), 415−428. (35) Majid, S.; Dar, A. A.; Saini, S.; Chen, Y.; Shahryari, V.; Liu, J.; Zaman, M. S.; Hirata, H.; Yamamura, S.; Ueno, K. Regulation of minichromosome maintenance gene family by microRNA-1296 and genistein in prostate cancer. Cancer Res. 2010, 70 (7), 2809−2818. (36) Skog, J.; Würdinger, T.; van Rijn, S.; Meijer, D. H.; Gainche, L.; Curry, W. T.; Carter, B. S.; Krichevsky, A. M.; Breakefield, X. O. Glioblastoma microvesicles transport RNA and proteins that promote tumour growth and provide diagnostic biomarkers. Nat. Cell Biol. 2008, 10 (12), 1470−1476. (37) Park, S. G.; Schimmel, P.; Kim, S. Aminoacyl tRNA synthetases and their connections to disease. Proc. Natl. Acad. Sci. U. S. A. 2008, 105 (32), 11043−11049. (38) Li, M.; Zhang, Q.; Liu, L.; Lu, W.; Wei, H.; Li, R. W.; Lu, S. Expression of the mismatch repair gene hMLH1 is enhanced in nonsmall cell lung cancer with EGFR mutations. PloS One 2013, 8 (10), e78500. (39) Leithner, K.; Hrzenjak, A.; Trötzmüller, M.; Moustafa, T.; Köfeler, H.; Wohlkoenig, C.; Stacher, E.; Lindenmann, J.; Harris, A.; Olschewski, A. PCK2 activation mediates an adaptive response to glucose depletion in lung cancer. Oncogene 2014, DOI: 10.1038/ onc.2014.47. (40) Pastor, M.; Nogal, A.; Molina-Pinelo, S.; Meléndez, R.; Salinas, A.; González De la Peña, M.; Martín-Juan, J.; Corral, J.; GarcíaCarbonero, R.; Carnero, A. Identification of proteomic signatures associated with lung cancer and COPD. J. Proteomics 2013, 89, 227− 237. (41) Chen, L.; He, Z.; Qin, L.; Li, Q.; Shi, X.; Zhao, S.; Chen, L.; Zhong, N.; Chen, X. Antitumor effect of malaria parasite infection in a murine Lewis lung cancer model through induction of innate and adaptive immunity. PloS One 2011, 6 (9), e24407. (42) Hsin, I.-L.; Hsiao, Y.-C.; Wu, M.-F.; Jan, M.-S.; Tang, S.-C.; Lin, Y.-W.; Hsu, C.-P.; Ko, J.-L. Lipocalin 2, a new GADD153 target gene, as an apoptosis inducer of endoplasmic reticulum stress in lung cancer cells. Toxicol. Appl. Pharmacol. 2012, 263 (3), 330−337. (43) Navab, R.; Strumpf, D.; Bandarchi, B.; Zhu, C.-Q.; Pintilie, M.; Ramnarine, V. R.; Ibrahimov, E.; Radulovich, N.; Leung, L.; Barczyk, 5749
dx.doi.org/10.1021/pr5007203 | J. Proteome Res. 2014, 13, 5743−5750
Journal of Proteome Research
Article
M. Prognostic gene-expression signature of carcinoma-associated fibroblasts in non-small cell lung cancer. Proc. Natl. Acad. Sci. U. S. A. 2011, 108 (17), 7160−7165. (44) Li, B.-Q.; You, J.; Huang, T.; Cai, Y.-D. Classification of NonSmall Cell Lung Cancer Based on Copy Number Alterations. PloS One 2014, 9 (2), e88300. (45) Chen, Z.; Fillmore, C. M.; Hammerman, P. S.; Kim, C. F.; Wong, K.-K. Non-small-cell lung cancers: A heterogeneous set of diseases. Nat. Rev. Cancer 2014, 14 (8), 535−546. (46) Zhang, W. C.; Shyh-Chang, N.; Yang, H.; Rai, A.; Umashankar, S.; Ma, S.; Soh, B. S.; Sun, L. L.; Tai, B. C.; Nga, M. E. Glycine decarboxylase activity drives non-small cell lung cancer tumorinitiating cells and tumorigenesis. Cell 2012, 148 (1), 259−272. (47) Maring, J. G.; Groen, H. J.; Wachters, F. M.; Uges, D. R.; de Vries, E. G. Genetic factors influencing pyrimidine-antagonist chemotherapy. Pharmacogenomics J. 2005, 5 (4), 226−243.
5750
dx.doi.org/10.1021/pr5007203 | J. Proteome Res. 2014, 13, 5743−5750