Mining the Plasma Proteome for Disease Applications Across Seven

Nov 9, 2010 - Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle,. Washington 98109, United States. Received October 19 ...
0 downloads 0 Views 4MB Size
Mining the Plasma Proteome for Disease Applications Across Seven Logs of Protein Abundance Q. Zhang, V. Faca, and S. Hanash Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington 98109, United States Received October 19, 2010

The current state of proteomics technologies has sufficiently advanced to allow in-depth quantitative analysis of the plasma proteome and development of a related knowledge base. Here we review approaches that have been applied to increase depth of analysis by mass spectrometry given the substantial complexity of plasma and the vast dynamic range of protein abundance. Fractionation strategies resulting in reduced complexity of individual fractions followed by mass spectrometry analysis of digests from individual fractions has allowed well in excess of 1000 proteins to be identified and quantified with high confidence that span more than seven logs of protein abundance. Such depth of analysis has contributed to elucidation of plasma proteome variation in health and of protein changes associated with disease states. Keywords: plasma proteome • protein abundance • dynamic range • fractionation • mass spectrometry • biomarker

Introduction Interest in profiling serum, and plasma from which serum is derived, using protein separation and characterization technologies has been of long-standing because of easy accessibility of this circulating fluid and its rich content of proteins that inform regarding the health status of a subject. The initial methodology consisted of one-dimensional protein separations, which was followed by the use of two-dimensional polyacrylamide gel electrophoresis.1,2 The advent of mass spectrometry in the postgenome era has redefined proteomics and has resulted in calls for large-scale efforts to comprehensively map the human proteome. A major concern that has been expressed related to the plasma proteome relates to its vast complexity and dynamic range of protein abundance that challenges the capabilities of current mass spectrometers. The Human Proteome Organization (HUPO) launched a plasma pilot study in 2003 in which plasma aliquots were distributed to many participating laboratories for mass spectrometry-based analysis and the resulting combined data was integrated. Taking into consideration multiple hypotheses testing and other considerations resulted in a stringent subset of 889 proteins with at least 95% confidence in protein identification.3 Also in the same time period a human plasma Peptide Atlas4 with 960 distinct proteins was constructed from 28 data sets, 20 of which were part of the HUPO study. As a result of global community efforts, the most current human plasma Peptide Atlas build (May 2010) contained 16 987 distinct peptides. Many plasma proteins are found with multiple forms, resulting from cleavage and posttranslational modification * To whom correspondence should be addressed. E-mail: shanash@ fhcrc.org.

46 Journal of Proteome Research 2011, 10, 46–50 Published on Web 11/09/2010

(PTM) products5,6 which may be associated with disease. For example, fragments of calreticulin and protein disulfide isomerase A3 have been found to be significantly elevated in the serum of patients with hepatocellular carcinoma compared to healthy controls.7 Quantitative Analysis of the Plasma Proteome. Concerns have been expressed regarding the ability to reliably compare protein compositions in different samples by MS for the purpose of identifying biomarkers. Aside from the issue of quality of samples and avoidance of any biases among samples to be compared, MS is prone to under sampling. Therefore peptides may be sampled in one run and not another for aliquots of the same specimen. This concern has been addressed in numerous reviews and solutions have been offered for the comparative analysis of specimens.8 Approaches for comparative analysis that do not include protein tagging are referred to as “label-free.” A commonly used approach is based on a determination of the frequency of tandem mass spectra for peptides in comparison samples.9 Several protein tagging approaches for comparative analysis are currently available. Due to its many advantages, cysteine alkylation has been extensively used to tag proteins with stable isotopes for mass spectrometry based quantitative analysis.10-12 The particular approach we have followed relies on acrylamide as a cysteine tagging reagent for quantitative studies. Its low mass (71 Da) and hydrophylicity do not introduce significant mass shifts or charge changes in the protein or peptide.13 The reaction is performed using standard protein solubilization solutions with close to 100% yield and the reagents are relatively inexpensive. Once proteins in samples to be compared have been differentially isotopically labeled, they are mixed together which avoids the issue of sampling of peptides in one specimen and not the other. This procedure is equally applicable to tissue lysates and 10.1021/pr101052y

 2011 American Chemical Society

Mining the Plasma Proteome for Disease Applications

reviews 14,15

Figure 1. Flow scheme of in-depth qualitative and quantitative profiling of the plasma proteome.

Figure 2. Extensive fractionation and protein identification of a reference plasma pool. Differentially labeled and mixed plasmas were fractionated into a total of 96 fractions that were further analyzed by LC-MS/MS. The number of proteins identified from each fraction is indicated, the darker the red shades, the higher the values. On an average, about 107 proteins were identified per fraction, which resulted in a total of 1607 nonredundant protein identifications for this entire experiment. AX, anion exchange; RP, reverse phase.

plasma and the methodology has been detailed and has led to studies in which novel biomarkers were identified and independently validated.15,16 Depth of Analysis of the Plasma Proteome using Mass Spectrometry. The issue of whether mass spectrometry allows analysis of plasma with sufficient depth to identify potential biomarkers has been investigated by several groups. Assessment of whether proteins derived from diverse tissues may be detectable in plasma with MS was investigated by enriching N-linked glycopeptides from tissues, cells, and plasma and identifying corresponding peptide sequences and proteins by MS.17 A significant overlap was observed between glycoproteins identified in tissue and cells and glycoproteins identified in plasma leading to the conclusion that extracellular glycoproteins originating from tissues and cells are released into the blood at levels that are detectable by MS. As an alternative to enrichment procedures that target a particular subset of the proteome,18 we have investigated strategies for the comprehensive analysis of the serum/plasma proteome without targeting any particular subset. The work flow for this in-depth qualitative and quantitative profiling of plasma proteome is illustrated in Figure 1. We first engaged in studies to determine the contribution of protein fractionation, followed by tryptic digestion and LC-MS/MS analysis of individual fractions.19 Concerns regarding fractionation included reduced throughput and the potential of diluting individual proteins or outright inducing their loss. Serum was depleted of abundant proteins and fractionated by an orthogonal two-dimensional system consisting of anion-exchange and reversed-phase chromatography. Some of the resulting protein fractions were divided into aliquots one of which was analyzed by shotgun LC-MS/MS, and another was further resolved into protein bands using SDS-PAGE. We demonstrated that increased fractionation resulted in increased depth of analysis based on total number of proteins identified in serum and based on representation in individual fractions of specific proteins identified in gel bands following a third-dimension SDS gel analysis. With this approach, in a single experiment, more than 1400 plasma proteins can be identified by LC-MS/MS including isoforms with altered chromatographic mobility. A typical

Figure 3. Correlation of spectral counts with known protein concentrations. (A) Plasma protein concentrations from a combination of publicly available sources were used to construct a graph of concentrations for identified proteins which spans 7 orders of magnitude. Proteins at the two extremes of the range are marked. (B) Using the correlation of spectral counts and estimated protein concentration, an inverse relationship between the total number of proteins identified and their abundance was observed. Almost half (47%) of identified proteins were of low abundance, with e10 MS/MS events per protein. Journal of Proteome Research • Vol. 10, No. 1, 2011 47

reviews

Figure 4. Subcellular location in relation to estimated protein concentration. Using the annotation of Ingenuity Pathway Analysis software, we correlated protein subcellular locations with spectral counts. The higher the spectral counts, the higher the protein concentration, the greater the enrichment in secreted proteins. In contrast, the lower the spectral counts, the greater the representation in cytoplasmic and plasma membrane proteins was found.

plasma profiling experiment from a study aimed at discovery of breast cancer markers is presented to illustrate depth of analysis. In this study, differentially labeled samples were combined and fractionated into a total of 96 fractions that were

Zhang et al. further analyzed by LC-MS/MS. The number of proteins identified from each fraction is shown in Figure 2. On an average, about 107 proteins were identified per fraction, which resulted in a total of 1607 nonredundant plasma protein identifications. To estimate the concentration range of human plasma proteins identified, we correlated spectral count data (number of total MS/MS events per protein) to known concentrations of proteins in plasma obtained from a combination of publicly available sources20 (ELISAs from R and D Systems, www.rndsystems.com). We observed a significant correlation between spectral counts for a given protein and its plasma protein concentration (R2 ) 0.838, Figure 3A). From this analysis, we estimated that this proteomic approach allowed for identification of plasma proteins across 7 orders of magnitude and detection of proteins in human plasma at concentrations as low as 1 ng/mL. Using the correlation of spectral counts and estimated protein concentration, an inverse relationship between the total number of proteins identified and their abundance was observed, that is, the number of proteins identified inversely correlated with protein concentrations. Thus almost half (47%) of all identified proteins yielded e10 MS/MS events per protein (Figure 3B). Using the annotation capabilities of Ingenuity Pathway Analysis (Ingenuity Systems, www.ingenuity.com) software, we further characterized the subcellular locations of proteins identified in relation to their spectral counts (Figure 4). An interesting finding is that the higher the protein concentration as predicted from spectral count, the greater is the enrichment

Figure 5. Tissue expression of proteins identified in plasma. High abundance proteins identified in plasma were expressed particularly highly in liver tissue (A). As the range of estimated protein concentration decreased, we clearly observe a greater contribution of other tissues and decreased contribution of the liver (B and C). It is particularly noticeable that a large fraction of proteins present below 500 ng/mL are highly expressed in peripheral blood cells. 48

Journal of Proteome Research • Vol. 10, No. 1, 2011

reviews

Mining the Plasma Proteome for Disease Applications in secreted proteins. On the other hand, the lower the spectral count is, the greater the representation in plasma of cytoplasmic and plasma membrane proteins that may result from cell breakdown or leakage, or cleavage/shedding from the surface membrane into the extracellular space. The tissue specificity of proteins identified was examined using a published data set from a human tissue mRNA expression study.21 We observed that proteins estimated to be of high abundance (5 mg/mL to 50 µg/mL) in plasma had a much greater representation of liver associated proteins (Figure 5A). At a lower range of estimated protein concentrations, we clearly observed greater contribution of other tissues and reduced contribution of the liver (Figure 5B and C). Of note, a relatively high percentage of proteins that occurred below 500 ng/mL exhibited a high level of expression of their corresponding genes in peripheral blood cells. Thus, variability in protein spillage into plasma among subjects in a study may represent a confounding factor for biomarker discovery. Contribution of Database Search Strategies to Protein Identification. Database searches represent an important if not critical step in the MS-based proteomics pipeline that impacts on peptide and protein identification. This is illustrated in a HUPO study to try to identify errors leading to irreproducibility, including incompleteness of peptide sampling, in LC-MSbased proteomics.22 A test sample, comprising 20 highly purified recombinant human proteins was distributed to 27 laboratories. While only 7 laboratories identified all proteins correctly, centralized analysis of raw data revealed that all 20 proteins were detected by the mass spectrometer in all laboratories. Database matching and curation of protein data were identified as important sources of problems in identification. Currently most MS-based proteomics studies rely on the use of one of many available search engines for a given data set derived from MS. Available search engines perform conditionally with variable accuracy, sensitivity and specificity. Consensus approaches that rely on two or more search engines are increasingly being used, resulting in significant improvements in sensitivity and specificity for peptide identification in complex biological mixtures.23-25 The importance of database search parameters is illustrated in Figure 6. One open source software that can match tandem mass spectra with peptide sequences for protein identification is X!Tandem.26 A set of search parameters are defined by a user that generally include the enzyme used for digesting peptides, the m/z range to be collected, the number of missed cleavages, peptide mass error tolerance, fragment ion mass error tolerance, charge state of the parent peptide, and fixed and variable amino acid modifications. For example, allowing semitryptic peptides with only one terminus corresponding to the digest motif will increase the total number of peptides identified (Figure 6B). The basic search (green line) was established via X!Tandem with parent mass tolerance 1.5 Da and the K-Score module plugged in.27 By turning on semitryptic and high mass accuracy parameters, an optimized search can be achieved with approximately 50% more peptide identifications and lower false discovery rate (red line). Richer databases that include additional peptide sequences representing slicing isoforms and sequence variations also contribute to increased depth of analysis. Databases established from translated human genomic sequence have been used in several proteome studies, that allowed identification of novel proteins and protein isoforms.28-30

Figure 6. Optimization of database search. (A) Basic database search approach is established and subsequently augmented by turning on a semitryptic search and high mass accuracy search options, and applying consensus methods which employs at least two search engines. (B) Example is shown based on using different combinations of search parameters. The basic search is established via X!Tandem with parent mass tolerance 1.5 Da (green line) and K-Score module plugged in. By turning on semitryptic and high mass accuracy parameters, the optimized search achieves approximately 50% more peptide identifications and lower false discovery rate (red line).

Conclusion Current MS-based proteomic technologies have made it possible to profile the plasma proteome with substantial reliability and depth of analysis for biomarker studies. Massive amounts of complex and heterogeneous proteomics data can be produced, analyzed, and integrated with other high dimensional data.

References (1) Hanash, S.; Pitteri, S.; Faca, V. Mining the plasma proteome for cancer biomarkers. Nature 2008, 452 (7187), 571–79. (2) Hanash, S.; Taguchi, A. The grand challenge to decipher the cancer proteome. Nat. Rev. Cancer 2010, 10 (9), 652–60. (3) States, D. J.; Omenn, G. S.; Blackwell, T. W.; Fermin, D.; Eng, J.; Speicher, D. W.; Hanash, S. M. Challenges in deriving highconfidence protein identifications from data gathered by a HUPO plasma proteome collaborative study. Natl. Biotech. 2006, 24 (3), 333–8. (4) Deutsch, E. W.; Eng, J. K.; Zhang, H.; King, N. L.; Nesvizhskii, A. I.; Lin, B.; Lee, H.; Yi, E. C.; Ossola, R.; Aebersold, R. Human Plasma PeptideAtlas. Proteomics 2005, 5 (13), 3497–500. (5) Misek, D. E.; Kuick, R.; Wang, H.; Deng, B.; Zhao, R.; Galchev, V.; Tra, J.; Pisano, M. R.; Amunugama, R.; Allen, D.; Strahler, J.;

Journal of Proteome Research • Vol. 10, No. 1, 2011 49

reviews

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15) (16)

(17)

(18)

50

Andrews, P.; Omenn, G. S.; Hanash, S. M. A wide range of protein isoforms in serum and plasma uncovered by a quantitative Intact Protein Analysis System (IPAS). Proteomics 2005, 5, 3343–52. Nedelkov, D.; Kiernan, U. A.; Niederkofler, E. E.; Tubbs, K. A.; Nelson, R. W. Investigating diversity in human plasma proteins. Proc. Natl. Acad. Sci. U.S.A. 2005, 102 (31), 10852–7. Chignard, N.; Shang, S.; Wang, H.; Marrero, J.; Brechot, C.; Hanash, S.; Beretta, L. Cleavage of endoplasmic reticulum proteins in hepatocellular carcinoma: Detection of generated fragments in patient sera. Gastroenterology 2006, 130 (7), 2010–22. Nilsson, T.; Mann, M.; Aebersold, R.; Yates, J. R., 3rd; Bairoch, A.; Bergeron, J. J. Mass spectrometry in high-throughput proteomics: ready for the big time. Nat. Methods 2010, 7 (9), 681–5. Matthiesen, R.; Carvalho, A. S. Methods and algorithms for relative quantitative proteomics by mass spectrometry. Methods Mol. Biol. 2010, 593, 187–204. Sechi, S.; Chait, B. T. Modification of cysteine residues by alkylation. A tool in peptide mapping and protein identification. Anal. Chem. 1998, 70 (24), 5150–8. Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Natl. Biotech. 1999, 17, 994–9. Shen, M.; Guo, L.; Wallace, A.; Fitzner, J.; Eisenman, J.; Jacobson, E.; Johnson, R. S. Isolation and isotope labeling of cysteine- and methionine-containing tryptic peptides: application to the study of cell surface proteolysis. Mol. Cell. Proteomics 2003, 2 (5), 315– 24. Sechi, S. A method to identify and simultaneously determine the relative quantities of proteins isolated by gel electrophoresis. Rapid Commun. Mass Spectrom. 2002, 16 (15), 1416–24. Faca, V.; Coram, M.; Phanstiel, D.; Glukhova, V.; Zhang, Q.; Fitzgibbon, M.; McIntosh, M.; Hanash, S. Quantitative analysis of acrylamide labeled serum proteins by LC-MS/MS. J. Proteome Res. 2006, 5 (8), 2009–18. Faca, V.; Wang, H.; Hanash, S. Proteomic global profiling for cancer biomarker discovery. Methods Mol. Biol. 2009, 492, 309–20. Paczesny, S.; Braun, T. M.; Levine, J. E.; Hogan, J.; Crawford, J.; Coffing, B.; Olsen, S.; Choi, S. W.; Wang, H.; Faca, V.; Pitteri, S.; Zhang, Q.; Chin, A.; Kitko, C.; Mineishi, S.; Yanik, G.; Peres, E.; Hanauer, D.; Wang, Y.; Reddy, P.; Hanash, S.; Ferrara, J. L. Elafin is a biomarker of graft-versus-host disease of the skin. Sci. Transl. Med. 2010, 2 (13), 13ra2. Zhang, H.; Liu, A. Y.; Loriaux, P.; Wollscheid, B.; Zhou, Y.; Watts, J. D.; Aebersold, R. Mass spectrometric detection of tissue proteins in plasma. Mol. Cell. Proteomics 2007, 6 (1), 64–71. Wildes, D.; Wells, J. A. Sampling the N-terminal proteome of human blood. Proc. Natl. Acad. Sci. U.S.A. 2010, 107 (10), 4561–6.

Journal of Proteome Research • Vol. 10, No. 1, 2011

Zhang et al. (19) Faca, V.; Pitteri, S. J.; Newcomb, L.; Glukhova, V.; Phanstiel, D.; Krasnoselsky, A.; Zhang, Q.; Struthers, J.; Wang, H.; Eng, J.; Fitzgibbon, M.; McIntosh, M.; Hanash, S. Contribution of protein fractionation to depth of analysis of the serum and plasma proteomes. J. Proteome Res. 2007, 6 (9), 3558–65. (20) Haab, B. B.; Zhou, H. Multiplexed protein analysis using spotted antibody microarrays. Methods Mol. Biol. 2004, 264, 33–45. (21) Su, A. I.; Wiltshire, T.; Batalov, S.; Lapp, H.; Ching, K. A.; Block, D.; Zhang, J.; Soden, R.; Hayakawa, M.; Kreiman, G.; Cooke, M. P.; Walker, J. R.; Hogenesch, J. B. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci. U.S.A. 2004, 101 (16), 6062–7. (22) Bell, A. W.; Deutsch, E. W.; Au, C. E.; Kearney, R. E.; Beavis, R.; Sechi, S.; Nilsson, T.; Bergeron, J. J. A HUPO test sample study reveals common problems in mass spectrometry-based proteomics. Nat. Methods 2009, 6 (6), 423–30. (23) Yu, W.; Taylor, J. A.; Davis, M. T.; Bonilla, L. E.; Lee, K. A.; Auger, P. L.; Farnsworth, C. C.; Welcher, A. A.; Patterson, S. D. Maximizing the sensitivity and reliability of peptide identification in large-scale proteomic experiments by harnessing multiple search engines. Proteomics 2010, 10 (6), 1172–89. (24) Dagda, R. K.; Sultana, T.; Lyons-Weiler, J. Evaluation of the Consensus of Four Peptide Identification Algorithms for Tandem Mass Spectrometry Based Proteomics. J. Proteomics Bioinform. 2010, 3, 39–47. (25) Searle, B. C.; Turner, M.; Nesvizhskii, A. I. Improving sensitivity by probabilistically combining results from multiple MS/MS search methodologies. J. Proteome Res. 2008, 7 (1), 245–53. (26) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466–7. (27) MacLean, B.; Eng, J. K.; Beavis, R. C.; McIntosh, M. General framework for developing and evaluating database scoring algorithms using the TANDEM search engine. Bioinformatics 2006, 22 (22), 2830–2. (28) Omenn, G. S.; Yocum, A. K.; Menon, R. Alternative splice variants, a new class of protein cancer biomarker candidates: findings in pancreatic cancer and breast cancer with systems biology implications. Dis. Markers 2010, 28 (4), 241–51. (29) Menon, R.; Zhang, Q.; Zhang, Y.; Fermin, D.; Bardeesy, N.; DePinho, R. A.; Lu, C.; Hanash, S. M.; Omenn, G. S.; States, D. J. Identification of novel alternative splice isoforms of circulating proteins in a mouse model of human pancreatic cancer. Cancer Res. 2009, 69 (1), 300–9. (30) Menon, R.; Omenn, G. S. Proteomic characterization of novel alternative splice variant proteins in human epidermal growth factor receptor 2/neu-induced breast cancers. Cancer Res. 2010, 70 (9), 3440–9.

PR101052Y