Single Peptide-Based Protein Identification in Human Proteome

Department of Biological Chemistry, School of Medicine, University of California at Davis, ..... trometry workstation equipped with a N2 laser (337 nm...
0 downloads 0 Views 212KB Size
Anal. Chem. 2003, 75, 1316-1324

Single Peptide-Based Protein Identification in Human Proteome through MALDI-TOF MS Coupled with Amino Acids Coded Mass Tagging Songqin Pan,† Sheng Gu,† E. Morton Bradbury,†,‡ and Xian Chen*,†

BN-2, MS M888, Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, and Department of Biological Chemistry, School of Medicine, University of California at Davis, Davis, California 95616

Identification of proteins with low sequence coverage using mass spectrometry (MS) requires tandem MS/MS peptide sequencing. It is very challenging to obtain a complete or to interpret an incomplete tandem MS/MS spectrum from fragmentation of a weak peptide ion signal for sequence assignment. Here, we have developed an effective and high-throughput MALDI-TOF-based method for the identification of membrane and other low-abundance proteins with a simple, one-dimensional separation step. In this approach, several stable isotope-labeled amino acid precursors were selected to mass-tag, in parallel, the human proteome of human skin fibroblast cells in a residue-specific manner during in vivo cell culturing. These labeled residues can be recognized by their characteristic isotope patterns in MALDI-TOF MS spectra. The isotope pattern of particular peptides induced by the different labeled precursors provides information about their amino acid compositions. The specificity of peptide signals in a peptide mass mapping is thus greatly enhanced, resolving a high degree of mass degeneracy of proteolytic peptides derived from the complex human proteome. Further, false positive matches in database searching can be eliminated. More importantly, proteins can be accurately identified through a single peptide with its m/z value and partial amino acid composition. With the increased solubility of hydrophobic proteins in SDS, we have demonstrated that our approach is effective for the identification of membrane and lowabundant proteins with low sequence coverage and weak signal intensity, which are often difficult for obtaining informative fragment patterns in tandem MS/MS peptide sequencing analysis. Along with the rapid advances in proteomics, mass spectrometry (MS) has been proven to be an essential technology for the characterization of proteins on a genomic scale.1 With the development of the MALDI and ESI ionization techniques,2,3 MS-based approaches have been widely employed in proteomics studies. * To whom correspondence should be addressed. Tel: 505-665-3197. Fax: 505-665-3024. E-mail: [email protected]. † Los Alamos National Laboratory. ‡ University of California at Davis. (1) Mann, M.; Hendrickson, R. C.; Pandey, A. Annu. Rev. Biochem. 2001, 70, 437-473.

1316 Analytical Chemistry, Vol. 75, No. 6, March 15, 2003

With its high-throughput potential, MALDI-TOF MS measures peptide mass maps (PMMs) of proteins generated through proteolytic treatment with known enzymatic specificity. However, high sequence coverage in a PMM is usually required for unique protein identification.4,5 Alternatively, ESI tandem MS/MS is a common approach for generating the daughter fragment ions of a proteolytic peptide. The distribution of daughter ions in the MS/ MS spectrum can provide information to deduce the amino acid sequence of individual peptides for further identification of their parent proteins.6,7 Although this approach is highly specific, it demands relatively strong signals for fragmentation and it is also relatively time-consuming when dealing with complex mixtures. In general, protein identification using either the PMM or tandem MS/MS approach is based on the availability of a protein database for a particular organism. The mass-to-charge ratio (m/z) values derived from a MALDI-TOF PMM or the deduced peptide sequence from a tandem MS/MS spectrum is submitted to the protein database to correlate these experimental data with the theoretical database.6,8 However, the large number of proteins present in a proteome can give a high degree of mass degeneracy in the protein database that often results in multiple matches including false-positive results.9 This mass ambiguity can become more serious and pronounced in large proteomes such as a human than those of yeast or Escherichia coli. To reduce the number of possible matches, there are a number of ways to introduce additional parameters to restrict database searching. These parameters include high accuracy in mass measurements using a highly sophisticated instrument such as the ESI-FTICR MS,10 the molecular weight and isoelectric point of proteins, the type of proteases used in enzymatic digestion, possible peptide modifications, and mass tags introduced by chemical reagents. (2) Fenn, J. B.; Mann, M.; Meng, C. K.; Wong, S. F.; Whitehouse, C. M. Science 1989, 246, 64-71. (3) Karas, M.; Hillenkamp, F. Anal. Chem. 1988, 60, 2299-2301. (4) Fenyo, D.; Qin, J.; Chait, B. T. Electrophoresis 1998, 19, 998-1005. (5) Zubarev, R. A.; Håkansson, P.; Sundqvist, B. Anal. Chem. 1996, 68, 40604063. (6) Mann, M.; Wilm, M. Anal. Chem. 1994, 66, 4390-4399. (7) McCormack, A. L.; Schieltz, D. M.; Goode, B.; Yang, S.; Barnes, G.; Drubin, D.; Yates, J. R., III. Anal. Chem. 1997, 69, 767-776. (8) Henzel, W. J.; Billeci, T. M.; Stults, J. T.; Wong, S. C.; Grimley, C.; Watanabe, C. Proc. Natl. Acad. Sci. U.S.A. 1993, 90, 5011-5015. (9) Egelhofer, V.; Gobom, J.; Seitz, H.; Giavalisco, P.; Lehrach, H.; Nordhoff, E. Anal. Chem. 2002, 74, 1760-1771. (10) Bruce, J. E.; Anderson, G. A.; Wen, J.; Harkewicz, R.; Smith, R. D. Anal. Chem. 1999, 71, 2595-2599. 10.1021/ac020482s CCC: $25.00

© 2003 American Chemical Society Published on Web 02/19/2003

For instruments such as MALDI-TOF MS with only moderate mass accuracy, their high-throughput potential can only be fully explored when their signal specificity is increased. The postsource decay (PSD) usually is difficult to generate informative daughter fragments for peptide sequencing; on the other hand, many tagging approaches for proteolytic peptides have been developed to discriminate the signals with mass degeneracy in mass spectra. For example, 18O-labeled water is commonly used in protease digestion to introduce the 18O tag through the hydrolysis reaction to uniformly label all proteolytic peptides at the carboxyl terminus.11,12 Similarly, peptides can also be uniformly tagged at their N-terminus through the acetylation reaction13 or the modification by nicotinyl-n-hydroxysuccinimide.14 The cysteine-containing peptides can be selectively isolated and characterized using the isotope-coded affinity tag (ICAT) reagent.15 The mass-coded abundance tagging (MCAT) can be used to characterize lysinecontaining peptides for de novo sequencing and protein quantification.16 These methods through enzymatic or chemical reactions to mass tag proteomes have been shown to be effective to differentiate certain peptide signals from the overall peptide pool to facilitate unique protein identification. With lack of signal specificity, high-resolution separation of cellular proteins is required to obtain individual proteins for mass spectrometry analysis. Although two-dimensional polyacrylamide gel electrophoresis (2D PAGE) can routinely be used in protein separation for this purpose, the majority of low-abundance, hydrophobic, and membrane proteins is excluded due to their limited solubility during the process of the isoelectric focusing.17 However, these proteins are very important to cellular functions. Many low-abundant proteins such as some protein kinases are the key regulators of biological processes.18 Membrane proteins have many specialized functions such as receptors or ligandbinding proteins that can sense environmental signals and trigger cellular signal transduction pathways.19 The characterization of these proteins is essential to our understanding of the basic mechanisms of these cellular processes. Our residue-specific mass tagging allows for the identification of multiple protein species from each 1D SDS-PAGE band. Meanwhile, these proteins are soluble in strong detergent such as SDS, can be separated by 1D SDS-PAGE, and, therefore, can be included in MS analysis.20 A recent study by Goodlett et al. has suggested that it is possible to identify proteins through the characterization of a single peptide without the need for tandem MS/MS.21 This (11) Kosaka, T.; Takazawa, T.; Nakamura, T. Anal. Chem. 2000, 72, 1179-1185. (12) Yao, X.; Freas, A.; Ramirez, J.; Demirev, P. A.; Fenselau, C. Anal. Chem. 2001, 73, 2836-2842. (13) Geng, M.; Ji, J.; Regnier, F. E. J. Chromatogr., A 2000, 870, 295-313. (14) Munchbach, M.; Quadroni, M.; Miotto, G.; James, P. Anal. Chem. 2000, 72, 4047-4057. (15) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Nat. Biotechnol. 1999, 17, 994-999. (16) Cagney, G.; Emili, A. Nat. Biotechnol. 2002, 20, 163-170. (17) Herbert, B. R.; Harry, J. L.; Packer, N. H.; Gooley, A. A.; Pedersen, S. K.; Williams, K. L. Trends Biotechnol. 2001, 19, S3-9. (18) Lewis, T. S.; Hunt, J. B.; Aveline, L. D.; Jonscher, K. R.; Louie, D. F.; Yeh, J. M.; Nahreini, T. S.; Resing, K. A.; Ahn, N. G. Mol. Cell 2000, 6, 13431354. (19) Kim, E. S.; Khuri, F. R.; Herbst, R. S. Curr. Opin. Oncol. 2001, 13, 506513. (20) Hunter, T. C.; Yang, L.; Zhu, H.; Majidi, V.; Bradbury, E. M.; Chen, X. Anal. Chem. 2001, 73, 4891-4902. (21) Goodlett, D. R.; Bruce, J. E.; Anderson, G. A.; Rist, B.; Pasa_Tolic, L.; Fiehn, O.; Smith, R. D.; Aebersold, R. Anal. Chem. 2000, 72, 1112-1118.

approach is based on mass tagging of cysteine residues through an isotope distribution encoded tag (IDEnT) and then the accurate determination of the m/z value of a single peptide by ESI-FTICR MS with ultrahigh mass accuracy. Considering the high solubility of hydrophobic proteins in 1D SDS gel, the ability of using the accurate mass of single peptides for protein identification will extend the dynamic range of mass measurement. A proteome can be mass tagged during cell culturing in vivo. Stable isotope-enriched amino acid precursors can be incorporated into proteomes during the in vivo process of protein synthesis. After proteolytic digestion of a protein mixture, the isotope patterns of peptide signals induced by labeled amino acid precursors reflect the presence or absence of a particular amino acid in the corresponding peptide. This constraint of peptide amino acid composition can serve as an additional parameter in database searching, providing a higher degree of specificity and accuracy.20,22,23 In this study, we have mass tagged the human proteome with the amino acids coded mass tagging (AACM) approach to resolve its complex mass degeneracy for high-throughput protein identification. With a number of small-scale human skin fibroblast (HSF) cell cultures each labeled with a different type of stable isotope-enriched amino acid precursor, we were able to determine the presence or absence of these precursors in the proteolytic peptides. Thus, information regarding the residue content of a particular peptide is readily coded in the form of isotope patterns induced by the labeled amino acids. The constraint provided by the AACM is very rigid so that a single peptide m/z with moderate accuracy is sufficient for identification of its parent protein. Further, combined with 1D SDS gel separation, this approach is particularly useful for the direct identification in a MALDI-TOF PMM of low sequence coverage proteins such as the membrane and low-abundant proteins. MATERIALS AND METHODS Chemicals. The stable isotope-enriched amino acid precursors [5,5,5-d3]leucine (Leu-d3), [4,4,5,5-d4]lysine (Lys-d4), [methyl-d3]methionine (Met-d3), [2,3,3-d3]serine (Ser-d3), and [3,3-d2]tyrosine (Tyr-d2) were purchased from Cambridge Isotope Laboratories, Inc. (Andover, MA). The modified trypsin was purchased from Roche Diagnostics Corp. (Indianapolis, IN). All components of the culturing medium for the HSF cell including the R-MEM (medium formula 32561), the dialyzed fetal bovine serum, and the antibiotic penicillin were obtained from GIBCO-BRL (Grand Island, NY). C18 ziptips were purchased from Millipore Corp. (Bedford, MA). Other chemicals for gel electrophoresis, peptide extraction, and sample preparation for MALDI-TOF and ESI MS were purchased from Sigma (St. Louis, MO). All the chemicals are sequence- or HPLC-grade unless specifically mentioned. Cell Culture, Amino Acid Labeling, and Sample Preparation. The HSF cells were grown in the R-MEM medium at 37 °C with 5% CO2 and 90% relative humidity. The medium was supplemented with 10% dialyzed fetal bovine serum (FBS), 100 units/mL penicillin, and 100 µg/mL streptomycin sulfate. For the amino acid-specific labeling, the selected isotope-enriched amino acid precursor was added into the medium with the equal final (22) Zhu, H.; Hunter, T. C.; Pan, S.; Yau, P. M.; Bradbury, E. M.; Chen, X. Anal. Chem. 2002, 74, 1687-1694. (23) Chen, X.; Smith, L. M.; Bradbury, E. M. Anal. Chem. 2000, 72, 11341143.

Analytical Chemistry, Vol. 75, No. 6, March 15, 2003

1317

concentration of its unlabeled counterpart to obtain 50% labeling. At ∼80% confluent, cells were harvested with the trypsin-EDTA treatment and then washed three times with a large volume of PBS buffer to remove any trypsin and FBS residuals. The washed cells were resuspended in PBS buffer and then lysed with the SDS loading buffer in boiling water for 10 min. Any undissolvable debris was spun down with a microcentrifuge at 14 000 rpm; the supernatants were then collected and subjected to gel electrophoresis using 12% SDS-PAGE. The protein bands in the 1D SDS gel were visualized with the Coomassie staining. Trypsin In-Gel Digestion, Peptide Extraction. The bands of interest were sliced from the 1D gel and cut into small pieces. They were first destained with a solution of 50% acetonitrile and 50 mM NH4HCO3 and then dried with 100% acetonitrile for 20 min followed by speed-vacuum centrifuge for 10 min. Trypsin was added into the dried samples to give a final concentration of 10 µg/mL and incubated overnight at 37 °C. The tryptic peptides were extracted twice from the gel slices by 45-min sonication for each, first in 200 µL 5% acetic acid solution and then in 200 µL 5% acetic acid/50% acetonitrile solution. The supernatants were collected and dried to a pellet in a speed-vacuum centrifuge. The pellets were resuspended in 20 µL of 0.1% TFA solution for LCQ MS/MS analysis. For MALDI-TOF MS analysis, additional desalting steps were performed using C18 ZipTips. MALDI-TOF MS Peptide Mass Mapping. All MALDI-TOF mass spectra were acquired with a PE Voyager DE_STR biospectrometry workstation equipped with a N2 laser (337 nm, 3-ns pulse width, 20-Hz repetition rate) using the reflector mode with delayed extraction (PE Biosystem, Framingham, MA). The matrix, R-cyano-4-hydroxycinnamic acid, was prepared as a saturated solution in 50% acetonitrile/0.1% TFA solvent. For MALDI-TOF analysis, 1 µL of matrix was mixed with 1 µL of sample on the sample plate, and the mixture was air-dried to form crystal analyte. The m/z values of proteolytic peptides were externally calibrated with Calmix (angiotensin II and insulin chain B oxidized) obtained from PE Biosystem. For a single 1D gel band, individual MALDI-TOF PMMs were generated for each of the unlabeled and five labeled samples. The isotope patterns of the labeled amino acids were compiled for the peptides that were consistently detected in all of the six PMMs. Information about the amino acid composition of these peptides was obtained from the isotope patterns of the labeled amino acids. Peptide Sequencing by the LCQ Tandem MS/MS. The ESI mass spectra were collected on a Finnigan LCQ Deca ion trap mass spectrometer equipped with an electrospray ionization source (San Jose, CA). The spray voltage (needle voltage) was set at 3.7 kV, and the sheath gas was set at 55 units; no auxiliary gas was employed. The capillary temperature was kept at 200 °C. The mass spectrometer was tuned daily using 500 fmol/µL angiotensin I solution to maintain the optimal system conditions. The mass spectra were collected with the triplay mode including full scan, zoom scan, and MS/MS, respectively. The sequence of a selected peptide was determined by the distribution of its daughter ions in the MS/MS spectrum. Protein Identification through Database Search. Each MALDI-TOF PMM was first calibrated with the two standard peptides (angiotensin II and insulin chain B oxidized) and then submitted to the human protein database (SwissProt.9.30.2001) 1318

Analytical Chemistry, Vol. 75, No. 6, March 15, 2003

to determine the protein identities using the MS-Fit program (UCSF Mass Spectrometry Facility). For the best-matched protein, two peptides with known amino acids identified through their isotope patterns were selected to recalibrate the PMM using the theoretical masses to reduce searching errors. The isotope patterns of labeled amino acids were used to verify the peptide compositions of matched proteins to score protein matches. For the unmatched peptides, their newly calibrated m/z values together with the experimentally determined amino acid contents were submitted to the database to identify the parent proteins using the MS-Seq program (UCSF Mass Spectrometry Facility). The LCQ MS/MS results were submitted to the database using the program of SEQUEST 2.0 to find the corresponding peptide sequence and its parent protein from the NCBI human protein database. RESULTS Determination of Residue Contents by the AACM Patterns for Individual Peptides. Although our residue-specific mass tagging with stable isotope-labeled amino acids has been demonstrated in E. coli and yeast to facilitate protein identification,20,23 this approach has not been applied to higher eukaryotes with larger proteomes such as human. To examine the general applicability of such an approach to the human proteome, five deuterium-labeled essential and nonessential amino acid precursors, such as lysine-d4, leucine-d3, methionine-d3, serine-d3, and tyrosine-d2, were chosen for mass tagging the human proteome of the HSF cells. Serine and tyrosine are the nonessential amino acids, and the others are the essential amino acids for human cells. The labeled essential amino acids will be taken up by the human cells because the cells cannot make these essential amino acids. For the nonessential amino acids, however, the uptake of a labeled amino acid precursor could be not as efficient as the essential ones if the cells can make their own instead of using available sources in the growth medium. Five small-scale cell cultures were grown in parallel, each with a different labeled precursor at a 1:1 ratio with the unlabeled counterpart in the medium to give 50% labeling. As a control, cells were grown in the regular medium without the labeled amino acid precursors. Based on their deuterium contents, the mass increase for these labeled amino acids should be 2 Da for tyrosine, 4 Da for lysine, and 3 Da for leucine, methionine, and serine, in each PMM. To determine whether a labeled amino acid precursor was residue-specifically incorporated into the human proteome, total cellular proteins of labeled or unlabeled HSF cells were separated by the 1D SDS gel and two unknown protein bands with different Coomassie-staining intensities were excised from the gel for trypsin digestion and MALDI-TOF MS analysis. For each 1D SDS gel band, a series of peptide peaks from the MALDI-TOF PMMs were screened for the isotope distributions by comparing the individual mass signals between the unlabeled and labeled samples. A few examples of peptides displaying characteristic isotope patterns induced by the heavy amino acid precursors are shown in Figure 1. The peptide with an m/z value of 945.56 from band 1 (Figure 1A) showed a 3-Da isotope pattern for the Ser-d3labeled sample when compared to the unlabeled one, indicating one serine in this peptide; meanwhile, the same peptide did not show any isotope patterns for other labeled samples, indicating no lysine, leucine, methionine, and tyrosine in this particular peptide. Similarly, the peptide at 1772.89 Da displayed 3-, 3-, and

Figure 1. Representative peptide signals in the MALDI-TOF PMMs showing the characteristic isotope patterns induced by the incorporation of various types of labeled amino acids. The labels are listed at the right; ‘K’-lysine, ‘L’-leucine, ‘M’-methionine, ‘S’-serine, ‘Y’-tyrosine, and ‘U’ for the unlabeled control. The isotope pattern induced by each labeled amino acid is marked as “*”. (A) protein band 1. (B) protein band 2.

Analytical Chemistry, Vol. 75, No. 6, March 15, 2003

1319

Table 1. Positive or False-Positive Matches in Protein Identification Determined by the AACM Patterns Induced by Multiple Amino Acid Precursorsa proteins matched with MALDI-TOF PMM actinc (positive)

P25686d (false positive)

amino acid labeled m/z submitted 976.4624 1132.5275 1198.7031 1515.7533 1790.8963 1954.0448 1960.9117 2215.0639 2343.1674 976.4624 1132.5275 1182.1477 1501.7168

matched peptide sequence

errorg (ppm)

AGFAGDDAPR GYSFTTTAER AVFPSIVGRPR DSYVGDEAQSKe IWHHTFYNELR SYELPDGQVITIGNER VAPEEHPVLLTEAPLNPK YPIEHGIITNWDDMEK DLYANTVLSGGTTMYPGIADR KDLYANTVLSGGTTMYPGIADR IMENGQERe EGLTGTGTGPSRe RIMENGQERe RQGRPRPSTKe AEAGSGGPGFTFTFRe

14 0 -3 150 2 2 -10 0 -3 1 10 -29 -23 -452 6

Kb

Lb

Mb

Sb

Yb

-f -

-

-

+f +

+ -

+ + + -

+ + + + + -

+ + + -

+ + + +

+ + + + + +

-

+

-

+

+

a Two protein matches as the representatives from band 1 are shown here. Among the top 20 possible candidates in the database search, actin was listed in first place and P25686 (SwissProt accession numbers) was listed in second place. b Single-letter abbreviation of amino acids. c Unambiguous identification by the PMM (MOWSE score, 1260) constrained by the ACCM patterns. d False-positive result by the PMM (MOWSE score, 614) determined by the ACCM patterns. e The peptides that matched the submitted m/z values of PMM but were determined as mismatches by the AACM patterns. f -, absence; +, presence, of the labeled amino acid residue. g The spectrum was internally calibrated with two ions: m/z 1132.5275 and 1960.9117.

2-Da isotope patterns that are characteristic for Leu-d3-, Ser-d3-, and Tyr-d2-labeled cells respectively, except for Lys-d4 and Met-d3 that are not residues of this peptide. For the peptide at m/z 2343.17, characteristic 4-, 3-, and 3-Da isotope patterns were observed for Lys-d4-, Met-d3-, and Ser-d3-labeled cells, respectively. Two sets of isotope patterns were observed for both the Leu-d3labeled and Tyr-d2-labeled samples, and they occurred at 3 and 6 Da for Leu-d3 labeling and 2 and 4 Da for Tyr-d2 labeling, respectively, indicating that two residues of each labeled precursor were specifically incorporated in this peptide (Figure 1A). The peptide at m/z 976.46 did not display any characteristic isotope patterns, and therefore, it contains none of the five labeled amino acid residues. In Figure 1B, peptides extracted from another band of the 1D gel also showed the expected isotope patterns for the individual labeled amino acid precursors. Moreover, there was no evidence of isotopic scrambling for all cases. Thus, Lys-d4, Leud3, Met-d3, Ser-d3, and Tyr-d2 amino acid precursors can be used in AACM in the human proteome, and the partial peptide composition can be easily determined by the observed AACM pattern for individual peptides in MALDI-TOF PMM. Therefore, these results have demonstrated that both essential and nonessential amino acids can be used for AACM. Efficient labeling by nonessential amino acid precursors has suggested that cellular pathways for amino acid synthesis are inhibited by provided amino acid precursors in the medium, presumably through a mechanism of negative-feedback inhibition. Elimination of False-Positive Results in Database Search by the AACM Approach. The MALDI-TOF PMM generated from each band was first submitted to the protein database using the MS-Fit program to search for protein candidates at an error bar of 500 ppm.9 Because of the high degree of mass degeneracy in the large human proteome, an enormous number of proteins was given in the searching results. To determine unambiguously the unique result from the candidate pool, the top 20 possible matches to each PMM were selected to examine their amino acid contents 1320 Analytical Chemistry, Vol. 75, No. 6, March 15, 2003

of the matched peptides against the experimentally determined AACM patterns. As an example using AACM patterns to discriminate falsepositive matches, the database searching result of the top two protein candidates from protein band 1 is presented in Table 1. There were two peptide candidates of actin matched to the submitted m/z at 1198.70 Da. The first possibility with the sequence of AVFPSIVGRPR contains one serine but no leucine, lysine, methionine, or tyrosine. Its amino acid composition matched with the AACM pattern shown in the right columns of Table 1. The second possible peptide, DSYVGDEAQSK, contains lysine and tyrosine that were not observed in the AACM patterns. This comparison clearly indicates that the peptide of AVFPSIVGRPR is the unique match whereas the peptide DSYVGDEAQSK is a false-positive result due to mass degeneracy. Performing at the same criteria, all the other peptides of actin were identified unambiguously as summarized in Table 1. For the protein P25686, however, all of its five matched peptides were not supported by the AACM patterns, and therefore, its identification was considered to be a false-positive result although most of its matched peptides were in low errors (Table 1). Thus, only actin was correctly identified in protein band 1 (Table 1). Similarly, transgelin is the only candidate correctly identified in protein band 2 using the constraint of AACM patterns. All other proteins on the list simply resulted from the high degree of mass degeneracy in the human proteome and were false-positive matches. Therefore, the AACM patterns served as a filter for database searching to eliminate the false-positive identifications. Furthermore, the recalibration of the PMMs with the labeled peptides greatly reduced search errors to less than 15 ppm for actin (Table 1) and less than 100 ppm for transgelin (data not shown). Proof of Principle: Identification of Actin and Transgelin through Single Peptides with the AACM-Determined Amino Acid Composition. The data search using the m/z as a sole parameter showed highest sequence coverage for both actin and

Table 2. Identification of Actin and Transgelin Based on the Single Peptide m/z Constrained with Multiple AACM Patterns 0 missed cleavagea amino acid labeled

peptide mass (m/z)

Kb

Lb

Mb

Sb

Yb

976.46239

-e

-

-

-

-

1132.5275 1198.7031 1501.7168 1515.7533 1790.8963 1954.0448

+

+ + + +

-

+e + + + -

+ + + + -

1960.9117 2343.1674

+ +

+

+ +

+

854.3831 1211.6025

-

+

+ +

1295.6119 1408.6161

+ -

+ -

+

matched peptide sequence

protein IDc

1 missed cleavagea error MW (ppm) totald

AGFAGDDAPR EETITIDR GYSFTTTAER AVFPSIVGRPR IWHHSFYNELR IWHHTFYNELR SYELPDGQVITIGNER VAPEEHPVLLTEAPLNPK

Actin ACTIN Q92966 ACTIN ACTIN ACTIN ACTIN ACTIN ACTIN

42 46.7 41.8 41.8 41.8 42 42 42

14 -33.5 0 -3 -12 2 2 -10

12

+ +

YPIEHGIITNWDDMEK

ACTIN

41.8

0

18 8

+ +

+ -

+ +

+

GPSYGMSR HVIGLQMGSNR TVTSTMLGVFR EFTESQLQEGK GASQAGMTGYGRPR

Transgelin transgelin 23 transgelin 23 P30793 28 transgelin 23 transgelin 23

0 -24 -36 0 -42

23 28 13 10 15 17

matched peptide sequence VCACPGRDR

LEEGPPVTTVLTREDGLK

protein IDc

error MW (ppm) totald

P48775

43.7

P08559

KDLYANTVLSGG- ACTIN TTMYPGIADR

43.3 42

17

30

-2

55 43 36 28 36 39

1

46 42

14 9

19 17

5 6

15 15

a Proteins were digested with trypsin. b Single-letter abbreviation of amino acids. c Protein accession numbers based on the SwissProt database. The number of total possible matches to a single mass without the AACM constrains with 500 ppm mass tolerance. e -, absence, +, presence, of the labeled amino acid residue.

d

transgelin in the corresponding PMMs of protein bands 1 and 2. We use these two proteins as positive controls to demonstrate the specificity and accuracy of our single-peptide-based AACM approach in protein identification directly from MALDI-TOF PMMs. Many of peptides in the PMMs displayed the various AACM patterns in response to the type of labeled amino acid precursors used in cell culturing. Individual peptides were examined against their m/z values along with complete AACM patterns, and the identification results are listed in Table 2. In addition to the m/z value of each possible peptide, AACM patterns have played critical roles in distinguishing the specific matches. Nine peptides from actin and four peptides from transgelin were matched to both the m/z value and AACM pattern at error bars less than 20 ppm for actin and 40 ppm for transgelin. Also, a peptide at 2343.16 Da with a missed cleavage site was also identified as an actin peptide through its AACM pattern. Meanwhile, some peptides also showed secondary matches in addition to the parent proteins due to mass degeneracy in database searching: 976.46 Da matched to two other proteins (accession number P48775 and Q92966), 1954.04 Da matched to P08559, and 1211.60 Da matched to P30793. However, the majority of identifications were single unique matches. To further confirm our protein assignments using single AACM peptides, the peptide de novo sequencing approach was employed using LCQ MS/MS. As a representative spectrum shown in Figure 2, the peptide sequence deduced from MS/MS fragmentation for the parent ion at m/z 1960.91 of actin was consistent with that of the matched peptide using the MALDI-TOF AACM approach shown in Table 2. Furthermore, the database search using the SEQUEST program identified the same peptide sequence for the MS/MS spectrum. The consistency between our MALDI-TOFbased approach and the tandem MS/MS has been also observed in identifying other peptide sequences and their parent proteins.25 These results suggest that a single peptide m/z value from a

MALDI-TOF PMM together with its AACM pattern is specifically sufficient for accurate protein identification in a protein mixture. This should be applicable for identifying low sequence coverage proteins with few or only a single peptide detectable in the MALDITOF MS. Identification of Comigrating Low-Abundant Proteins from Single Peptides with Their AACM-Determined Partial Amino Acid Compositions. The unassigned peptide signals other than those of actin and transgelin from the PMMs of 1D SDS gel bands suggest that they may result from other low-abundant proteins comigrating with actin and transgelin. In general, only a few or probably even a single peptide detectable by MALDI-TOF MS for individual proteins would prevent them from being identified by the conventional search criteria that usually require at least four peptides matching within a error range less than 100 ppm. However, after internal recalibration with the labeled actin and transgelin peptides, these proteins have been identified from a single individual peptide using the multiple AACM patterns. For those unidentified signals at relatively low intensity in the same PMM of the protein bands 1 and 2, Table 3 shows the peptide signals that were observed in the MALDI-TOF MS spectra of all six unlabeled and labeled samples. Other weak signals not consistently detected in each labeled PMM were not included in the analysis. The amino acid composition determined by the AACM patterns for each of these peptides were indicated as “-” for “absence” or “+” for “presence” under each amino acid abbreviations in the table. Using the MS-Seq program, each of these m/z values was submitted along with the AACM-determined amino acid composition information to identify its parent protein. As a result, the identified proteins were given by their accession (24) Dreger, M.; Bengtsson, L.; Schoneberg, T.; Otto, H.; Hucho, F. Proc. Natl. Acad. Sci. U.S.A. 2001, 98, 11943-11948. (25) Gu, S.; Pan, S.; Bradbury, E. M.; Chen, X. Anal. Chem. 2002, 74, 57745785.

Analytical Chemistry, Vol. 75, No. 6, March 15, 2003

1321

Figure 2. LCQ tandem MS/MS spectrum showing a peptide sequence from actin. As the positive control, an actin tryptic fragment (m/z ) 1961) was sequenced by the tandem MS/MS. The daughter ion series are shown in detail, and the amino acid ladders are indicated with the single letter abbreviations. The obtained peptide sequence was matched to that identified by the single MALDI-TOF m/z combined with the AACM pattern induced by multiple precursors. Table 3. Identification of Proteins for Two 1D Gel Bands Using Single Peptide m/z Constrained with the AACM Patterns 0 missed cleavagea amino acid labeled

peptide mass (m/z)

Kb

Lb

Mb

Sb

Yb

945.5623 1182.0446 1499.6862 1531.7361 1547.7220 1772.8928 1976.8922

-f +

+ + + -

+

+f + -

+ + + +

944.5235 1155.4928 1194.6164 1198.6697 1223.5941 1227.5926 1316.6282 1424.5975 1582.8274 2384.8286

-

+ + + + -

+ + + + -

+ + + + -

+ + -

1 missed cleavagea

matched peptide sequence

protein IDc

MW

error (ppm)

IGNGTSGIR GAAGGAEQPGPGGR

P55795 Q99447e

Band 1 49.3 43.8

51 401

YLQHHHFHQER ?

O75791e

37.9

4

?

totald

matched peptide sequence

protein IDc

MW

28 32 12 15 11 12 7

AASRANTVR

Q13873

44.7

39

EIFQNGHVRDER YPGPLAQQAQRFR

P48775 Q12926

47.9 39.6

-33 -50

ISDFGLATVFRYNNR

O14757e

54

11 4 10 39 10 6 22 10 4 3

RCGGILVR

P20718

27.3

-24

RITQGFCVVT VAIIIPFRNR QNARFLTSMR

O00212 P15291e O75558e

23.5 44.2 33.2

-12 -61 -31

Band 2 ?

LMQESLTLHR LQIWDTAGQER ? NCTITANAECACR ?

Q16613 P20338e

23.3 23.9

-48 -24

P26842e

29.2

95

error (ppm)

-9

totald 55 72 30 43 30 39 36 28 11 22 64 28 14 28 28 10 8

a Proteins were digested with trypsin. b Single-letter abbreviation of amino acids. c Protein accession numbers based on the SwissProt database. The number of total possible matches to a single mass without any amino acid constrains with 500 ppm mass tolerance. e Membrane and low-abundant proteins. f -, absence, +, presence, of the labeled amino acid residues.

d

numbers in SWISS-prot database. Significantly, from each single peptide signal, unique proteins matched to the both search criteria were specifically identified mostly with low errors including those peptide sequences with a missed cleavage site. Among the total of the 14 identified proteins listed in Table 3, 5 of them, i.e., the access numbers of Q99447, P20338, P26842, P15291, and O75558, are known as membrane proteins,26-30 and 2 of them, i.e., the 1322 Analytical Chemistry, Vol. 75, No. 6, March 15, 2003

access numbers of O14757 and O75791, apparently are lowabundant proteins.31,32 Among all proteins identified, including actin and transgelin in these two protein bands, membrane and low-abundant proteins are 7 out of 16, which is close to ∼40% in total protein population. (26) Wang, X. M.; Moore, T. S., Jr. J. Biol. Chem. 1991, 266, 19981-19987.

Figure 3. MALDI-TOF PMM of protein band 1 showing peptide signals corresponding to the identified proteins listed in Tables 2 and 3. Proteins are listed as the accession numbers, “?” for possible novel proteins currently not present in the database. It shows highest sequence coverage for actin with nine peptides detected while all other proteins with only a single peptide signal detectable in the MS. Inset: expanded view of a small region showing two actin peptides with characteristic 4-Da isotope patterns induced by 50% lysine-d4 labeling.

The mass of the intact protein O14757 for the 1772.89-Da peptide in band 1 was found to be 54 kDa, which is significantly larger than its apparent molecular weight on the gel. It is possibly a cleavage product in the respective bands as previously reported.18 Similarly, two proteins in band 2, P15291 and O75558, were in the same category. Meanwhile, for two peptides in band 1 (1547.72 and 1976.89 Da) and three peptides in band 2 (1155.49, 1424.60, and 2384.83 Da), no parent proteins were identified. They might be novel proteins not present in the protein database,24 or these peptides could be the products of nonspecific cleavage or other unknown modifications. To determine whether the single peptide-identified proteins have any other peptides for possible matching in these PMMs, we used the peptide signals of the entire PMMs to search against each individual parent proteins identified. The results of this search confirmed that only one single peptide from each of these parent proteins was detected in the MALDI-TOF PMMs with only one exception that two peptides were matched to P48775 (1471.82 and 1499.69 Da). Therefore, these low-abundant parent proteins had very low sequence coverage in the PMMs as an example shown in Figure 3 for protein band 1, and they were identified through a single peptide with specific amino acid composition information obtained from the AACM patterns. DISCUSSION In this human proteome study, the stable isotope-enriched amino acid mass tagging approach was systematically applied to (27) van der Sluijs, P.; Hull, M.; Huber, L. A.; Male, P.; Goud, B.; Mellman, I. EMBO J. 1992, 11, 4379-4389. (28) Camerini, D.; Walz, G.; Loenen, W. A.; Borst, J.; Seed, B. J. Immunol. 1991, 147, 3165-3169. (29) Yamaguchi, N.; Fukuda, M. N. J. Biol. Chem. 1995, 270, 12170-12176. (30) Valdez, A. C.; Cabaniols, J. P.; Brown, M. J.; Roche, P. A. J. Cell Sci. 1999, 112 (Pt 6), 845-854. (31) Sanchez, Y.; Wong, C.; Thoma, R. S.; Richman, R.; Wu, Z.; Piwnica-Worms, H.; Elledge, S. J. Science 1997, 277, 1497-1501.

the identification of proteins in a complex protein mixture from 1D gel bands. For both essential and nonessential amino acids tested, all showed characteristic isotope patterns for peptides containing the labeled amino acid(s). Moreover, isotopic scrambling was not observed in MALDI-TOF MS spectra. Therefore, incorporating various isotope-enriched amino acid precursors into the human proteome in vivo during cell culturing can be a highly specific definition for protein identification. Although the AACM pattern for each peptide was compiled through visual inspection of isotope patterns of labeled amino acids in this report, a computer-assisted method has been developed for the automated recognition of the isotope patterns for future studies.33 The need for multiple precursor labeling in the improvement of accuracy and specificity of protein identification correlates to the size of a proteome. With the relatively small yeast and E. coli proteomes, a single amino acid label is sufficient for accurate and specific protein identification.20,23 For the much larger human proteome with many unknown gene products, single amino acid labeling is insufficient and multiple AACM is needed to reveal the composition of a number of amino acid residues in individual peptides. It is particularly important for protein identification based on a single peptide. As shown in Table 4, we have performed the searches in the human protein database with only the m/z parameter, single labeling, or different combinations of multiple labelings for some peptides. Statistically, the results of these different searches have suggested that, because of the large size of the human proteome, multiple amino acid contents need to be determined for specific protein identification. The MALDI-TOF PMM-based protein identification usually requires high peptide coverage of each parent protein for high accuracy and specificity.4 This approach is readily coupled to the (32) Ellis, J. H.; Ashman, C.; Burden, M. N.; Kilpatrick, K. E.; Morse, M. A.; Hamblin, P. A. J. Immunol. 2000, 164, 5805-5814. (33) Lubeck, O.; Sewell, C.; Gu, S.; Chen, X.; Cai, M. Proc. IEEE 2002, 90, 1868-1874.

Analytical Chemistry, Vol. 75, No. 6, March 15, 2003

1323

Table 4. Number of Identified Proteins Using the Single Peptide m/z Constrained with Different AACM Combinations peptide 1 (m/z 1182.05)a no. of labeling

Kb -c

0 1

x

amino acid labeled Lb Mb Sb -

peptide 2 (m/z 1531.74)a Yb -

no. of proteins matchede

x x x x 2

3 4 5

x

x x x x x x

x x x x x x x x x x

x x x x x x x x x

x x

x x

76 34 28 62 45 54 10 22 15 20 6 4 6 3 5 1

no. of labeling

Kb -

0 1

x

amino acid labeled Mb Lb Sb +c -

Yb +

x x x x 2

3 4 5

x

x x x x x x

x x x x x x x x x x

x x x x x x x x x

x x

x x

no. of proteins matchedd 43 16 20 34 22 29 9 15 10 11 8 4 4 4 4 2

a The two peptides were selected from Table 2. b Amino acid single-letter abbreviation. c The presence or absence of a labeled amino acid residue in the peptides is indicated with ‘+’ and ‘-’, respectively. d One trypsin missed cleavage was allowed in the database searching with 500 ppm mass tolerance.

2D gel approach, which has high-resolution separation power so that a single gel spot contains only one or a few proteins. However, for the 1D gel format, it is much more challenging because of multiple proteins that comigrate in the same gel band. Identification of individual proteins from such a protein mixture is difficult due to low peptide coverage for each parent protein except for highly abundant proteins. This problem can be even more serious with the very large size of the human proteome that is ∼10 times larger than the yeast proteome. Consequently, a single 1D gel band likely contains 10 times more proteins from human cells than from yeast. Because of the high number of proteins present in a sample, the peptide coverage in MALDI-TOF MS for individual proteins might be relatively low. In fact, as shown in this study, with many proteins only a single peptide was detected in a MS spectrum. Because of low sequence coverage it is crucial to develop efficient methods that can perform protein identification for the human proteome with only few or even a single peptide. Protein identification using the PMM-based single AACM peptides is an efficient approach for the large-scale protein identification of complex protein mixtures. A previous study with yeast has demonstrated that a single peptide mass from a PMM is sufficient for the accurate identification of its parent protein within certain constraints.21 These constraints include highaccuracy mass measurement to reduce mass degeneracy and a cysteine modification to recognize the cysteine-containing peptides. However, the ultrahigh mass accuracy can only be achieved through the use of a highly sophisticated instrument FTICR that is not widely used due to its high maintenance cost and complexity in operation. Due to its low distribution in cellular proteins, cysteine may not be an ideal target residue as a specific constraint for database searching. Some proteins may not be identified due to the lack of cysteine-containing peptides in the PMM.21 To overcome these potential limitations, we chose a set of labeled amino acid residues to constrain the database search. As illustrated, the combination of these selected amino acids is highly 1324 Analytical Chemistry, Vol. 75, No. 6, March 15, 2003

distributed in various protein sequences and the labeling-dependent isotope patterns were often observed in PMM. Our approach can tolerate a high degree of mass degeneracy in a large proteome such as the human without demanding the instruments with ultrahigh mass accuracy. Furthermore, since many basic human biological questions are addressed at cellular levels by culturing different cell lines, the amino acid labeling scheme should be generally applicable for proteomic studies. Although MS/MS analysis is commonly used for accurate protein identification by peptide sequencing, it has some limitations. First, some weak peptide signals may not get informative tandem MS/MS spectra due to the chemical and physical properties of the peptides themselves or limitations of the fragmentation methods. Second, some peptides detected in MALDI may not be detectable in LC/ESI MS/MS due to the intrinsic difference in the two ionization mechanisms. Third, the MS/MS fragmentation-generated daughter ion distribution may not be complete in a spectrum that may lead to an ambiguous peptide sequence. Our AACM method can easily yield an initial and specific identification of peptides, and therefore, it should be a useful approach complementary to the existing LC/MS/MS methods. The advantage of the AACM single-peptide approach was evident with the identifications of several membrane and low-abundance proteins shown in this study. ACKNOWLEDGMENT This work was supported by DOE Human Genome Instrumentation Grant ERW9840, Los Alamos National Laboratory LDRD Grant 20071, and DOE grant KP1103010 (to X.C). X.C. is a recipient of the Presidential Early Career Award for Scientists and Engineers. This article is Los Alamos National Laboratory journal series LAUR 02-3035. Received for review July 24, 2002. Accepted January 9, 2003. AC020482S