Large-Scale Identification of Caenorhabditis elegans Proteins by

Oct 16, 2002 - These technical limitations prevent high-throughput, large-scale protein .... a comma separated values format (csv), using an in-house ...
0 downloads 0 Views 1011KB Size
Large-Scale Identification of Caenorhabditis elegans Proteins by Multidimensional Liquid Chromatography-Tandem Mass Spectrometry Kwasi G. Mawuenyega,†,‡ Hiroyuki Kaji,*,†,‡ Yoshio Yamauchi,§ Takashi Shinkawa,† Haruna Saito,† Masato Taoka,† Nobuhiro Takahashi,§,| and Toshiaki Isobe†,§ Department of Chemistry, Graduate School of Science, Tokyo Metropolitan University, Hachioji, Tokyo 192-0397, Japan, Integrated Proteomics System Project, Pioneer Research on Genome the Frontier, MEXT, Japan, and Department of Applied Bioscience, United Graduate School of Agriculture Science, Tokyo University of Agriculture and Technology, Fuchu, Tokyo 183-8509, Japan Received July 16, 2002

A proteome of a model organism, Caenorhabditis elegans, was analyzed by an integrated liquid chromatography (LC)-based protein identification system, which was constructed by microscale twodimensional liquid chromatography (2DLC) coupled with electrospray ionization (ESI) tandem mass spectrometry (MS/MS) on a high-resolution hybrid mass spectrometer with an automated data analysis system. Soluble and insoluble protein fractions were prepared from a mixed growth phase culture of the worm C. elegans, digested with trypsin, and fractionated separately on the 2DLC system. The separated peptides were directly analyzed by on-line ESI-MS/MS in a data-dependent mode, and the resultant spectral data were automatically processed to search a genome sequence database, wormpep 66, for protein identification. The total number of proteins of the composite proteome identified in this method was 1616, including 110 secreted/targeted proteins and 242 transmembrane proteins. The codon adaptation indices of the identified proteins suggested that the system could identify proteins of relatively low abundance, which are difficult to identify by conventional 2D-gel electrophoresis (GE) followed by an offline mass spectrometric analysis such as peptide mass fingerprinting. Among the ∼5400 peptides assigned in this study, many peptides with post-translational modifications, such as N-terminal acetylation and phosphorylation, were detected. This expression profile of C. elegans, containing 571 hypothetical gene products, will serve as the basic data of a major proteome set expressed in the worm. Keywords: liquid chromatography • tandem mass spectrometry • peptide signature • C. elegans

Introduction Proteomics is a field of genome science that aims to uncover functional protein networks of biological systems through direct analyses of proteins expressed in cells. Thus, typical studies include the determination of quantitative changes in the expression levels of proteins, the assessment of the effects of a wide variety of cellular perturbations, and the comprehensive analysis of protein interactions by the mass identification of protein components in functional multiprotein complexes, membrane domains, and cellular organelles. Since proteomics is essentially based on the genome-wide analyses of proteins, its success depends largely on the technologies of protein separation and identification. * To whom correspondence should be addressed. Fax: 81-426-77-2525. E-mail: [email protected]. † Tokyo Metropolitan University. ‡ These authors contributed equally to this work. § MEXT. | Tokyo University of Agriculture and Technology. 10.1021/pr025551y CCC: $25.00

 2003 American Chemical Society

Among the current technologies, two-dimensional polyacrylamide gel electrophoresis (2DGE) followed by mass spectrometry (MS) is most widely used for protein separation and identification. 2DGE allows the resolution of thousands of proteins in a single analysis. However, the protein identification is essentially performed on a spot-basis and requires timeconsuming, multiple manual steps for digestion, extraction, and pretreatments for the MS analysis. In addition, the experimental 2DGE procedure has not yet been automated, and some of cellular proteins, such as those with extremes in their isoelectric point (pI) or molecular weight (Mr), and membrane-associated proteins are rarely found in 2DGE studies. These technical limitations prevent high-throughput, large-scale protein analyses, a key issue in proteomics. Almost two decades ago, we presented an automated twodimensional high-performance liquid chromatography (2DLC) technique for the systematic separation of very complex protein or peptide mixtures1-3 and applied this system for the sequence analysis of very large proteins and protein genetic variants4-6 and for profiling the proteins expressed in the developing rat Journal of Proteome Research 2003, 2, 23-35

23

Published on Web 10/16/2002

research articles cerebella.7-9 This technique has currently been refined for a more sophisticated approach by incorporating new MS technology, e.g., replacing the conventional UV detector by MS with an electrospray ionization (ESI) source.10 Accordingly, the analytical columns were downsized to reduce the LC flow rate, and a small reversed-phase “trap” column was inserted between the two analytical columns to remove the salts from the peptides and proteins eluted from the first ion-exchange column before the MS step. This new configuration, coupled with the high-resolution Q-Tof hybrid mass spectrometry and automated data processing system, provided a fully integrated analytical platform for nongel based, large-scale protein identification. A similar LC-MS technology, combined with a new strategy for protein identification using MS, called the “peptide sequence tag”, served as an innovative tool for the mass identification of proteins in the functional ribosome complex and for a large number of proteins expressed in S. cerevisiae.11,12 Here, we have applied our 2DLC-MS/MS system for the identification of proteins expressed in a model organism, C. elegans, for which genome to proteome information is present in a variety of public databases (see ref 13 for review). Analyses of the soluble and insoluble fractions of the worm revealed 1616 protein species, including 242 membrane proteins with predicted transmembrane segment(s). The expression levels and the cellular roles of the identified proteins, as well as the possibility of comprehensive detection of post-translational modifications, are also discussed.

Materials and Methods Materials. The wild-type (strain N2) C. elegans was cultured in a liquid medium at 20 °C14 with E. coli HB101 as food. Mixed growth phase cultures of worms were harvested and separated from the bacteria by centrifugation in a 30% sucrose solution. The floated worms were collected, washed with 0.1 M NaCl, and stored at -20 °C until use. Sample Preparation. The worm cells were lysed by sonication in 50 mM Tris-HCl buffer (pH 8.0) containing a protease inhibitor (PI) cocktail (Sigma, St. Louis, MO), and soluble and insoluble protein fractions were prepared by ultracentrifugation at 100000G for 1 h at 4 °C. The soluble extract was precipitated immediately by the addition of trichloroacetic acid to a final concentration of 10% to avoid possible artificial proteolysis and was delipidified by acetone treatment. The insoluble fraction was resuspended in the same buffered PI cocktail and was precipitated again by ultracentrifugation to remove the soluble proteins. All preparation steps were carried out in an ice bath. The soluble and insoluble protein precipitates were each dissolved in 7.0 M guanidine-HCl buffered with 500 mM TrisHCl (pH 8.0) containing 10 mM EDTA. The preparations were reduced by the addition of 1 mM dithiothreitol (DTT) and were alkylated with 10 mM iodoacetamide under a nitrogen atmosphere. The S-carbamoylmethylated proteins were dialyzed against 10 mM Tris-HCl (pH 8.0) to remove the excess reagents and then were digested overnight at 37 °C with sequence grade modified trypsin (Promega, Madison, WI) at an enzymesubstrate ratio of 1:25 (w/w). The digests were acidified to pH 2 by the addition of an aliquot of concentrated HCl, and the precipitates thus formed were removed by centrifugation. The supernatant was adjusted to pH 8 with aqueous ammonia and was subjected immediately to the 2DLC-MS/MS analysis. Automated 2DLC-MS/MS Analysis. The tryptic digest was analyzed by the microscale 2DLC-MS/MS system as described.10 The chromatography was carried out by a combina24

Journal of Proteome Research • Vol. 2, No. 1, 2003

Mawuenyega et al.

tion of first-dimensional anion-exchange (AE) and seconddimensional reversed-phase (RP) LC, which was synchronized by a computer program. The first AE-LC was performed on a Bioassist-Q column (2 mm ID ×35 mm L, 5 µm particles, TOSOH, Tokyo), and the second RP-LC was on a MightysilC18 column (320 µm i.d. × 100 mm L, 3 µm particles, Kanto Chemicals, Tokyo). The system was also equipped with a small “trap” column packed with Mightysil-C18 (1 mm i.d. × 5 mm L), which was inserted between the two analytical columns through a 6-way column-switching valve. This trap column served to remove the salts from the peptides eluted from the first AE-LC, which was important for the efficient ionization of peptides on the MS analysis and for reproducible analyses. Thus, the peptide mixture was first separated on the AE column by 10 discrete step-gradients of NaCl from 0 to 400 mM (0, 20, 40, 60, 80, 100, 120, 160, 200, 400 mM NaCl in 25 mM TrisHCl, pH 8.0, respectively) at a flow rate of 100 µL/min. The peptides eluted by each step were captured once on the trap column for desalting, and then were sequentially separated further on an RP column by a 70 min-linear gradient (5∼40%) of acetonitrile in 0.2% formic acid at 5 µL/min. The eluted peptides were sprayed directly into a quadrupole time-of-flight (QTOF) hybrid mass spectrometer (Q-Tof 2, Micromass UK Ltd., Manchester, U.K.). The total analysis time for a single 2DLC process was 16 h. Protein Identification by Tandem Mass Spectrometry and Data Analyses. The peptides eluted from 2DLC were detected in the MS mode to select a set of precursor ions for a datadependent, collision-induced dissociation mass spectrometric (MS/MS) analysis, and every 1 or 20 s the largest four signals selected were subjected to the MS/MS analysis. The large volume of MS/MS data was acquired by the software MassLynx (Micromass, Manchester) and was converted to text files listing the mass values of the parent ions and the intensities and the mass values of the fragment ions by the ProteinLynx software (Micromass). Using these data, the MASCOT software (Matrix Science Ltd., London) searched a genome sequence database for peptide assignment. We used the wormpep 66 database, which was constructed and supplied by The Wellcome Trust Sanger Institute (Cambridge, U.K.). The database search was performed with appropriate parameters. In a typical instance, the sole fixed modification parameter was carbamoylmethylation (Cys), and the variable modification parameters were pyro-Glu, acetylation (protein N-terminus), oxidation (Met), and phosphorylation (Ser, Thr, and Tyr). The maximum missed cleavages was set at 3, with a peptide Mw tolerance of (500 ppm. Peptide charges from +2 to +4 states and MS/MS tolerances of (0.5 Da were allowed. Among the obtained search results file (dat file), the parameters of top-ranked candidate(s), such as the amino acid sequence of the peptide, the coding sequence (CDS) identifier (such as, F52D10.3), the probability (total score, threshold, and the difference), the modification, and so forth were extracted as a text file with a comma separated values format (csv), using an in-house program named STEM. The results were imported into Microsoft Excel for further analysis. We basically selected the candidate peptides with probability-based Mowse scores (total score) that exceeded its threshold, indicating a significant (or extensive) homology (p < 0.05), and referred to them as “hits”. The criteria were based on the manufacturer’s definitions (Matrix Science, Ltd.).15 Furthermore, we set more strict criteria for protein assignment, as follows. (1) Any peptide candidate with an MS/MS signal

Profiling of C. elegans Proteins by 2DLC-MS/MS

research articles

Figure 1. Two-dimensional display of the proteome predicted from the genome sequence (Wormpep66)(closed circle b; 20 219 entries) and experimentally obtained by this study (open circle O; 1616 proteins). Molecular mass (Mr) and isoelectric point (pI) of the whole proteome were calculated from their amino acid sequences without considering post-translational modifications. The y-axis is presented as a logarithmic scale.

number of less than 2 was eliminated from the “hit” candidates, regardless of the match score (total score minus threshold). (2) Proteins with match scores exceeding 10 (p < 0.005) were referred to as “identified”. (3) If the protein was identified with a single peptide candidate having a match score lower than 10, then the original MS/MS spectrum was carefully inspected to confirm that the assignment was based on three or more yor b-series ions. (4) For all candidates, if an individual candidate carried multiple modifications, its MS/MS spectrum was visually inspected for confirmation. If all of the modifications were evident from the MS/MS signals, then the peptides were included in the “hit” peptides.

Results and Discussion Protein Identification. The mixed growth phase worm was disrupted by sonication, and the homogenate was separated into soluble and insoluble fractions by ultracentrifugation. Both protein mixtures were separately digested with trypsin after S-carbamoylmethylation. The digests of about 300 µg of protein were subjected to the 2DLC-MS/MS system. Each fraction was analyzed twice under the same conditions. The fully automated LC separation of the tryptic peptides followed by the MS/MS analysis generated a huge amount of spectral data. Depending mainly on the population of the constituent peptides, 8000-14000 MS/MS analyses were carried out in a single operation, which lasted about 16 h. Among the spectra, about one-fourth gave reliable candidate peptides (2200-3700) by searching the sequence-database, wormpep. The peptides were assigned to 1700-2500 original proteins (genes). Several peptides derived from a single protein were detected by a single analysis run. About 2.0-3.5 peptides were assigned per protein on average. However, 40-60% of the proteins were assigned by single peptide hits. If even a single peptide was identified that satisfied the criteria mentioned in the Material and Methods, then we considered that its original protein existed in the fraction. So, after removing the redundant

Figure 2. Codon adaptation index (CAI) of the genes predicted from the C. elegans genome (gray bars) and the proteins identified in this study (black bars). The CAI was obtained from the WormPD database (19 032 entries). The figure also indicates the average number of peptides used to identify proteins with different CAI ranges (shown by triangles). Although the 2DLCMS/MS system tends to identify proteins with CAI > 0.5, i.e., highly and moderately expressed proteins, with relatively high recovery, to detect the peptides derived from proteins with larger CAI values more frequently, it also allows the identification of “low abundance” proteins in the cell, with CAI < 0.5 (more than 500 proteins in this analysis).

assignments, 600-1000 proteins were identified by a single analytical run. On the other hand, since a single, short sequence can belong to multiple proteins (genes), such as multi-copy gene products, Journal of Proteome Research • Vol. 2, No. 1, 2003 25

research articles Table 1. Putative Transmembrane Proteins Identified in the C. elegans Proteome

26

Journal of Proteome Research • Vol. 2, No. 1, 2003

Mawuenyega et al.

Profiling of C. elegans Proteins by 2DLC-MS/MS

research articles

Table 1 (Continued)

Journal of Proteome Research • Vol. 2, No. 1, 2003 27

research articles Table 1 (Continued)

28

Journal of Proteome Research • Vol. 2, No. 1, 2003

Mawuenyega et al.

Profiling of C. elegans Proteins by 2DLC-MS/MS

research articles

Table 1 (Continued)

spliced variants, family gene products, and in some cases unrelated proteins, the multiple candidates were assigned from a single MS/MS spectrum. Considering the possibility that all candidate proteins might exist in the source, all candidates were listed as the “identified” proteins. We measured each of the two protein fractions twice. In the cases where a protein could be detected once in each analysis, the protein could be regarded as an “identified” protein as well as an overlapped protein. By the same criteria, the worm proteome was composed of the identified proteins of the two fractions. Using the 820 soluble fraction proteins and 1269 insoluble fraction proteins, by subtracting those that overlapped in the two fractions, a composite proteome of 1616 proteins was formed. The proteins identified thus far are listed in the Supporting Information. To obtain detailed information on each protein, we searched the individual protein (gene) in Proteome BioKnowledge Library of C. elegans, WormPD, maintained by Proteome Inc. (presently by Incyte Genomics Inc. as pay service).16,17 A total of 1593 proteins (genes) of the identified proteins were listed in the database. The number of transmembrane (TM) regions was predicted by the SOSUI program.18-20 To distinguish between a signal peptide and a TM region, the SOSUI signal (β-version) was used. Detailed results about their categorization are described below. Isoelectric Point (pI) and Molecular Mass (Mr) of the C. elegans Proteome. Unlike nucleic acids, proteins have quite a variety of physicochemical characteristics, in terms of their molecular mass (Mr) and isoelectric point (pI). This is a major

issue in protein separation for expression profiling. The LCbased technology presented herein is expected to identify proteins regardless of these physicochemical parameters, as the method depends on the analysis of peptide fragments, rather than proteins. So, the pI and Mr distributions of the identified 1616 proteins were calculated without considering any posttranslational modification and were compared with those of the proteome predicted from the genome sequence (from wormpep 66). The comparison revealed that the two distribution sets almost overlapped (Figure 1). The most acidic protein identified (pI 3.48) was encoded by the ZK84.1 gene (human mucin like), and conversely, the most basic one was Y38F2AR9 (pI 12.41). The smallest gene product detected was C27A2.2B (Mr ) 6.0 kDa), and the largest one was K07E12.1 (CAM, Mr ) 1369 kDa). Only a fraction of the number of gene products was out of the pI range of detectable proteins in this experiment (19 gene products on the acidic side and 22 on the basic side). However, the low Mr proteins seemed to be difficult to identify (140 proteins were smaller than the smallest protein detected here), probably because the small proteins generated a relatively small number of peptide fragments. In general, 2DGE could detect proteins with an Mr between 8 and 200 kDa and a pI between 4 and 10. In fact, earlier studies of the C. elegans proteome using 2DGE21-23 identified only a small number of proteins larger than 180 kDa or proteins with pI values higher than 10. Thus, the 2DLC-MS/MS system can detect more than 99% of the proteins predicted from the genome sequence of C. elegans, as far as their Mr and pI are concerned. Journal of Proteome Research • Vol. 2, No. 1, 2003 29

research articles Table 2. Secreted or Targeted Proteins Predicted in the C. elegans Proteome

30

Journal of Proteome Research • Vol. 2, No. 1, 2003

Mawuenyega et al.

Profiling of C. elegans Proteins by 2DLC-MS/MS

research articles

Table 2 (Continued)

Expression Level. The codon adaptation index (CAI) is widely used as an indicator of the protein expression level. This index was shown to be effective not only in unicellular organisms such as E. coli, S. cerevisiae,24 Haemophilus influenzae, and Mycobacterium tuberculosis,25 but also in multicellular animals, e.g., fruit fly Drosophila melanogaster and C. elegans.26 The positive relationship between the expression level at the transcription level (EST) and the frequency of favored codon (Fav) usage in C. elegans was directly demonstrated with the experimental facts on over 8000 genes, rather than a statistical prediction. Therefore, we assessed the 1616 proteins identified by this study in terms of their possible abundance in the worm, using the CAI as a convenient indicator. The CAIs of 1549 out of 1616 identified proteins were found in the WormPD database. First, the CAI distribution of the worm proteins was compared with that of E. coli, and the worm proteins were categorized into three expression level classes with the following CAI ranges, i.e., high (CAI g 0.7), moderate (0.5 g CAI < 0.7), and low (CAI < 0.5) expression levels, which were determined according to the method of Sharp.27 The CAI

distribution of the worm proteome, shown in Figure 2, suggested that only 16% of the whole proteome (about 3000 proteins) is highly or moderately expressed, and remaining 84% has a low expression level (CAI < 0.5). All of the proteins identified in this study are classified in terms of their CAI value (Figure 2). In the WormPD, there are 506 “highly expressed” proteins (CAI g 0.7), of which 336 (66%) were detected in this study. Likewise, 666 (30%) of the 2228 “moderately expressed” proteins and 549 (3.4%) of the 16 298 “low abundant” proteins were detected. Thus, the LC-based technology presented here favored the identification of proteins with higher abundance, yet it appeared to have a wider dynamic range than conventional 2DGE, as most of the “low abundance” proteins had not been identified by the previous 2DGE studies.22,23 Most LC-based protein identification technologies, including the one reported here, are based on the analysis of peptide fragments derived from the proteolysis of a complex protein mixture. Thus, the number of “peptide hits” used to identify a protein is thought to relate primarily to the abundance of a Journal of Proteome Research • Vol. 2, No. 1, 2003 31

research articles protein in the sample mixture and to the protein length (the number of peptides generated by tryptic digestion). A plot of the number of peptide hits versus the CAI range of the identified protein (Figure 2) shows some correlation between these two parameters, in which the number of peptide hits increased with the higher CAI values of the identified proteins. Proteins of low abundance with CAI < 0.5 were identified with 1.82 peptides on average, and those of medium abundance (0.5 g CAI < 0.7) had about 4.30 peptides. Many highly abundant proteins with CAI g 0.7 were identified on the basis of 11.9 peptides (Figure 2). Meanwhile, there was little bias detected in terms of the length of the identified proteins for the CAI value (data not shown). Therefore, the number of peptide hits roughly reflects the abundance of proteins in the sample mixture. In this study, the most frequently detected proteins were the myosin heavy chains F11C3.3 (unc-54: CAI ) 0.835) and K12F2.1 (myo-3: CAI ) 0.751), paramyosin (F07A5.7: CAI ) 0.828), the vitellogenins K07H8.6 (vit-6: CAI ) 0.830) and C42D8.2 (vit-2: CAI ) 0.838), actins (M03F4.2A: CAI ) 0.866 and T04C12.4: CAI ) 0.862), and glutamate dehydrogenase (ZK829.4: CAI ) 0.805). The average CAI value of these proteins reached 0.826. However, this should not always be the case, because 60% of the identified proteins were assigned by single or two peptide hit(s). Therefore, the semiquantitative information obtained by this system needs to be supplemented by additional experiments. Insoluble Proteins with Transmembrane Segments and Secreted/Targeted Proteins. Insoluble proteins are difficult to analyze by conventional technologies. To identify the insoluble proteins efficiently, we prepared the crude precipitates by centrifuging the worm extracts and digesting them with trypsin, and the soluble peptide fragments thus produced were analyzed by the 2DLC-MS/MS system (see Materials and Methods). Among the 1269 proteins identified in this insoluble fraction, 149 proteins were identified as having two or more transmembrane (TM) segments by the SOSUI program. In the soluble fraction, we also found 40 proteins with multiple TM segments. In addition to these proteins with multiple TM segments, many proteins appeared to have single TM-like segments, i.e., a stretch of hydrophobic amino acids. We analyzed these segments to distinguish whether they were regarded as a TM segment or a signal sequence for protein secretion or targeting, using a SOSUI program (SIGNAL, β version). This analysis predicted an additional 56 TM proteins in the insoluble fraction and 38 additional TM proteins in the soluble fractions. After removing the redundant protein identifications in the soluble and insoluble fractions, 242 proteins were finally predicted to be TM proteins (Table 1). Likewise, 110 proteins appeared to be secreted or targeted to specific cellular organelles, from their predicted signal sequences (Table 2). In this experiment, many TM proteins were found in the soluble fraction, and conversely, many soluble proteins were found in the insoluble fraction. This is partly because of the cross-contamination of our preparation, which could be estimated as ∼20% from the overlapped protein identification, and partly because of the interactions between soluble and insoluble proteins or between proteins and other cellular components. Among the 242 TM proteins assigned in this study, 111 were hypothetical proteins without a known homologous protein. The remaining 131 proteins include transporters, channels, cell surface receptors, structural proteins, and other membrane32

Journal of Proteome Research • Vol. 2, No. 1, 2003

Mawuenyega et al. Table 3. Cellular Roles of the C. elegans Proteins Identified by the 2DLC-MS/MS System cellular role

WormPD

signal transduction small molecule transport protein modification Pol II transcription lipid, fatty acid and sterol metabolism protein degradation cell structure protein synthesis chromatin/chromosome structure energy generation carbohydrate metabolism other metabolism amino acid metabolism RNA processing/modification DNA synthesis vesicular transport cell stress cell cycle control other meiosis nucleotide metabolism cell adhesion mitosis DNA repair cell polarity RNA splicing differentiation protein folding cytokinesis protein translocation nuclear-cytoplasmic transport axonal transport RNA turnover membrane fusion recombination Pol III transcription asymmetric cell division cell elongation dosage compensation aging phosphate metabolism protein complex assembly Pol I transcription cell wall maintenance mitochondrial transcription total having any role unknown not annotated total entries

1641 563 467 391 315 253 225 201 152 122 121 116 106 102 92 88 84 78 74 70 69 61 61 57 55 55 52 50 29 29 24 20 16 15 13 12 11 11 8 7 6 6 5 1 1 5935 4816 14 549 19 365

identified proteins (%)

45 (2.7) 57(10.1) 50 (10.7) 9 (2.3) 32 (10.2) 58 (22.9) 103 (45.8) 100 (49.8) 74 (48.7) 60 (49.2) 55 (45.5) 17 (14.7) 32 (30.2) 45 (44.1) 5 (5.4) 30 (34.1) 31 (36.9) 10 (12.8) 20 (27.0) 11 (15.7) 14 (20.3) 12 (19.7) 18 (29.5) 4 (7.0) 17 (30.9) 21 (38.2) 12 (23.1) 29 (58.0) 10 (34.5) 14 (48.3) 7 29.2) 0 (0.0) 6 (37.5) 5 (33.3) 0 (0.0) 0 (0.0) 0 (0.0) 0 (0.0) 2 (25.0) 0 (0.0) 1 (16.7) 3 (50.0) 0 (0.0) 0 (0.0) 0 (0.0) 1019 791 799 26 1616

bound enzymes (Table 1). In the worm proteome predicted from the genome, proteins with 7 TM segments, such as G-protein coupled receptors (GPCR), are significant in terms of their number of genes as compared to other eukaryotic model organisms.28 This study identified several GPCRs expressed in the worm (Table 1), but probably failed to detect many others due to their low abundance in the cell. Therefore, a more efficient strategy to concentrate membrane proteins, such as an affinity-based capture of their glycosylated moieties, will be necessary to detect minor membrane components. Cellular Role. The cellular roles of the identified 1616 proteins were searched in the protein database, WormPD, maintained by Incyte Genomics, Inc. (Table 3). As expected from the current status of gene annotation in multicellular organisms, the cellular role of about half of the identified proteins were unknown. The number of gene products with any cellular role was 799 (49.4%) out of 1616 proteins. In the

research articles

Profiling of C. elegans Proteins by 2DLC-MS/MS Table 4. List of N-Terminally Acetylated Proteins Predicted in the C. elegans Proteome CDS identifier

Wormpep accession

SWISS-PROT TrEMBL acc

B0041.4 B0393.1 C07H6.1 C16H3.2 C18D11.2 C18E9.6 C30C11.4 C30F8.2 C35B1.5 C44E4.4 C44F1.3 D1054.2 D1069.3 D2096.8 F01F1.12 F01F1.9 F13D12.7 F13H8.7 F20D6.4 F22B5.7 F25D7.1 F32A11.1 F36A2.6 F43E2.7 F47B10.7 F52D10.3A F52D10.3B F52H3.7B F53G12.1 F54C1.7 F54C9.1 K12F2.1 M04B2.3 M163.3 R06C1.4 R07E5.7 R09B3.2 R09B3.3 R10E11.2 T01C8.5 T05D4.1 T05E11.1 T05G5.10 T12D8.6 T21H3.3 T25C8.2 T27E9.2 Y37E3.7 Y38F2AL.4 Y39B6A.OO Y50D7A.7 Y57G11C.10 Y69E1A.5 Y87G2A.8 ZK418.4 ZK546.14 ZK673.7 ZK770.3 ZK945.2

CE07669 CE00854 CE00757 CE08236 CE18513 CE05298 CE00103 CE25796 CE16903 CE08718 CE02163 CE05521 CE17615 CE04306 CE01225 CE01235 CE02186 CE02641 CE07109 CE20707 CE09629 CE17737 CE09945 CE10348 CE03357 CE03389 CE28235 CE29330 CE11006 CE11052 CE02249 CE12204 CE12388 CE12450 CE18119 CE00672 CE16307 CE16308 CE06290 CE07462 CE16341 CE06360 CE00503 CE16403 CE13902 CE16463 CE14265 CE26658 CE06290 CE21680 CE26144 CE14944 CE22812 CE24687 CE25690 CE02914 CE01719 CE15413 CE01733

SW:O02056 SW:P46769 TR:Q94169 TR:Q9XTZ5 SW:Q18090 SW:Q05036 TR:O45060 TR:O01806 TR:Q18625 SW:Q27488 TR:O44790 TR:Q19007 SW:P46563 TR:Q19087 SW:P17343 TR:Q19437 TR:Q19650 TR:O61442 TR:Q93561 TR:O62198 TR:Q9XVP0 TR:O02093 SW:Q20507 SW:Q20655 TR:O01803 TR:P91328 SW:Q20751 TR:Q21440 TR:Q21501 TR:Q93901 TR:O62337 TR:Q21821 TR:O45712 TR:O45713 SW:P34546 SW:Q22067 TR:O45747 SW:P49041 SW:P34563 TR:Q9XVI9 TR:O16305 TR:O45815 TR:O45864 TR:Q9GR59

TR:Q21449 TR:Q9XW37 TR:Q9U1Q2 TR:Q23482 SW:Q23525 SW:Q09665 TR:O01634 SW:Q09583

protein

modified residue

rpl-4 ribosomal protein L1 rps-0 40S ribosomal protein lig-4 lec-9 sugar-binding protein Acyl CoA binding protein hypothetical Msi3p vaculolar ATPase hypothetical RNA-binding protein lec-4 galactoside binding lectin pas-2 proteasome component C3 hypothetical hypothetical Fructose-biphosphate aldolase hypothetical gpb-1 guanine nucleotide-binding protein beta subunit beta-ureidopropionase (rat) srp-7 serine protease inhibitor zyg-9 elongation factor hypothetical hypothetical rps-15 40S ribosomal protein S15 hypothetical acyl-CoA-binding protein ftt-2 14-3-3 protein hypothetical lec-2 galactoside-binding lectin RAS-related protein calcium binding protein initiation factor 5A myo-3 myosin heavy chain Human AF-9 protein like his-24 histone H1 RNA recognition motif. (aka RRM, RBD, or RNP domain) hypothetical RNA recognition motif. (aka RRM, RBD, or RNP domain) RNA recognition motif. (aka RRM, RBD, or RNP domain) vha-2 Vacuolar ATP synthase subunit aminotransferase Fructose-bisphosphate aldolase class-I 40S ribosomal protein S5 Initiation factor 5A EF hand calmodulin act-5 Actins ubiquinol-cytochrome c reductase complexn 11 KD protein hypothetical hypothetical hypothetical hypothetical gdi-1 GDI-1 GDP dissociation inhibitor Phosphatidylethanolamine-binding protein glucose-6-phosphate isomerase lin-37 hypothetical troponin C hypothetical pas-7 proteasome component (A-type)

A S A A S A S G A G S G M A A A S S A S M A A A S S S A G A S S M S A S S S S S A A S M A M S A S A S M S S S A G M S

genome database, the percentage of proteins annotated with any cellular role was 24.9% (4816/19365 entries). Thus, the proteins identified in this study were relatively well characterized as compared to those of the total proteome in C. elegans. This might be due to the fact that the identified proteins were relatively abundant in the worm cells (see the Expression Level section). As expected, most of the identified proteins were structural components of the cell (such as actins, myosins, tubulins, collagens, filament proteins, and major sperm proteins) and chromatin (as histones), metabolic and energy generation proteins, and proteins for protein folding, protein

translocation, RNA processing, and protein synthesis, such as initiation and elongation factors and a series of ribosomal proteins (Table 3). For these categories, about half of the proteins listed in the protein database were detected in this study (Table 3). On the other hand, the products of many genes, such as those for signal transduction, transcription factors, and transporters, were rarely found in this study. Post-Translational Modifications. The direct analysis of peptides using 2DLC-MS/MS allowed the identification of many peptides with post-translational modifications, such as N-terminal acetylation and site-specific phosphorylation in Journal of Proteome Research • Vol. 2, No. 1, 2003 33

research articles

Mawuenyega et al.

Table 5. Putative Phosphoproteins in the C. elegans Proteome and the Sites of Phosphorylation CDS identifier

Wormpep accession

SWIPP-PROT TrEMBL acc

C04F6.5

CE03925

SW:Q11177

C50E3.3 D1054.10 F11D11.2 F56A8.1 K10H10.4

CE08896 CE05528 CE15793 CE16121 CE16254

TR:Q22932 TR:Q18943 TR:O62151 TR:O45572 TR:O45681

sequencea

protein

alcohol /ribitol dehydrogenase hypothetical hypothetical hypothetical hypothetical hypothetical

HDLDYLK CGPSELTQCNSLK SNSRDSDEKWECRb LEIQFVSSKc TLIVWAYNYSIK VHSEEQGFIEEKPR

match score

0.491 2.3345 6.209 3.5413 2.7948 15.3275

position(s)

493 46 47 or 49, 52 195 or 196 982 92

a Phosphorylated residues are displayed in boldface. b Amino acid residues indicated in italics could not be defined from its MS/MS spectrum. c This sequence could alternatively be derived from F38A1.2/K08A2.6/R10H1.3/W07E6.6/Y45F10D.1/Y48G10A.5/Y66D12A.2/ZC247.4/ZK262.5/B0303.5/C01B12.6/F56A6.3/ T02G5.5/T20H12.2/Y47H9C.3/Y6B3B.8/ZK218.2.

proteins. These natural modifications were detected by the analysis of peptide fragments with additions of 42 and 80 mass units, respectively, to the polypeptide masses expected from the amino acid composition. In the data derived from analyses of the collision-induced dissociation (CID) fragmentation spectra of the almost 5,400 peptides obtained in this study, 59 peptides were found to be N-terminally acetylated (Table 4). The acetylation took place at the initiation Met residue in 7 cases and in 52 cases at the second residue, next to the initiation Met. The only second residues that were acetylated were Gly, Ala, and Ser, with incidences of 5, 21, and 26, respectively. This coincided with the previous notion that the N-terminal acylation tends to occur at the initial Met or at relatively small second residues. In addition to the N-terminal acetylation, the Mascot software predicted potential phosphorylation in 58 peptides. Thus, 85 residues were assumed to carry phosphate residues: 51 cases at Ser, 25 at Thr, and 9 at Tyr. Among the phosphoprotein candidates were a transcription factor (F55D12.3), tnc-1 (F54C1.7), a cytochrome P450 (T10B9.8), ptr-20 (Y53F4B.28), the protein tyrosine kinase Ark-1 (C01C7.1), and a hypothetical protein (K10H10.4). However, careful inspection of the MS/ MS data suggested that only a small number of peptides provided MS/MS signals that were sufficient to assign the phosphopeptide sequence, including the site of phosphorylation (shown in Table 5). Although these findings of potential posttranslational modifications must await additional experiments for confirmation, they will provide further annotations in the C. elegans proteome and information toward the elucidation of their functions.

the prefractionation step, as carried out here, is also effective. Although this system needs further improvement to satisfy the demand for the efficient analysis of the total proteome of a multicellular organism, it should be extremely powerful for the expression profiling of a focused proteome or for functional proteomics, such as the mass identifications of the protein components in large protein complexes and membrane domains, as well as affinity captured sub-proteomes, like glycoproteins and phosphoproteins. Efforts are still in progress to increase the sensitivity of the system further, by incorporating direct nano-flow LC-MS technology with a ReNCon gradient device.29 In general, MS shows poor quantitative properties; however, previous pioneering studies have proved that the use of a stable isotope for differential labeling allows a comparative, quantitative analysis of two preparations by an LC-based technology.30,31 Thus, further developments of the methodology to extract representative peptides without a population change of the proteome will open the next door toward more detailed information on the proteome dynamics accompanying various biological events.

Concluding Remarks

References

We applied a fully integrated protein identification system, comprising a microscale 2DLC connected on-line with a high resolution ESI-Q-Tof hybrid mass spectrometer and a dataretrieval system, to identify proteins expressed in the model organism, C. elegans. From sub-fractionated preparations of the worm proteins, 1,616 individual protein species were identified. Analyses of their physicochemical properties, such as Mr, pI, and localization, revealed that this LC-based protein identification system is applicable to almost all proteins expressed in C. elegans. At the current status, the maximum number of proteins that can be identified by a single analysis is about 1000. The selection of a peptide for analysis by CID MS/MS is datadependent and somewhat irregular, because the number of peptides that coelute at some moment may overwhelm the capability of the spectrometer, which can assess 8 peaks per arbitrary period. Therefore, multiple measurements of the same preparation increase the number of identified proteins, and 34

Journal of Proteome Research • Vol. 2, No. 1, 2003

Acknowledgment. This work was supported in part by Grants for the Integrated Proteomics System Project, Pioneer Research on Genome the Frontier from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan. Supporting Information Available: Table of C. elegans protein identified by the 2DLC-MS/MS system. This material is available free of charge via the Internet at http:// pubs.acs.org.

(1) Takahashi, N.; Ishioka, N.; Takahashi, Y.; Putnam, F. W. J. Chromatogr. 1985, 326, 407-18. (2) Takahashi, N.; Isobe, T.; Putnam, F. W. In HPLC of Proteins, Peptides, and Polynucleotides; Hearn, M. T. W., Ed.; Verlag Chemie lnternational: New York, 1991; pp 307-330. (3) Isobe, T.; Uchida, K.; Taoka, M.; Shinkai, F.; Manabe, T.; Okuyama, T. J. Chromatogr. 1991, 588, 115-23. (4) Takahashi, N.; Ortel, T. L.; Putnam, F. W. Proc. Natl. Acad. Sci. U.S.A. 1984, 81, 390-4. (5) Takahashi, N.; Takahashi, Y.; Isobe, T.; Putnam, F. W.; Fujita, M.; Satoh, C.; Neel, J. V. Proc. Natl. Acad. Sci. U. S.A. 1987, 84, 80015. (6) Takahashi, N.; Takahashi, Y.; Blumberg, B. S.; Putnam, F. W. Proc. Natl. Acad. Sci. U.S.A. 1987, 84, 4413-7. (7) Isobe, T.; Takahashi, N.; Putnam, F. W. In HPLC of Peptides and Proteins; Separation, Analysis and Conformation; Hodges, R. S., et al., Eds.; CRC Press: New York, 1991; pp 835-845. (8) Taoka, M.; Yamakuni, T.; Song, S. Y.; Yamakawa, Y.; Seta, K.; Okuyama, T.; Isobe, T. Eur. J. Biochem. 1992, 207, 615-20. (9) Taoka, M.; Isobe, T.; Okuyama, T.; Watanabe, M.; Kondo, H.; Yamakawa, Y.; Ozawa, F.; Hishinuma, F.; Kubota, M.; Minegishi, A.; Son, S. Y.; Yamakuni, T. J. Biol. Chem. 1994, 269, 9946-51.

research articles

Profiling of C. elegans Proteins by 2DLC-MS/MS (10) Isobe, T.; Yamauchi, Y.; Taoka, M.; Takahashi, N. In Protein Analysis; Laboratory Manual; Simpson, R., Ed.; Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY, 2002, in press. (11) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., III. Nat. Biotechnol. 1999, 17, 676-82. (12) Washburn, M. P.; Wolters, D.; Yates, J. R., III .Nat. Biotechnol. 2001, 19, 242-7. (13) Kaji, H.; Isobe, T. J. Chromatogr. B. 2002, in press. (14) Lewis, J.; Fleming, J. In Caenorhabditis elegans: Modern Biological Analysis of an Organism; Epstein, H. F., Shakes, D., Eds.; Academic Press: San Diego, 1995; Vol. 48, pp 3-29. (15) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-67. (16) Costanzo, M. C.; Hogan, J. D.; Cusick, M. E.; Davis, B. P.; Fancher, A. M.; Hodges, P. E.; Kondu, P.; Lengieza, C.; Lew-Smith, J. E.; Lingner, C.; Roberg-Perez, K. J.; Tillberg, M.; Brooks, J. E.; Garrels, J. I. Nucleic Acids Res. 2000, 28, 73-6. (17) WormPD: http://www.Incyte.com/sequence/proteome/databases/ WormPD.shtml. (18) Hirokawa, T.; Boon-Chieng, S.; Mitaku, S. Bioinformatics 1998, 14, 378-9. (19) Mitaku, S.; Ono, M.; Hirokawa, T.; Boon-Chieng, S.; Sonoyama, M. Biophys. Chem. 1999, 82, 165-71.

(20) SOSUI: http://sosui.proteome.bio.tuat.ac.jp/sosuiframe0.html. (21) Bini, L.; Heid, H.; Liberatori, S.; Geier, G.; Pallini, V.; Zwilling, R. Electrophoresis 1997, 18, 557-62. (22) Kaji, H.; Tsuji, T.; Mawuenyega, K. G.; Wakamiya, A.; Taoka, M.; Isobe, T. Electrophoresis 2000, 21, 1755-65. (23) Schrimpf, S. P.; Langen, H.; Gomes, A. V.; Wahlestedt, C. Electrophoresis 2001, 22, 1224-32. (24) Kurland, C. G. FEBS Lett. 1991, 285, 165-9. (25) Pan, A.; Dutta, C.; Das, J. Gene 1998, 215, 405-13. (26) Duret, L.; Mouchiroud, D. Proc. Natl. Acad. Sci. U.S.A. 1999, 96, 4482-7. (27) Sharp, P. M.; Li, W. H. Nucleic Acids Res. 1987, 15, 1281-95. (28) The Wellcome Trust Sanger Institute: http://www.sanger.ac.uk/ Projects/C_elegans/wormpep/. (29) Natsume, T.; Yamauchi, Y.; Nakayama, H.; Shinkawa, T.; Yanagida, M.; Takahashi, N.; Isobe, T. Anal. Chem. 2002, 74, 4725-33. (30) Borisov, O. V.; Goshe, M. B.; Conrads, T. P.; Rakov, V. S.; Veenstra, T. D.; Smith, R. D. Anal. Chem. 2002, 74, 2284-92. (31) Han, D. K.; Eng, J.; Zhou, H.; Aebersold, R. Nat. Biotechnol. 2001, 19, 946-51.

PR025551Y

Journal of Proteome Research • Vol. 2, No. 1, 2003 35