Defining Absolute Confidence Limits in the Identification of

May 8, 2002 - A Top-Down/Bottom-Up Study of the Ribosomal Proteins of Caulobacter crescentus. William E. Running, Shobha Ravipaty, Jonathan A. Karty, ...
0 downloads 10 Views 116KB Size
Defining Absolute Confidence Limits in the Identification of Caulobacter Proteins by Peptide Mass Mapping Jonathan A. Karty,† Marcia M. E. Ireland,‡ Yves V. Brun,‡ and James P. Reilly*,† Department of Chemistry, Indiana University, Bloomington, Indiana 47405, and Department of Biology, Indiana University, Bloomington, Indiana 47405 Received March 20, 2002

A derivatization reaction, guanidination, was recently reported that increases MALDI-TOF MS sensitivity toward lysine-terminated peptides. Its application conveys sequence information that can be used as a parameter in peptide mass mapping database searches. This paper presents a systematic study of the impact of guanidination on proteomic analysis of an entire bacterial organelle. Sixty-two 2-D gel isolated proteins from Caulobacter crescentus stalks were studied. A novel computer algorithm, Prodigies, was developed to analyze the data. Absolute confidence limits associated with protein assignments were established using Monte Carlo simulations of database searches. The advantages of guanidination are illustrated using both experimental and theoretical data. Keywords: proteomics • peptide mass mapping • database searching • MALDI-TOF • Caulobacter

Introduction Since its introduction in the early 1990s, peptide mass mapping has become a widely used technique in the field of proteomics.1-5 The variety of proteomics applications exploiting peptide mass mapping increases as the number of sequenced genomes expands. One of the major advantages of peptide mass mapping using MALDI-TOF mass spectrometry is its speed. Mass spectra can be acquired in seconds, and database searches can be performed in near real-time, making MALDITOF peptide mass mapping amenable to high-throughput applications.6,7 Peptide mass mapping is based on matching a limited number of experimentally measured ion masses to a set of theoretically predicted proteolytic fragment masses.8 Complex samples containing multiple proteins often give rise to a large number of proteolytic fragments. Mass spectra contain a limited number of peaks, and all components may not be represented in MALDI-TOF mass spectra of complex mixture proteolytic digests. To ensure that all components in a mixture are adequately represented, a relatively simple sample is required for successful MALDI-TOF peptide mass mapping. For this reason, two-dimensional gel electrophoresis commonly precedes peptide mass mapping analyses of complex samples. Unfortunately, not all proteins isolated by 2D gel electrophoresis can be identified by MALDI-TOF peptide mass mapping. In some cases, slower techniques such as micro-Edman degradation or nano-electrospray ionization tandem mass spectrometry are used to generate sequence information enabling the identification of the components in those samples.8 * To whom correspondence should be addressed at 800 E. Kirkwood Ave., Bloomington, IN 47405. E-mail: [email protected]. † Department of Chemistry, Indiana University. ‡ Department of Biology, Indiana University. 10.1021/pr025518b CCC: $22.00

 2002 American Chemical Society

Often, the inability to identify a protein in a MALDI-TOF peptide mass mapping experiment is due to an insufficient number of interpretable mass spectral peaks.9 A MALDI-TOF mass spectrum provides only the masses of proteolytic fragments; no sequence information is revealed. A rapid technique that produces additional information from MALDI-TOF mass spectra would facilitate a wide variety of proteomics experiments. One difficulty encountered when using MALDI-TOF for peptide mass mapping of trypsin digests is a sensitivity bias toward arginine-terminated peptides. For example, Krause and co-workers described a 4-15-fold decrease in signal intensity for peptides whose C-terminal arginine residues were exchanged for lysines.10 Furthermore, four groups demonstrated that guanidination can increase the sensitivity of MALDI-TOF MS to lysine-terminated tryptic peptides.11-14 In addition to increasing intensities, guanidination also provides information about the lysine content of peptides. The lysine content can be exploited to refine the database searches performed during peptide mass mapping experiments. Previous studies demonstrated this on a limited number of samples.15 A survey of a much larger set of samples can provide a better appreciation for the global utility of guanidination in peptide mass mapping. An obvious but not clearly resolved issue associated with peptide mass mapping involves the number of matched peptide masses that are needed to identify a protein.4,6,8,9,16,17 The factors that affect the number of matches required include the complexity of the sample, the mass accuracy of peak measurements, and the size of the proteome. Furthermore, the degree of confidence associated with an assignment depends not only on the number of masses in a spectrum that match a particular open reading frame (ORF) but also on the total number of masses in that spectrum. We chose to investigate the effect of guanidination on peptide mass mapping from both Journal of Proteome Research 2002, 1, 325-335

325

Published on Web 05/08/2002

research articles experimental and statistical points of view. In particular, does guanidination cause a reduction in the number of matches needed to identify proteins? If so, how can this reduction be exploited? Statistical modeling of peptide mass mapping experiments enabled the assignment of absolute confidence limits for protein identifications and an understanding of the effect that guanidination has on these confidence limits. These issues were explored during a study of electrophoretically separated proteins from the stalk organelle of Caulobacter crescentus. C. crescentus is an aquatic bacterium with a dimorphic life cycle. At an appropriate time, a flagellated, motile swarmer cell sheds its flagellum and replaces it with a new structure called a stalk. The stalked cell begins DNA replication. As the cell divides, two morphologically different progeny are produced. A new flagellated swarmer cell forms at the pole opposite the stalk, while the stalked cell remains unchanged. Upon division, the swarmer cell is free to swim away and the stalk cell can immediately begin division; the swarmer cell does not divide.18 An understanding of stalk function should provide insight into the Caulobacter life cycle, and identifying the proteins present in isolated stalks should facilitate characterization of stalk function. The complete genome of C. crescentus has been published;19 translation of this genome enables the prediction of the proteome and a complete set of tryptic fragment masses. The stalk protein study presented an opportunity to probe the efficacy of guanidination in improving our ability to identify proteins present in a number of gel spots. In an earlier study of the Caulobacter life cycle, approximately two-thirds of the gel spots analyzed were identified.20 We presently demonstrate that guanidination enables an 81% identification rate for stalk gel spots using MALDI-TOF MS alone. Statistical calculations assign absolute confidence limits to the protein identifications and gauge the impact of guanidination on these limits.

Experimental Section Cell Growth and Protein Extraction. C. crescentus strain YB2811 was created by introducing a miniTn5lacZ1100 transposon into the pstS gene of the stalk-shedding strain NY111d1.21,22 The new strain was more efficient at shedding its stalks than NY111d1, and a high-purity stalk isolate was obtained. The complete procedures for isolating the stalks and extracting the proteins are described in a concurrent publication.22 Briefly, cells were grown for 3-5 days at 30 °C in phosphate-limited HIGG medium. Cell bodies were pelleted by centrifugation at 17000g for 25 min. The supernatant containing the shed stalks was centrifuged at 48000g for 30 min. The precipitate from the second centrifugation was resuspended in 8 M urea, 2% w/v 3-[(3-cholamidopropyl)dimethlammonio]-1-propanesulfonate (CHAPS), with 0.5% v/v protease inhibitor cocktail (Roche, Indianapolis, IN). The suspension was subjected to four freeze-thaw cycles using a dry iceethanol bath. The resulting lysate was centrifuged at 154000g for 2 h, and the supernatant was retained. The protein concentration of the protein extract was estimated using a modified Lowry assay.23,24 Electrophoresis. A 200 µg portion of protein was suspended in 250 µL of rehydration buffer, containing 0.5% w/v pH 3-10 nonlinear immobilized pH gradient (IPG) buffers (APBiotech, Piscataway, NJ). This mixture was used to rehydrate 13 cm 3-10NL IPG strips (APBiotech). The hydrated strips were loaded into a IPGphor apparatus (APBiotech), and the proteins were electrofocused according to the manufacturer’s instructions. 326

Journal of Proteome Research • Vol. 1, No. 4, 2002

Karty et al.

The IPG strips containing focused proteins were immersed in an equilibration buffer consisting of 50 mM pH 8.8 tris(hydroxymethyl)aminomethane hydrochloride (Tris‚HCl), 30% v/v glycerol, 6 M urea, and 2% w/v sodium dodecyl sulfate (SDS) containing 10 g/L dithiothreitol (DTT) and shaken for 20 min. The free cysteines generated by the reduction of disulfide bonds in the previous step were alkylated by replacing the DTT/equilibration buffer with a buffer containing 25 g/L iodoacetamide. Again, the strips were shaken for 20 min. The strips were then rinsed with purified water and placed on top of a 2.5 mm layer of 0.5% agarose that in turn formed the stacking layer for a 12% cross-linked polyacrylamide gel. Another layer of agarose was placed above the IPG strip, and a running buffer of 25 mM pH 8.8 Tris‚HCl, 0.1% w/v SDS, and 192 mM glycine was used for separation. The second-dimension gels were run at 35 mA constant current for a period of 4 h. Proteins were visualized using colloidal Coomassie Blue stain (Novex, San Diego, CA) in accordance with the manufacturer’s instructions. Spot Destaining and Protein Digestion. The gel spots were excised and destained in a manner similar to that described by Fountlakis and Langen.25 The spots were excised within 2 days of visualization using a polypropylene pipet tip that was sheared off to a diameter of about 1 mm. The excised spots were placed in 600 µL microcentrifuge tubes and stored at -20 °C until digestion. They were destained by adding 100 µL of 100 mM ammonium bicarbonate in 50% v/v HPLC-grade acetonitrile and agititating for 20 min. The liquid was decanted, and the process repeated. A 100 µL portion of Type I water (Banstead E-Pure, Dubuque, IA) was added, and the gel spots were allowed to stand for 15 min. The water was decanted, and another 100 µl was added for a period of 5 min. Again, the water was discarded, and the spots were soaked in 100 µL of acetonitrile for 5 min. Destained gel spots were dried in a vacuum centrifuge (Jouan Inc., Winchester, VA) for 15 min at 65 °C. The dried gel spots were rehydrated with 15 µL of 16.67 mg/L (250 ng total enzyme) aseptically filled, TPCK-treated bovine trypsin (Sigma, St. Louis, MO) in 10 mM ammonium bicarbonate and incubated overnight at 37 °C. Peptide Extraction and Guanidination. The digestion was stopped by adding 100 µL of 0.1% v/v trifluoroacetic acid (TFA). Tryptic peptides were extracted by sonication for 20 min, and the liquid was retained. Two more steps of extraction using 100 µL of 30% v/v and 60% v/v acetonitrile were performed. The supernatants from these three steps were combined, and the extracts were vacuum centrifuged to dryness. The peptides were resuspended in 8 µL of water. A 4 µL portion of this extract was guanidinated following the procedure described by Beardsley and Reilly.26 Eight molal O-methylisourea hemisulfate (OMI) (Acros Organic, Springfield, NJ) was prepared by adding 100 µL of water to 98.4 mg of OMI. A 1.5 µL portion of OMI solution was added to 4 µL of peptide extract and 5.5 µL of 7 M ammonia; the solution was thoroughly mixed. The mixture was incubated for 20 min at 65 °C. The ammonia was removed by vacuum centrifuging the samples at 65 °C until a pressure of approximately 100-150 mTorr registered on the rough pump gauge. This minimized adsorptive losses that could arise from drying the sample completely.27 The guanidinated reaction products were acidified with 6 µL of 5% v/v TFA. This reaction quantitatively converts lysines to homoarginines, resulting in a peptide mass increase of 42.02 Da per lysine residue.12,26 This mass shift can be used to infer the number of lysines in a particular tryptic peptide.

Confidence Limits in Caulobacter Proteomics

research articles

Mass Spectrometry. MALDI spots were prepared by mixing 0.65 µL of raw peptide extract with 0.65 µL of 15 g/L R-cyano4-hydroxycinnamic acid (CHCA) in 75% v/v acetonitrile, 0.1% v/v TFA. Guanidinated samples were purified using micropipet tips packed with C18 stationary phase (Zorbax C18, Sigma, St. Louis, MO). The peptides were eluted with 2 µL of 10 g/L CHCA in 50% v/v acetonitrile, 0.1% v/v TFA. A 0.65 µL portion of these solutions was deposited onto the MALDI target. All spots were allowed to air-dry prior to loading into the mass spectrometer. Positive-ion mass spectra were recorded using a Bruker Reflex III MALDI-Reflectron TOF mass spectrometer. All mass spectra were internally calibrated with three or four trypsin autolysis peaks ([M + H]+ ) 805.417, 1153.574, 2163.057, and 2273.160 Da) using a linear fit. Database Searching Algorithms. The peak reports for each sample were compared to masses calculated by an in silico tryptic digest of the entire C. crescentus proteome19 using Prodigies (protein digest identification and elucidation software), an in-house software package. The program reads the ASCII formatted peak reports from the Bruker XMASS software package and generates a list of masses for database comparison. For these analyses, Prodigies was configured to use a peptide mass error window of (0.15 Da, up to two missed trypsin cleavage sites, partial single oxidation of methionines, partial loss of protein N-terminal methionine residues, total alkylation of cysteines by iodoacetamide, and partial conversion of peptide N-terminal glutamine residues to pyroglutamic acid. This led to over 600 000 theoretical tryptic fragment masses. In a simple mode of operation, Prodigies compares the data from a single mass spectrum to the masses generated by the in silico tryptic digest of the entire proteome and generates a list of open reading frame (ORF) products that could be present in the sample. Results are summarized in a “master hit array” (MHA). In the MHA, ORFs are ranked by the number of observed masses that match theoretical tryptic peptide masses of a particular ORF, and the errors between the observed and theoretical values are displayed. Many publicly available peptide mass mapping computer programs work in a similar manner,4,17,28-31 but Prodigies incorporates a more advanced mode that directly exploits the sequence information obtained by comparing guanidinated and unguanidinated data. In this mode, data from both guanidinated and unguanidinated mass spectra are used to generate three MHAs. Prodigies creates an MHA for the unguanidinated data as described above. It performs a second in silico tryptic digest assuming that all lysine residues have been converted to homoarginines. This new list is compared against the peak report from the guanidinated mass spectrum to generate a second MHA. Often, only one ORF will be common to both the unguanidinated and guanidinated MHAs, and it can be inferred that the product of this ORF is present in the gel band. Prodigies then tabulates those features that appear either at the same mass or shifted by an integral multiple of 42.02 Da in the two MHAs. The number of 42.02 Da shifts corresponds to the number of lysine residues in a particular peptide. Each mass in this new table is again compared to the unguanidinated in silico digest; a match is indicated only when a theoretical peptide mass matches an observed mass and its sequence contains the correct number of lysines. This new “consistent” master hit array lists those ORFs with the largest numbers of matches based on the criteria just described. The consistent MHA is a powerful tool for assigning proteins to gel band tryptic digests. It reduces the number of false positives that arise when multiple peptides with

Figure 1. 2-D gel obtained from a stalk protein sample harvested from C. crescentus strain YB2811. Underlined numbers indicate spots in which no proteins were identified.

different sequences share the same mass. Random matches occur because there are hundreds of thousands of predicted tryptic peptides compressed into a relatively narrow mass range. For example, there are 162 predicted Caulobacter tryptic peptides having masses within 0.15 Da of 1061.6 Da. Lysine content information excludes many of these peptides from appearing in the consistent master hit array since it only lists those peptides with the correct mass and number of lysines. For this particular example, 82 of the 162 peptides contain 0 lysines, 67 have 1, and only 13 peptides have 2 lysines in their sequences. Prodigies requires less than 1 s to search a pair of mass spectra against the Caulobacter proteome on a 700 MHz computer. Random Spectra Generation and Statistical Analysis. The confidence with which one makes a protein identification is as important as the identification itself. A statistical analysis of several million database searches with simulated data allowed the assignment of absolute confidence levels to protein identifications. The question that these simulations address is rather simple: given a mass spectrum containing a certain number of peaks, how many of these must match peptides derived from a single ORF in order for the identification to be certified with a specific confidence level? The method proceeds as follows: One of the 3762 Caulobacter ORFs is selected at random. A specified number of “real” tryptic peptides are chosen at random from the list of predicted tryptic peptides for that particular ORF. Likewise, a specified number of “random” tryptic peptides are selected from the list of all predicted tryptic peptides from all ORFs in the Caulobacter proteome. The combination of these sets of masses defines a theoretical mass spectrum. Prodigies shifts the masses of all “real” and “random” peptides based on the number of lysines in their sequences, generating a corresponding theoretical guanidinated mass spectrum. It then analyzes the theoretical mass spectra just as it handles experimental data. For each simulated spectrum, the “winning” ORF is determined. The “winning” ORF is defined herein as the ORF with the most Journal of Proteome Research • Vol. 1, No. 4, 2002 327

research articles

Karty et al.

Figure 2. Mass spectra obtained from (A) unguanidinated and (B) guanidinated tryptic digests of gel spot 35. Measured masses are listed in Tables 1 and 2.

matches in a master hit array. By repeating the process many times, Prodigies determines the percentage of theoretical mass spectra that are properly interpreted; i.e., the “winning” ORF is the one from which the real peptides were selected.

Results Identification of a Strong Gel Spot. Figure 1 is a photograph of a representative 2-D gel of a Caulobacter stalk protein sample. We were able to identify proteins in 50 out of 62 (81%) gel spots analyzed. Figure 2 displays the two mass spectra obtained from the unguanidinated and guanidinated tryptic digests of the proteins in gel spot 35. Tables 1-3 contain the three truncated MHAs for gel spot 35. (Note: The portion of the MHA to the right of the right-hand column has been deleted from all MHAs presented to save space. This truncation does not significantly affect interpretation since deleted items correspond to ORFs with relatively few matches.) The top row lists the ORFs ranked by number of peptides matched. The lefthand column lists the experimental peptide masses in order of decreasing intensity. The fractional elements in the table are the differences between the experimental and theoretical masses. Numbers in braces indicate the total number of matches in each column. The “winning” ORF corresponds to the protein most likely to be present in the gel spot. Tables 1-3 display data whose preliminary interpretation is obvious. In Table 1, 29 of the 35 observed masses match predicted tryptic peptides of ORF1749 (CC1750, a tonB dependent receptor). The only question that arises is whether any proteins 328

Journal of Proteome Research • Vol. 1, No. 4, 2002

other than ORF1749 are also present in this gel spot. Guanidinated data help to address this. Table 2 shows the MHA derived from the tryptic digest of gel spot 35 after guanidination. Since the reaction is quantitative,26 complete conversion of lysine to homoarginine is assumed. The MHA in Table 2 compares observed guanidinated masses with theoretical masses from the guanidinated proteome. ORF1749 once again has the largest number of matches with 23 out of 32 masses submitted. More importantly, only two other ORFs, 373 and 503, appear in both Tables 1 and 2. Almost none of the matches to these two ORFs involve experimental masses that were not already assigned to ORF1749, eliminating them from further serious consideration. Prodigies makes full use of the sequence information that guanidination reveals. In the consistent MHA for gel spot 35 digest samples shown in Table 3, ORF1749 once again stands out. Nevertheless, for such a clear case as this, the consistent MHA is not needed to interpret the mass spectra. Identification of a Weak Gel Band. Figure 3 displays mass spectra obtained from the tryptic digest of gel spot 16. There are about half as many peaks in these mass spectra as in Figure 2. It is clear in Figure 1 that spot 16 is not as dark as spot 35, implying that less protein is present. Tables 4-6 are the three MHAs that arise from analysis of mass spectra derived from spot 16. Unlike the previous gel spot, there is no obvious assignment from the first MHA. Three different ORFs match 6 of the 14 observed masses. A comparison of Tables 4 and 5 shows that only ORF1461 is common to both, suggesting that

research articles

Confidence Limits in Caulobacter Proteomics Table 1. Master Hit Array Obtained from Interpreting the Mass Spectrum in Figure 2A 1749 {29}

1295.6 1258.68 1902.89 885.44 733.45 1988.97 1303.73 888.48 1584.84 1308.74 1515.87 1194.55 1573.76 737.41 2131.99 1317.75 1859.9 1123.55 846.41 2216.09 1090.65 1715.87 1178.55 814.44 2267.18 1561.84 1774.87 1433.77 1756.84 1745.05 2360.19 2663.29 2611.35 2726.41 2327.1

-0.04 -0.03 -0.07 -0.01 -0.01 -0.06 -0.05 -0.01 -0.05 -0.04 -0.07 0.03 -0.03 0.02 -0.02 -0.02 -0.01 0.02 -0.02 -0.04 -0.07 -0.02 0.03 0.02 -0.05 -0.01 * * * -0.09 -0.01 0.01 * * *

2480 {9}

2922 {8}

3536 {8}

3665 {8}

3702 {8}

* * * * * * -0.02 * * -0.06 * * * * 0.08 * * 0.07 0.05 * 0.01 * * * * * * 0.1 * * * * * * * * * * * -0.03 * * * * 0.00 0.05 * -0.06 -0.03 * -0.04 * * * * * 0.04 * * * * 0.05 * * 0.05 * -0.03 * * * 0.06 0.02 * * * * -0.03 * 0.00 -0.1 * * * * * * * * * * 0.06 * 0.00 * * * * * 0.07 * * * * * * * * * * * * 0.05 0.1 0.03 * * * * * * -0.15 * * * * * * * * * * * 0.06 0.02 * -0.09 * * -0.13 * * * * * 0.12 -0.11 * * * * * * * * * * 0.02 * 0.06 * * * -0.03 * * -0.11 * * * -0.06 * * 0.06 * *

13 {7}

373 {7}

503 {7}

543 {7}

619 {7}

* * * * * -0.05 * * 0.05 * * * * * * 0.09 * 0.07 * 0.09 * * * * * * 0.06 * * * * -0.05 -0.01 * * * * * * * * * * -0.09 0.04 * * * * * -0.06 * * * * * * 0.04 * * * 0.03 * 0.02 * * * * * -0.1 * 0.08 * * * * * -0.13 * * * 0.08 * * * * * * 0.00 * 0.03 0.04 * 0.00 * * * 0.07 0.02 0.00 * * -0.04 * * * * * * * 0.08 * 0.04 * * -0.04 * * * 0.05 * * * * * * * * * * * * * * * * * * * * * * * * 0.07 * * * * * * * * * * * * * * * -0.06 * * * -0.01 * -0.12 * 0.02 * * * * * *

the protein encoded by ORF1461, flagellin fljK, is present in this gel spot. Table 6, the consistent MHA, provides confirmation of this identification. Five of the eight consistent peaks are associated with tryptic peptides of ORF1461. Thus, the only way that an ORF could be convincingly assigned to this gel spot was by using the information provided by both of the mass spectra. Nevertheless, since this identification is less unambiguous than that for spot 35, a more quantitative measure of our confidence in this assignment is desirable. Analysis of Simulated Data. The confidence with which assignments are made can be assessed through Monte Carlo simulations. Figure 4 is a contour plot derived from analysis of 2.75 million randomly simulated unguanidinated mass spectra. Search conditions were identical to those used with the experimental data. Ten thousand analyses were performed for each number of real and random peptides. The average and standard deviation for correct identifications were computed. The standard deviations of the contours increased monotonically from a low of 1% at 99% confidence to 5% at 50% confidence. A correct identification was registered when only one ORF had the highest number of matches in an MHA. Figure 4 reveals that the minimum number of “real” peptides required to correctly identify an ORF 99% of the time is 7 and even then, only if very few “random” peptides are present. Guanidinated theoretical data yielded comparable results, and the confidence contour plot (not shown) looked very similar to Figure 4. Overall, consideration of guanidinated or unguanidinated mass spectra individually yielded nearly the same confidence level. For example, 6 “real” masses and 8 “random” masses gave the correct assignment 85% of the time for unguanidinated data,

786 {7}

* * * 0.05 * * * 0.00 * * * 0.1 * * * * 0.05 * * * * * 0.04 0.03 * * * * * * * * * * 0.07

1142 {7}

1913 {7}

3605 {7}

3651 {7}

339 {6}

* * * * * 0.04 * * * * * * * * * * * * -0.09 * * * * * * 0.08 * * 0.1 * -0.07 * * -0.14 * * * 0.00 * * * * * * * -0.05 * * * * * * * * * * * * * 0.12 * * 0.11 * * -0.06 * * * * * * * 0.05 * -0.03 * * * * * 0.1 * * * * * * * * * 0.08 0.01 * 0.06 * * 0.01 * 0.1 * -0.08 * * * * 0.04 * * 0.14 * * * * * * 0.02 * * * * * * * * * * * -0.08 0.01 0.11 * * * * * * * * * * 0.05 0.06 0.03 * * -0.07 * * -0.1 * * * 0.04 * * * * * * * * -0.03 * * * * -0.02 * * * * * * *

and 86% of the time for guanidinated data. This makes sense since total guanidination merely shifts the mass of lysine by 42.02 Da. Information about lysine content is not obtained when the guanidinated mass spectrum is considered by itself. Figure 5 dramatically illustrates the impact of interpreting the guanidinated and unguanidinated mass spectra in conjunction. The minimum number of consistent matches needed to identify an ORF with 99% confidence falls to 5. Likewise, one can infer from Figure 5 that having 10 or more consistent matches enables identification at the 99% confidence level even if there are 25 other consistent pairs that do not match the “winning” ORF. Correlation between Theoretical Data and Experimental Data. The correspondence between simulated and experimental data is as follows: The number of “real” peptides in a simulated mass spectrum corresponds to the number of experimentally measured masses that match the “winning” ORF. The number of “random” masses in a simulated spectrum corresponds to the number of experimentally measured masses that do not match the “winning” ORF. An average experimental unguanidinated mass spectrum contained 24 masses. For such a case, at least 10 matches (10 “real” peptides and 14 “random” peptides) to the “winning” ORF are required to make an assignment with 95% confidence and 12 matches for 99% confidence. The average experimental guanidinated mass spectrum contained 20 masses. Nine matches are required to make an identification with 95% confidence and 10 matches for 99% confidence. The average experimental consistent MHA obtained during the stalk gel analyses had 12 masses. Perusal of Figure 5 reveals that only 5 consistent matches are required Journal of Proteome Research • Vol. 1, No. 4, 2002 329

research articles

Karty et al.

Table 2. Master Hit Array Obtained from Interpreting the Mass Spectrum in Figure 2B

1615.76 1236.55 1988.94 1258.64 1345.69 1584.78 885.47 1515.84 1943.94 1220.58 856.46 779.45 1715.73 775.42 1387.72 1309.66 1178.53 890.5 1295.54 888.42 1603.88 1123.5 1520.78 1986.88 1676.81 1929.19 1359.66 1901.92 2187.02 2309.08 2360.19 2091.9

1749 {23}

826 {9}

373 {8}

655 {8}

701 {8}

2178 {8}

42 {7}

503 {7}

1342 {7}

1603 {7}

1958 {7}

2588 {7}

2694 {7}

2722 {7}

3536 {7}

-0.01 0.06 -0.03 0.02 0.01 0.01 -0.05 -0.04 0.00 0.03 0.03 0.00 0.12 * * * * 0.03 0.02 0.05 -0.02 0.07 -0.02 * 0.06 * 0.1 * * 0.08 -0.01 *

0.14 0.12 0.13 -0.06 * * * * * * * * * * -0.02 0.03 0.05 * * * * * * * * * 0.02 * 0.12 * * *

0.06 * 0.1 * * * * * * * * * * -0.03 0.00 0.00 * * * 0.05 * * * * * -0.13 * 0.08 * * * *

* * * 0.02 0.02 0.01 * -0.07 0.07 * * * * -0.01 * * * * * * * * 0.09 * -0.02 * * * * * * *

0.03 * * 0.00 * * * 0.01 * * * * * * 0.01 -0.08 0.04 * * * * * * * -0.04 * 0.09 * * * * *

* 0.12 0.07 * * * 0.06 * * 0.02 * 0.00 * 0.00 * * 0.12 * * * * * 0.05 * * * * * * * * *

0.07 * * 0.02 * * * -0.1 * * * * 0.00 * * * * * * 0.09 * * * 0.12 * * * * -0.05 * * *

* -0.01 * 0.09 0.05 * 0.03 * * 0.04 * * * * 0.04 * * 0.01 * * * * * * * * * * * * * *

* * 0.01 * * * * * * * * * * * * * * -0.01 * 0.08 -0.07 * * * 0.02 * * 0.08 0.12 * * *

* * * * * * 0.00 * * * * * * -0.04 * * * * * 0.08 * * * * -0.03 * * 0.1 * 0.08 * 0.14

* * * * 0.03 * * 0.02 * * 0.07 * * * * * * * * 0.07 * * 0.01 * * -0.1 0.1 * * * * *

0.12 * * * * 0.13 * * * * * * * 0.05 * * * * * 0.11 * * 0.03 * 0.09 * * * * 0.09 * *

0.07 0.15 0.13 * * * * * * 0.07 * * * * * * * * * * * * * 0.13 * * * 0.12 -0.02 * * *

* * * * * * -0.02 0.04 * * * * * * 0.00 * * * * 0.11 * * -0.01 * 0.03 * * 0.08 * * * *

* * 0.13 * * * 0.05 -0.05 * 0.1 * * * -0.01 * * * 0.03 * 0.02 * * * * * * * * * * * *

Table 3. Consistent Master Hit Array Obtained from Interpreting the Two-Spot 35 Mass Spectra

1295.6 1258.68 1902.89 885.44 733.45 1988.97 1303.73 888.48 1584.84 1515.87 1194.55 1573.76 737.41 1317.75 1859.9 1123.55 846.41 1715.87 1178.55 814.44 2267.18 1561.84 2360.19

1749 {20}

373 {5}

503 {4}

593 {4}

804 {4}

826 {4}

1407 {4}

1784 {4}

2549 {4}

15 {3}

28 {3}

42 {3}

71 {3}

119 {3}

437 {3}

543 {3}

-0.04 -0.03 * -0.01 * -0.06 -0.05 -0.01 -0.05 -0.07 0.03 -0.03 0.02 -0.02 -0.01 0.02 * -0.02 0.03 0.02 -0.05 -0.01 -0.01

* * * * * 0.06 -0.05 * * * * 0.03 * * 0.08 * 0.04 * * * * * *

* * * 0.07 * * -0.01 * * * 0.04 * * * * * * * 0.04 * * * *

0.04 0.04 * * * * -0.04 * * * * * * -0.08 * * * * * * * * *

* 0.00 * * * * * * * * * -0.06 * * * * * * * * 0.04 * -0.03

* * * * * * -0.07 * * * 0.09 0.12 * * * * * * 0.03 * * * *

* * * * * * * * * -0.03 * 0.07 * * * * 0.09 * * 0.02 * * *

0.15 * * * * * * -0.03 * -0.08 * * * * * * * * 0.06 * * * *

* * * * * * *

* * * * * * * -0.04 * * * 0.04 * * * * * * * * * 0.09 *

* * * * * * * * * * * * * * 0.07 * * -0.02 * 0.06 * * *

* * * * * * * 0.04 * -0.13 * 0.05 * * * * * * * * * * *

* * * 0.04 * * -0.13 * * * * * * * * * * * * * * * -0.03

* * * * * * * * * * * * 0.03 * * 0.00 * * * * * -0.05 *

* * * * * 0.06 * 0.02 * * * * * * * 0.04 * * * * * * *

* * * * * * * * -0.09 * * * * * * 0.00 0.00 * * * * * *

for 95% confidence and 6 for 99% confidence. The reduced slopes of the Figure 5 contours relative to those of Figure 4 imply that spot identifications based on consistent mass spectral features are less affected by the presence of random data and require fewer matches. This is especially helpful for analyzing faint gel spots that typically yield fewer mass spectral peaks. In the case of gel spot 35, 29 of the 35 observed mass spectral features in the unguanidinated mass spectrum matched 330

Journal of Proteome Research • Vol. 1, No. 4, 2002

0 * * * * -0.03 * * * * 0.04 * * 0.07 * *

ORF1749. One can discern from Figure 4 that 29 “real” peptides and 6 “random” peptides provide a confidence greater than 99%. The more ambiguous case of gel spot 16 demonstrates the utility of guanidination. In that case, 6 of the 14 experimentally measured masses matched 3 different ORFs (503, 922, and 1461). From Figure 4, we have 85% confidence that one of these three proteins is present. Furthermore, 7 out of 14 masses in the guanidinated spectrum matched tryptic peptides of ORF1461, implying a confidence in the assignment of 96%.

research articles

Confidence Limits in Caulobacter Proteomics

Figure 3. Mass spectra obtained from (A) unguanidinated and (B) guanidinated tryptic digests of gel spot 16. Measured masses are listed in Tables 4 and 5. Table 4. Master Hit Array Obtained from Interpreting the Mass Spectrum in Figure 3A 503 {6}

922 {6}

1461 {6}

972 {5}

88 {4}

273 {4}

999 {4}

1031 {4}

1099 {4}

1298 {4}

1540 {4}

1606 {4}

1760 {4}

1795 {4}

3223 {4}

3536 {4}

3549 {4}

3608 {4}

1452.72 * 0.09 -0.02 * * -0.02 * * * * * * * -0.02 * -0.04 0.07 * 1914.02 * * -0.04 * * * * * * -0.04 0.00 * * * * * * * 2003.1 * * -0.02 -0.13 * * * * * * -0.11 * * * * * * * 1029.46 0.14 * * * -0.05 * * * * * * * * * * * * * 1907.96 0.07 * -0.04 * 0.08 * * * * * 0.12 * * * 0.01 * * * 1308.68 * 0.02 -0.06 * * -0.06 -0.07 * * * * * * -0.06 0.01 0.00 * 0.02 1424.76 -0.02 -0.08 * 0.01 * * 0.05 0.04 * * * 0.05 * * * * * 0.07 1234.71 -0.03 -0.07 * * * -0.09 * * * * * -0.13 -0.01 -0.09 * 0.02 * -0.08 1707.83 0.02 0.08 * * * * 0.03 * 0.02 -0.02 * 0 -0.07 * * * -0.09 * 1609.83 0.02 * * 0.04 -0.02 0.14 -0.09 -0.03 0.13 * * * * 0.14 * * -0.02 * 1794.99 * * -0.06 * * * * * -0.12 * * * * * -0.01 * 0.04 * 2384.09 * 0.14 * 0.02 0.04 * * 0.05 0.15 0.08 * * 0.03 * * 0.01 * * 2717.11 * * * * * * * * * * * * * * * * * * 2500.27 * * * -0.06 * * * 0.12 * 0.01 0.01 0.07 0.00 * 0.06 * * 0.03

Finally, Table 3 indicates that 5 out of 8 consistent peptides matched ORF1461. From Figure 5, one infers that these data would yield a correct identification 99% of the time. Reviewing the consistent MHAs for all 50 gel spots in which proteins were identified, all but 7 assignments were made with 99% confidence, and of those 7, only 2 were below 95% confidence. There were 14 sub-99% assignments for the unguanidinated MHAs, with 6 of those being at less than 95% confidence. A similar number of sub-99% assignments, 14, were observed in the guandinated MHAs, with only 4 of those being sub-95%. Only

one assignment was made with all three MHAs demonstrating less than 95% confidence. Closer examination of the data from that gel spot (spot 50), revealed several sample handlinginduced peptide modifications. Once these modifications were taken into account, the protein was identified with 98% confidence.

Discussion Overview of Peptide Mass Mapping Data. On the basis of the analysis of all gel spots, 33 distinct proteins were identified Journal of Proteome Research • Vol. 1, No. 4, 2002 331

research articles

Karty et al.

Table 5. Master Hit Array Obtained from Interpreting the Mass Spectrum in Figure 3B 1461 {7}

690 {5}

346 {4}

1494.71 0.01 * * 1071.5 * * * 859.5 0.00 0.01 * 1350.69 -0.04 * 0.01 1950.93 * 0.07 * 744.42 0.03 * -0.02 1955.99 0.02 * * 1949.9 0.04 * * 1276.74 * -0.09 * 1837.03 -0.09 0.09 -0.09 1601.82 * 0.07 -0.01 2096.95 * * * 2383.9 * * * 2716.99 * * *

739 {4}

792 {4}

793 {4}

799 {4}

1460 {4}

0.06 * * * * * * * * * * 0.00 0.00 -0.02 0.00 0.04 * * 0.08 * 0.13 * * * * 0.00 0.03 0.03 -0.07 0.03 * 0.02 0.02 * 0.02 * * * * * * * * * * * -0.09 -0.09 -0.02 -0.09 * * * * * * * * * * * * * * * * * * * *

including 7 different tonB dependent receptors, 2 ompA family proteins, 4 flagellins, and 12 proteins annotated as either hypothetical or conserved hypothetical. Many of these proteins were predicted to be membrane associated.22 This supports the hypothesis that the stalk is mostly membranous in nature and should therefore contain few cytosolic proteins.32,33 The identification of a relatively large number of tonB dependent receptors is consistent with previously published proteomic analyses of Caulobacter cell membranes.34,35 A more complete discussion of the proteins identified and their biological significance is presented in a concurrent publication.22 Averaged over all gel spots, we observed 24 masses per unguanidinated tryptic digest spectrum. Of those 24 measured masses, on average, 16 matched the “winning” ORF. For the guanidinated data, there were on average 13 “winning” ORF matches for 20 observed masses. When consistent masses were considered, the percentage of those matched to the “winning” ORF increased to 83% (10/12). We observed 37% average sequence coverage in the unguanidinated samples, and 31% after guanidination. In many proteins, trypsin cleavage sites are widely spaced. The large tryptic peptides produced in these cases can be difficult to detect in the presence of smaller peptides. Likewise, MALDI mass spectrometry is generally considered nonoptimal for studying low mass analytes (