Evaluating Preparative Isoelectric Focusing of ... - ACS Publications

Apr 2, 2005 - 6-155 Jackson Hall, Minneapolis, Minnesota 55455. We have evaluated the use of free-flow electrophoresis, an emerging separation method ...
0 downloads 0 Views 417KB Size
Anal. Chem. 2005, 77, 3198-3207

Evaluating Preparative Isoelectric Focusing of Complex Peptide Mixtures for Tandem Mass Spectrometry-Based Proteomics: A Case Study in Profiling Chromatin-Enriched Subcellular Fractions in Saccharomyces cerevisiae Hongwei Xie,† Sricharan Bandhakavi,† and Timothy J. Griffin*

Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, 321 Church Street SE, 6-155 Jackson Hall, Minneapolis, Minnesota 55455

We have evaluated the use of free-flow electrophoresis, an emerging separation method for preparative isoelectric focusing of complex peptide mixtures, as a tool for highthroughput tandem mass spectrometry-based proteomic analysis. In this study, we investigated the ability of freeflow electrophoresis to resolve and fractionate complex peptide mixtures and also the effectiveness of using peptide isoelectric point in conjunction with peptide match probability scoring in sequence database searching. As a model system for this study, we analyzed a chromatinenriched fraction from the yeast Saccharomyces cerevisiae. This mixture was fractionated using preparative isoelectric focusing by free-flow electrophoresis, followed by online capillary liquid chromatography electrospray tandem mass spectrometry and sequence database searching. Our results demonstrate that (1) FFE effectively resolves and fractionates complex peptide mixtures on the basis of peptide isoelectric point and (2) the introduction of peptide pI is effective in minimizing both false positive and false negative sequence matches in sequence database searching of tandem mass spectrometry data. An essential component in the analysis of complex protein mixtures using mass spectrometry is the use of high-resolution separation methodologies for the detection of the thousands of components that make up these complex mixtures. The most effective of these methods has coupled strong cation-exchange (SCX) HPLC with capillary reverse-phase liquid chromatography (µLC) to separate peptide mixtures derived from proteolysis of complex protein mixtures, followed by tandem mass spectrometry (MS/MS) analysis and sequence database searching for protein identification.1-3 This HPLC-based approach has overcome many * Corresponding author. E-mail: [email protected]. Tel: 612-624-5249. Fax: 612-624-0432. † These authors contributed equally to this work. (1) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., III. Nat. Biotechnol. 1999, 17, 676-682. (2) Washburn, M. P.; Wolters, D.; Yates, J. R., III. Nat. Biotechnol. 2001, 19, 242-247. (3) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43-50.

3198 Analytical Chemistry, Vol. 77, No. 10, May 15, 2005

of the inherent disadvantages of two-dimensional gel-based separations of protein mixtures.4 Although the first-stage SCX separation is based upon solution-phase charge, and thus closely related to the isoelectric point (pI) determined by the amino acid sequence of each peptide, this important physiochemical property of peptide sequences is not taken into consideration in the ultimate identification of the peptides by MS/MS and sequence database searching using SCX separations. As an alternative separation method, preparative isoelectric focusing (IEF) of peptide mixtures can be thought of as an “information-added” separation approach, as complex mixtures are fractionated such that information on peptide sequence (i.e., pI) is introduced prior to identifying the exact amino acid sequence by MS/MS and sequence database searching.5-7 Introduction of a peptide pI constraint in database searching has the potential to increase the accuracy of sequence database searching and to decrease the need for manual interpretation of database search results, which is a critical need when interpreting the results from the thousands of peptide components detected in high-throughput proteomic studies.5-7 Recently, it has been demonstrated5 that the use of immobilized pH gradient gel strips can be used as an effective method for reproducibly separating complex peptide mixtures by highresolution IEF; furthermore, using computational methods, the introduction of peptide pI as a constraint in the sequence database search was also demonstrated to be an effective way in which to filter out false positive identifications and also to increase the confidence of sequence matches.6,7 An attractive alternative for IEF of peptide mixtures is the use of free-flow electrophoresis (FFE).8 The effectiveness of FFE for preparative IEF of peptide mixtures has recently been described.9 (4) Gygi, S. P.; Corthals, G. L.; Zhang, Y.; Rochon, Y.; Aebersold, R. Proc. Natl. Acad. Sci. U.S.A. 2000, 97, 9390-9395. (5) Cargile, B. J.; Talley, D. L.; Stephenson, J. L., Jr. Electrophoresis 2004, 25, 936-945. (6) Cargile, B. J.; Bundy, J. L.; Freeman, T. W.; Stephenson, J. L., Jr. J. Proteome Res. 2004, 3, 112-119. (7) Cargile, B. J.; Bundy, J. L.; Stephenson, J. L., Jr. J. Proteome Res. 2004, 3, 1082-1085. (8) Loseva, O. I.; Gavryushkin, A. V.; Osipov, V. V.; Vanyakin, E. N. Electrophoresis 1998, 19, 1127-1134. (9) Moritz, R. L.; Ji, H.; Schutz, F.; Connolly, L. M.; Kapp, E. A.; Speed, T. P.; Simpson, R. J. Anal. Chem 2004, 76, 4811-4824. 10.1021/ac0482256 CCC: $30.25

© 2005 American Chemical Society Published on Web 04/02/2005

FFE offers several advantages for peptide separations including (1) solution-phase isoelectric focusing without the need for extraction from a gel matrix, resulting in high sample recovery;9 (2) large loading capacities (up to tens of milligrams) and flexibility of loading volumes (microliter amounts to several milliliter loading volumes); (3) rapid IEF and collection of samples (∼30-60 min per sample) as well as continuous, sequential loading and collection of samples; (4) high level of reproducibility.9 In this study, we evaluate the use of FFE for preparative IEF of peptides as a first separation dimension for tandem mass spectrometry-based proteomic analysis, with emphasis on the use of peptide pI information in sequence database searching. Using subcellular fractionation to enrich for chromatin-associated proteins in Saccharomyces cerevisiae, we use this enzymatically digested protein mixture to demonstrate the effectiveness of FFE to resolve and fractionate peptides on the basis of peptide pI, followed by automated µLC electrospray ionization (ESI) MS/ MS analysis. Furthermore, we present a validation of the use of peptide pI in conjunction with sequence match probability scoring10 to effectively minimize both false positive and false negative results7 using sequence database searching. We comprehensively validate the use of peptide pI and probability scoring using a combination of independent criteria, including in-silico analysis via reverse database searching, subcellular protein localization information, and biochemical detection of selected protein components by immunoblotting. Collectively, the results presented here demonstrate the power of preparative IEF using FFE as a general tool for high-throughput mass spectrometry-based analyses.

MATERIALS AND METHODS Chemicals and Reagents. Chemicals including HPLC grade acetonitrile, formic acid, and HPLC grade water were purchased from Sigma-Aldrich Chemical Co. (St. Louis). Sep-Pak tC18 chromatography cartridges were purchased from Waters Corporation (Milford, MA). Yeast Chromatin Preparation Protocol and Western Analysis. A crude yeast chromatin pellet was prepared essentially as described by others previously.11,12 Briefly, yeast spheroplasts were prepared from 109 actively growing yeast cells (strain BJ 5464; ATCC), lysed in a buffer containing protease inhibitors, 20 mM PIPES/KOH, pH 6.8, 0.4 M sorbitol, 2 mM magnesium acetate, 100 mM potassium acetate, and 1% Triton X-100, and centrifuged at 16 000g for 15 min. After discarding the supernatant, the crude chromatin pellet was washed gently in lysis buffer without Triton X-100, and proteins were extracted either with high salt (2 M NaCl, 50 mM Tris-HCl, pH 7.5) or resuspended in lysis buffer and boiled after addition of SDS sample buffer. To monitor efficiency of chromatin fractionation, equivalent fractions of whole cell extract, supernatant, and chromatin were analyzed by SDS-PAGE and agarose gel electrophoresis (for tracking genomic DNA). To prepare chromatin for SDS-PAGE, the pellet was resuspended in lysis buffer; equivalent amounts of (10) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem 2002, 74, 5383-5392. (11) Donovan, S.; Harwood, J.; Drury, L. S.; Diffley, J. F. Proc. Natl. Acad. Sci. U.S.A. 1997, 94, 5611-5616. (12) Liang, C.; Stillman, B. Genes Dev. 1997, 11, 3375-3386.

each fraction were boiled in SDS sample buffer, run on a 4-20% SDS-PAGE gel, and immunoblotted for presence of the cytoplasmic protein Cdc37p (mouse monoclonal anti-Cdc37 antibody was used at 1:1000 dilution) and the chromatin-associated histone H4p (Upstate antihistone H4 used at 1:1000 dilution). Additionally, we also checked for presence of genomic DNA in the whole cell extract, supernatant, and chromatin pellet by phenol chloroform precipitation of equivalent quantities of each of these fractions, followed by ethanol precipitation. DNA precipitates from the fractions were run on a 1% agarose gel and visualized by staining with ethidium bromide. For FFE and mass spectrometric analysis, 500 µg total protein from the crude chromatin pellet (diluted to 200 mM NaCl, 100 mM Tris-HCl pH 7.5) was supplemented with 5 mM TCEP and incubated overnight with 10 µg of trypsin. The resulting peptide samples were cleaned using reverse-phase Sep-Pak cartridges (Waters) and dried to completeness by vacuum centrifugation. Preparative IEF Using FFE. Preparative IEF of the peptide mixture was performed using a commercially available Pro Team free-flow electrophoresis system from Tecan (Salzburg, Austria) which has been described.9 The separation media (Pro Team Prolyte solutions) was freshly prepared as per the manufacturer’s instructions. Stabilization media was 100 mM H2SO4 and 100 mM NaOH at the anode and cathode, respectively. The separation media contained 0.2% (hydroxylpropyl)methyl cellulose (HPMC, Mn ∼86 000) to minimize electroosmotic flow during the separation, which can lead to a decrease in IEF resolution. The flow rate in the separation chamber was 60 mL/hour. The chromatin sample (500 µg) was dissolved in 50 µL of separation media (Prolyte 2, pH ∼7) and introduced into the separation chamber at a rate of 1 mL/hour. The sample was separated by IEF and collected in a 96 deep-well (1.2 mL per well) polystyrene microtiter plate (Fisher Scientific), with each well containing ∼300 µL after collection. IEF and sample collection took about 45 min from the start of sample loading to collection. Preparation of FFE Fractions and Mass Spectrometric Analysis. Immediately after FFE fractionation, the pH of each FFE fraction was measured using a microelectrode (Accumet Combination Micro Electrode, Fisher Scientific). The microtiter plate was then stored at -20 °C. A 35-µL aliquot was taken from each of the microtiter plate wells and placed in an Amicon Ultrafree-MC centrifugal filter device (5 K MW cutoff, Millipore Corporation, Bedford, MA). The device was preconditioned by 50 µL methanol followed by 50 µL HPLC buffer B (0.1% formic acid in acetonitrile) before samples were loaded. Each aliquot was centrifuged at 5000g for 1 h and washed with 50 µL buffer B and 100 µL methanol, respectively. The combined filtrate was then evaporated to dryness by vacuum centrifugation and reconstituted to about 20 µL in 0.1% formic acid in water for µLC-MS/MS analysis. The µLC-MS/MS analysis was done using an Agilent 1100 binary HPLC system, coupled to an LCQ Classic ion trap mass spectrometer (Finnigan Mat, San Francisco, CA). Samples (20 µL) were manually loaded by sample loop injection using an integrated six-port valve. The samples were loaded to a vented precolumn for desalting, similar to a previously described design.13 This (13) Yi, E. C.; Lee, H.; Aebersold, R.; Goodlett, D. R. Rapid Commun. Mass Spectrom. 2003, 17, 2093-2098.

Analytical Chemistry, Vol. 77, No. 10, May 15, 2005

3199

design consisted of a precolumn packed behind an integrated frit (100 um i.d. × 1.0 cm Picofrit, New Objective, Cambridge, MA), located directly behind the analytical capillary column with an integrated spray tip (75 um i.d. × 10 cm Picofrit, New Objective). The columns were packed in-house with 5 µm, 200 Å Magic C18AQ (Michrom BioResource, Auburn, CA). Samples were loaded onto the precolumn at a flow rate of ∼0.02 mL/min. Buffer A consisted of 0.1% formic acid in water; Buffer B consisted of 0.1% formic acid in acetonitrile. A linear gradient from 10 to 35% solvent B over 60 min was used for peptide elution, at a flow rate of ∼250 nL/minute across the µLC column. Peptides eluting from the capillary column were automatically selected for CID by the mass spectrometer using a protocol that alternated between one MS and three MS/MS scans for the three most abundant precursor ions in the MS survey scan. Precursor m/z values selected for CID were dynamically excluded for 1.5 min after selection. The electrospray voltage was set to 1.7 kV. The mass scan range for both MS (three microscans) and MS/ MS (four microscans) was from 400 to 1800 Dalton. The operation of the mass spectrometer was controlled by the software Xcalibur Reversion B. Sequence Database Searching and Data Analysis. The MS/MS spectra were searched using SEQUEST14 (Thermo Finnigan, San Jose) against a yeast sequence database containing all 6139 open reading frames, with a reversed-sequence version of the same database appended to the end of the forward version.3 The search results were validated using the publicly available peptide validation program PeptideProphet (http://www.systemsbiology.org/Default.aspx?pagename) proteomicssoftware),10 and the data was organized using the program Interact.15 The predicted pI of peptide sequences was calculated according to Bellqvist16 and also Shimura17 using an automated script and was automatically inputted into the Interact sorted results. Western Blot Analysis against Selected Markers. Strains expressing TAP-tagged versions of selected genes were obtained from Open Biosystems (www.openbiosystems.com) and were fractionated as described above to generate chromatin-enriched, Triton-insoluble pellets. Pellets were resuspended in lysis buffer and boiled after addition of SDS sample buffer, separated on a 10% SDS-PAGE gel, and analyzed by immunoblotting using antiTAP antibody (Open Biosystems CAB 10001) as per instructions provided by manufacturer. RESULTS AND DISCUSSION Effectiveness of FFE for Separation of Complex Peptide Mixtures. The operational aspects of the FFE instrumentation used in this work has been previously described.8,9 Briefly, ampholyte buffers at carefully controlled pH are introduced at a constant flow between two sealed plates via peristaltic pumps. An electrical potential is applied orthogonal to the direction of (14) Eng, J.; McCormack, A. L.; Yates, J. R., III. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (15) Han, D. K.; Eng, J.; Zhou, H.; Aebersold, R. Nat. Biotechnol. 2001, 19, 946951. (16) Bjellqvist, B.; Hughes, G. J.; Pasquali, C.; Paquet, N.; Ravier, F.; Sanchez, J. C.; Frutiger, S.; Hochstrasser, D. Electrophoresis 1993, 14, 1023-1031. (17) Shimura, K.; Kamiya, K.; Matsumoto, H.; Kasai, K. Anal. Chem 2002, 74, 1046-1053.

3200 Analytical Chemistry, Vol. 77, No. 10, May 15, 2005

Figure 1. (A) Procedure for crude isolation of chromatin and associated proteins from yeast. (B) Presence of genomic DNA in whole cell extract (WCE), supernatant (SUP) containing the soluble protein fraction, and the insoluble pellet fraction (PEL). (C) Equivalent amounts of protein from whole cell extracts, supernatant, and the pellet fraction were separated by SDS-PAGE, transferred onto membrane, and analyzed by immunoblotting for chromatin-associated protein, HistoneH4p, and cytoplasmic protein, Cdc37p.

liquid flow, establishing a stable pH gradient over a pH range of ∼3-10, and protein or peptide mixtures are then introduced through capillary tubing into the ampholyte flow path via a separate peristaltic pump and focused on the basis of isoelectric point. A counter flow at the opposite end of the chamber directs the flow through tightly aligned capillaries at the end of the separation chamber, and the focused proteins or peptides exiting each of these capillaries are directed into a 96-well microtiter plate. In the present study, we chose to study a subcellular fraction of proteins from the yeast S. cerevisiae, prepared using an adapted chromatin isolation procedure.11,12 Figure 1A details the chromatin fractionation procedure. The effectiveness of this fractionation was evaluated by first detecting total DNA content in either the soluble supernatant fraction or the insoluble pellet fraction after the centrifugation step. Figure 1B shows the DNA detected in each of these fractions, indicating the enrichment of genomic DNA in the pelleted fraction. On the basis of densitometry scanning, we estimated up to a 5-fold enrichment of chromatin in the insoluble fraction (data not shown); Figure 1C shows immunoblotting results against a known cytoplasmic protein marker (Cdc37p) and a chromatin-bound protein (histone H4p), further indicating the effectiveness of this fractionation procedure to enrich for chromatinassociated proteins. As the objective of this study was to evaluate the utility of preparative IEF by FFE from the perspective of a separation power and also the use of peptide pI in sequence database searching,5,6 this subcellular fractionated mixture of proteins from the model organism S. cerevisiae represented an ideal sample, providing several independent ways to validate the effectiveness of using peptide pI for sequence database searching. These validation methods include (1) in-silico analysis using reverse database searching of a yeast sequence database;3 (2) investigation of

Figure 2. (A) Plot of measured pH value from each microtiter plate well versus average calculated pI of identified peptide sequences for two different pI prediction algorithms (Bellqvist16 or Shimura17). (B) Distribution of calculated pI values using the Bellqvist algorithm from well number 32.

subcellular protein localization distribution based on the Saccharomyces genome database (http://www.yeastgenome.org/)18 and previous studies in yeast;19 (3) biochemical validation by western blots against selected proteins enriched in the fractionated mix. As such, the strategy presented here provides for novel validation criteria, beyond previous studies which have only concentrated on in-silico validation of the advantages of peptide pI in sequence database searching. Five hundred micrograms of total protein was prepared using the adapted chromatin isolation procedure and fractionated using FFE. Immediately after FFE fractionation, the pH in each well of the microtiter plate (approximately 300 µL total volume per well) was measured using a micro pH electrode. The (18) Christie, K. R.; Weng, S.; Balakrishnan, R.; Costanzo, M. C.; Dolinski, K.; Dwight, S. S.; Engel, S. R.; Feierbach, B.; Fisk, D. G.; Hirschman, J. E.; Hong, E. L.; Issel-Tarver, L.; Nash, R.; Sethuraman, A.; Starr, B.; Theesfeld, C. L.; Andrada, R.; Binkley, G.; Dong, Q.; Lane, C.; Schroeder, M.; Botstein, D.; Cherry, J. M. Nucleic Acids Res. 2004, 32 (Database issue), D311314.

standard deviation of the pH measurements was calculated to be (0.04 pH units on average. For mass spectrometry analysis, approximately 35 µL was removed from each well across the pH range of 3.5-10, and these were subjected to ultrafiltration to remove contaminating high molecular weight HPMC polymer components of the ampholyte mixtures. The filtrate was dried under vacuum and then loaded to a µLC column and analyzed by automated ESI-MS/MS. The MS/MS was sequence database searched using the program Sequest,14 and the results were organized using the software tool Interact.15 The identified peptides were filtered using a recently described, publicly available probabilistic scoring algorithm called Peptide Prophet.10 This statistical algorithm assigns to each peptide sequence match a probability that it has been correctly identified using several different criteria, including Sequest scores (e.g., Xcorr and ∆corr) (19) Huh, W. K.; Falvo, J. V.; Gerke, L. C.; Carroll, A. S.; Howson, R. W.; Weissman, J. S.; O’Shea, E. K. Nature 2003, 425, 686-691.

Analytical Chemistry, Vol. 77, No. 10, May 15, 2005

3201

and additional sequence information, including the number of tryptic termini and precursor mass accuracy. Therefore, Peptide Prophet represents an advanced manner by which to filter large MS/MS data sets postdatabase searching, as compared to filtering the data on the basis only of the Sequest scoring values, as has been used in other studies.2,3 Peptide Prophet assigns a probability score (P) between 0 and 1 to each match, with a score of 1 representing the highest confidence match. Initially, only peptides showing a probability score of 0.9 or greater were considered as correct hits, providing a high stringency filter for sequence matches. To calculate peptide pI values, we compared two different algorithms. One of these algorithms, described by Bellqvist et al.,16 was first developed for predicting pI values of proteins separated using immobilized pH gradients; the other algorithm was described by Shimura et al.17 and was based on data derived using capillary isoelectric focusing of peptides. Using P g 0.9 as a filter, Figure 2A shows the results of the average calculated peptide pI for identified peptide sequences using these two algorithms, along with the measured pH for each corresponding well analyzed across the microtiter plate. The standard deviation of calculated peptide pI from each well for both algorithms is also shown. A total of 124 193 MS/MS spectra were searched against a yeast sequence database, resulting in 3745 unique peptide sequence matches with P g 0.9. The results shown in Figure 2A demonstrate several important points indicative of the performance of FFE in preparative IEF of peptide mixtures. The plotted pH values measured across the wells of the microtiter plate indicate the ability of the FFE system to establish a stable pH gradient across a working range of ∼3-10, consistent with other previous descriptions of this FFE system.8,9 These results also demonstrate the ability of FFE to resolve peptides from a complex mixture on the basis of peptide pI. Generally, the two algorithms compared for predicting peptide pI values show similar results across the pH range, with good correlation between average calculated peptide pI and measured pH in each well. Notably, the measured pH and average calculated pI values for both algorithms show the closest match in the acidic region of the pH gradient (∼3.5-6.5) and at the basic end of the gradient (∼8-10), with the neutral region of the gradient (∼6.57.5) showing a larger discrepancy between the values. This behavior is similar to other descriptions of IEF of peptide mixtures using both gel-based methods.6 The Shimura algorithm shows a better correlation to the measured pH in this region, indicating that this calculation method may be slightly better suited overall for the conditions used in IEF of peptides by FFE than that described by Bellqvist and colleagues, although for much of the pH range the algorithms give identical results. The majority of the peptides (86.3% of the total) identified were in the acidic region of the gradient having calculated pI values in the range of 3.5-7, with very few peptides being identified with a calculated pI in the range of ∼7-8; this pI distribution of tryptic peptides is in concordance with previous descriptions of separations by IEF using complex peptide mixtures from soluble whole cell extracts from Escherichia coli5 and rat.7 The low numbers of peptides identified in this region are also the cause of the larger standard deviation observed in the calculated peptide pI values in these wells (Figure 2A). 3202

Analytical Chemistry, Vol. 77, No. 10, May 15, 2005

For the purposes of this present study, we chose to focus on eight FFE fractions in the acidic region of the pH gradient (wells 29-36, measured pH 3.99-4.99), as this region showed the highest number of identified peptides as well as close concordance between measured pH values of the FFE fractions and calculated pI values of identified peptide sequences using both prediction algorithms (average deviation of (0.12 between the average peptide pI and the measured pH using the Bellqvist algorithm). Figure 2B shows a plot of the distribution of calculated peptide pI values from one of these acidic fractions (well 32). The distribution of peptides is approximately symmetrical, with very close concordance of the measured pH (4.34) in the well to the average peptide pI (4.38). The spread of peptide pI values is within ∼0.45 pH units of the measured pH. Subcellular Distribution of Identified Proteins. Using the well-annotated proteome in yeast,18,19 we first plotted the known cellular localization of those proteins identified by analysis of the eight selected acidic FFE fractions. Initially, no pI information was included in filtering the sequence database search results using Sequest. Sequence matches were filtered using Peptide Prophet, and only those matches having a probability score of 0.9 or greater were considered to be correct. From the eight selected fractions, a total of 16 285 MS/MS spectra were searched against a database containing all 6139 open reading frame sequences in yeast, resulting in the identification of 281 unique proteins derived from 1165 unique peptide sequence matches. Figure 3 shows the known subcellular distribution of the 281 total proteins identified using this sequence match criteria. It is immediately clear from Figure 3 that a high proportion (∼50%) of the identified proteins are known to localize to the nucleus; a significant portion of these peptides are also low-abundance proteins involved in transcription or DNA-binding.18 It is estimated that nuclear-localized proteins in yeast make up ∼25% of the total proteome.19,20 Therefore, these results confirm that our chromatin isolation procedure is enriching for nuclear-localized proteins. For the other proteins identified, a large proportion are either involved with translation or compartmentalized to the cytoplasm or other organelles. It is unclear whether these proteins are artifacts of the chromatin isolation procedure or are truly associated with chromatin, either directly or indirectly. These results are in concordance with a previous proteomic study investigating chromatin-associated proteins in a mammalian system, where significant numbers of translational and cytoplasmic proteins were also identified using a similar chromatin isolation procedure.21 For the purpose of this study, it is most important to recognize that the isolation procedure is enriching for nuclear-localized proteins, which will be used as an independent validation of sequence database search results incorporating peptide pI. Minimizing False Negative and False Positive Identifications of Proteins via a Combination of pI Filtering and Probability Scoring. We next investigated our hypothesis that the combination of peptide pI constraints with probability scoring will increase the number of high confidence peptide sequence matches from sequence database searching of MS/MS data. (20) Kumar, A.; Agarwal, S.; Heyman, J. A.; Matson, S.; Heidtman, M.; Piccirillo, S.; Umansky, L.; Drawid, A.; Jansen, R.; Liu, Y.; Cheung, K. H.; Miller, P.; Gerstein, M.; Roeder, G. S.; Snyder, M. Genes Dev. 2002, 16, 707-719. (21) Shiio, Y.; Eisenman, R. N.; Yi, E. C.; Donohoe, S.; Goodlett, D. R.; Aebersold, R. J. Am. Soc. Mass Spectrom. 2003, 14, 696-703.

Figure 3. Subcellular distribution of proteins identified at high stringency (P g 0.9).

Recent studies have provided evidence that incorporation of peptide pI into peptide sequence database searching provides a reliable way in which to increase the number of sequence matches, while minimizing the false positive identifications.5,7 These studies have relied on in-silico validation of this strategy, employing a reverse database searching approach to measure the false positive rate for any given scoring criteria of database search results.3 Using this reversed database searching method, the false positive rate is calculated using eq 1:

false positive % ) [2nreverse/(nforward + nreverse)] × 100 (1) where nforward is the number of matches against peptide sequences in the real, forward sequence database, and nreverse is the number of matches to sequences from the nonsense, reversed database. A false positive rate of 1% or less has been considered to be acceptable for sequence database searching of large data sets.5 For the work presented here, a chimeric database consisting of a forward and reverse-sequence database was used, containing protein sequences from the 6139 known open reading frames in S. cereviae.3 Our choice of a subcellular fraction of proteins in yeast provided a unique way in which to validate the use of peptide pI constraints in database searching, beyond previously used computational methods (e.g., reverse database searching) that have been used as the sole validation in other studies.3,5,7 Since the subcellular localization of the majority of the proteome of the model organism S. cerevisiae is known, the distribution of cellular localization for the proteins identified using a subcellular enrichment procedure (i.e., chromatin fraction) can be used as an independent validation of the results. This would not be possible if the entire proteome of model organisms derived from whole cell lysates was being investigated. Furthermore, analysis of an enriched subset of proteins allows for independent validation by

biochemical methods (i.e., immunoblotting against specific protein markers). Therefore, we employed a three-pronged approach to validating the use of peptide pI in sequence database searching including in-silico measurement of false positive rates using reverse database searching, subcellular protein localization distribution analysis, and Western blot detection of selected proteins. Figures 2 and 3 represent results derived from peptide matches filtered with high stringency, considering only those matches with a probability score of P g 0.9 or greater. Although this level of stringency provides high confidence peptide matches at a minimal false positive error rate, it also sacrifices sensitivity.10 This means that there is an increased number of false negatives when employing high-stringency filtering criteria, where false negatives are defined as those sequence matches that are actually correct but do not score well enough to pass the strict filtering criteria.7 To investigate the effect of the peptide pI constraint to lower the probability scoring criteria and effectively decrease the number of false negative identifications while keeping the false positive rate to a minimum, we used the measured pH values from microtiter wells 29-36 and filtered the peptide sequence matches using peptide pI values at different probability scoring values. We based these filtering criteria on the following expected relationship (eq 2):

pHmeas - ∆pH e peptide pI e pHmeas + ∆pH

(2)

where pHmeas is the measured pH from any given microtiter plate well, peptide pI is calculated on the basis of the amino acid sequence,16 and ∆pH is the error tolerance between pHmeas and the calculated pI. It is expected that as the ∆pH value is decreased for any P value, the false positive rate should also decrease. This is because the false positive matches should have random calculated pI values, whereas true matches should have peptide pI values close to the corresponding pHmeas value. Figure 4 shows Analytical Chemistry, Vol. 77, No. 10, May 15, 2005

3203

Figure 4. False positive rate versus probability score and the effect of peptide pI filtering.

the effects of ∆pH variation on false positive % at different P values. This data clearly shows a decrease in false positive rate as ∆pH is decreased from a value of (8 to (0.25. It is also clear from Figure 4 that when using a high-stringency filter (P g 9.0), the error rate is minimized to zero; however, the tradeoff using this strict filtering level is a loss of sensitivity, which increases the number of false negative identifications and limits the number of proteins identified.10 To decrease these false negative identifications and keep the false positive rate at 1% or below, we determined that using a probability score of g0.37 and a ∆pH of (0.5 was an optimal filtering criteria for peptide sequence matches, having a false positive rate of 0.99% (nreverse ) 7); the false positive rate increases to 1.74% (nreverse ) 13) when using a wide-open pH tolerance of (8 (effectively no pI filtering), demonstrating the utility of the pI constraint in minimizing the false positive identifications. This false positive rate was achieved in a completely automated manner, without need for manual inspection of data, as was necessary in a previous study determining optimal Sequest scoring criteria.3 We chose a ∆pH tolerance of (0.5, as it was slightly greater than the average standard deviation of calculated pI values across these eight wells (Figure 2A) and was also within the spread of calculated pI values within these wells as indicated in Figure 2B. Although a ∆pH of (0.25 provided a lower false positive rate in general, a small number of peptides having very high probability scores (0.9 or greater) were discarded using this tight pH tolerance. This is most likely due to a combination of errors in the peptide pI calculation algorithms used5 and the IEF resolution using FFE; given these factors, a conservative tolerance of (0.5 was selected. Using these optimized filtering criteria for peptide sequence matches, a total of 89 new peptide sequences were considered to be correct matches as compared to the original analysis using a peptide probability score of P g 0.9. These 89 new peptides would constitute false negative peptide matches at high stringency (P g 0.9). Thirty-six of these peptide sequences were derived from 34 completely new proteins not identified at P g 0.9. Table 1 summarizes these newly identified 36 peptide sequences, including the Sequest and probability score for each, as well as information 3204

Analytical Chemistry, Vol. 77, No. 10, May 15, 2005

on the protein from which each peptide is derived. The other 54 peptide sequences (not shown in Table 1) represented additional peptide matches to proteins already identified at high stringency, thus increasing the sequence coverage of these proteins. Figure 5 shows the subcellular localization distribution of the 34 new proteins identified; it is immediately apparent that the majority (∼64%) of the identified proteins have known nuclear localization, with a significant portion of these having known DNA or RNAbinding properties. This localization distribution is also in keeping with Figure 3 for proteins identified using high-stringency filtering, providing further validation, independent of the false positive rate or probability scores, that these peptide sequences are indeed correct assignments. Increasing the Confidence of Matches for “Single Hits” and Partially Tryptic Peptides Using Peptide pI. A common occurrence in sequence database searching of MS/MS data are so-called “single hits”, which are proteins identified by only a single peptide sequence match in the database search of MS/ MS data. Many times these single-hit peptides make up a significant proportion of sequence matches and many times are derived from low-abundance proteins.2,3 In the present study, of the 315 proteins identified at P g 0.37 and ∆pH ) (0.5 from the eight selected FFE fractions, single hits constituted 59.4% of the total. Because of uncertainties in the confidence of these single hits, in many cases these matches are either discarded from the final analysis or require manual validation of the sequence match.2,3 It would therefore be beneficial to provide additional constraints to these single-hit matches to increase the confidence in these matches and reduce the need for manual validation of the data. Using the filtering criteria of P g 0.37 and ∆pH of (0.5, the false positive rate is 6.2% (nreverse ) 6 out of 193 total matches) when considering only the proteins identified by a single peptide sequence; the same calculation at P g 0.37 and ∆pH of (8 increases this false positive rate to 10.8% (nreverse ) 11 out of 203 total matches). Previous studies have estimated a false positive rate of 26%3 for single-hit matches when using Sequest score filtering alone. Therefore, the use of peptide pI combined with

Table 1. 34 Additional Proteins Identified at P g0.37 and ∆pH ) (0.5 peptide sequencea

pI

Xcorr (charge state)

∆Cn

prob score (P)

protein

cellular component

K.IQMLDLPGIIDGAK .D K.NYENGFINNPIVIS PTTTVGEAK.S K.SLVVDSEGQIR.Y R.LFEMGFQEQLNEL LASLPTTR.Q K.TPSLTVYLEPGHA ADQEQAK.L K.ARVDQLNLNLTD DQIK.E K.ISDDILSVLDSHLIP SATTGESK.V R.IPGYSADEIR.S R.SKLDLIEEVEPLVR. T K.AVVESVGAEVDEA R.I R.NNTSTLAQIESNV LEDFEFPKDER.N R.LLPPNLTADEFFAI LR.D K.AALEAGAFEAVTS NHWAEGGK.G K.INGMPEDVPLSVT PGIQSALNILQSYK.S K.LPLTDEQTAEGR.K R.EKDCSSSSEVESQ SK.C K.WVWNLFEDAFEK. A K.YTPDELTTVLNQL VR. K.ADDHASVQINVAK .V R.HGDEEDESLSMDQ VK.L R.NYPEPLSGEQLSLL SIK.Y K.YLEELQR.K R.ELAQQVYNVLEK. L R.IPGVILDELK.T R.TQDVPQTELQEK. V K.LETDESPIQTK.S K.NKVEQQENDEEPE KDDIIR.S R.EYLNLPEHIVPGTY IQER.N R.DGDDLIYTLPLSFK .E K.GGIGAVFAELNQG ENITK.G R.LAASNLEDLVK.A K.ISIQEGEHSSVEDA R.A R.SLGHWVDSNGEPI DGK.L R.KTDIIPIASGEDR.S

4.21 4.53 4.37 4.25 4.65 4.43 4.22 4.37 4.41 4 4.08 4.37 4.75 4.37 4.14 4.41 4.14 4.37 5.21 4.01 4.53 4.53 4.53 4.37 4.14 4.14 4.12 4.75 3.93 4.53 4.37 4.4 4.53 4.56

2.9 1.8 2.0 1.9 2.5 1.9 2.1 2.1 2.4 2.4 3.6 1.6 1.9 1.9 2.3 2.6 2.1 2.5 1.5 2.4 2.2 2.7 2.1 2.1 1.7 2.4 3.9 1.8 2.5 2.3 2.2 2.5 2.3 2.2

0.26 0.339 0.255 0.399 0.365 0.367 0.326 0.302 0.216 0.257 0.313 0.44 0.317 0.277 0.47 0.283 0.162 0.312 0.384 0.253 0.382 0.219 0.226 0.197 0.246 0.286 0.256 0.396 0.444 0.178 0.285 0.333 0.449 0.259

0.88 0.62 0.78 0.83 0.88 0.49 0.42 0.47 0.54 0.90 0.83 0.89 0.58 0.70 0.82 0.78 0.46 0.85 0.79 0.69 0.86 0.89 0.74 0.43 0.45 0.60 0.74 0.80 0.89 0.60 0.77 0.86 0.50 0.61

RBG1 IMD1 YCR08 DBP10 RPO21 LYS20 BMH2 TAF12 SPC19 RPP2B YER049 UPF3 ADE3 TRA1 CRP1 NVJ1 KGD1 EXO70 RPS21B UTP11 RIX7 RPL15 DBP9 SGD1 UTP14 HXT2 ESC1 RPS10B SIS1 SRV2 NOG2 REX4 RPA43 RRP15

cytoplasm unknown nucleolus nucleolus nulcleus nucleus/mitochondrial nucleus nucleus Spindle pole cytoplasmic nucleus cytoplasm/polysome nucleus/cytoplasm nucleus nucleus nuclear membrane mitochondrial exocyst/bud neck cytoplasmic nucleoleus nucleolus/nucleus cytoplasmic nucleolus nucleus nucleolus/nucleus plasma membrane nucleus cytoplasmic cytoplasmic actin cortical patch nucleus/nucleolus nucleus/nucleolus nucleus nucleus/nucleolus

a

Partially tryptic peptides indicated in bold were derived from proteins that were selected for validation by immunoblotting (see Figure 6).

Figure 5. Subcellular distribution of 34 additional proteins identified at P g 0.37 and ∆pH ) (0.5.

probability scoring significantly increases the accuracy of singlehit matches. When analyzing complex mixtures of peptides derived from proteolytic digestion (e.g., trypsin), miscleaved peptides represent another problematic class of peptides. These are peptides in which

trypsin does not hydrolyze the peptide at every expected arginine or lysine residue. These partially tryptic peptides can make up a significant number of peptides matched in sequence database searches. For example, almost 20% of the 89 new peptide matches using the scoring criteria of P g 0.37 and ∆pH of (0.5 were Analytical Chemistry, Vol. 77, No. 10, May 15, 2005

3205

Figure 6. (A) Immunodetection of TAP-tagged versions of four different proteins, Yer049Wp, Nvj1p, Lys20p, and Rrp15p, that had been identified by a single, partially tryptic peptide at P g 0.37 and ∆pH ) (0.5 in chromatin-enriched fraction. With the exception of Nvj1-TAP, remaining proteins were identified by presence of bands at expected molecular weights as indicated by arrowheads. (* denotes nonspecific background bands.) (B) Representative MS/MS spectrum of partially tryptic peptide derived from the protein Rrp15. The single-charged y and b ions detected are indicated in bold in the ion list and the corresponding peaks are labeled in the spectrum.

partially tryptic sequences (16 peptides out of 89). It has been proposed that when using Sequest scores alone, more stringent match criteria should be employed to accept partially tryptic peptide sequence matches.2,3 The use of peptide pI as a constraint represents a more accurate way in which to validate these matches, especially in proteolysis using trypsin, in which miscleavages will have a significant effect on peptide pI because of the inclusion of basic lysine or arginine amino acids within the peptide sequence. To further validate the accuracy of partially tryptic peptide matches using the peptide pI constraint, we selected four peptide sequence matches (indicated in bold in Table 1), which were derived from proteins identified by a single peptide sequence match to a partially tryptic sequence. Using yeast strains expressing recombinant versions of these proteins containing a tandem affinity purification (TAP) tag (http://www. openbiosystems.com/yeast-tap-fusion-library.php), we isolated the chromatin protein fraction and performed an immunoblot against the specific TAP tagged proteins. Figure 6A shows the results of 3206 Analytical Chemistry, Vol. 77, No. 10, May 15, 2005

these Western blots. The results of these immunoblots independently confirm the presence of three of these proteins (Rrp15p, Lys20p, and YER049Wp) in the subcellular fractionated protein mixture, further validating the veracity of the partially tryptic sequence matches using a combination of probability scoring and peptide pI. We also show a representative MS/MS spectrum of one of these partially tryptic peptides (from Rrp15p) with the matching b and y ions expected for this sequence labeled in the spectrum (Figure 6B). One of the selected proteins (Nvj1-TAP) was not detected in the chromatin fractionated sample. In contrast to the other three proteins, Nvj1p is present at low copy number (1900 copies/cell) and was not detected even in whole cell yeast lysates (data not shown). Thus, the low copy number of this protein therefore is a likely basis for our inability to detect it in the chromatin fraction. Using previously described criteria based solely on Sequest scores for partially tryptic peptides (Xcorr g 3.0 or 4.0 for 2+ and 3+ peptides, respectively, and ∆corr g 0.08),3 none of the partially tryptic peptide sequence matches described

here would have been accepted as correct hits. This further supports the effectiveness of peptide pI to provide more accurate peptide matches from sequence database searching. CONCLUSIONS Preparative IEF of complex peptide mixtures represents an emerging, powerful first dimension of separation for shotgun proteomics. This “information-added” separation approach represents not only a high-resolution method for separating peptide mixtures, but also provides the introduction of peptide pI as a constraint in sequence database searching of MS/MS data. The results of our study demonstrate the use of FFE as an emerging tool for preparative IEF of peptide mixtures as well as providing a novel validation of the use of peptide pI in combination with probability scoring to increase the accuracy of sequence database searching in tandem mass spectrometry-based proteomic studies. We demonstrate that lowering the probability score filtering criteria to 0.37 and using a pI constraint of (0.5 pH units for any given fraction of peptides decreased the false negative sequence matches, while keeping the false positive rate below 1%. Although we demonstrate this approach for a limited pH region (∼4-5), it should be extendable across the entire pH gradient, although an adjustment to the ∆pH tolerance may be necessary for the neutral pH region where there is a slightly larger discrepancy between peptide pI and measured pH values. (22) Strittmatter, E. F.; Kangas, L. J.; Petritis, K.; Mottaz, H. M.; Anderson, G. A.; Shen, Y.; Jacobs, J. M.; Camp, D. G., II; Smith, R. D. J. Proteome Res. 2004, 3, 760-769. (23) Cargile, B. J.; Stephenson, J. L., Jr. Anal. Chem 2004, 76, 267-275. (24) Resing, K. A.; Meyer-Arendt, K.; Mendoza, A. M.; Aveline-Wolf, L. D.; Jonscher, K. R.; Pierce, K. G.; Old, W. M.; Cheung, H. T.; Russell, S.; Wattawa, J. L.; Goehle, G. R.; Knight, R. D.; Ahn, N. G. Anal. Chem 2004, 76, 3556-3568.

Importantly, this filtering can be done in a completely automated fashion, without the need for manual interpretation of the data. The present study was conducted using a quadrupole ion trap mass spectrometer, which gives limited mass accuracy to the precursor peptide mass assignments. A similar study using more accurate mass detection (i.e., time-of-flight or Fourier transform ion cyclotron resonance) could provide added constraints in terms of peptide precursor mass, which may further help to decrease false negative assignments and false positive rate from database searching. Combining peptide pI with an orthogonal property such as peptide hydrophobicity using reverse-phase µLC may also have great potential to more rapidly and accurately identify peptides, either by MS/MS analysis or possibly by MS alone.22,23 The approach described here combining peptide pI with probability scoring to obtain optimized criteria for database searching should also show general applicability to higher organisms, including humans, where methods to accurately identify proteins in tandem mass spectrometry-based proteomic studies present a significant challenge.24 ACKNOWLEDGMENT The authors thank the laboratory of Steve Gygi for the use of the reversed yeast sequence database, Jimmy Eng for assistance with the Interact software and the automated calculation of peptide pI values, and Brennan O’Callaghan for technical assistance with the figures. Instrumental resources for this work were made possible through the Minnesota Consortium for the Life Sciences mass spectrometry facility at the University of Minnesota. Received for review November 30, 2004. Accepted February 17, 2005. AC0482256

Analytical Chemistry, Vol. 77, No. 10, May 15, 2005

3207