Added Value for Tandem Mass Spectrometry Shotgun Proteomics Data Validation through Isoelectric Focusing of Peptides Manfred Heller,†,‡ Mingliang Ye,§ Philippe E. Michel,† Patrick Morier,† Daniel Stalder,‡ Martin A. Ju1 nger,| Ruedi Aebersold,§,| Fre´ de´ ric Reymond,† and Joe1 l S. Rossier*,† DiagnoSwiss SA, Monthey, Switzerland, Department of Clinical Research, University Hospital, 3010 Bern, Switzerland, Institute for Systems Biology, Seattle, Washington 98103, and Institute for Molecular Systems Biology, Swiss Federal Institute of Technology (ETH), Zu ¨ rich, Switzerland. Received June 24, 2005
A very popular approach in proteomics is the so-called “shotgun LC-MS/MS” strategy. In its mostly used form, a total protein digest is separated by ion exchange fractionation in the first dimension followed by off- or on-line RP LC-MS/MS. We replaced the first dimension by isoelectric focusing in the liquid phase using the Off-Gel device producing 15 fractions. As peptides are separated by their isoelectric point in the first dimension and hydrophobicity in the second, those experimentally derived parameters (pI and RT) can be used for the validation of potentially identified peptides. We applied this strategy to a cellular extract of Drosophila Kc167 cells and identified peptides with two different database search engines, namely PHENYX and SEQUEST, with PeptideProphet validation of the SEQUEST results. PHENYX returned 7582 potential peptide identifications and SEQUEST 7629. The SEQUEST results were reduced to 2006 identifications by validation with PeptideProphet. Validation of the PeptideProphet, SEQUEST and PHENYX results by pI and RT parameters confirmed 1837 PeptideProphet identifications while in the remainder of the SEQUEST results another 1130 peptides were found to be likely hits. The validation on PHENYX resulted in the fixation of a solid p-value threshold of 95%, and a final count of 2034 highly confident peptide identifications was achieved after pI and RT validation. Although the PeptideProphet and PHENYX datasets have a very high confidence the overlap of common identifications was only at 79.4%, to be explained by the fact that data interpretation was done searching different protein databases with two search engines of different algorithms. The approach used in this study allowed for an automated and improved data validation process for shotgun proteomics projects producing MS/MS peptide identification results of very high confidence. Keywords: LC-MS/MS • isoelectric focusing • retention time • peptide identification • database searching • proteomics • data validation
Introduction The term proteomics was coined in 1994 to describe a technology with which the entire protein complement of a biological sample could be described.1 At that time, proteomics was based on the technique of two-dimensional gel electrophoresis separating proteins in the first dimension by ioselectric point followed by molecular weight in the second dimension. This technique seemed to be suitable for the separation of several thousand proteins including isoforms because on a large format gel several thousand spots could be detected. With * To whom correspondence should be addressed. Dr. Joel S. Rossier, DiagnoSwiss SA, Route de l′Ile-au-Bois 2, c/o CIMO SA, CH-1870 Monthey, Switzerland. Fax: +41 24 471.49.01. E-mail:
[email protected]. † DiagnoSwiss SA. ‡ Department of Clinical Research, University Hospital. § Institute for Systems Biology. | Institute for Molecular Systems Biology, Swiss Federal Institute of Technology (ETH). 10.1021/pr050193v CCC: $30.25
2005 American Chemical Society
performance improvements achieved in mass spectrometry during the second half of the past decade more and more 2D gel spots could be submitted for protein identification. Soon it was realized that despite the fact that 2D gels offer a tremendous separation power only few, and mostly high abundant proteins were detectable and identifiable with 2D gels. This restriction can be explained by a limited loading capacity, tremendous differences in physicochemical properties of proteins, problems of transferring proteins from the first dimensional gel onto the second one, and artifacts introduced during sample treatment.2 Technical advances on several fronts spurred the development of alternative proteomics methods with a trend away from labor intensive gel technology to more automatable methods encompassing a more direct hyphenation of mass spectrometry with liquid chromatography and working with proteolytic digests of complex protein samples rather than intact proteins in order to reduce physicochemical constraints.3,4 Both approaches, the 2-dimensional peptide Journal of Proteome Research 2005, 4, 2273-2282
2273
Published on Web 10/05/2005
research articles separation by strong cation exchange and reversed phase liquid chromatography and the isotope coded affinity technology have since been refined and combined for increased proteome coverage.5 More recently, different groups separated intact proteins in the liquid phase with multidimensional chromatography before mass spectrometric detection and identification.6-9 A very interesting attempt that was applied to subproteomes is the combination of a top-down and bottom-up approach. The exact mass of intact proteins was determined for the detection of potential isoforms and the relative quantification of the different proteins based on the mass spectrometry signal, and identification of proteins by mass spectrometric analysis of the protein digests.10,11 Another tool for fractionation of proteins and peptides based on their isoelectric point called off-gel isoelectric focusing (OGE) has been recently introduced and presented as a versatile device.12,13 Intact human plasma protein isoforms differing in pI were separated in a first stage IEF followed by proteolysis of the proteins and subsequent isoelectric focusing of the peptides with the same device. The resulting 225 peptide fractions (15 protein fractions and 15 peptide fractions of each protein fraction) were directly amenable for LC-MS/MS analysis. Such large scale proteome analyses result in a tremendous amount of peptide identification data. Validation of these data becomes a major task and currently researchers rely on peptide and/or protein scoring values returned by the identification software. Doing so, one has to confront identification data that is either contaminated with a high percentage of false positive identifications or a big loss of correct identifications, termed false negatives, as illustratively documented by Cargile and colleagues.14 This had been recognized by different groups who subsequently engineered new statistical models allowing for the validation of identification results returned by the database searching software, mostly SEQUEST.15-17 More recently, the new identification software OLAV was developed and is now available as PHENYX from GeneBio.18 While the frequently used SEQUEST 19 and MASCOT 20 software identify peptides by correlation of experimental with theoretical fragmentation spectra without involving a model, PHENYX uses a more efficient scoring scheme based on signal detection theory coupled with pattern recognition for likelihood ratio calculation in order to distinguish correct from false peptide identifications. In addition to improvements on the software side, the incorporation of experimentally measured peptide parameters into the peptide identification process will increase significantly the correct identification ratio in shotgun proteomics. It was the group of Stephenson that has suggested with a couple of publications in 2004 to use experimentally determined peptide pI values for validation of the correctness of peptide identifications in shotgun proteomics experiments.14,21 We have successfully used this concept for the human plasma protein work published recently and have combined it in a manual fashion with the relative retention time behavior of peptides from the reversed phase column.13 The calculation of peptide retention times was first reported by Martin in 194822 and since then different groups published peptide retention time prediction algorithms. One of the more elaborate algorithm was developed by Sakamoto et al.,23 and more recently, the group of Smith has refined and used it as a peptide identification criteria.24,25 We have written a Microsoft Excel macro that computes pH values, charge states at defined pI’s, and a predictive retention time for peptides identified by LC-MS/MS. Here, we report the application of these tools for 2274
Journal of Proteome Research • Vol. 4, No. 6, 2005
Heller et al.
the confident peptide identification from a tryptic digest of a total drosophila cell lysate isoelectrically fractionated on the OGE device, pH 3-10, into 15 fractions with subsequent LC-MS/MS analysis of each OGE fraction. The MS/MS data was searched with SEQUEST and PHENYX against relevant protein databases. The raw identification results were treated with the excel macro and by applying acceptance criteria for pI and retention time the raw data was filtered and cleaned from false positive identifications. The comparison of the identification results returned by both search engines was further used to validate this approach and establish acceptance criteria for peptide identifications with PHENYX.
Materials and Methods Sample Preparation for OGE Fractionation. The Drosophila Kc167 cell line cell line was generously provided by E. Hafen (Zoological Institute, University of Zurich) and was originally derived from disaggregated 8 to 12 h old Drosophila embryos.26 Cells were cultured in Schneider medium containing 10% FCS at 25 °C. Density-arrested Kc167 cells were washed with PBS, collected in hypotonic lysis buffer (10 mM HEPES-KOH pH 7.9, 1.5 mM MgCl2, 10 mM KCl, 0.5 mM DTT, 1x Complete Protease inhibitor cocktail, Roche) and lysed by Dounce homogenization. After centrifugation at 100 000 × g for 30 min, the pellet was discarded, and cytoplasmic proteins were precipitated with acetone. After precipitation, proteins were redissolved in 100 mM Tris-HCl buffer (pH 8.3) including 0.05% SDS, 5 mM EDTA and 6 M urea. An aliquot of 2 mg protein was reduced by 5 mM tributylphosphine (Aldrich) and alkylated by 10 mM iodoacetamide. The sample was then 8-fold diluted with water and sequencing-grade modified trypsin was added at an enzyme: protein ratio of 1:50 w/w and incubated at 37 °C overnight. The Drosophila protein digest was desalted by strong cation exchange (SCX) purification as follows. The digest solution was acidified to pH 2.8 by 1% formic acid, and then loaded onto a SCX SPE cartridge (Polysulfoethyl Aspartamide, PolyLC, Columbia, MD). After extensive washing with 4 mL of 0.1% (v/v) formic acid and 2 mL of 25% (v/v) acetonitrile in 0.1% (v/v) formic acid peptides were eluted with 1.5 mL of 10% (v/v) NH4OH solution containing 25% (v/v) acetonitrile. The purified peptides were evaporated to dryness in a vacuum centrifuge and redissolved in OGE buffer composed of 5% (v/v) glycerol, 0.5% (v/v) ampholytes pH 3.0-10.0 (Amersham Biosciences, Otelfingen, Switzerland) in pure water. OGE Fractionation of Peptides. Isoelectric focusing of peptides was performed with the off-gel electrophoresis device (OGE) composed of 15 wells over a 13 cm IPG strip (Amerham Biosciences, Otelfingen, Switzerland) exhibiting a linear pH gradient ranging from 3 to 10 as described elsewhere.13 Briefly, the separations were run by dispensing 50 µL of peptide solution in each well (total of 750 µL) and the potential was fixed during 1 h at 500 V, then 1 h at 1000 V and finally 3.5 h at 8000 V (total of 29.5 kVh). The current limit was set at 200 µA per strip and the temperature was maintained at 20 °C. Fractionations were run with a peptide loading equivalent to 100 µg of the starting protein preparation. After OGE, liquid fractions were withdrawn (20 µL in average) and the OGE wells were rinsed once in order to enhance the peptide yield. For this purpose 100 µL of a water/methanol/formic acid (49:50:1 by volume) mixture was added per well and incubated for 90 min without voltage. Corresponding peptide fractions from 10 runs were pooled and concentrated by vacuum centrifugation. Each OGE peptide fraction was purified from residual traces
research articles
Shotgun Proteomics Data Validation
of glycerol, urea and ampholytes as follows. The fractions were diluted to 1 mL in 0.1% (v/v) TFA and acidified to pH 3.0 by addition of 1% (v/v) TFA, then loaded onto a Sep-Pak C18 cartridge (Waters) for purification as recommended by the manufacturer. The purified fractions were then evaporated to dryness and redissolved in 20 µL of 0.4% (v/v) acetic acid for capillary RPLC-MS/MS analysis. RPLC-MS/MS Analysis. The setup of the capillary RP-LC system was as described previously.27 The system consisted of a binary HPLC pump (HP1100, Agilent Technologies, Wilmington, DE), a micro-autosampler (Famos, Dionex LC Packings, San Francisco, CA), a ten-port switching valve integrated on a Finnigan ion trap mass spectrometer (model LCQ XP, Thermo Electron Corporation, San Jose, CA), a precolumn (100 µm i.d. × 2.0 cm length), and an analytical capillary column (75 µm × 12 cm). Fused silica capillary tubing with an integrated borosilicate frit (Integrafrit, New Objective, Cambridge, MA) was used for the precolumn. For the capillary column, one end of polyimide-coated fused-silica capillary (Polymicro Technologies, Phoenix, AZ) was manually pulled to a fine point ∼5 µm with a micro-flame torch. The columns were in-house packed with C18 resin (5 µm, 200 Å Magic C18AQ, Michrom BioResources, Auburn, CA) using a pneumatic pump (Brechbuehler, Spring, TX) at constant helium gas pressure of 1500 psi. Sample volumes of 6 µL were loaded onto the precolumn at a flow rate of 5 µL/min in 5 min with solvent A {0.1% (v/v) formic acid in water}. After sample loading and cleanup, a linear binary gradient of 5-35% solvent B {0.1% (v/v) formic acid in acetonitrile} in solvent A over 80 min was applied, followed by isocratic elution at 80% B for 10 min. Peptides eluting from the capillary column were selected for CID by the mass spectrometer using a protocol that alternated between one MS scan and three MS/MS scans. Data Processing. MS/MS spectra were interpreted by SEQUEST from a drosophila sequence database downloaded from NCI (ftp://ftp.ncifcrf.gov/pub/nonredun/protein.nrdb.Z) at the Institute for Systems Biology in Seattle. Carbamidomethylated cystein was set as a fixed modification and oxidation of methionin as a variable modification. Furthermore, no enzyme specificity was chosen and a mass difference of ( 3 Da and ( 0.5 Da for precursor and fragment ions, respectively, was accepted. The database search results were validated using the PeptideProphet program.15 PeptideProphet assigns to each SEQUEST peptide identification a probability that it has been correctly identified based upon its SEQUEST scores and additional information of the assigned peptides, including the number of tryptic termini. Peptides with a probability of 0.9 or higher were considered a match in this study. Peptide identification by PHENYX was performed on the vital-it processor cluster at the EPFL in Lausanne, Switzerland (www.phenyx-ms.com/). The Uniprot-Tremble database release 26.2 and SwissProt release 46.0 were searched with restriction to Drosophila melanogaster protein entries only. Otherwise, the same search criteria were applied as for the SEQUEST search except that PHENYX requests a protease specificity (trypsin) but allowing for half-tryptic peptides with up to two missed cleavages. Furthermore, only PHENYX identifications with a p-value of