Anal. Chem. 2004, 76, 6017-6028
Increased Identification of Peptides by Enhanced Data Processing of High-Resolution MALDI TOF/ TOF Mass Spectra Prior to Database Searching Tomas Rejtar, Hsuan-shen Chen, Victor Andreev, Eugene Moskovets, and Barry L. Karger*
Barnett Institute and Department of Chemistry, Northeastern University, Boston, Massachusetts 02115
This paper presents application of sequential enhanced data processing procedures to high-resolution tandem mass spectra for identification of peptides using the Mascot database search algorithm. A strategy for (1) selection of fragment ion peaks from MS/MS spectra, (2) utilization of improved mass accuracy of the precursor ions, and (3) wavelet denoising of the mass spectra prior to fragment ion selection have been developed. The number of peptide identifications obtained using the enhanced processing was then compared with that obtained using software provided by the instrument manufacturer. Approximately 9000 MS/MS spectra acquired by the Applied Biosystems 4700 TOF/TOF MS instrument were used as a model data set. After application of the new processing, an increase of 33% unique peptides and 22% protein identifications with at least two unique peptides were found. The influence of the processing on the percentage of false positives, estimated by searching against a randomized database, was estimated to increase false positive identifications from 2.7 to 3.9%, which was still below the 5% error rate specified in the Mascot search. These data processing approaches increase the amount of information that can be extracted from LCMS analysis without the necessity of additional experiments.
In both 2D-gel and liquid-phase separation approaches, identifications based on tandem mass spectra typically facilitate database searching algorithms that assign a specific peptide sequence to a particular MS/MS spectrum, based on similarity to the theoretically predicted fragmentation of peptides generated by in-silico digestion of protein sequence database.9 Various algorithms for database searching have been developed including, among others, Sequest,10 Profound,11 Tandem12 and Mascot.13 On the other hand, only limited effort has been devoted to processing of MS/MS spectra prior to searching.14,15 Current software tools focus mostly on elimination of lowquality MS/MS spectra prior to database searching,16-18 in order to reduce the search time and the percentage of false positives; however, this approach does not reduce the percentage of false negatives or increase the amount of information extracted from a sample. To increase the number of successful assignments of peptide sequences to MS/MS spectra, an advanced strategy must be employed. For example, database search engines require selection of peaks corresponding to sequencing ion fragments while minimizing the number of peaks due to spectral noise. This strategy can be utilized for preprocessing of high-resolution mass spectra acquired in a full profile mode (e.g., q-TOF, TOF/TOF, FT-ICR). The S/N of the ions in the MS and MS/MS spectra can be increased by spectral denoising, allowing detection of lowintensity peaks and also more accurate selection of monoisotopic peaks.19
Mass spectrometric analysis followed by database searching has become a well-established tool for peptide and protein identification. In a traditional approach, proteins are separated by 2D-gel electrophoresis, followed by staining, excision, and in-gel digestion, and the resulting peptides are analyzed by mass spectrometry (MS).1-5 An alternative method for protein identification utilizes multidimensional separation to analyze complex peptide mixtures resulting from proteolytic digestion of proteins.6-8
(6) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., III. Nat. Biotechnol. 1999, 17, 676-682. (7) Washburn, M. P.; Wolters, D.; Yates, J. R., III. Nat. Biotechnol. 2001, 19, 242-247. (8) Wolters, D. A.; Washburn, M. P.; Yates, J. R., III. Anal. Chem. 2001, 73, 5683-5690. (9) McCormack, A. L.; Schieltz, D. M.; Goode, B.; Yang, S.; Barnes, G.; Drubin, D.; Yates, J. R., 3rd. Anal. Chem. 1997, 69, 767-776. (10) Yates, J. R., 3rd.; Eng, J. K.; McCormack, A. L.; Schieltz, D. Anal. Chem. 1995, 67, 1426-1436. (11) Zhang, W.; Chait, B. T. Anal. Chem 2000, 72, 2482-2489. (12) Craig, R.; Beavis, R. C. Bioinformatics 2004. (13) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (14) Chamrad, D. C.; Korting, G.; Stuhler, K.; Meyer, H. E.; Klose, J.; Bluggel, M. Proteomics 2004, 4, 619-628. (15) Gentzel, M.; Kocher, T.; Ponnusamy, S.; Wilm, M. Proteomics 2003, 3, 1597-1610. (16) www.agilent.com. (17) Moore, R. E.; Young, M. K.; Lee, T. D. J. Am. Soc. Mass Spectrom 2000, 11, 422-426. (18) Sadygov, R. G.; Eng, J.; Durr, E.; Saraf, A.; McDonald, H.; MacCoss, M. J.; Yates, J. R., 3rd. J. Proteome Res. 2002, 1, 211-215.
* Corresponding author. E-mail: b.karger@neu.edu. (1) Anderson, N. L.; Anderson, N. G. Electrophoresis 1998, 19, 1853-1861. (2) Wilm, M.; Shevchenko, A.; Houthaeve, T.; Breit, S.; Schweigerer, L.; Fotsis, T.; Mann, M. Nature 1996, 379, 466-469. (3) Patterson, S. D.; Aebersold, R. Electrophoresis 1995, 16, 1791-1814. (4) Shevchenko, A.; Wilm, M.; Vorm, O.; Jensen, O. N.; Podtelejnikov, A. V.; Neubauer, G.; Mortensen, P.; Mann, M. Biochem. Soc. Trans. 1996, 24, 893-896. (5) Shevchenko, A.; Jensen, O. N.; Podtelejnikov, A. V.; Sagliocco, F.; Wilm, M.; Vorm, O.; Mortensen, P.; Shevchenko, A.; Boucherie, H.; Mann, M. Proc. Natl. Acad. Sci. U.S.A. 1996, 93, 14440-14445. 10.1021/ac049247v CCC: $27.50 Published on Web 09/14/2004
© 2004 American Chemical Society
Analytical Chemistry, Vol. 76, No. 20, October 15, 2004 6017
Recently, our laboratory introduced a new algorithm (MEND) for denoising of LC-MS data.20 In this approach, the denoising is performed by matched filtration in the chromatographic time domain, based on the assumption of Gaussian chromatographic peak shapes and the experimentally determined noise profile. The algorithm provides a significant improvement in the signal-to-noise ratio (S/N) and mass accuracy, which facilitates proper selection of the precursor ions for MS/MS analysis, including low-intensity precursors. After selection of appropriate peaks from the mass spectrum, the database search is performed with specified precursor mass and fragment ion tolerances. Most database searching algorithms use a precursor ion mass tolerance window to select all potential peptide candidates for comparison with the experimental spectrum. Thus, the accuracy of mass calibration in the MS mode substantially influences the results of database searching.21 A wider mass tolerance window results in more potential peptide candidates, which not only increases the chances of false positive identification but also results in longer search times. For the Mascot algorithm, the precursor ion mass tolerance window is particularly important since the number of potential peptides is also used for estimation of the score for a significant match. A narrower mass window tolerance yields lower Mascot significant scores. Thus, by improving the MS mass accuracy, more peptide assignments can be significant. The mass accuracy in the MS mode for axial MALDI MS, used in this study, is ∼(50 ppm with external calibration. This value can be improved to (10 ppm or better using internal calibration; however, the addition of an internal standard to the sample can limit the dynamic range of analysis by suppressing analyte signal.22 To avoid this limitation, several approaches have been suggested. In one case, the MS spectra were mass recalibrated based on the peptides identified by MS/MS with high confidence.23 On the other hand, our laboratory has introduced a universal method to achieve a mass accuracy similar to the internal calibration without ion suppression, in which closely placed external calibrants were used.24 Compared to the method described in ref 23, our approach is generally applicable to all MS/MS spectra without the necessity to know initially the identity of sample components. After database searching, the results of the search must be validated. A commonly used method is calculation of percentage of false positive (selectivity) and false negative (sensitivity) identifications. Several approaches describing validation of the database search results have been reported. In one example, a mixture of known components was experimentally analyzed by LC-MS from which an estimate of both the selectivity and sensitivity of database searching algorithms was made.25,26 In (19) Kast, J.; Gentzel, M.; Wilm, M.; Richardson, K. J. Am. Soc. Mass Spectrom 2003, 14, 766-776. (20) Andreev, V. P.; Rejtar, T.; Chen, H. S.; Moskovets, E. V.; Ivanov, A. R.; Karger, B. L. Anal. Chem. 2003, 75, 6314-6326. (21) Clauser, K. R.; Baker, P.; Burlingame, A. L. Anal. Chem. 1999, 71, 28712882. (22) Preisler, J.; Hu, P.; Rejtar, T.; Karger, B. L. Anal. Chem. 2000, 72, 47854795. (23) Graber, A.; Juhasz, P. S.; Khainovski, N.; Parker, K. C.; Patterson, D. H.; Martin, S. A. Proteomics 2004, 4, 474-489. (24) Moskovets, E.; Chen, H. S.; Pashkova, A.; Rejtar, T.; Andreev, V.; Karger, B. L. Rapid Commun. Mass Spectrom 2003, 17, 2177-2187. (25) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383-5392.
6018
Analytical Chemistry, Vol. 76, No. 20, October 15, 2004
another example, the Sequest algorithm was modified to calculate the probability of correct identification allowing statistical determination of the significant thresholds.27 Similarly, the percentage of false positive identifications can also be estimated by searching against a database of different species or a database for the same organism but with reversed protein sequences.28, 29 In both cases, it is expected that all significant identifications would be incorrect assignments. In this paper, we continue our efforts to extract the maximum amount of information from the LC-MS/MS experiment using a new algorithm for the processing of high-resolution MS/MS spectra. The algorithm sequentially utilizes (1) multiple criteria for selection of the fragment ions from the MS/MS spectrum, (2) improved mass accuracy of the precursor ion, and (3) wavelet denoising of the raw MS/MS spectrum to increase the number of identified peptides. The enhanced approach was applied for processing of MALDI MS/MS spectra acquired by a MALDI TOF/TOF MS instrument from a tryptic digest of portion of soluble yeast proteins separated by 2D LC, and the results were compared with that obtained by software provided by the instrument manufacturer. EXPERIMENTAL SECTION MS Instrumentation. Mass spectra were acquired from an early access version of AB 4700 Proteomics Analyzer (Applied Biosystems, Framingham, MA) in both the MS and MS/MS modes. The instrument was equipped with a Nd:YAG laser (PowerChip, JDS Uniphase, San Jose, CA) operating at 200 Hz and controlled by Applied Biosystems Explorer version 1.1 software. The mass resolution of the instrument was 15 000 and 4000 in the MS and MS/MS modes, respectively. The mass accuracy in the MS mode, using internal calibration was roughly (10 ppm, while in the MS/MS mode, the mass accuracy was (0.2 Da. Both the MS and MS/MS spectra were acquired in the full profile mode with approximately 130 000 and 80 000 data points per spectrum, respectively. 2D LC. Strong cation exchange chromatography was carried out on a Series II 1090 liquid chromatograph (Agilent Technologies, Palo Alto, CA). The yeast lysate sample (500 µL, ∼1 mg of total protein; see below) was injected into a polysulfoethyl A column (2.1 mm × 15 cm, PolyLC, Columbia, MD) and separated with a 1 h linear gradient at a flow rate of 200 µL/min. The composition of solvent A was 25% (v/v) acetonitrile (ACN), 10 mM phosphate buffer at pH 3.0, and solvent B was 25% (v/v) ACN, 10 mM phosphate buffer at pH 3.0 with 350 mM KCl. The separation was monitored using a Waters 486 UV detector (Waters Corp., Milford, MA). Fractions of ion exchange eluent, collected in 3-min time intervals (600 µL), were concentrated by a SpeedVac (Thermo Savant, Holbrook, NY) to a final volume of 100 µL and frozen at -80 °C. Ten SCX fractions with the highest UV absorbance out of total of 20 were selected for analysis by reversed-phase HPLC. Aliquots of the SCX fractions were diluted (26) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. Anal. Chem. 2003, 75, 4646-4658. (27) MacCoss, M. J.; Wu, C. C.; Yates, J. R., 3rd. Anal. Chem. 2002, 74, 55935599. (28) Moore, R. E.; Young, M. K.; Lee, T. D. J. Am. Soc. Mass Spectrom 2002, 13, 378-386. (29) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43-50.
10-fold and analyzed by an UltiMate chromatographic system (Dionex, Sunnyvale, CA), using a 75 µm × 15 cm column packed with 3-µm C18 stationary phase of 100-Å pore size (PepMap, Dionex). The sample (10 µL) was loaded, using the manual injector of the UltiMate system, to a precolumn (1 mm × 300 µm) packed with the same particles as the analytical column. After desalting, the sample was eluted onto an analytical column at 200 nL/min flow rate. Solvent A was 0.1% (v/v) trifluoroacetic acid (TFA) in 2% (v/v) ACN/water, and solvent B was 0.1% (v/v) TFA, 5% (v/v) 2-propanol in 85% (v/v) ACN/water, with a linear gradient time of 60 min. The analytical column was directly connected to a MicroTee (Upchurch, Oak Harbor, WA) where the eluent was mixed with the MALDI matrix solution at a flow rate of 1.3 µL/ min and streaked on the MALDI plate. Chemicals. Reagents, organic solvents, and standard peptides were purchased from Sigma (St. Louis, MO) unless otherwise specified. Nitrocelluose was from Bio-Rad Laboratories (Hercules, CA), and HPLC grade ACN, methanol (MeOH), and TFA were from Fisher Scientific (Fair Lawn, NJ). Deionized water was produced by an Alpha-Q system (Millipore Corp., Marlborough, MA). Preparation of the Yeast Proteomic Sample. Yeast Saccharomyces cerevisiae strain YP H499 (American Type Culture Collection, Manassas, VA) was grown in Pepton/dextrose/yeast broth at 30 °C to a density of 6 × 108 cells/mL and then harvested by centrifugation at 3000g at 4 °C for 10 min. YeastBuster Protein Extraction Reagent (Novagen, Madison, WI) was used to extract proteins from the cell lysis solution, following a standard protocol. The crude cell extract was mixed with a 6-fold larger volume of -20 °C cold acetone, followed by centrifugation at 13 000 rpm for 10 min. The precipitated proteins were then resuspended in the denaturing buffer. The total protein concentration was determined by the standard Bradford assay (Sigma-Aldrich, St. Louis, MO). The protein pellets were first suspended in 100 µL of denaturing buffer, consisting of 6 M urea, 50 mM Tris-HCl, 2 mM DTT solution, and the resultant mixture was heated at 95 °C for 20 min. The sample solution was then cooled to room temperature, followed by addition of 15 mM iodoacetamide, and placed in the dark for 30 min at room temperature. After alkylation, 600 µL of 50 mM NH4HCO3, 1 mM CaCl2 solution was added to the sample. Sequencing grade trypsin (Promega, Madison, WI) was added in a ratio of 1:25 (w/w), and the sample was incubated for 12 h at 37 °C. Continuous Deposition. The mixture of LC eluent was deposited on a MALDI target plate via the continuous deposition interface, as described previously.24,30 In brief, the LC eluent was mixed with the matrix solution (7 mg/mL R-cyano-4-hydroxycinnamic acid, 50% (v/v) ACN/water, 1 mM ammonium citrate, 0.1% (v/v) TFA) and deposited onto a standard MALDI plate in a sealed chamber under a rough vacuum (∼200 Torr). The mixture was transferred to the MALDI plate via a 20 cm × 40 µm i.d., 150µm-o.d. capillary, resulting in a uniform, roughly 250-µm-wide streak. The speed of the plate during deposition was 0.5 mm/s. Prior to the deposition, the MALDI plate was coated with a thin layer of nitrocellulose for proper wetting by the matrix solution.31,32 (30) Rejtar, T.; Hu, P.; Juhasz, P.; Campbell, J. M.; Vestal, M. L.; Preisler, J.; Karger, B. L. J. Proteome Res. 2002, 1, 171-179.
After the LC separation was completed, a mixture of external standards containing five peptides covering a mass range from 900 to 2500 Da was rapidly deposited as a separate streak parallel to the sample streak, with a 100-µm gap. MS and MS/MS Acquisition. The MS and MS/MS spectra were acquired in a manner similar to that described previously.20 The deposited streaks of sample and external mass standards were segmented into a series of adjacent 1.0 mm × 0.5 mm wells, each representing 2 s of chromatographic time; each well could be independently accessed by the AB 4700 control software for data acquisition. The MS signal was acquired with 600 laser shots from each well using predefined positions within the well with ∼80% of the signal collected from the sample streak and ∼20% from the mass calibrants. Note that each sample spectrum thus included peaks of the five mass standards used as internal calibrants. The individual files were then combined into one binary file that was analyzed by the MEND denoising algorithm.20 The denoised spectra were subsequently analyzed by the PRESEL algorithm to generate a reliable list of precursor ions for MS/MS analysis.33 The process of MS and MS/MS spectral acquisition and calibration was automated using a combination of the AB 4700 controlling software and an in-house Visual Basic program. The MS/MS spectra were acquired by accumulating 1000 laser shots from the sample streak alone. In total, 9504 MS/MS spectra were acquired. The MS/MS data were processed by either the enhanced data processing method or the method provided by the manufacturer. Database Searching. All searches were performed using a Mascot server version 1.9 (Matrix Science, London, U.K.) against a database containing all open reading frames (ORFs) of yeast 34 with carbamidomethylation as a permanent cysteine modification and methionine oxidation as the only potential modification. The number of missed cleavages was set at 2. The searches were repeated with tryptic (both peptide ends tryptically cleaved) and semitryptic (one peptide end tryptically cleaved) cleavages. MS/ MS spectra with a Mascot score higher than the significant score, using a confidence interval (c.i.) greater than 95%, were assumed to be correct calls. When more than one peptide sequence was assigned to a spectrum with a significant score, the spectra were examined manually. The cases where the top-scoring peptide sequences had equal score were discarded. Standard Processing of MS and MS/MS Spectra. The MS spectra were calibrated using six external calibration spots on each MALDI plate using the manufacturer-specified procedure (plate model), resulting in a mass accuracy of ∼ (50 ppm. The MS/ MS spectra were centroided and deisotoped using the AB 4700 controlling software, based on a specified S/N and the intensities of the fragment ions, and then automatically stored by the AB 4700 controlling software. The fragment ion intensity was expressed as the area of the isotopic cluster (referred below as the cluster area) that represented the sum of areas of all peaks in the isotopic cluster. Additional filtering provided by the standard (31) Preston, L. M.; Murray, K. K.; Russell, D. H. Biol. Mass Spectrom. 1993, 22. (32) Perera, I. K.; Perkins, J.; Kantartzoglou, S. Rapid Commun. Mass Spectrom. 1995, 9, 180-187. (33) Andreev, V.; Rejtar, T.; Chen, H.; Moskovets, E.; Karger, B., Proceedings of the 52nd ASMS Conference on Mass Spectrometry and Allied Topics. Nashville, TN. 2004. (34) Yeast database obtained from NCBI (www.ncbi.mln.nih.gov) in November 2003.
Analytical Chemistry, Vol. 76, No. 20, October 15, 2004
6019
software package was in terms of the number of peaks per m/z region, which was set to the 20 most intense peaks per 200 Da. The mass range for fragment ions was always from 70 Da to the precursor mass minus 50 Da. The final lists of fragment ions for all spectra were converted to the Mascot generic format using the Peak2Mascot tool (Applied Biosystems). Enhanced MS and MS/MS Spectral Processing. The MS spectra were calibrated using closely placed mass calibrants resulting in mass accuracy of ∼(10 ppm. The enhanced processing of MS/MS spectra was performed by a computer program written in Visual Basic 6.0. The program performed several tasks on the raw MS/MS spectra: (1) fragment ion detection, centroiding, and deisotoping with an estimation of S/N and cluster area using an ActiveX library (part of DataExplorer, Applied Biosystems); (2) removal of fragment ions with values in the forbidden mass ranges; (3) generation of multiple Mascot generic files for each raw spectrum based on specified criteria such as minimum fragment ion intensity; (4) MS/MS spectral denoising using MATLAB (MathWorks, Natick, MA) wavelet toolbox; see Figure 2. Additionally, the program utilized a MySQL database ver. 4.0 (MySQL AB, Sweden) to catalog performed operations. In this way, the program could process a subset of spectra that, for example, did not yield a significant score. The software also utilized the improved mass accuracy achieved by the closely placed internal standard. Estimation of False Positive Identifications. The percentage of false positives was estimated by searching the data set against a yeast protein database with reversed sequences.28,29 The reversed database was used as an approximation of a database where all the significant hits were expected to be incorrect. Note that the reversed database is equivalent to the normal database in the frequency of occurrence of amino acids and also of protein molecular weights. Using a program written in Perl, it was estimated that only 1% of all peptides distinguishable by MS/MS analysis longer than five amino acids (K/Q and I/L are considered to be equivalent) were common between the normal and reversed databases. In addition, 90% of the common peptides had only six amino acids (average m/z ∼ 750) or consisted of repeating patterns of amino acids. This level of similarity between normal and reversed databases was considered adequate to estimate the false positive rate. Data Storage. All peak lists, Mascot output files, and summary reports were stored in a MySQL database. The Mascot output files were converted into the MySQL database using a modified version of DBParser (J. Kowalak, National Institute of Mental Health). The results were evaluated using an in-house set of programs written in Perl. RESULTS AND DISCUSSION A model sample consisting of 10 SCX fractions (out of 20) of a yeast soluble protein digest were individually separated by reversed-phase nano-LC and deposited onto standard MALDI plates in the form of narrow streaks using a subatmospheric continuous deposition interface, as described elsewhere.30 The MS spectra were automatically acquired from an AB 4700 MALDI TOF/TOF MS instrument and mass calibrated using closely placed internal calibrants.24 The spectra were processed by the in-house algorithm for denoising (MEND), and the selection of precursor ions for MS/MS analysis was done with another in6020
Analytical Chemistry, Vol. 76, No. 20, October 15, 2004
house algorithm PRESEL.33 Next, the MS/MS spectra were acquired for the selected peak positions and precursors, and the spectra were processed by the procedure described in this paper. This procedure included among others, the generation of multiple peak lists per spectrum for database searching, the use of improved precursor ion mass accuracy, and wavelet denoising. The results of this enhanced processing were compared to those obtained by the standard software package supplied with the AB 4700 instrument with respect to the total number of identified peptides, unique peptides, proteins, and number of false positive identifications. Standard Processing of MS/MS Spectra. All acquired spectra were first processed using AB 4700 controlling software to generate a set of peak lists. Figure 1 schematically shows the process of generation of the peak list from the high-resolution MS/MS spectra. In Figure 1A, a representative raw spectrum is shown using an 80-Da-wide window. The spectrum was first centroided, so that each full profile peak was represented by a single m/z value of the peak centroid; see Figure 1B. Next, the peaks were deisotoped, and the intensity of the deisotoped peaks was expressed as the sum of the area of peaks in a particular isotopic cluster (referred below as a cluster area). Then, a single peak list was created by thresholding using default settings for peptide analysis, i.e., S/N greater than 5, area of cluster area greater than 100, and with up to 20 of the most intense peaks per 200 Da region of spectrum. Figure 1C presents the final peak list that was employed for the database searching. The searches were conducted using the Mascot server with full trypsin enzyme specificity. The precursor ion mass tolerance was set to (50 ppm in order to approximate typical mass accuracy achieved with the AB 4700 instrument with manufacturer-recommended external calibration, using only six calibration spots for the entire MALDI plate. The database search of a total 9504 spectra of all 10 LC-MALDI MS runs resulted in 1777 identified peptides. To estimate the percentage of false positives, the search was repeated using a yeast database with reversed protein sequences. In total, 48 significant peptide identifications were found with the reversed database, resulting in ∼2.7% of the significant hits in the normal database search, i.e., an estimate of the percentage of false positive identifications. Another database search using semitryptic enzyme specificity to account for peptides with only one tryptic cleavage was next performed. Since, the AB 4700 software does not allow removal of already identified MS/MS spectra, the search was repeated with all MS/MS spectra, and both tryptic and semitryptic peptides were reported. However, there were many more semitryptic than tryptic peptides in the theoretical digest, resulting in a higher Mascot significant score. As a result, the total number of significant peptides decreased from 1777 to 1512 as some tryptic peptides were eliminated by the increased significant score. While not implemented in the standard procedure, a better approach would be to compare manually the list of identified peptides with that of strict tryptic cleavage and, in the second step, add new peptides corresponding to semitryptic cleavages. Nevertheless, below we will compare the above number of identified peptides including the percentage of false positives with that obtained with the new data processing procedure.
Figure 1. Process of selection of fragment ions from a high-resolution MS/MS spectrum. (A) Region of raw MS/MS spectrum (m/z ) 470550); (B) the same m/z region after peak centroiding; (C) peak list after deisotoping and thresholding using the sum of peak areas in the isotopic cluster (cluster area) greater than 100.
Enhanced Processing of the MS/MS Spectra. The strategy for enhanced data processing of MS/MS spectra consisted of several sequential steps schematically shown in Figure 2 and summarized here. First, all peaks in an MS/MS spectrum were detected and values of S/N and cluster area for all peaks determined. Then, only peaks with a S/N and cluster area above specific thresholds (see below) were further processed. The procedure could also select a specific number of the most intense peaks in specified m/z intervals from each spectrum or eliminate fragments corresponding to forbidden m/z values. Three separate peak lists were generated for each MS/MS spectrum, based on an increasing cluster area threshold. These three peak lists were next submitted for database searching using (10 ppm mass tolerance, achieved by incorporating a closely placed external standard streak over the entire deposited chromatographic run. Next, after removal of identified spectra, the search using semitryptic specificity was conducted for all three cluster area thresholds, and the combined results from all searches were stored. Following this, the spectra not yet identified were denoised using wavelet transformation. After peak detection, the database searching was performed with (10 ppm mass tolerance. While not performed in this paper, the MS/MS spectra that still were not identified could be further examined by de novo sequencing. We next explore in detail the individual steps of the scheme in Figure 2. Selection of Fragment Ion Peaks from the Raw MS/MS Spectrum. Initially, the peak lists generated for each acquired
MS/MS spectrum were examined using AB 4700 controlling software. The influence of various parameters on Mascot score, e.g., intensity and number of random peaks in a given spectrum, was studied in order to optimize the selection of fragment ions from the MS/MS spectrum. The influence of noise or peaks with random m/z values in the MS/MS spectrum on the value of the significant Mascot score was first simulated using a model peptide MS/MS peak list. Theoretical mass values of six consecutive y-ions for a model peptide FAHEGGYYIVPLSSK were calculated, and the intensities of the y-ions were normalized to 1. Then, 1-19 peaks with random m/z values and 10-fold lower intensity (0.1) than the y-ion series were added to the peak list. The resulting 20 peak lists were subjected to database searching using Mascot. The Mascot score was found to decrease systematically from the maximum value of 59, for no random peaks, to 28, when 19 random peaks were added at a 10-fold lower intensities. This decrease in the Mascot score demonstrates that random peaks (or noise), even with significantly lower intensity than fragment ions, can nevertheless still have a negative effect on the Mascot score. Therefore, to maximize the possibility of peptide identification, it is clear that random peaks should be excluded from the peak list wherever possible. A strategy that we first tested to reduce the number of random peaks was to eliminate certain m/z values from inclusion in the data analysis. It is well known that specific m/z values cannot occur due to the elemental composition of peptides (i.e., forbidden Analytical Chemistry, Vol. 76, No. 20, October 15, 2004
6021
Figure 2. Scheme of the overall data processing procedure. Step 1: Raw MS/MS spectra are denoised, deisotoped, and three different peak lists, based on variable cluster areas, are generated. Step 2: The database search is carried out for all three peak lists using (10 ppm mass tolerance (c.i. 95%). All significant identifications are excluded, and the search is then repeated with (30 ppm mass tolerance to account for incorrectly calibrated peaks (c.i. 99%). In addition, database searching using other than fully tryptic specificity or additional peptide modifications can be performed. Step 3: MS/MS spectra that did not yield significant score are denoised using wavelet transformation. Step 4: The database search is repeated using denoised spectra with (10 ppm mass tolerance. Step 5: After selection of spectra with moderate to high intensity of fragmentation, de novo sequencing can in principle be performed.
regions).35 The fragment ions obviously have the same elemental composition constraints as the parent peptide ions; thus, similar forbidden zones will result. To calculate the specific intervals of allowed masses for various m/z values, all the theoretical tryptic peptides from the ORFs of yeast were generated and the masses of their corresponding b- and y-ion fragments calculated. The intervals for the lowest and the highest masses of the observed fragment ions for each nominal mass in the mass range from 70 to 3000 Da were next calculated. Then, the interval of allowed fragment masses was extended by (0.2 Da to account for the mass accuracy of the AB 4700 in the MS/MS mode. The resulting band of allowed m/z values for the yeast digest represented ∼70% of all potential masses. Thus, theoretically up to 30% of the random MS/MS peaks could be eliminated based on forbidden masses. The experimental peak lists generated by the standard AB 4700 software for all acquired MS/MS spectra were then processed to remove peaks from the forbidden mass ranges. Approximately 7% of the total number of masses were found to have been removed by this procedure. This low percentage of eliminated peaks using actual data indicated that most of the peaks selected by the standard program corresponded to fragment ions. Database searching was repeated with omitted masses in the forbidden regions, resulting in a roughly 4% increase in the number of peptide identifications. To estimate the influence of this processing on the percentage of false positive identifications, the same search was performed using the reversed database, and it was found that the removal of random peaks in the forbidden ranges resulted in a negligible increase in the number of false positive identifications (35) Wehofsky, M.; Hoffmann, R.; Hubert, M.; Spengler, B. Eur. J. Mass Spectrom. 2001, 7, 39-46.
6022
Analytical Chemistry, Vol. 76, No. 20, October 15, 2004
from 2.7 to 2.8%. In summary, this study revealed that removal of MS/MS peaks outside the allowed m/z intervals slightly increased the number of peptides identified with only a minor increase in the false positive rate. This procedure was nevertheless employed in all further studies, since the additional computer time was negligible. Influence of the Multiple Peak Lists per MS/MS Spectrum. The intensities of fragments ions, as well as the background signal, varied between individual MS/MS spectra. Using fixed threshold values for peak selection could help select fragment ions in some MS/MS spectra but, at the same time, could also result in increased background noise for other spectra. Thus, a single common threshold represents a compromise. To overcome this compromise, we developed a new strategy based on the use of multiple threshold values for each MS/MS spectrum. First, all MS/MS spectra were centroided and deisotoped, and only peaks with a S/N above 3 were selected regardless of their cluster areas, with peaks in the forbidden regions removed. To estimate appropriate cluster area thresholds for fragment ion selection, a histogram of the dependence of different cluster areas (i.e., peak intensities) from all MS/MS spectra was generated, as shown in Figure 3. It can be seen that there is a 2-fold increase in the number of peaks with a cluster area at 100, relative to 200. This result suggests, not surprisingly, that, for most spectra, the background noise peaks have a cluster area below 200. From the results in Figure 3, three levels of peak cluster area (100, 300, 500) were chosen for generating three separate peak lists for each MS/MS spectrum. As described above, peaks in the MS/MS spectrum with S/N greater than 3 were preselected first, followed by peaks with m/z values in the forbidden mass
Figure 3. Histogram of occurrence of ion cluster areas in MS/MS spectra. T1, T2, and T3 indicate cluster area cutoff used for generating three peak lists for each MS/MS spectrum.
ranges being removed, and then the 20 most intense peaks in each 200-Da mass interval were selected. This step was followed by the generation of the three peak lists with fragment ion intensities greater than the selected threshold cluster area values. For purpose of comparison with the standard processing procedure, the database search was performed for all three lists with (50 ppm mass accuracy. As seen in Figure 3, ∼70% of all significant MS/MS spectra attained a significant score (c.i. 95%) for all three peak lists, 20% for two peak lists, and 10% for only one peak list per spectrum. Importantly, compared to the standard processing procedure, the three peak list approach resulted in 201 (11.3%) additional identifications. It should be noted that, in this initial study, all three peak lists for each spectrum were submitted for the database search; however, it is also possible to submit the peak list with higher threshold only if it had yet not been identified with the lower threshold. The influence of the multiple peak list procedure on percentage of false positive identifications was next estimated by searching the same data set using all three peak lists per MS/MS spectrum against the reversed yeast database. The search resulted in a total of 60 significant hits, corresponding to 3.2% of all identifications in the normal database. This level represented a slight increase in the false positive identifications compared to the standard procedure (single cluster area threshold) where 2.7% false positives were found. An example of a successful identification of a low-intensity MS/MS spectrum from this three cluster area approach is shown in Figure 4A. Peptide QLYSFDLEGFWMDVGQPK was identified with a highly significant Mascot score (56) using a peak list generated with a cluster area threshold of 100. For a cluster area threshold set to 300, the resultant Mascot score decreased to only 6, due to availability of only a few fragment ions of sufficient intensity. The accuracy of identification of this peptide was confirmed by a mass difference of only 2.2 ppm compared to the theoretical m/z of the peptide (database precursor ion tolerance set to (50 ppm) and also by the identification of five additional peptides from the same protein in the yeast sample. The example in Figure 4A illustrates that lowering the intensity threshold for peak selection can indeed increase the Mascot score in specific cases. However, if the same low-intensity peak selection parameter (cluster area 100) were used for all MS/MS spectra, the increased level of noise could result in a decrease of the
Mascot score in some cases. For example, Figure 4B presents an MS/MS spectrum that resulted in a significant score only with a cluster area threshold of 500. The accuracy of this peptide identification, EVSDGVIAPGYEPEALAILSK, was confirmed by the presence of an additional 15 unique peptides, corresponding to the same protein. Figure 4 demonstrates that multiple-intensity thresholds increase the number of correctly identified peptide sequences from MS/MS spectra. Precursor Ion Mass Accuracy. In addition to the different intensity thresholds, another factor that was examined in this work was the importance of precursor ion mass accuracy on peptide identification. Roughly 1000 MS/MS spectra from the analysis of a single SCX fraction were sequentially searched by Mascot employing an increasing mass tolerance for the precursor ion. Figure 5 demonstrates the results of this study in terms of the dependence of the value of Mascot significant score (c.i. 95%) on the precursor mass accuracy. In Figure 5A, the search was performed against the entire MSDB database36 for all organisms, while in Figure 5B, all ORFs of the yeast genome alone were used. As has been already discussed, the Mascot algorithm calculates the significant score from the number of theoretical peptides within the mass tolerance of the precursor ion,13 and thus, the Mascot significant score became larger with the increasing mass tolerance up to roughly (100 ppm. Then, the value of the significant score plateaued since most of the theoretically predicted peptides with m/z below 4000 Da could be found within the (100 ppm range. Higher absolute values of the significant score for the MSDB database were observed due to a much greater number of theoretical peptides generated from the allspecies database compared to a single organism. The dependence of the significant score on the precursor ion mass tolerance can be utilized to increase the number of identified peptides. When the mass accuracy in the MS mode is improved, the database search can be performed with narrower precursor ion mass tolerance, resulting in a lower value of the significant score and thus more significant hits. Mass accuracy typically achieved with the AB 4700 instrument, using external calibration, was (50 ppm, corresponding to a Mascot significant score of 23. Internal calibration with two or more mass calibrants could provide a mass accuracy better than (10 ppm with a Mascot significant score of 17.37,38 However, the direct addition of mass standards to the MALDI matrix solution can cause ion suppression of analytes. Recently, we introduced a technique that utilizes a closely placed external standard containing five calibrants across the full mass range for mass calibration. In this procedure, the spectrum from the analyte trace and calibrant trace was acquired within one acquisition sequence, resulting in a mass accuracy of better than (10 ppm without ion suppression.24 Then, the accuracy of the calibration was evaluated based on highly significant identifications (Mascot score >40) obtained by the standard MS/MS processing using (50 ppm precursor mass tolerance. By comparison of the precursor m/z values to the identified peptides, a mass accuracy better than (10 ppm was confirmed for ∼95% of the highly significant identifica(36) ftp://ftp.ncbi.nih.gov/repository/MSDB/. (37) Gobom, J.; Mueller, M.; Egelhofer, V.; Theiss, D.; Lehrach, H.; Nordhoff, E. Anal. Chem. 2002, 74, 3915-3923. (38) Moskovets, E.; Karger, B. L. Rapid Commun. Mass Spectrom. 2003, 17, 229-237.
Analytical Chemistry, Vol. 76, No. 20, October 15, 2004
6023
Figure 4. Example of identification of a peptide using different cluster area thresholds. (A) MS/MS spectrum that can be assigned to peptide, QLYSFDLEGFWMDVGQPK, with a significant score only when the fragment ions are selected using 100 cluster area threshold. (B) MS/MS spectrum that can be assigned to peptide, EVSDGVIAPGYEPEALAILSK, with a significant score only when the fragment ions are selected using 500 cluster area threshold.
tions. The remaining 5% of identifications were further examined, and it was found that these peaks corresponded to precursors 6024
Analytical Chemistry, Vol. 76, No. 20, October 15, 2004
with low S/N. In such cases, noise may influence the determination of the m/z peak centroid. In addition, some peaks were found
Table 1. Comparison of the Number of Significant Identifications in the Normal and Reversed Databases for Standard and Enhanced Processinga
normal BD reversed DB
enhanced processing
standard processing totalb
3 peak listsc
combined approachd
semitryptice
wavelet denoisef
total
additnl IDs
1777 48
201 12
377 25
96 2
178 21
2428 (+37%) 96
651 48
a Standard processing, commercial software on AB 4700 instrument. Enhanced processing, newly developed procedure to increase information content. b Number of identified peptides using trypsin specificity with (50 ppm precursor mass tolerance. c Additional identifications after generation of three peak lists per MS/MS spectrum with (50 ppm mass tolerance. d Additional identifications using three peak lists, (10 ppm and (30 ppm mass tolerance. e Additional significant peptides with only one tryptic end with (10 ppm mass tolerance. f Additional significant identifications after wavelet denoising using trypsin specificity, with (10 ppm mass accuracy.
Figure 5. Dependence of Mascot significant score (c.i. 95%) on mass tolerance of the precursor ion. (A) MSDB database using all organisms;36 (B) all ORFs of yeast.
to overlap with another peak, and this could also shift the peak centroid. The database search of the data set with the three peak lists per spectrum, corresponding to three different thresholds for cluster area, was then performed using (10 ppm mass tolerance. In total, 2101 peptides were identified with a significant score (Mascot score >17). Rather than discard ∼5% of the precursor peaks with mass accuracy outside the (10 ppm interval, the program first excluded all significant hits with (10 ppm and repeated the search with (30 ppm mass tolerance. This time, to reduce false positive identifications, only hits with 99% confidence, corresponding to a Mascot significant score of 26.5, were accepted. This processing resulted in 19 additional significant identifications. The comparison of peptide identifications with enhanced data processing to that using AB 4700 standard software is presented in Table 1. The combination of the three peak lists as a function of intensity plus the improved mass accuracy resulted in an additional 377 identifications (21.2%), relative to the standard processing. Importantly, as can be seen from the reversed database results, the number of false positive identifications increased from 2.7% to only 3.6%, which is still below the 5% level, corresponding to 95% confidence. The combination of the multiple sequential steps, from three peak lists per spectrum and the improved mass accuracy, allowed identification of additional peptides. Moreover, improved mass accuracy can help in excluding false positive identifications. As an example, Figure 6 shows an MS/MS spectrum for a precursor ion with m/z ) 820.449. Mascot analysis assigned two different
peptides to this spectrum for a peak list generated with a cluster area greater than 100, both with scores higher than the significant threshold for (50 ppm precursor mass tolerance (22). The precursor ion matched the first candidate peptide INFGIEK with a mass difference of -8.9 ppm. The second candidate LNMoxAVEK differed by only 31.3 ppm from the precursor ion. Since the mass accuracy of the precursor ion was expected to be within (10 ppm, the second match could be ruled out. In addition, the first peptide belonged to pyruvate kinase, which was a highly abundant protein identified by an additional 44 unique peptides, whereas the second candidate was the only peptide that belonged to protein Ygr126wp. Thus, mass accuracy of the precursor ion plays an important role in the increasing the number of peptide identifications, while excluding many false positive identifications. Additional Processing. Although sequencing grade trypsin was used for the digestion, it is known that peptides with not fully tryptic termini are present in the complex cell extract samples due to, for example, the presence of nonspecific proteases. As shown in Figure 2, MS/MS spectra that did not yield a significant identification with the improved mass accuracy based on full tryptic digestion were next selected (i.e., identified spectra were excluded) and submitted to database searching with semitryptic enzyme specificity. This search resulted in an additional 96 (5%) peptide identifications (Mascot score >35) that were added to the list of identified peptides. The search using the reversed database yielded only two significant identifications indicating the high selectivity of the Mascot algorithm. At this stage of the data processing, ∼75% of all acquired MS/MS spectra were still unidentified. To increase the number of identifications, the database search could be repeated and specific posttranslational modifications could be considered. In this study, however, another generic procedure to enhance the number of identified peptides was employed. Wavelet Denoising. The wavelet transformation is known to be a very effective denoising procedure,39 and thus, it was selected for processing of all MS/MS spectra that were not identified up to this point. The denoising was employed in the final stage of the processing in order to minimize the impact of potential spectra distortion by this procedure, which could lead to false negative identifications. Initially, the optimization of wavelet denoising was performed with the subset of 200 MS/MS spectra by varying parameters such as the wavelet base functions, order, and level (39) Shao, X. G.; Leung, A. K.; Chau, F. T. Acc. Chem. Res. 2003, 36, 276-283.
Analytical Chemistry, Vol. 76, No. 20, October 15, 2004
6025
Figure 6. Importance of mass accuracy in peptide identifications.MS/MS spectrum that can be assigned to two peptide sequences with significant Mascot score. Based on the mass accuracy of the precursor ion, peptide LNMoxAVEK was excluded as a false positive.
of decomposition.40 Finally, based on this study, it was decided to use the Sym8 function for denoising with the level of decomposition equal to five. After applying the denoising procedure, a single peak list per MS/MS spectrum was created using a S/N of 15. Next, the database searching with (10 ppm mass tolerance was performed with full tryptic enzyme specificity. This search resulted in an additional 178 (10%) peptide identifications. An example of denoised spectra is shown in Figure 7, where an almost complete y-ion series was recovered after applying the wavelet, without a decrease in the quality of the MS/MS spectrum. The peptide AVDDFLISLDGTANK became highly significant with a Mascot score of 47, compared to the original score of only 17. This example demonstrates further that application of proper signal enhancement techniques can increase the number of significant identifications. In addition, the level of false positives was estimated by searching against the yeast reversed database, and an additional 21 significant identifications were found. Nevertheless, the total level of false positives (3.9%) was still below Mascot’s 95% confidence, and thus, the wavelet denoising was incorporated into the enhanced strategy. Comparison of Standard and Enhanced Processing. Taking into account all processing steps, the overall comparison of identifications versus standard AB 4700 is summarized in Table 1. The standard processing, assuming only tryptic specificity, resulted in identification of 1777 peptides with 48 (2.7%) false positives. The combined identifications after the enhanced pro(40) Mallat, S. A wavelet tour of signal processing; Academic Press: London, 2001.
6026 Analytical Chemistry, Vol. 76, No. 20, October 15, 2004
cessing resulted in 2428 with 96 (3.9%) false positive identifications; thus, the enhanced processing added a total of 651 peptides to the original 1777. The rate of false positive identifications increased, but it was still less than the 5% error rate defined by Mascot. Note that protein identification was based on at least two significant peptides, which greatly improved confidence of the identified proteins. Table 2 summarizes the detailed results for identification of peptides, unique peptides, and proteins using both processing methods. The number of unique peptides represented different peptide sequences; for example, oxidation of methionine was not considered as a new identification. The number of unique proteins consisted of proteins that differed by at least one identified peptide. When the same peptides matched more than one protein, the proteins were grouped together and counted only once as a unique hit. The complete enhanced processing allowed identification of 33% additional unique peptides. Out of the total number of identified peptides, ∼40% were redundant (the same peptide identified more than once) using either processing method. We next investigated the sources of peptide redundancy for enhanced processing. First, the peptide redundancy within the fractions was calculated. It was found that on average only 6% of peptides were identified more than once in the same SCX fraction. This finding demonstrated the inherent advantage of the off-line precursor selection approach compared to the on-line datadependent analysis (e.g., ESI), where several MS/MS spectra are typically acquired for the same peptide. Then, the peptide redundancy between fractions was investigated. It was found that
Figure 7. Denoising of MS/MS spectrum by wavelet transformation. (A) Original MS/MS spectrum of the precursor with m/z ) 1578.78 with a score below significant threshold. (B) The same spectrum after denoising becomes highly significant at 47. Note the increase in the intensity of the y-ion fragments.
CONCLUSIONS
Table 2. Comparison of the Number of Identified Peptides and Proteins from the Standard and Enhanced Processing number
standard processing
enhanced processing
added IDs (%)
total peptides unique peptides unique proteins protein IDs (>1 peptides)
1777 1156 461 221
2428 1529 567 271
36 33 23 22
the major source of peptide redundancy was the low separation resolution of the SCX column, resulting in rather broad chromatographic zones, in agreement with that of others.29 As further seen in Table 2, the enhanced processing enabled identification of 22% additional proteins with more than one significant peptide, compared to the standard AB 4700 processing. Importantly, the additional identifications were extracted from the already acquired data without the need of additional experiments, requiring only computer processing time. Finally, the protein lists from standard and enhanced processing procedures were compared, and it was found that all of the proteins identified using standard processing were also found after the enhanced processing. Thus, it was concluded that the enhanced processing allowed utilization of additional information in the data without compromising the data quality.
We have developed an enhanced sequential method for processing of MS/MS spectra acquired by the AB 4700 instrument prior to the database searching. The method combines several strategies including removal of fragment ion masses in forbidden ranges followed by selection of MS/MS ions based on several criteria. The database search was performed sequentially at different stages of processing to maximize the number of significant identifications. Further, the method utilized improved mass accuracy of the precursor ion, achieved by closely placed external standard for calibration and also denoising of the MS/ MS spectra using wavelet transformation. The enhanced processing was compared with processing using standard software supplied with the AB 4700 instrument. A model data set consisting of 10 SCX fractions of tryptic digest of yeast soluble proteins was used to demonstrate the enhanced processing. The application of the processing resulted in an increased number of unique peptide identifications by 33%, corresponding to 22% additional proteins identified with more than one significant peptide. The computer processing time for one SCX fraction was ∼20 min on a single CPU Pentium 4 computer, resulting in the total analysis time of ∼3.5 h for all spectra. This processing time would dramatically decrease using a computer cluster. Even though the enhanced processing was applied to the MS/MS spectra acquired by the AB 4700 instrument, in principle, similar analysis could be Analytical Chemistry, Vol. 76, No. 20, October 15, 2004
6027
done with any high-resolution MS/MS spectra, for example, using q-TOF instruments in either the ESI or MALDI mode. ACKNOWLEDGMENT The authors gratefully acknowledge NIH for support of this work (GM 15847, HG 02033). The authors also thank Dr. S. Martin and Dr. P. Savickas, Applied Biosystems, for stimulating discussion and Dr. J. Kowalak, National Institute of Mental Health, for providing Mascot DBParser, S. Pillai, Barnett Institute, for help
6028
Analytical Chemistry, Vol. 76, No. 20, October 15, 2004
with Perl scripts, and A. Pashkova, Barnett Institute, for assistance in sample preparation. Contribution no. 838 from the Barnett Institute.
Received for review May 21, 2004. Accepted July 29, 2004. AC049247V