Integrated Approach for Manual Evaluation of Peptides Identified by

May 24, 2005 - Oh Kwang Kwon , JuHee Sim , Ki Na Yun , Jin Young Kim , and .... Huiming Yan , Nan Wang , Michael Weinfeld , William R. Cullen and X. C...
0 downloads 0 Views 193KB Size
Integrated Approach for Manual Evaluation of Peptides Identified by Searching Protein Sequence Databases with Tandem Mass Spectra Yue Chen, Sung Won Kwon, Sung Chan Kim, and Yingming Zhao* Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas 75390-9038 Received December 23, 2004

Abstract: Quantitative proteomics relies on accurate protein identification, which often is carried out by automated searching of a sequence database with tandem mass spectra of peptides. When these spectra contain limited information, automated searches may lead to incorrect peptide identifications. It is therefore necessary to validate the identifications by careful manual inspection of the mass spectra. Not only is this task time-consuming, but the reliability of the validation varies with the experience of the analyst. Here, we report a systematic approach to evaluating peptide identifications made by automated search algorithms. The method is based on the principle that the candidate peptide sequence should adequately explain the observed fragment ions. Also, the mass errors of neighboring fragments should be similar. To evaluate our method, we studied tandem mass spectra obtained from tryptic digests of E. coli and HeLa cells. Candidate peptides were identified with the automated search engine Mascot and subjected to the manual validation method. The method found correct peptide identifications that were given low Mascot scores (e.g., 20-25) and incorrect peptide identifications that were given high Mascot scores (e.g., 40-50). The method comprehensively detected false results from searches designed to produce incorrect identifications. Comparison of the tandem mass spectra of synthetic candidate peptides to the spectra obtained from the complex peptide mixtures confirmed the accuracy of the evaluation method. Thus, the evaluation approach described here could help boost the accuracy of protein identification, increase number of peptides identified, and provide a step toward developing a more accurate next-generation algorithm for protein identification. Keywords: protein identification • manual evaluation • automated database search

1. Introduction Quantitative proteomics is an emerging approach for identifying proteins that change with respect to their expression, * To whom correspondence should be addressed. Tel: (214) 648-7947. Fax: (214) 648-2797. E-mail: [email protected].

998

Journal of Proteome Research 2005, 4, 998-1005

Published on Web 05/24/2005

modifications, subcellular localizations, or interactions in response to alterations of cellular environment. Tandem mass spectrometry (MS/MS) has become indispensable for identifying and quantifying proteins, largely due to its unparalleled sensitivity and the speed at which fragment mass fingerprints can be generated.1 Combining tandem mass spectrometry with multidimensional liquid chromatography (LC) enables acquisition of thousands of MS/MS spectra in a few hours.2,3 In a typical protein identification experiment, proteins of interest are digested with a proteolytic enzyme, usually trypsin, and the resulting peptides are subjected to LC-MS/MS analysis. Each resulting MS/MS spectrum contains the masses of the parent peptide and its fragment ions. This information is used in an automated search of a protein sequence database to find the peptide that most closely matches the observed spectrum. Searching a collection of sequences with tandem mass spectrometry information is performed by mimicking the experiment: each protein sequence in the collection is theoretically digested according to the cleavage specificity of the enzyme used in the experiment. The masses of each resulting peptide and its potential MS/MS fragments are calculated, and a theoretical tandem mass spectrum is constructed. The measured MS/MS spectrum is then compared to the theoretical MS/MS spectrum and a score is calculated representing the degree of correlation. This procedure is repeated for each protein in the sequence collection. Finally, the proteins in the database are ranked according to the calculated scores. The most commonly used search engines that employ tandem mass spectrometry data are SEQUEST,4 PepSea,5 Mascot,6 Sonar,7 ProbID,8 Popitam,9 and Tandem.10 These search engines use different scoring methods. In addition to natural amino acid sequences, information about protein modification and isotopic labeling (e.g., isotope-coded affinity tags (ICAT) or stable isotope labeling with amino acids in cell culture (SILAC)) can be incorporated into the database search.11-15 Moreover, in some cases, de novo peptide sequencing has been attempted based on tandem mass spectrometry combined with database searching.5,16-18 Although automated search engines dramatically improve the efficiency of protein identification, they always yield both false and true identifications due to random matching between the experimental and theoretical data.7,19-24 When only small amounts of protein are available for protein digestion, data quality is often poor, making the problem of random matching more serious. In such situations, and during analysis of highly complex peptide mixtures, protein identification may rely on 10.1021/pr049754t CCC: $30.25

 2005 American Chemical Society

technical notes only one MS/MS spectrum. Thus, a robust and accurate method is needed to ensure that protein identifications are of high quality. Distinguishing correct peptide assignments from incorrect assignments in database search results can be achieved, but requires expertise in interpretation of mass spectra. The process involves time-consuming manual inspection of the correlation between the observed MS/MS spectra and the theoretical fragment pattern of the candidate peptides identified by the database search algorithm. A few recent studies aimed at improving the accuracy and sensitivity of protein identifications made by automated sequence database searching.23,25-27,35,36 Unfortunately, a systematic study of a large set of MS/MS spectra has not yet been carried out to guide manual evaluation of protein identification. Such a study would also provide insights for the design of next-generation database search algorithms, and would provide useful information for efficient use of current software. Here, we report an approach for the evaluation of peptide identifications derived from searching protein sequence databases. To test the feasibility of this approach for distinguishing correct peptide results from incorrect ones, we analyzed 1389 peptide identifications from tryptic digests of E. coli and HeLa cell extracts. Correct peptide identifications had Mascot scores ranging from 20 to 117, while incorrect identifications had scores ranging from 20 to 60, suggesting that a clear-cutoff Mascot score does not exist. MS/MS analyses of synthetic peptides with low Mascot scores (22 to 24) suggest that our manual evaluation method can reliably establish correct peptide results even when Mascot gives a low score and the MS/ MS spectrum is of moderate quality. Manual analysis of protein identification results from “cross-species” database searches (e.g., searching the human sequence database with MS/MS spectra of peptides derived from E. coli proteins), and searches of databases in which the protein sequences have been reversed,33,34 resulted in detection of all incorrect protein identifications. Our investigation clearly demonstrates the critical role of manual evaluation when using the current version of the Mascot search algorithm. The same approach could be applied to evaluating protein identifications derived from other database search algorithms.

2. Materials and Methods 2.1. Materials. Fetal bovine serum (FBS), trypsin, Dulbecco’s modified Eagle’s medium (DMEM), penicillin/streptomycin, and Luria-Bertani (LB) medium were from Life Technologies, Inc. (Carlsobad, CA). Dulbecco’s phosphate buffered saline was purchased from Sigma (Saint Louis, MO). Urea, thiourea, CHAPS, ammonium bicarbonate, and dithiothreitol (DTT) were bought from Fisher Scientific Corp. (Pittsburgh, PA). Trifluoroacetic acid (TFA) was from Fluka (Buchs, Switzerland). Sequencing-grade trypsin was from Promega (Madison, WI). µC18 ZipTips were from Millipore Corp. (Bedford, MA). Luna C18 resin was from Phenomenex (St. Torrance, CA). 2.2. Methods. 2.2.1. Preparation of Cell Lysates from HeLa Cells and E. coli. One dish (10 cm) of HeLa cells was grown in DMEM supplemented with 10% fetal bovine serum and 1% penicillin/streptomycin in a humidified CO2 atmosphere at 37°C. When the cells reached 80-90% confluence, they were washed with cold Dulbecco’s phosphate buffered saline twice. To the resulting cell pellet was added 200 µL of cell lysis buffer (6 M urea, 2 M thiourea, 4% CHAPS, 50 mM Tris-HCl, pH 8.0)

Chen et al.

to lyse the cells. The cell lysate was harvested and sonicated three times for 5 s each with 20-s intervals between sonications using a 550 Sonic Dismembrator (Fisher Scientific Corp, CA). The lysate was centrifuged at 4°C for 1 h at 21 000 × g. The debris was discarded while the supernatant was divided into aliquots and stored at -80°C until use. E. coli DH5 was grown aerobically in LB medium at 37 °C. The cultured cells were harvested at log phase by centrifugation at 4500 × g for 10 min and washed twice by resuspension of the pellet in ice-cold PBS buffer (0.1 M Na2HPO4, 0.15 M NaCl, pH 7.2). The cells were resuspended in chilled lysis buffer (50 mM Tris-HCl, pH 7.5, 100 mM NaCl, 5 mM DTT) and then sonicated with 12 short bursts of 10 s followed by intervals of 30 s for cooling. Unbroken cells and debris were removed by centrifugation at 4 °C for 30 min at 21 000 × g. The supernatant was divided into aliquots and stored at -80 °C until use. 2.2.2. Protein Digestion. HeLa cell lysate solution was diluted with four volumes of 50 mM ammonium bicarbonate buffer (pH 8.0) to reduce the urea concentration. Trypsin in 50 mM ammonium bicarbonate buffer was added to the HeLa cell lysate at an enzyme-to-substrate ratio of 1:50. After overnight incubation at 37 °C, peptide solutions were dried in a SpeedVac (ThermoSavant Corp, Holbrook, NY) and reconstituted in 0.1% (v/v) TFA solution. E. coli cell lysate was digested in a similar fashion. µC18 ZipTips were used to wash the tryptic peptides according to the manufacturer’s directions before nano-HPLC/mass spectrometry. 2.2.3. Nano-HPLC Mass Spectrometry Analysis. HPLC-MS/ MS analysis was performed in an LCQ DECA XP ion-trap mass spectrometer (ThermoFinnigan, San Jose, CA) equipped with a nano-electrospray ionization source. The source was coupled online to an Agilent 1100 series nano flow LC system (Agilent, Palo Alto, CA). A total of 2 µL of the peptide solution in buffer A (2% acetonitrile/97.9% water/0.1% acetic acid, v/v/v) was manually injected and separated in a capillary HPLC column (50 mm length × 75 µm I.D., 5 µm particle size, 300 Å pore diameter) packed in-house with Luna C18 resin. Peptides were eluted from the column with a 60-min gradient of 5% to 80% buffer B (90% acetonitrile/9.9% water/0.1% acetic acid, v/v/v) in buffer A. The eluted peptides were electrosprayed directly into the LCQ DECA XP ion-trap mass spectrometer. Normalized energy for collision-induced dissociation is 35%. Each MS/MS spectrum obtained by averaging three micro-scans with maximum injection time of 110 ms for each micro-scan. The MS/ MS spectra were acquired in a data-dependent mode, such that the masses and fragmentation patterns of the three strongest ions in each MS scan were determined. All spectra were acquired in centroid mode. 2.2.4. Protein Sequence Database Search. Tandem mass spectra were used to search the NCBI-nr database with the Mascot search engine (version 1.9, Matrix Science, London, UK). Trypsin was specified as the proteolytic enzyme. Oxidization of methionine residues (+16 Da) and 1 missed cleavage site per peptide were taken into account. The maximum allowable mass error was set to (4 Da for parent ion masses and (0.5 Da for fragment ion masses. Charge states of +1, +2, or +3 were considered for parent ions. If more than one spectrum was assigned to a peptide, then each was given a Mascot score and only the spectrum with the highest score was used for manual analysis. Peptides identified with a Mascot score higher than 20 were considered to be potential positive identifications and each was manually verified by the method described in the Results section. Journal of Proteome Research • Vol. 4, No. 3, 2005 999

Manual Evaluation of Protein Identification

3. Results 3.1. Method for Evaluation of Peptide Identification. We have established three rules to evaluate protein identifications made by automated protein sequence database searches. The rules are based on our accumulated experience manually analyzing MS/MS spectra and on the principle that a correct result should explain all the major mass spectrometric peaks in the MS/MS spectrum, except peaks resulting from electronic sparks that occasionally occur during data acquisition. The rules were also trained with reversed sequence database and cross-species sequences. We consider those peptide identifications that can meet the rules as “Correct” identification. Specifically, the following criteria were used to evaluate protein identifications: Rule I: Normal rule for validation of peptide candidates of doubly charged ions 1. Only y-, b-, or a-ions or associated peaks arising due to water or amine loss are considered as daughter ions of a parent peptide. At least 5 isotopically resolved, independent fragment peaks must match theoretical peptide fragments. 2. All isotopically resolved peaks with intensities higher than 5% of the maximum intensity and m/z ratios larger than that of the doubly charged parent mass must match theoretical peptide fragments. 3. All isotopically resolved peaks with intensities higher than 20% of the maximum intensity and m/z values between onethird of the parent m/z ratio and the parent m/z ratio must match theoretical peptide fragments. 4. The difference in the mass errors of neighboring fragment peaks that are within 200 Da of each other must be lower than 0.4 Da. Those fragment peaks having no matched peak with mass difference of water or amine loss, or doubly charged peak are considered as independent peaks. Only independent peaks are added to the total number of the peaks (>5). A small fraction of peptides fragment by unusual pathways.28 Therefore, the masses of some daughter ions will not match those of y-, b-, or a-ions of the parent peptide. Alternatively, two peptides with close parent masses might be coeluted from the HPLC column; they will be isolated and subsequently fragmented in the mass spectrometer simultaneously. To handle these situations, an alternate rule is used to evaluate the peptide identification. Rule II: Alternate rule for validation of peptide candidates of doubly charged ions 1. Only y-, b-, or a-ions or associated peaks arising due to water or amine loss are considered as daughter ions of a parent peptide. 2. At least 7 isotopically resolved, independent fragment peaks must match theoretical peptide fragments. At least three fragments must be consecutive in the peptide sequence (e.g., y6, y7, and y8). 3. The difference in the mass errors of neighboring fragment peaks that are within 200 Da of each other must be lower than 0.4 Da. In these rules, isotopically resolved peaks were emphasized because a single peak could come from an electronic spark or chemical noise. Single peaks are less likely to be relevant to peptide fragments. Also, noise peaks are less abundant in the high mass region than in lower mass regions. Therefore, peptide fragments with an m/z ratio higher than the m/z ratio of the doubly charged parent ion must be explained by the 1000

Journal of Proteome Research • Vol. 4, No. 3, 2005

technical notes peptide sequence. Otherwise, the peptide identification should be considered an incorrect identification. The criteria for low mass peaks (below the doubly charged parent m/z ratio) are less stringent because more significant noise peaks exist in the low mass region. The mass error of a fragment refers to the difference between the observed m/z value and the m/z value of the matched theoretical fragment. Mass errors of fragment ions with similar m/z values should be closely related. If mass errors of fragment peaks with mass difference less than 200 Da fluctuate by more than 0.4 Da, an incorrect assignment is suggested. Internal fragmentation of peptides sometimes occurs. Internal fragmentation should be considered when a spectrum contains at least 7 isotopically resolved peaks that match theoretical fragment masses. The internal fragment ions are likely to be derived from b- or y-ions that have strong intensities and contain basic residues. When y-, b-, or a-ions or associated peaks arising due to water or amine loss cannot explain the peaks in an MS/MS spectrum according to Rule I, internal fragment ions are considered and Rule II is applied. In this event, we run the Mascot search with MALDI-TOF-TOF selected as the instrument type, because this search includes internal fragment masses obtained using high-energy collisioninduced dissociation. A very small percentage of peptides fragment with an atypical mechanism, such as peptide rearrangement as suggested by Vazquez and his colleagues.28 Rule II was also developed to handle atypically fragmenting peptides. Usually previously observed rules of peptide fragmentation should be followed. In recent years, attempts were made to systematically analyze peptide fragmentation rules, to facilitate evaluation of protein identification.29-31 These studies have dramatically increased our understanding of peptide behavior in the LCQ mass spectrometer. For example, previous studies established that peptide bonds immediately N-terminal to proline residues and immediately C-terminal to aspartate and glutamate residues tend to break easily in ion trap mass spectrometers.29,30,32 Since these rules are not applicable to all peptide sequences, we use them only as extra parameters for confirmation. For singly charged peptides, MS/MS spectra usually have more unpredictable noise peaks with high intensity. Therefore, rules for manual evaluation of doubly charged peptides cannot normally be applied to singly charged ions. We use the following rule for evaluation of the identities of singly charged peptides: Rule III: Validation of peptide candidates of singly charged ions 1. Mascot score should be equal to or above the identity score threshold of the peptide. 2. For peptides ended with argenine or lysine residue, both the b-ion and the y-ion series should confirm at least 3 consecutive amino acids in the peptide sequence. 3. For C-terminal peptides without C-terminal argenine or lysine residue, either the b-ion or the y-ion series should confirm at least 3 consecutive amino acids in the peptide sequence. 3.2. Using the Rules to Thoroughly Identify Incorrect Protein Identifications. To test the effectiveness of these Rules, we carried out nano-HPLC/LCQ mass spectrometric analyses of tryptic peptides derived from E. coli proteins and HeLa proteins (chromatograms and mass spectra in Supporting Figure 1). Three thousand ninety-nine MS/MS spectra were

technical notes

Chen et al.

Figure 2. Experimental verification of an incorrect peptide identification from a cross-species search of the human sequence database. (A) MS/MS spectrum of a tryptic peptide from E. coli that resulted in identification of peptide AQVVPPAR from the human sequence database, with MASCOT score 51. (B) MS/MS spectrum of synthetic peptide AQVVPPAR.

Figure 1. Distribution of Mascot scores for incorrect peptide identifications from cross-species and reversed protein database searches. Several database searches were conducted with the intent of obtaining incorrect peptide identifications. The Mascot score distributions for the resulting peptide identificationswere plotted. The searches were as follows: (A) the human sequence database was searched with MS/MS spectra of E. coli peptides; (B) the E. coli sequence database was searched with MS/MS spectra of peptides derived from HeLa cells; (C) the reversed human sequence database was searched with MS/MS spectra of peptides derived from HeLa cells; and (D) the reversed E. coli sequence database was searched with MS/MS spectra of E. coli peptides.

acquired for E. coli peptides, and 5358 MS/MS spectra were collected for cytosolic HeLa peptides in two separate 2-hour nano-HPLC/mass spectrometric analyses. 3.3. Peptide Identifications Obtained by Cross-Species Protein Sequence Database Searching. To test whether the Rules for manual evaluation described above can exhaustively identify incorrect peptide identifications, we used the Mascot software to search human protein sequences in the NCBI-nr database with MS/MS spectra of tryptic peptides derived from E. coli proteins. We reasoned that all human peptide sequences identified using MS/MS spectra of E. coli tryptic peptides should be incorrect identifications, unless the peptide sequences are shared between human and E. coli. This cross-species search led to identification of 464 peptides with Mascot scores ranging from 20 to 51 (Figure 1A). Evaluation of peptide identification using our Rules established that all the identifications were false. To further confirm the falseness of these identifications, we synthesized the peptide AQVVPPAR, which had the highest Mascot score (51). This score is above the identity threshold given by Mascot for this particular peptide. The MS/MS spectrum of the synthetic peptide showed a different fragmentation pattern than that obtained during the LC-MS/MS analysis (Figure 2), showing that this peptide was incorrectly identified. In a parallel experiment, we searched the E. coli protein sequence database using the MS/MS spectra of tryptic peptides derived from HeLa cell extracts. One hundred eighty-six E. coli peptides were identified, with Mascot scores ranging from 20 to 50 (Figure 1B). Evaluation of peptide identification using our approach established that peptide IINEPTAAALAYGLDK from the E. coli protein database (Mascot score 50) was the only correct identification. The major peaks of the MS/MS spectrum could be explained by the theoretical mass fingerprint of the peptide. Comparison of the observed MS/MS spectrum with that of a synthetic peptide of the same sequence revealed that the two spectra were almost completely identical, fulfilling Rule II (Figure 3). These results suggest that this sequence is a true peptide identification. A BLAST search using the sequence of this peptide showed that a homologous peptide is present in the 70-kDa human heat shock protein 5, with an isobaric amino Journal of Proteome Research • Vol. 4, No. 3, 2005 1001

Manual Evaluation of Protein Identification

technical notes

Figure 3. Experimental verification of a correct peptide identification from a cross-species search of the E. coli sequence database. (A) MS/MS spectrum of a tryptic peptide from HeLa cells that resulted in identification of peptide IINEPTAAALAYGLDK from the E. coli sequence database. (B) MS/MS spectrum of synthetic peptide IINEPTAAALAYGLDK.

acid substitution of leucine for isoleucine. We synthesized a peptide representing another identification from this experiment with a Mascot score of 38. Tandem mass spectrometry of the synthetic peptide confirmed that they were incorrectly identified [data not shown, see attached Supporting Figure 2.] 3.4. Peptide Identifications Obtained from Searching Protein Databases with Reversed Protein Sequences. Next, we used the MS/MS spectra of tryptic peptides derived from HeLa cell extracts to search a sequence database composed of all human proteins with their sequences in the reverse order (i.e., from C-terminus to N-terminus). The search led to 408 peptide identifications with MASCOT scores ranging from 20 to 44 (Figure 1C). Similarly, MS/MS spectra obtained from tryptic peptides of E. coli proteins were used to search a reversed E. coli protein sequence database, resulting in identification of 214 peptide candidates with MASCOT scores ranging from 20 to 43 (Figure 1D). Manual evaluation with our Rules suggested that all these identifications were incorrect. Taken together, these data demonstrate the following: (1) incorrectly identified peptides can have Mascot scores of up to 50; and (2) false protein identifications were comprehensively removed by our manual evaluation method. 3.5. Using the Rules to Identify Correct Peptide Identifications with Low Mascot Score. Next, we used the MS/MS spectra of the tryptic peptides derived from HeLa cells to search the human protein sequence database. The search led to the identification of 745 peptides with Mascot score ranging from 20 to 117 (Figure 4A). Based on our previous experience, few peptide identifications with scores below 20 can be correlated with MS/MS spectra. Therefore, peptides given scores below 20 were not analyzed. Manual evaluation suggested that 376 of the 745 candidates were true identifications while 369 were incorrect (Figure 4A). To test if we had mistakenly considered an incorrect identification to be correct, we synthesized two peptides with Mascot scores of 22 and 24, well below the homologue threshold given by Mascot for these peptides. The MS/MS spectrum of each synthetic peptide contained the same fragment signatures as the spectrum obtained in the HPLC-MS/MS analysis, confirming each of these peptide identifications (Figure 5). 1002

Journal of Proteome Research • Vol. 4, No. 3, 2005

Figure 4. Distribution of Mascot scores for true and false peptide identifications of E. coli and HeLa cytosolic peptides. MS/MS spectra of E. coli and HeLa tryptic peptides were used to search the protein sequence databases of the appropriate species. All peptide identifications with Mascot score of at least 20 were manually verified. The Mascot score distributions of true and false peptide identifications from analysis of (A) HeLa proteins and (B) E. coli proteins are shown.

Protein identification and manual evaluation were also carried out for MS/MS spectra of E. coli tryptic peptides, searching the E. coli sequence database. Five hundred eightynine peptide candidates were identified with Mascot scores between 20 and 100. Manual evaluation showed that 322 were true identifications while 267 were incorrect (Figure 4B). MS/ MS analysis of two synthetic peptides representing correct identifications with low Mascot scores (20, 27) again confirmed that the identifications were indeed correct (data not shown, see attached Supporting Figure 3). Collectively, these experiments demonstrated that (1) correctly identified peptides could have Mascot scores as low as 20 (lower than corresponding homologue score), again indicating that there is no definitive threshold Mascot score for true peptide identifications, and (2) our Rules were able to evaluate correct peptide identifications with low scores. A Mascot score with a probability of occurring of less than 5% is considered a significant match by default in the Mascot software (http://www.matrixscience.com/help/ scoring_help.html). In our analyses, this value corresponded to a Mascot score of 36 for the E. coli database search and 42 for the human database search. Given this information, a Mascot score of 40 might be considered a reasonable threshold for correct identification in our experiments. We found that ∼99% of human peptide identifications and ∼98% of E. coli peptide identifications with Mascot scores above 40 would be correct peptide identifications. However, 44% of human peptide identifications and 38% of E. coli peptide identifications with scores from 20 to 39 were also correct identifications (Figure 6), meaning that applying a strict cutoff value of 40 would result in discarding a considerable number of correct identifications.

technical notes

Figure 5. Experimental verification of correct peptide identifications with low Mascot scores. Shown are MS/MS spectra of tryptic peptides from HeLa cells that resulted in identification of peptides NPEPELLVR (A) and HSQDLAFLSMLNDIAAVPATAMPFR (C) from the human sequence database, with Mascot scores of 22 and 24, respectively. Also shown are the MS/MS spectra of the corresponding synthetic peptides NPEPELLVR (B) and HSQDLAFLSMLNDIAAVPATAMPFR (D).

3.6. Mass Errors of Parent Peptides and their Fragment Ions in LCQ Mass Spectrometry. Masses of parent peptides typically have large mass errors due to inconsistent space charge that arises because of variation in the number of ions trapped in the LCQ mass analyzer.42 To survey the mass errors of parent peptides and fragment ions, we analyzed the spectra of 100 manually verified peptide identifications. Mass errors of parent peptides varied from -2 to +4 Da, depending on ion intensity (Figure 7A). Typically, spectra with higher total ion counts produced larger mass shifts due to more significant space charge. For this reason, 4 Da is used in our Rules as the largest allowable mass error for parent ions during automated sequence database searches. Conversely, fragment ions have low mass errors. On the basis of our experience, a recently calibrated LCQ will have fragment ion mass errors of less than 0.5 Da. Mass errors of fragment ions from 6 manually verified peptides are shown in Figure 7B. Fragment ions have low mass errors because relatively few targeted ions are isolated and trapped during MS/MS analysis, resulting in removal of a large number of background ions. Thus, 0.5 Da is used as the largest allowable mass error for fragment ions during automated sequence database searches.

Chen et al.

Figure 6. Distribution of Mascot scores for correct and incorrect peptide identifications. The number of true and false peptide identifications in three Mascot score ranges are shown for the nano-HPLC-MS/MS analyses of HeLa peptides (A) and E. coli peptides (B). Also shown are the distributions of positive peptide identifications in the HeLa cell (C) and E. coli (D) analyses if a Mascot score of 40 is selected as the threshold for correct identification.

If a peptide hit is correct, the mass errors of the fragment ions are correlated due to systematic errors associated with the calibration of the instrument. For example, neighboring peaks (e.g., with mass difference