Mass Measurement Errors of Fourier-Transform Mass Spectrometry (FTMS): Distribution, Recalibration, and Application Jiyang Zhang,†,‡,§ Jie Ma,†,§ Lei Dou,† Songfeng Wu,† Xiaohong Qian,† Hongwei Xie,‡ Yunping Zhu,*,† and Fuchu He*,† State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing 102206, China, and School of Mechanical Engineering and Automatization, National University of Defense Technology, Changsha 410073, China Received April 30, 2008
The hybrid linear trap quadrupole Fourier-transform (LTQ-FT) ion cyclotron resonance mass spectrometer, an instrument with high accuracy and resolution, is widely used in the identification and quantification of peptides and proteins. However, time-dependent errors in the system may lead to deterioration of the accuracy of these instruments, negatively influencing the determination of the mass error tolerance (MET) in database searches. Here, a comprehensive discussion of LTQ/FT precursor ion mass error is provided. On the basis of an investigation of the mass error distribution, we propose an improved recalibration formula and introduce a new tool, FTDR (Fourier-transform data recalibration), that employs a graphic user interface (GUI) for automatic calibration. It was found that the calibration could adjust the mass error distribution to more closely approximate a normal distribution and reduce the standard deviation (SD). Consequently, we present a new strategy, LDSF (Large MET database search and small MET filtration), for database search MET specification and validation of database search results. As the name implies, a large-MET database search is conducted and the search results are then filtered using the statistical MET estimated from high-confidence results. By applying this strategy to a standard protein data set and a complex data set, we demonstrate the LDSF can significantly improve the sensitivity of the result validation procedure. Keywords: bioinformatics • SEQUEST • Mascot • LTQ/FT • mass error tolerance • statistical distribution • recalibration
Introduction The linear trap quadrupole Fourier-transform (LTQ-FT) mass spectrometer is a state-of-the-art instrument with high accuracy, sensitivity, and dynamic range.1 It is an ideal platform for high-throughput proteomic analysis of complex samples, such as urine.2 The high accuracy and resolution also makes LTQ/FT useful for quantitative proteomics,3-6 and for proteomic analysis of protein modifications.7-9 A commonly used workflow to achieve accurate precursor mass measurement (mass error is generally less than 5 ppm) and high-throughput MS/MS spectra generation in proteomics research10 entails performing a survey MS scan in the FT-ion cyclotron resonance (ICR) cell, fragmenting selected precursor ions, and obtaining the MS/MS spectra in the LTQ component. The high accuracy of the determination of precursor ion mass reduces the occurrence of false-positive matches in results of database searches.2,6,11-17 Quantitative proteomics based on isotopic * To whom correspondence should be addressed. Fuchu He: tel, +8610-68271208; e-mail,
[email protected]. Yunping Zhu: tel, +86-10-80705225; e-mail,
[email protected]. † Beijing Institute of Radiation Medicine. ‡ National University of Defense Technology. § These authors contributed equally to this paper. 10.1021/pr8005588 CCC: $40.75
2009 American Chemical Society
labeling has also benefited from the high resolution of FT-ICR which makes it possible to accurately distinguish isotopic peaks.4 However, the mass accuracy of FT-ICR can change over time and the LTQ-FT platform must be calibrated regularly; the manufacturer recommends that this be done every 4 days. We found that the measured m/z error distribution might drift from zero even immediately after calibration. The current database search algorithms do not take this drift into account and assume the m/z error distribute around zero symmetrically. Drift outside the limits of the user-defined mass error tolerance (MET) would prevent some MS/MS spectra from being assigned to the correct peptides. With some database search procedures, some MS/MS spectra will not be matched with the correct peptides if too small a MET is used. With the most commonly used algorithms, SEQUEST and Mascot, too large a MET may lead to an increase in candidate peptides for each MS/MS spectrum, reducing the discriminating power of the database scores, and impeding validation of results.18,19 On the other hand, if a statistical MET smaller than the database search MET could be established for each MS/MS data set, the obviously random matchessthose with a large mass errorscould be discarded immediately. With fewer incorrect matches in the results, the power of the database search scores to discriminate Journal of Proteome Research 2009, 8, 849–859 849 Published on Web 12/22/2008
research articles
Zhang et al.
between correct and incorrect assignments can be improved. Thus, it is beneficial to filter the database search results using the statistical MET and then validate the results using database scores. The systematic error of FTICR-MS is related to both the measured m/z and the ion space charge effect has been observed experimentally and inferred from the principle of FT-ICR mass measurement, respectively, by many researchers20-31 some recalibration laws have been proposed accordingly.10,20,22,23,29,30 Statistical methods-based recalibration can reduce the distribution range of the m/z error and determine the tolerance for random error.29 Wilhelm et al.10 took into account the cyclotron frequency of ion detection and the total ion current (TIC) of the MS spectrum to recalibrate the FT-MS data. Here, a new factor was included in the recalibration formula, based on the observation that m/z error drifted with retention time (RT). The recalibration formula used in this study is: m ⁄ zt ) a ⁄ f + b ⁄ f 2 + c * TIC ⁄ f 2 + d * t + e
(1)
where m/zt is the calculated mass to charge ratio; f is the frequency of the ICR cell for measurement of m/zt; TIC is the total ion current of the MS scan, which accounts for the space charge effect; t is the retention time of the MS scan; a, b, c, and d are the calibration coefficients; and e is a constant which represents the drift of the m/z error. In practice, 1/f is directly proportional to the observed m/z value (m/ze), while TIC and t can be obtained from the *.RAW file. In this study, we investigated the parent ion mass error distribution of the LTQ-FT mass spectrometer and applied a recalibration procedure to determine the statistical MET of different data sets. Recalibration was based on robust multivariate linear regression (RLS) that effectively suppressed the effect of noise in the observations and performed better than ordinary linear regression (OLS) on one of the data sets that showed interference from noise. It was found that the standard deviation (SD) of the relative mass error could be reduced after recalibration and the shape of the mass error distribution curve was closer to that of the ideal Gaussian distribution curve. On the basis of data recalibration and a robust mass error distribution-fitting algorithm (RDF), a new strategy (LDSF-Large MET database search followed by small MET filtration) was proposed to address the determination of the database search MET and validation of the database search results. First, a relatively large MET was specified and the database search results were prefiltered using the cutoff method based on the decoy-target database strategy. The high-confidence results (false discovery rate, FDR e 0.01) extracted were then used to acquire the mass error recalibration coefficients and estimate the statistical MET (3σ; σ is the SD of the mass error distribution) using the RDF algorithm. Finally, the mass errors were recalibrated and all the database search results were filtered using the statistical MET. When the LDSF strategy is applied to a control data set, approximately 14% additional matches can be confirmed compared with the strategy that recalibrates the data first and then uses a small MET in the database search. LDSF can effectively discard false-positive matches by “luring” these MS/MS spectra to match with the peptides outside of the statistical MET, improving the sensitivity of the cutoff-based method. Additionally, an automated tool (FTDR; Fouriertransform data recalibration) was used to fulfill the data recalibration. FTDR is a multithread program and can work in batch mode, which is useful for high-throughput shotgun proteomics and produces large amounts of data. FTDR can 850
Journal of Proteome Research • Vol. 8, No. 2, 2009
process database search results in both the SEQUEST32 *.out and Mascot33 *.htm formats and can use Thermo *.RAW format, mzXML, or mzData as the raw data source. FTDR requires Xcalibur to provide the COM interface for reading MS/MS RAW files when using the Thermo *.RAW file format. FTDR is implemented on the GNU scientific library (GSL; http:// www.gnu.org/software/gsl/). The program and a set of test data can be downloaded free from http://www.hupo.org.cn/soft/ ms/FTDR.
Materials and Methods BSA Data Set. A total of 100 µg of bovine serum albumin (BSA; Sigma, St. Louis, MO) was dissolved in 100 µL of 50 mM NH4HCO3, pH 8.5, and denatured at 95 °C for 5 min. After cooling to room temperature, the protein solution was added to 10 mM DTT and maintained at 37 °C for 1 h to reduce disulfide bonds. Then, 25 mmol/L IAA was added and the alkylation reaction was allowed to proceed for 2 h at room temperature in the dark. Samples were digested with trypsin and then analyzed on an LTQ-FT mass spectrometer (Thermo Electron, San Jose, CA) coupled with an Agilent 1100 nanoflow liquid chromatography system as previously described.34 The database search was performed using SEQUEST (TurboSEQUEST v.27). The fixed modification carboxyamido-methylation on Cys (57.024 Da) and variable modification oxidation (15.999 Da) on Met were set. Tryptic cleavage at Lys or Arg only was selected and up to two miscleavages were allowed. The database used was IPI BovineV3.10 (ftp://ftp.ebi.ac.uk/ pub/databases/IPI/old/BOVIN/ipi.BOVIN.v3.10.fasta.gz). The database search of MS/MS spectra before calibration was performed with a MET of 10 ppm. Only the first-rank matches were accepted and all SEQUEST search results were validated according to the target-decoy database search strategy.35 Only matches assigned to the BSA proteins were considered possible positive results, while others were interpreted as random matches. The cutoff values of Xcorr and ∆Cn were determined using the method described by Elias et al.35 with an FDR of 0.01. Control Protein Data Set. A data set previously described by Klimek et al.36 was generated by analyzing 18 purified proteins on the LTQ-FT platform. The database was constructed using the method proposed by us in Jiyang et al.37 and was based on IPI Human 3.23 (ftp://ftp.ebi.ac.uk/pub/ databases/IPI/old/HUMAN/ipi.HUMAN.v3.23.fasta.gz), the 18 control proteins, and common contaminants. The raw files were searched against the modified targetdecoy database using the local SEQUEST server and then using Mascot Demon v2.1. The fragment ion mass tolerance was 1.0 Da for the SEQUEST process and 0.4 Da for the Mascot process. The precursor ion mass tolerance settings are described in detail below. The other database search parameters were the same as for the BSA data set. Complex Sample Data Set. An unpublished LTQ-FT data set, generated from human liver, was provided by the Beijing Proteome Research Center (BPRC). Strong cation exchange (SCX) chromatography was performed on the treated protein mixtures. Forty-three fractions were analyzed by the LTQ-FT platform,34 and data from 15 fractions of them were analyzed here. The raw files were searched against the modified targetdecoy database of IPI Human 3.23 using the local SEQUEST server. The fragment ion mass tolerance was set to 1.0 Da and the precursor ion mass tolerance was set to 3.0 Da and 10 ppm,
research articles
Mass Measurement Errors of FTMS
was performed to determine the optimized Xcorr/∆Cn pair. Optimization of the values maximized the validated matches and maintained the FDR estimated by Formula 2 at a lower value than the expected (i.e., 0.01). In Formula 2, Nn and Nr are the number of target and decoy matches, respectively, which passed the filter rules: Xcorr > XC and ∆Cn > DC, where XC and DC are the cutoff values. FDREst )
Figure 1. LDSF strategy. FTDR calculations are outlined in red.
respectively, to process large- and small-MET database searches. All other database search parameters were as described above. LDSF Strategy Workflow. A flowchart of the LDSF strategy is shown in Figure 1. First, the raw MS/MS data collected by Xcalibur are used for a database search with a large (e.g., 3.0 Da) MET. The results are then validated by the cutoff-based method and the resulting high-confidence level matches (e.g., having FDR ) 0.01) are used to determine the calibration coefficients in Formula 1 using RLS. The derived calibration model is then applied to all MS/MS precursors and the statistical MET is estimated using the RDF algorithm. Finally, the database search results are filtered using the statistical MET and then using the cutoff-based method. In this study, both the mass error and the m/z error were involved. The term “m/z error” is used in the calibration procedure because it is directly related to the MS measurement. The mass error expressed in ppm or Da is commonly used in the database searches. The absolute mass error (Da) is not equal to the m/z error if the charge state of the parent ions is not +1. But the relative mass error (ppm) is equal to the relative m/z error, so we do not distinguish them in the following data processing. Zubarev and Mann38 had provided a comprehensive discussion of mass accuracy in proteomics, and the maximum mass deviation (MMD) allowed in a database search was used as the MET in this paper. Validation of Database Search Results Using the CutoffBased Method. The cutoff-based method was used to select high-confidence matches for determining the statistical MET of the parent ions after the database search with a relatively large prespecified parent MET. An exhaustive search procedure
2Nr N n + Nr
(2)
Assuming the optimized pair of Xcorr and ∆Cn to be (XCo, DCo), the following criteria were used to filter the target database matches: (1) Xcorr > XCo; (2) ∆Cn > DCo; (3) PLen > 6, where PLen is the peptide length; (4) PNum > 20, where PNum is the number of peaks in the MS/MS spectrum. Robust Multivariate Linear Regression Algorithm. Generally, the peptide matches after validation include few falsepositive results (typically, the FDR is e0.05). False positive matches with large mass error (also called outliers) can significantly affect the recalibration model. The RLS algorithm can achieve more precise modeling than the OLS algorithm when the observed data set contains outliers.39 We take X ) (m/ze, (m/ze)2, TIC*(m/ze)2, t, 1)T to represent the input vector; y ) m/zt to represent the expected value; and C ) (a, b, c, d, e) to represent the parameters to be estimated; the calibration model can be denoted as y ) CX. If there are N observations in the data set (Xi, yi), i ) 1...N, the OLS aims to minimize the N objective function S ) ∑i)1 r i2, where ri ) yi - yˆi is the residual for the ith data point and yˆi is the fitted response value. Each observation contributes to the objective function with the same weight. The outliers that bias the position of the center of the data set may significantly affect the estimate. RLS attempts to limit the influence of outliers by replacing the square of the residuals with a less rapidly increasing loss function of the residuals; this can be implemented by assigning different data 2 points different weights in the objective function S′ ) ∑N i)1 wir i . The weights are assigned using an iteratively reweighed leastsquares algorithm according to the following procedure: (1) Fit the model by weighted least-squares; initial weights are all set to one. (2) Compute and standardize the adjusted residuals. The adjusted residuals are given by: radi )
ri
√1 - hi
(3)
where ri is the ith least-squares residual and hi is the ith leverage.40 This adjusts the residuals by down-weighting the high-leverage data points, which have a large effect on the leastsquares fit. The standardized adjusted residuals are given by: ui )
radi Ks
(4)
where K is a tuning constant and s is the robust variance given by MAD/0.6745, where MAD is the median absolute deviation of the residuals.41 (3) Compute the robust weightssa function of ui. Here, a function proposed by Welsch et al.41 is used (we designate it “Welsch function” (Formula 5) where K is equal to 2.985): wi ) e-ui
2
(5)
(4) Perform the iteration of the fitting procedure by repeating it beginning with step (2) until the change in the estimated parameters is less than a predefined value (e.g., 10-6). Journal of Proteome Research • Vol. 8, No. 2, 2009 851
research articles
Zhang et al.
Estimation of the Statistical MET. Because the Xcorr algorithm does not use the mass error of the parent ion directly,32 it should be independent of the parent mass error; we can filter the database search results by the cutoff-based method and obtain an unbiased estimation of MET. We cannot exclude all the incorrect matches using the cutoff-based method and we found that a very small number of false positives in the resulting data set significantly affected the distribution fitting; we estimated the statistical MET using a normal RDF procedure. It is assumed that the observations filtered by the cutoffbased method (estimated FDR ) 0.01) contain some random matches, with the mass error distributing evenly within the range of the database search MET. Generally, the database search MET was larger than the underestimated statistical MET. Thus, we can describe the distribution of the mass errors as
pw1 )
∑ p(ω1|x ) ⁄ n k
k)1
pω2 ) 1 - pω1
p(ω1|xk) )
p(xk|ω1)pω1 p(xk|ω1)pω1 + p(xk|ω2)pω2
where i denotes the step of the iterative procedure. (5) Repeat steps (3) and (4) until e ) |µi+1 - µi| + |σi+1 - σi| < 1e - 6 or i ) 100. Data Calibration by FTDR. FTDR is a GUI software package designed to perform data recalibration. As shown in Figure 2, FTDR provides an interface for specifying the files and paths of the database search results. MS/MS data to be calibrated, the corresponding RAW data, the output path, and thread number can be selected using the GUI. The progress bar and message box show the status of FTDR. In large-scale shotgun proteomics experiments, the protein mixtures in each treatment sample are always separated by SDS-PAGE or 2Delectrophoresis to reduce sample complexity prior to introduction into the MS/MS system and the digested peptides are always divided into several fractions before online RPLC (Reversed phase liquid chromatography) separation,43 generating many RAW files and database search result files for each experiment. To process these files one at a time would be very time-consuming; FTDR performs this function in more efficient batch mode. The target files or paths stored in one directory can be added to the list in batch. Users can select the number of threads, allowing FTDR to multitask on the system by using several CPUs at one time. The user-defined filter criteria for the selection of high-confidence matches can also be specified in an individual pop-up box. A set of commonly used parameters can be established within a reasonable range to maintain the quality of data used to build the calibration model. The FTDR begins recalibration with selection of data files or paths. The user adds the SEQUEST output path or the Mascot *.htm files to the “database search results” list one at a time or in batch, while the corresponding MS/MS data files, RAW files (mzXML format or Raw format), or their paths must be entered in the same order in the “MS/MS data” and “Raw data” list. After selection of the output path and the thread number, the recalibration is started. When recalibration is complete, a message box will appear and a report (“FTDR_report.txt”) will be available in the output directory to document the details of data processing. The calibration formula and the recommended mass error tolerance are documented in the report file for each task. More details about FTDR can be found in the supplied file “A user guide for FTDR” in Supporting Information.
p(ω2|xk) )
p(xk|ω2)pω2 p(xk|ω1)pω1 + p(xk|ω2)pω2
Results and Discussion
f(x) )
pω1
√2πσ
e-
(x-µ) 2σ2
+
pω2 I
(6)
where pω2 ) 1 - pω1, and pω1, pω2 g 0 and µ and σ are the mean and SD of the normal distribution. I is the length of the prespecified database MET interval. ω1 and ω2 denote the normal distribution and uniform distribution, respectively. The parameter estimation can be fulfilled by the iterative maximum likelihood estimation (MLE) proposed by Richard et al.42 The iterative MLE procedure is conducted as follows: (1) Input the observation vector X with n observations; (2) Initialize the under-estimated parameters I ) max (X) min (X), µ0 )
∑x ⁄n i
i
σ0 )
∑
(xi - µ)2 ⁄ n
i
pω1 ) |{xi g µ0 - 2σ0, xi e µ0 + 2σ0}| ⁄ n and pω2 ) 1 - pω1, where |A| denotes the element number in the aggregate A. (3) Begin iteration: Calculate the posterior conditional probability: p(xk|ω1) )
1
√2πσi
e-
(xk-µi) 2σi2
, p(xk|ω2) ) 1 ⁄ I
and determine the prior probability using the Bayesian formula:
(4) Adjust the parameters: n
∑ p(ω1|x )x k
µi+1 )
k
k)1
n
∑ p(ω1|x ) k
k)1
n
∑ p(ω1|x )(x - µ ) k
2 σi+1 )
k
2
i
k)1
n
∑ p(ω1|x ) k
k)1
and 852
n
Journal of Proteome Research • Vol. 8, No. 2, 2009
Characterization of the Measured m/z Error Distribution. For the six individual runs of the BSA tryptic digest, the measured m/z error increased with retention time (Figure 3); we added a factor to the data recalibration model (Formula 1) that was directly proportional to the retention time of the MS scan. The high-order effect of the retention time was disregarded to simplify the model. The centers of the measured m/z error distribution were not equal to zero and the degree of drift was different in the six runs (Table 1). Because the drift was time-dependent, a constant was added to Formula 1; data calibration should be performed on each run individually. The standard deviation (SD) of m/z error in the six runs ranged from 1.3 to 1.7; the MET should be set to 5 ppm (approximately 3 times the value
Mass Measurement Errors of FTMS
research articles
Figure 2. FTDR’s graphical user interface. The GUI displays the lists of database search results, MS/MS data files, and RAW data files. The data files or paths can be added one at a time or in batch. Up to more than 10 threads can be selected. The output path and filter criteria should be selected before starting the program. The progress bar and message box show the status of the process.
of the maximal SD in the six runs) in the database search after calibration. The sum squared error (SSE) of the fitting decreased by approximately 20.0% and the SD decreased by approximately 9.3% when the RT-related factor was added to the calibration formula for this data set. For the control protein data set, the initial mass error distribution of the validated results was asymmetric (Figure 4A). After calibration, the distribution was closer to the symmetrical normal distribution, facilitating determination of the MET. By applying the distribution fitting technique, we can determine the statistical MET (2 ppm for the control protein data set; Figure 4B), which is equal to three times the value of the SD for the normal distribution. Calibration of the SEQUEST Database Search Results. A two-step database search strategy was used to process the control protein data set. First, all raw MS/MS spectra were searched against the constructed database with a “relaxed” MET of 3.0 Da and the statistical MET was determined by RDF using records with Xcorr > 2.5 and ∆Cn > 0.2. The statistical MET was approximately 3 ppm and ranged from -1 to 5 ppm (Figure 4A). The calibration procedure was then performed using FTDR and the resulting MS/MS spectra were searched against the same database with both a large (3.0 Da) and a small (2 ppm) MET; the small MET is the statistical MET after calibration (Figure 4B). In the large MET search, 39 745 matches
were collected. After calibration of the parent mass error, only 12 442 records had a parent mass error smaller than the statistical MET (2 ppm), and of these, 11 706 MS/MS spectra matched those of the control peptides. The classification and overlap of the results of searches with large and small METs are shown in Figure 5. With the use of the target-decoy database strategy and the cutoff validation method and with 1% as the estimated FDR, we observed 10 920 validated peptides for the large MET database search; this was 14.3% more than for the small MET database search, which yielded 9550 validated assignments. An alternative approach for the two-step database search would be to search the constructed database with a large MET of 3.0 Da only once. By calibrating the mass errors of the parent ions and then filtering the SEQUEST matches directly according to the statistical MET, the LDSF strategy can be executed with a smaller computational burden. We applied this approach to the control data set and observed 10 919 validated matches (FDR ) 0.01), which was close to the result obtained with the two-step database search strategy. Other METs can also be used to filter the database search results in the LDSF strategy. It appears that the cutoff threshold of the database search scores can “adjust” to the chosen MET to guarantee the data quality. However, use of the statistical Journal of Proteome Research • Vol. 8, No. 2, 2009 853
research articles
Zhang et al.
Figure 3. Relationship between the measured m/z error and retention time of the BSA data sets. Panels 1-6 represent the results of six independent runs. The fifth-order polynomial fitting curves (red) in each panel show that the measured m/z errors (Dm) increase with retention time (RT). Table 1. Means and Standard Deviations (SD) of Measured m/z Error from Six Independent Runs of the BSA Sample in ppm run
1
2
3
4
5
6
Mean SD
3.5162 1.3835
3.9221 1.5673
3.6116 1.3385
3.4714 1.2704
3.3558 1.7056
3.1227 1.6578
MET is recommended because this approach yields the highest number of validated peptides (Figure 6). Most spectra matched with different peptides in the large and small MET searches are incorrect because it is assumed that the database search algorithm will always assign the correct peptide as the best match to its MS/MS spectrum, regardless of how large the range of the MET is. However, it is not true in practice because noise, limitations of the database search engine, and other factors may cause a good MS/MS spectrum to be matched with different peptides in different MET searching approaches and one of them will be the correct match. However, following the rule “reject all uncertainty,” we chose to discard all matches of spectra with more than one peptide in the large and small MET searches. Of the 12 345 records assigned to a single peptide in the large and small MET searching, 11 294 were matched with a peptide in the target database; of these, 10 863 matches were from the control sequences. For large MET searches, 13 777 records were matched to control peptides and 11 783 (85.53%) had a mass error of less than 2 ppm. These observations indicate that not all matches assigned to control sequences are correct because some matched with different peptides in the large and small MET searches and some MS/MS spectra were randomly matched with control peptides when there were no MET limitations. In contrast, we found that some assigned peptides not derived from the control sequences were identified multiple times (seven were identified at least five times). For example, peptide “GHHEAEIKPLAQSHATK” was identified 50 times; the Xcorr reached 4.455 and all had a ∆Cn ) 0. By inspecting the *.out files generated by SEQUEST, we observed that all of the 854
Journal of Proteome Research • Vol. 8, No. 2, 2009
second-rank peptides were peptide “GHHEAELKPLAQSHATK,” a peptide derived from control protein sp P02188 MYG_HORSE. The only difference between these two peptide sequences is the seventh amino acid (L/I)sa difference which results in no change in mass. Thus, validating the matches with control peptides and rescuing the correct matches caused by substitution of amino acids indistinguishable by mass (e.g., L/I and K/Q) is necessary for control data set processing. While manually inspecting the matches generated with a large MET but having high database scores, we found some correct matches were erroneously discarded because incorrect m/z ions had been assigned to the parent ions by the MS/MS extraction routine Extract_MSn.exe; Figure 7 shows such an example. The parent m/z provided by Extract_MSn.exe was 753.00, which is the second and highest isotopic peak in the MS scan. The online instrument control software prefers to select the ions with highest intensity to perform the MS/MS scan, and therefore, this ion and not the monoisotopic ion was assigned to the MS/MS spectrum. This spectrum would not match with the correct peptide in the small-MET database search, while the large-MET search allowed the additional isotopic peak (1.007825 Da) to be selected and the misassigned parent ion to appear as a match. To address this situation, search results are prefiltered with the statistical MET and the matches whose parent mass is larger than the established MET by the mass of one or two hydrogen are kept for further validation. In this way, our LDSF strategy should prevent misassignment of monoisotopic ions. Relationship between Database Search Scores and MET. We investigated the difference in the database search scores of MS/MS spectra searched with the large and small MET (MET ) 3.0 Da and 2 ppm, respectively); the database search score distributions are shown in Figure 8. There were 12 345 records matched with the same peptides (rank 1) in the two different MET database searches (red dots in Figure 8).
research articles
Mass Measurement Errors of FTMS
Figure 4. Distribution of the mass error of the control proteins data set before (A) and after (B) calibration. Following calibration, the distribution was closer to the normal distribution. The red line is a fit of Gaussian distribution (µ ) 0.008494, σ ) 0.640781). The blue line is a fit by the robust normal fitting algorithm (µ ) 0.00757, σ ) 0.58089, pw1 ) 0.97239), which provides a better fit than the general fitting algorithm.
Figure 5. Comparison of the database search results obtained using large and small METs.
The spectra assigned to different peptides tend to have a larger Xcorr score in the large-MET database search than in the small-MET database search, while the Xcorr of the spectra assigned to the same peptides are similar (Figure 8, Xcorr panel). In contrast, most of the matches assigned to the same peptides have a larger ∆Cn in the small-MET database search than in the large-MET database search (Figure 8, ∆Cn panel). The Sp shows a trend similar to that of Xcorr (Figure 8, Sp panel). This phenomenon indicates that with MET settings increasing the database search score Xcorr, which is an indicator of the quality of the assignment, would increase for the random matches and remain unchanged for the correct ones. The ∆Cn, which is an indicator of the significance of a match, would decrease when more candidate peptides are provided for each spectrum. We can expect that there would be more overlap of the database search scores between the correct and incorrect matches in the “relaxed” MET setting than those in a “tight” MET database search. Therefore, use of a larger MET alone does not improve validation of the search results if the statistical MET is not first used to filter the database search results. Figure 9 shows the accepted regions on the Xcorr-∆Cn plane of the control data set results. There are significantly fewer negative results in the large-MET search plane after calibration (blue dots in right panel) because most of the falsepositive matches are assigned to peptides outside of the statistical MET and are eliminated by the MET filtration. The database search scores (Xcorr and ∆Cn) become more
Figure 6. Validated numbers of peptides and corresponding actual FPRs of the results of the SEQUEST search of the control protein data set using prefiltration with different METs. The cutoff-based method can adapt the threshold scores to different filtered METs to ensure the fixed FDR. However, the highest number of validated matches are observed using the statistical MET for prefiltering (see the black square).
powerful in distinguishing the peptide identifications and thus improve the sensitivity of the cutoff-based method. Journal of Proteome Research • Vol. 8, No. 2, 2009 855
research articles
Zhang et al.
Figure 7. An example of a misassigned parent ion mass to charge ratio (m/z) and the corresponding MS/MS matches.
Figure 8. Comparison of database search scores for MS/MS spectra searched using large and small METs. There were 12345 records matched with the same peptides in the two MET database searches (red). Spectra matched to different peptides with different MET are shown in blue.
Another interesting result is that most of the matches have a mass error distributed around multiples of 1.0 Da even when the MET is large (Figure 8, dm panel). Application of the LDSF Strategy to Mascot Database Search Results. Mascot, another widely used proteomics database search tool, uses a probability scoring mechanism for peptide identification.33 Mascot provides a probability threshold (e.g., 0.01 or 0.05), above which an identified ion has a defined probability of occurring by chance alone, giving users a certain level of confidence in the output results. Mascot may be more appropriate than SEQUEST for interpreting highaccuracy MS/MS data, such as QSTAR spectrum data.35 We 856
Journal of Proteome Research • Vol. 8, No. 2, 2009
applied the LDSF strategy to the control protein data set searched using Mascot. The two-step database search process was performed as described above; the numbers of identified peptides and the corresponding actual false positive rates (actual FPRs) are shown in the Table 2. To make a comprehensive comparison, Table 2 also lists the results of a database search using a large MET without parent ion calibration as well as using the commonly used MET setting search of 10 ppm, with and without calibration. The data set D3, which was searched after input MS/MS calibration with a large parent ion MET set, showed the highest number of validated peptides with 1% expected FDR.
research articles
Mass Measurement Errors of FTMS
Figure 9. The preserved regions of the database search results (+2 charge) of the control protein data set using mass calibration with small-MET (left) and large-MET (right) strategies. The hatched areas indicate the accepted identified assignments under 0.01 and 0.05 FDR, respectively. Positive matches are indicated in red; negative matches are indicated in blue. Table 2. Performance of the LDSF Strategy on Mascot Database Searches of the Control Protein Data Set expected FDR ) 5% data seta
D1 D2 D3 D4 D5
methodb
total/correct
actual FPR %
Cutoff method Mascot Cutoff method Mascot Cutoff method Mascot Cutoff method Mascot Cutoff method Mascot
8798/8409 5106/4969 11061/10589 9025/8737 --c 4840/4707 10578/11040 8688/8980 11574/11066 10417/10065
4.42 2.68 2.27 2.97 --c 2.83 4.18 3.25 4.39 3.38
Table 3. Performance of the LDSF strategy on a Complex Data Set
expected FDR ) 1%
expected FDR ) 5%
total/correct
actual FPR %
data seta
charge
5677/5512 3726/3617 8337/8079 7248/7035 10046/9729 3552/3469 8061/8317 7023/7235 8396/8137 8651/8381
2.91 2.93 3.09 2.94 3.16 2.69 3.08 2.93 3.08 3.05
D1
+1
1516
+2
29576
+3
1628
+1
1278
+2
20313
+3
643
a Data set: D1 (3.0 Da, 0.4 Da) without calibration; D2 (10 ppm, 0.4 Da) without calibration; D3 (3.0 Da, 0.4 Da) with calibration; D4 (10 ppm, 0.4 Da) with calibration; D4 (2 ppm, 0. 4 Da) with calibration. b Method: Mascot, validating the results with Mascot identity threshold; cutoff method, validating the results based on cutoff filtering method. c --: indicates the estimated FDR of the data set is less than the expected FDR.
All “hits” in this data set could be considered valid because the data set had been prefiltered with the statistical MET with the expected FDR of 3.56% (less than 5%; approximately 12 420 validated matches in all). As Table 2 shows, Mascot’s identity threshold may offer high-quality results with a relatively low actual FPR at the cost of some loss of sensitivity; this suggests that Mascot’s identity threshold would result in a relatively high negative predictive value.44 Consequently, the total correct matches validated by Mascot’s identity threshold were much fewer than those of the filtering method in all data sets (Table 2). Application of LDSF Strategy to a Complex Sample Data Set. To evaluate the generalization of the LDSF strategy to complex data sets, we verified this strategy with a human sample data set using the two-step search strategy. The raw MS/MS spectra were searched against the target-decoy database with a “relaxed” MET of 3.0 Da and the statistical MET was determined by RDF using the high-confidence records. The mass error spread approximately 30 ppm and ranged from -10 to 20 ppm (Supplementary Figure S1 in Supporting Information); the center of the range was about 6 ppm. After calibration by FTDR, the statistical MET decreased to 10 ppm (Supplementary Figure S1 in Supporting Information), a reduction of approximately 33%. We then calibrated the parent ion error,
D2
number validated
cutoff threshold
Xcorr > 0.665, ∆Cn > 0.100 Xcorr > 0.000 ∆Cn > 0.098 Xcorr > 2.820, ∆Cn > 0.090 Xcorr >1.300, ∆Cn > 0.240 Xcorr > 1.880, ∆Cn > 0.270 Xcorr > 2.995, ∆Cn > 0.000
expected FDR ) 1% number validated
1112 26848 1434 1071 18101 603
cutoff threshold
Xcorr > 0.765, ∆Cn > 0.197 Xcorr > 0.000, ∆Cn > 0.184 Xcorr > 3.030, ∆Cn > 0.117 Xcorr > 1.690, ∆Cn > 0.237 Xcorr > 2.180, ∆Cn > 0.332 Xcorr > 2.990, ∆Cn > 0.334
a Data set: D1 (3.0 Da, 1.0 Da) with calibration; D2 (10 ppm, 1.0 Da) with calibration.
filtered the searched results with the statistical MET, and compared the validated hits with the small-MET search. The numbers of identified peptides and the corresponding FDRs are shown in Table 3. Following input MS/MS calibration, the large-MET database search generated 32 720 and 29 394 matches with 5% and expected 1% FDR, respectively (both approximately 40% more matches than for the small-MET search). There are far fewer randomized matches (blue dots in Supplementary Figure S2 in Supporting Infomation) appear in the search results of larger MET database search strategy, as they find a better match in the larger search space and are rejected by the statistical MET filtration. Consequently, the threshold score would be lower and more positive results would be preserved. These results were consistent with those observed for control data sets.
Conclusion In this study, we conducted a comprehensive investigation of the distribution of precursor ion mass error for the LTQ-FT platform and applied the LDSF strategy to recalibrate the MS/ MS data and improve peptide identification. On the basis of the observation made during six runs of BSA data set that the m/z error increased with retention time and the center of the range of the error was not zero, the original calibration formula was modified by the addition of a factor related to retention time and by the inclusion of a constant. Journal of Proteome Research • Vol. 8, No. 2, 2009 857
research articles An automatic GUI software tool, FTDR, was developed for the recalibration of LTQ-FT MS data. FTDR will also be useful for large-scale shotgun proteomics because it can be operated in batch mode and process multiple tasks simultaneously on the system by employing more than one CPU. When MS/MS spectra were searched against the same database using different METs, high-confidence database search matches remained the same, while random matches tended to improve by matching spectra with the best peptides. Thus, the LDSF strategy adopted a “relaxed” MET database search strategy and calculated the statistical MET on prefiltered, high-confidence results using FTDR; the statistical MET was then used to filter the calibrated MS data. These findings suggest that database search result validation might be further refined using multiple searches following improvements elsewhere in the system, such as reduction of noise and use of different enzyme specifications to modify the properties of the protein and peptide spectra.
Acknowledgment. We thank Dr. Jianqi Li from Beijing Proteome Research Center for his useful suggestion about the software development. This work is supported by Chinese Ministry of Science and Technology (2006CB910803, 2006CB910706, 2006AA02A312, 2006AA02Z334), Creative Research Group Science Foundation of China (30621063), and Beijing Municipal Key Research Program (H030230280590). Supporting Information Available: Distribution of the mass error of the complex data set, the preserved regions of the database search results (+2 charge) of the complex data set using mass calibration, and a user guide for FTDR. This material is available free of charge via the Internet at http:// pubs.acs.org. References (1) Domon, B.; Aebersold, R. Mass spectrometry and protein analysis. Science 2006, 312, 212–217. (2) Adachi, J.; Kumar, C.; Zhang, Y.; Olsen, J. V.; Mann, M. The human urinary proteome contains more than 1500 proteins, including a large proportion of membrane proteins. Genome Biol. 2006, 7, R80. (3) Rifai, N.; Gillette, M. A.; Carr, S. A. Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat. Biotechnol. 2006, 24, 971–983. (4) Faca, V.; Coram, M.; Phanstiel, D.; Glukhova, V.; Zhang, Q.; Fitzgibbon, M.; McIntosh, M.; Hanash, S. Quantitative analysis of acrylamide labeled serum proteins by LC-MS/MS. J. Proteome Res. 2006, 5, 2009–2018. (5) Andreev, V. P.; Li, L.; Rejtar, T.; Li, Q.; Ferry, J. G.; Karger, B. L. New algorithm for 15N/14N quantitation with LC-ESI-MS using an LTQ-FT mass spectrometer. J. Proteome Res. 2006, 5, 2039–2045. (6) Everley, P. A.; Bakalarski, C. E.; Elias, J. E.; Waghorne, C. G.; Beausoleil, S. A.; Gerber, S. A.; Faherty, B. K.; Zetter, B. R.; Gygi, S. P. Enhanced analysis of metastatic prostate cancer using stable isotopes and high mass accuracy instrumentation. J. Proteome Res. 2006, 5, 1224–1231. (7) Meng, F.; Forbes, A. J.; Miller, L. M.; Kelleher, N. L. Detection and localization of protein modifications by high resolution tandem mass spectrometry. Mass Spectrom. Rev. 2005, 24, 126–134. (8) Norbeck, A. D.; Monroe, M. E.; Adkins, J. N.; erson, K. K.; Daly, D. S.; Smith, R. D. The utility of accurate mass and LC elution time information in the analysis of complex proteomes. J. Am. Soc. Mass Spectrom. 2005, 16, 1239–1249. (9) Wang, G.; Wu, W. W.; Zeng, W.; Chou, C. L.; Shen, R. F. Label-free protein quantification using LC-coupled ion trap or FT mass spectrometry: reproducibility, linearity, and application with complex proteomes. J. Proteome Res. 2006, 5, 1214–1223. (10) Haas, W.; Faherty, B. K.; Gerber, S. A.; Elias, J. E.; Beausoleil, S. A.; Bakalarski, C. E.; Li, X.; Villen, J.; Gygi, S. P. Optimization and use of peptide mass measurement accuracy in shotgun proteomics. Mol. Cell. Proteomics 2006, 5, 1326–1337. (11) Pilch, B.; Mann, M. Large-scale and high-confidence proteomic analysis of human seminal plasma. Genome Biol. 2006, 7, R40.
858
Journal of Proteome Research • Vol. 8, No. 2, 2009
Zhang et al. (12) Res, J. P. Differential analysis of membrane proteins in mouse foreand hindbrain using a label-free approach. J. Proteome Res. 2006, 5, 2701–2710. (13) de Souza, G. A.; Godoy, L. M.; Mann, M. Identification of 491 proteins in the tear fluid proteome reveals a large number of proteases and protease inhibitors. Genome Biol. 2006, 7, R72. (14) Krijgsveld, J.; Gauci, S.; Dormeyer, W.; Heck, A. J. In-gel isoelectric focusing of peptides as a tool for improved protein identification. J. Proteome Res. 2006, 5, 1721–1730. (15) Schroeder, M. J.; Webb, D. J.; Shabanowitz, J.; Horwitz, A. F.; Hunt, D. F. Methods for the detection of paxillin post-translational modifications and interacting proteins by mass spectrometry. J. Proteome Res. 2005, 4, 1832–1841. (16) Dieguez-Acuna, F. J.; Gerber, S. A.; Kodama, S.; Elias, J. E.; Beausoleil, S. A.; Faustman, D.; Gygi, S. P. Characterization of mouse spleen cells by subtractive proteomics. Mol. Cell. Proteomics 2005, 4, 1459–1470. (17) Denison, C.; Rudner, A. D.; Gerber, S. A.; Bakalarski, C. E.; Moazed, D.; Gygi, S. P. A proteomic strategy for gaining insights into protein sumoylation in yeast. Mol. Cell. Proteomics 2005, 4, 246–254. (18) Green, M. K.; Johnston, M. V.; Larsen, B. S. Mass accuracy and sequence requirements for protein database searching. Anal. Biochem. 1999, 275, 39–46. (19) Sleno, L.; Volmer, D. A.; Marshall, A. G. Assigning product ions from complex MS/MS spectra: The importance of mass uncertainty and resolving power. J. Am. Soc. Mass Spectrom. 2005, 16, 183–198. (20) Ledford Jr, E. B.; Rempel, D. L.; Gross, M. L. Space charge effects in Fourier transform mass spectrometry. Mass calibration. Anal. Chem. 1984, 56, 2744–2748. (21) Masselon, C.; Tolmachev, A. V.; erson, G. A.; Harkewicz, R.; Smith, R. D. Mass measurement errors caused by “local” frequency perturbations in FTICR mass spectrometry. J. Am. Soc. Mass Spectrom. 2002, 13, 99–106. (22) Muddiman, D. C.; Oberg, A. L. Statistical evaluation of internal and external mass calibration laws utilized in Fourier transform ion cyclotron resonance mass spectrometry. Anal. Chem. 2005, 77, 2406–2414. (23) Palmblad, M.; Bindschedler, L. V.; Gibson, T. M.; Cramer, R. Automatic internal calibration in liquid chromatography/Fourier transform ion cyclotron resonance mass spectrometry of protein digests. Rapid Commun. Mass Spectrom. 2006, 20, 3076–3080. (24) Wu, S.; Kaiser, N. K.; Meng, D.; Anderson, G. A.; Zhang, K.; Bruce, J. E. Increased protein identification capabilities through novel tandem MS calibration strategies. J. Proteome Res. 2005, 4, 1434– 1441. (25) Kaiser, N. K.; Anderson, G. A.; Bruce, J. E. Improved mass accuracy for tandem mass spectrometry. J. Am. Soc. Mass Spectrom. 2005, 16, 463–470. (26) Bruce, J. E.; Anderson, G. A.; Brands, M. D.; Pasa-Tolic, L.; Smith, R. D. Obtaining more accurate Fourier transform ion cyclotron resonance mass measurements without internal standards using multiply charged ions. J. Am. Soc. Mass Spectrom. 2000, 11, 416– 421. (27) Duan, L.; Chan, T. W. D. A modified internal lock-mass method for calibration of the product ions derived from sustained offresonance irradiation collision-induced dissociation using a Fourier transform mass spectrometer. Rapid Commun. Mass Spectrom. 2004, 18, 1286–1294. (28) Belov, M. E.; Zhang, R.; Strittmatter, E. F.; Prior, D. C.; Tang, K.; Smith, R. D. Automated gain control and internal calibration with external ion accumulation capillary liquid chromatography-electrospray ionization Fourier transform ion cyclotron resonance. Anal. Chem. 2003, 75, 4195–4205. (29) Yanofsky, C. M.; Bell, A. W.; Lesimple, S.; Morales, F.; Lam, T. K. T.; Blakney, G. T.; Marshall, A. G.; Carrillo, B.; Lekpor, K.; Boismenu, D. Multicomponent internal recalibration of an LC-FTICR-MS analysis employing a partially characterized complex peptide mixture: systematic and random errors. Anal. Chem. 2005, 77, 7246–7254. (30) Tolmachev, A. V.; Monroe, M. E.; Jaitly, N.; Petyuk, V. A.; Adkins, J. N.; Smith, R. D. Mass measurement accuracy in analyses of highly complex mixtures based upon multidimensional recalibration. Anal. Chem. 2006, 78, 8374–8385. (31) Kruppa, G.; Schnier, P. D.; Tabei, K.; Van Orden, S.; Siegel, M. M. Multiple ion isolation applications in FT-ICR MS: exact-mass MSn internal calibration and purification/interrogation of protein-drug complexes. Anal. Chem. 2002, 74, 3877–3886. (32) Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976–989.
research articles
Mass Measurement Errors of FTMS (33) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–3567. (34) Jiyang, Z.; Jianqi, L.; Xin, L.; Hongwei, X.; Yunping, Z.; Fuchu, H. A nonparametric model for quality control of database search results in shotgun proteomics. BMC Bioinf. 2008, 9, 29. (35) Elias, J. E.; Haas, W.; Faherty, B. K.; Gygi, S. P. Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat. Methods 2005, 2, 667–675. (36) Klimek, J.; Eddes, J. S.; Hohmann, L.; Jackson, J.; Peterson, A.; Letarte, S.; Gafken, P. R.; Katz, J. E.; Mallick, P.; Lee, H. The standard protein mix database: a diverse data set to assist in the production of improved peptide and protein identification software tools. J. Proteome Res. 2008, 7, 96–103. (37) Jiyang, Z.; Jianqi, L.; Xin, L.; Hongwei, X.; Yunping, Z.; Fuchu, H. A new strategy to filter out false positive identifications of peptides in SEQUEST database search results. Proteomics 2007, 7, 4036– 4044.
(38) Zubarev, R.; Mann, M. On the proper use of mass accuracy in proteomics. Mol. Cell. Proteomics 2007, 6, 377. (39) Levenberg, K. A method for the solution of certain problems in least squares. Q. Appl. Math. 1944, 2, 164–168. (40) Goodall, C. R. Computation using the QR decomposition. In Handbook in Statistics, V. 9. Statistical Computing; Rao, C. R. , Ed; Elsevier: Amsterdam, 1993. (41) Holland, P. W.; Welsch, R. E. Robust regression using iteratively reweighted least-squares. Commun. Stat: Theory Methods 1977, 6, 813–827. (42) Duda, R. O.; Hart, P. E.; Stork, D. G. Pattern Classification, Second ed.; Wiley-Interscience: Hoboken, NJ, 2001; Chapter 10, pp 3-13. (43) Issaq, H. J. Application of separation technologies to proteomics research. Adv. Protein Chem. 2003, 65, 249–269. (44) Rudnick, P. A.; Wang, Y.; Evans, E.; Lee, C. S.; Balgley, B. M. Largescale analysis of MASCOT results using a Mass Accuracy-based THreshold (MATH) effectively improves data interpretation. J. Proteome Res. 2005, 4, 1353–1360.
PR8005588
Journal of Proteome Research • Vol. 8, No. 2, 2009 859