A Universal Denoising and Peak Picking Algorithm for LC−MS Based

Oct 21, 2003 - of powerful algorithms for denoising of LC-MS data in order to detect low-abundant ... Corresponding author. E-mail: [email protected]...
0 downloads 0 Views 419KB Size
Anal. Chem. 2003, 75, 6314-6326

A Universal Denoising and Peak Picking Algorithm for LC-MS Based on Matched Filtration in the Chromatographic Time Domain Victor P. Andreev, Tomas Rejtar, Hsuan-Shen Chen, Eugene V. Moskovets, Alexander R. Ivanov, and Barry L. Karger*

Barnett Institute and Department of Chemistry, Northeastern University, Boston, Massachusetts 02115

A new denoising and peak picking algorithm (MEND, matched filtration with experimental noise determination) for analysis of LC-MS data is described. The algorithm minimizes both random and chemical noise in order to determine MS peaks corresponding to sample components. Noise characteristics in the data set are experimentally determined and used for efficient denoising. MEND is shown to enable low-intensity peaks to be detected, thus providing additional useful information for sample analysis. The process of denoising, performed in the chromatographic time domain, does not distort peak shapes in the m/z domain, allowing accurate determination of MS peak centroids, including low-intensity peaks. MEND has been applied to denoising of LC-MALDI-TOFMS and LC-ESI-TOF-MS data for tryptic digests of protein mixtures. MEND is shown to suppress chemical and random noise and baseline fluctuations, as well as filter out false peaks originating from the matrix (MALDI) or mobile phase (ESI). In addition, MEND is shown to be effective for protein expression analysis by allowing selection of a large number of differentially expressed ICAT pairs, due to increased signal-to-noise ratio and mass accuracy. LC-MS, using ESI or MALDI, has become a standard method for analysis of complex biological samples consisting of large numbers of components, e.g., proteome digests.1 For identification of the species analyzed by LC-MS, a list of m/z values corresponding to the centroids of MS peaks must be generated. This procedure, called peak picking, is based on various criteria, the most common being intensity of the peak, or signal-to-noise ratio (S/N), or both. Noise in LC-MS is of two types: random and chemical. In MALDI-MS, chemical noise results mainly from matrix clusters2 and in ESI-MS from mobile-phase impurities.3 Noise can cause either false negative or false positive identifications of sample components by masking or mimicking the signal. In addition, noise can reduce mass accuracy by shifting centroids of MS peaks. * Corresponding author. E-mail: [email protected]. (1) Yates, J. R., III. J. Mass Spectrom. 1998, 33, 1-19. (2) Krutchinsky, A. N.; Chait, B. T. J. Am. Soc. Mass Spectrom. 2001, 13, 129134. (3) Windig, W.; Phalp, J. M.; Payne, A. W. Anal. Chem. 1996, 68, 3602-3606.

6314 Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

Chemical noise is more difficult to remove than random noise because it has a pattern in the m/z domain similar to that of the signal. Unlike random noise, chemical noise cannot be minimized by averaging or increasing the number of subspectra (e.g., increasing the number of shots in MALDI-MS) added to yield one spectrum. As shown in this paper, it is possible to minimize chemical noise in LC-MS by exploiting the difference in the patterns of this noise from that of the signal in the chromatographic time domain. The complexity of the composition and the broad range of concentrations in real biological samples require the application of powerful algorithms for denoising of LC-MS data in order to detect low-abundant component peaks. Accurate detection of lowlevel components can be important for various applications. For example, in proteomics, such detection could aid in identification of low-abundant proteins through peptide mass fingerprinting and could also be useful when identification is based on MS/MS analysis. In addition, in protein expression analysis, peaks of low intensity can provide important information in selection of MS/ MS candidates. For example, in the case of the analysis of isotope coded affinity tag (ICAT) labeled samples,4 the ability to detect the lower intensity peak of the pair can increase the number of differentially expressed ICAT pairs selected, with subsequent MS/MS performed on the higher intensity member of the pair. In this and other applications, it is obviously important that the data processing procedure not distort the shape of the MS peaks or compromise the mass accuracy. Several papers have introduced approaches for denoising and baseline subtraction in LC-MS, taking advantage of the twodimensional nature of the data, e.g., CODA,3 sequential paired covariance (SPC),5,6 and a windowed mass selection method.7 Possible distortion of MS peak shapes by use of SPC was discussed in ref 7. Obviously, such distortion could lead to the shift of the centroid of the peak, compromise mass accuracy, and ultimately lead to false positive and negative identifications. More (4) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Nat. Biotechnol. 1999, 17, 994. (5) Muddiman, D. C.; Rockwood, A. L.; Gao, Q.; Severs, J. C.; Udseth, H. R.; Smith, R. D. Anal. Chem. 1995, 67, 4371-4375. (6) Muddiman, D. C.; Huang, B. M.; Anderson, G. A.; Rockwood, A. L.; Hofstadler, S. A.; Weir-Lipton, M. S.; Proctor, A.; Wu, Q.; Smith, R. D. J. Chromatogr., A 1997, 771, 1-7. (7) Fleming, C. M.; Kowalski, B. R.; Apffel, A.; Hancock, W. S. J. Chromatogr., A 1999, 849, 71-85. 10.1021/ac0301806 CCC: $25.00

© 2003 American Chemical Society Published on Web 10/21/2003

recently, an algorithm, based on the use of cross-correlation with the second derivative of the Gaussian in the chromatographic time domain, for denoising of LC-ESI-MS data was presented.8 The approach was shown to improve S/N and remove low-frequency oscillations of the baseline, however, at the price of some distortion of mass spectra. The high accuracy determination of MS peak centroids is necessary for reliable identification of sample components. The importance of high mass accuracy for protein identification by peptide mass fingerprinting is well known.9,10 The knowledge of exact precursor ion mass is also beneficial for increased confidence in protein identification through MS/MS analysis.11 Thus, an important goal is to develop a denoising and peak picking algorithm that enables low-intensity and low-S/N peaks to be accurately determined. The present paper introduces a new denoising and peak picking algorithmsmatched filtration with experimental noise determination (MEND). The MEND algorithm minimizes noise in the m/z domain by signal processing in the chromatographic time domain in which the shape and width of the chromatographic peaks and the characteristics of the noise are employed. Experimental determination of noise characteristics minimizes not only random but chemical noise as well. MEND selects low-intensity MS peaks without distortion of their shapes. Comparison of MEND with other denoising routines (e.g., spectra averaging, matched filtration with assumption of white noise, cross-correlation with the second derivative of the Gaussian8) will be presented. The algorithm will be shown to be effective for both LCMALDI-TOF-MS and LC-ESI-TOF-MS. The denoising and peak picking power of the algorithm will be especially illustrated in the analysis of LC-MALDI-TOF-MS data sets. An improvement in S/N by a factor of 5-8, as well as the ability to minimize the detection of false positive peaks, will be demonstrated. The importance of detecting low-intensity peaks will be illustrated by the analysis of an ICAT labeled sample where the number of detected ICAT pairs was significantly (up to 20%) increased by detection of the lowintensity labeled peptides of the differentially expressed ICAT pairs. The application of the MEND algorithm for denoising of LC-ESI-MS data will also be presented. THEORY Minimization of Noise in LC-MS. A mass spectrum includes not only peaks corresponding to the sample components but also physical noise resulting from limited ion statistics, instabilities of ion source, thermal noise, and spikes. Especially important is the presence in the spectrum of chemical noise from the MALDI matrix (MALDI-MS2) or mobile phase (ESI-MS3). Noise can influence accuracy of m/z values in the peak list in two ways. First, MS peaks, representing components of solvent (ESI) or matrix (MALDI) or their contaminants, or some intense spikes, could be mistaken for sample ions, potentially causing false positive identifications. Second, when the S/N of the sample peak is low, (8) Danielsson, R.; Bylund, D.; Markides, K. E. Anal. Chim. Acta 2002, 454, 167-184. (9) Conrads, T. P.; Anderson, G. A.; Veenstra, T. D.; Pasa-Tolic, L.; Smith, R. D. Anal. Chem. 2000, 72, 3349-3354. (10) Goodlett, D. R.; Bruce, J. E.; Anderson, G. A.; Rist, B.; Pasa-Tolic, L.; Fiehn, O.; Smith, R. D.; Aebersold. R. Anal. Chem. 2000, 72, 112-1118. (11) Clauser, K. R.; Baker, P.; Burlingame, A. L. Anal. Chem. 1999, 71, 28712882.

the centroid of the peak can be shifted, resulting in an inaccurate m/z value. To decrease the likelihood of false positives, the S/N threshold for peak selection could be increased. Unfortunately, this strategy will also increase the likelihood of false negatives, i.e., missing of low-abundant components of the sample. An alternative strategy, used in this paper, to decrease the extent of false positives without increasing false negatives, is to take advantage of chromatographic characteristics (the shape and width of the chromatographic peaks) as the basis for both peak picking and denoising of mass spectra. Denoising by Matched Filtration. It is well known in signal processing that, to extract signal from noise, the shape of the signal and the characteristics of the noise must be known.12 If the input X(t) can be represented as a sum of the signal of the known shape, described by a function S(t), and a random function N(t) representing noise, then the maximum S/N can be obtained when the input data are processed by a matched filter having the transfer function H(f)

H(f) ) S*(f)/PNN(f)

(1)

∞ where S*(f) ) ∫-∞ S(t) exp(j2πft) dtsthe complex conjugate of the Fourier transform of the signal S(t), PNN(f) ) ∫∞ RNN(t) exp(-j2πft) dtsthe power density spectrum of noise, RNN is the autocorrelation function of the noise, t is time, and f is frequency. In the case of white noise (i.e., noise with the power density uniformly distributed in the frequency domain), matched filtration is equivalent to the calculation of the cross-correlation function of the input data X(t) with the function representing the known shape of the signal S(t). On the other hand, for colored noise (general signal processing term for the noise with a nonuniform power density distribution), information on noise characteristics is required to derive the relevant transfer function for matched filtration according to eq 1. In the cases of LC-MALDI-MS and LC-ESI-MS, colored noise is represented by chemical noise. The application of a matched filtration approach to denoising in chromatography (with UV detection) was demonstrated over a decade ago,13-16 where the Gaussian function was assumed to characterize the chromatographic peak shape. The cases of both white noise (PNN ) const)13,14 and colored noise (PNN proportional to 1/f )15,16 were examined. Cross-correlation with the Gaussian was shown to be robust to the errors in the presumed peak width14 (e.g., a 2-fold error in peak width resulted in only a 10% decrease in S/N) and to errors in the noise model.16 The gain G in S/N due to the matched filtration of a Gaussian peak is equal to8

G)

x

nxπ ≈ 0.67xn 4

(2)

where n is the number of data points per chromatographic peak. (12) Peebles, P. Z. Probability, Random Variables and Random Signal Principles; McGraw-Hill: New York, 2001. (13) Van den Heuvel, E J.; Van Malssen, K. F.; Smit, H. C. Anal. Chim. Acta 1990, 235, 343-353. (14) Van den Heuvel, E. J.; Van Malssen, K. F.; Smit, H. C. Anal. Chim. Acta 1990, 235, 355-365. (15) Van den Bogaert, B.; Boelens, H. F. M.; Smit, H. C. Anal. Chim. Acta 1993, 274, 71-85. (16) Van den Bogaert, B.; Boelens, H. F. M.; Smit, H. C. Anal. Chim. Acta 1993, 274, 87-95.

Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

6315

Denoising of LC-MS data by matched filtration is an even more powerful noise reduction approach than the above matched filtration for chromatography with UV detection. A series of mass spectra generated during LC-MS can be expressed as a twodimensional array, with m/z being the first dimension and chromatographic retention time the second dimension. In principle, matched filtration can be performed in either or both dimensions of the 2D array, but as will be discussed, it is advantageous (results in higher S/N and mass accuracy) to perform the filtration first in the chromatographic time domain and then transfer the resultant denoised data set in the m/z domain for further analysis. The use of cross-correlation in the chromatographic time domain for denoising of mass spectra in LC-ESI-MS was described previously in ref 8. Cross-correlation with the second derivative of the Gaussian resulted in significant suppression of the lowfrequency fluctuations of the background. However, the gain in S/N was only half that of the cross-correlation with the Gaussian8 (compare to eq 2):

G)

x

nxπ ≈ 0.38xn 12

(3)

In addition, as noted earlier, cross-correlation with the second derivative of the Gaussian can produce negative artifact peaks on both sides of the real MS peaks, leading to peak shape distortion and decreased mass accuracy. Description of the Algorithm. The main feature of the MEND algorithm is the ability to minimize both random and chemical noise by the use of the shapes and widths of the peaks in the chromatographic time and m/z domains, as well as the characteristics of noise in the chromatographic time domain. MEND first analyzes extracted ion chromatograms (EICs) for each m/z value and then examines the mass spectra. (The typical data set analyzed in this paper consisted of 3000 mass spectra with 130 000 m/z data points in the range between 700 and 4000). Matched filtration causes a slight broadening of chromatographic peaks that is not crucial for peak picking but, importantly, as will be later shown, does not distort peaks in the m/z domain. Another advantage of performing matched filtration in the chromatographic time domain is that, in general, there are more data points per chromatographic peak than per MS peak. Thus, it is easier to distinguish signal from noise in the chromatographic time domain than in the m/z domain, and the corresponding gain in S/N will be higher (see eq 2). MEND assumes the shape of the chromatographic peak to be Gaussian. For the given LC gradient separation, the width (fwhm) of the peak is assumed to vary not more than by a factor of 10 and is estimated by a preliminary analysis of several extracted ion chromatograms. Cross-correlation with Gaussian is known to be relatively insensitive to the errors in the estimation of peak width14 (see previous section). In addition, it was shown that cross-correlation with Gaussian is not greatly affected by the moderate asymmetry of the chromatographic peaks. For an exponentially modified Gaussian peak with a large asymmetry factor of 5, the shift of the peak maximum resulting from crosscorrelation with Gaussian was as low as 3% of the peak width (fwhm). 6316 Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

Figure 1. Flowchart of the MEND algorithm.

The flowchart of the algorithm is shown in Figure 1. Experimental determination of the noise characteristics is achieved in the chromatographic time domain, where the algorithm selects “vacant” extracted ion chromatograms that contain no chromatographic peaks. A power density spectrum of these “vacant” extracted ion chromatograms is calculated and averaged for a number of ion chromatograms (∼500); then, the transfer function H(f) for matched filtration is determined according to eq 1. The noise characteristics have been found to be m/z dependent (for both LC-MALDI-TOF-MS and LC-ESI-TOF-MS), so transfer functions are separately determined for a number of consecutive regions (∼200) covering the whole range of m/z. As we have noted, the dependence of noise characteristics on the m/z value is consistent with the results of ref 2 for MALDI-MS, where it was shown that chemical noise was produced mainly by the clusters of matrix ions, and with the results of ref 3, where chemical noise in LC-ESI-MS was attributed to LC mobile-phase and buffer impurities. In the next step, the main procedure of the algorithm, matched filtration of the EICs for each m/z value (∼130 000 EICs) with the experimentally determined transfer functions, is performed. A new 2D array in which the original extracted ion chromatograms are substituted by EICs, denoised by matched filtration, is constructed. This new array can then be analyzed in both the chromatographic and mass spectral dimensions, allowing highly reliable peak picking even for peaks with low signal-to-noise ratio (S/N ∼3). Next, peak picking is performed, based on comparison of scores generated for each m/z value, with a certain threshold Tp. Scores are generated according to the set of rules described below. Currently, the scoring rules include several parameters that are determined by trial and error; later, it is planned to apply machine learning algorithms17 and large training data sets in order to determine the optimum values of both the score parameters as well as the threshold Tp for the given LC-MS system. A final

score Scf combining the results of the evaluation of peak candidates in the chromatographic time and m/z domains is calculated as

Scf ) ScKVKI

(4)

where Sc is an initial score determined by examination of the peak in the chromatographic time domain, while KV and KI are obtained for each peak candidate by examination of the MS peak shape and the ratios of the peak intensities in the isotopic cluster, as described in the Appendix. It is important to note that denoising of the data set greatly aids the determination of an accurate ratio of the isotope peaks heights. Only peak candidates with final score values that exceed the threshold Tp are included in the peak list. In this paper, parameters of the peak picking routine of the MEND algorithm were specifically optimized for a high-resolution MALDI-MS instrument (AB 4700 TOF/TOF, Applied Biosystems) providing ∼12 data points per MS peak (Rs ≈ 15 000). The ranges of KV and KI were from 0.1 to 10, the higher value (KVmax , KImax; see Appendix for details) corresponding to close agreement between the experimentally observed and theoretically predicted shapes for MS peaks and isotopic clusters.18 The optimum value of the threshold Tp was determined to be equal to 120, again by trial and error, using ∼30 LC-MALDI-TOF-MS data sets. To obtain the optimum value of the threshold, Tp was varied, and the resultant peak lists were compared. New peaks that appeared in the peak list due to decrease in Tp were examined manually. If the new peaks were real (observed both in the denoised spectrum and the extracted ion chromatogram), then Tp was further decreased. Importantly, the optimum Tp value appeared to be essentially independent of the sample concentration and MS instrument settings. This independence is an advantage of the MEND algorithm where peak selection is based on the examination of peak shape and width. However, this independence would not be the case when peak picking is based on intensity or S/N thresholding. As a final step, the monoisotopic peaks were selected from the isotopic clusters (deisotoping)18 and then peaks corresponding to sodium and potassium adducts were determined and eliminated from the peak list (deadducting). It is important to emphasize that deisotoping and deadducting were performed after denoising at the stage of peak picking. They do not refer to the removal of peaks from mass spectra, but removal of centroids of nonmonoisotopic and adduct peaks from the peak list. As an option, additional removal of matrix cluster peaks from the peak lists could be performed based on the prediction of m/z regions prohibited for peptides.19 EXPERIMENTAL SECTION MS Instrumentation. The MALDI mass spectra were acquired in the MS and MS/MS modes using an AB 4700 TOF/ TOF instrument (Applied Biosystems, Framingham, MA). ESI (17) Duda, R. O.; Hart, P. E.; Stork, D. G. Pattern Classification; WileyInterscience: New York, 2001. (18) Wehofsky, M.; Hoffmann, R.; Hubert, M.; Spengler, B. Eur. J. Mass Spectrom. 2001, 7, 39-46. (19) Zubarev, R. A.; Hakansson, P.; Sundqvist, B. Anal. Chem. 1996, 68, 40604063.

Table 1. List of Model Peptides Used for Sample 1 and as Internal Standards peptide

m/z

angiotensin II human, fragment 1-7 angiotensin III human Sar1-angiotensin II Tyr8-bradykinin substance P syntide 2 dynorphin A fragment 1-13 renin substrate procine big endothelin fragment 23-39 bovine epidermal growth factor fragment 19-36 des-Arg1-bradykinin angiotensin I human Glu1-fibrinopeptide B adrenocorticotropic hormone, fragment 1-17 adrenocorticotropic hormone, fragment 18-39

sample sample sample sample sample sample sample sample sample sample std std std std

899.4739 931.5149 1002.5525 1223.6326 1347.7360 1507.9324 1603.9900 1759.9300 1895.9616 2318.2832 904.4670 1296.6853 1570.6670 2093.0867

std

2465.1990

Table 2. Model Mixture of 10 ICAT Labeled Proteins and Quantitation Results amt of ICAT labeled (fmol)

proteins

heavy (H)

light (L)

H/L, theor

no. of correspd peptidesa

H/L, expl

β-galactosidase lysozyme C β-lactoglobulin ovalbumin BSA R-lactalbumin transferrin aldolase pepsinogen phosphorylase B

150 50 250 50 50 50 50 500 50 150

50 50 50 250 150 500 50 50 50 50

0.33 1.00 0.20 5.00 3.00 10.00 1.00 0.10 1.00 0.33

12 21 9 26 37 11 39 3 8 7

0.38 ( 0.11 0.95 ( 0.17 0.21 ( 0.03 5.1 ( 0.9 3.4 ( 0.6 11.0 ( 2.8 1.16 ( 0.32 0.12 ( 0.04 0.93 ( 0.23 0.39 ( 0.12

a Based on the theoretical digest of the 10 proteins (20 ppm mass tolerance).

analysis was carried out with a Mariner orthogonal extraction TOF MS instrument (Applied Biosystems). Samples. Sample 1 was a model mixture of 10 peptides (Table 1) from Sigma (St. Louis, MO) with equal concentrations of 40 nM. Sample 2 was a tryptic digest of a model mixture of seven proteins (horse cytochrome c, horse myoglobin, human histone, bovine trypsinogen, bovine R-casein, human R-lactalbumin, bovine β-lactoglobulin) obtained from Sigma and tryptic digested separately according to the Promega protocol (Promega, Madison, WI). After digestion, the samples were mixed in equal molar ratios (estimated equivalent protein concentration in the final sample was 40 nM). Sample 3 was the strong cation exchange chromatography intermediate (eighth out of 23) fraction of tryptic digest of yeast lysate (Saccharomyces cerevisiae strain HFY1200), kindly provided by Applied Biosystems. Sample 4 was a tryptic digest of the mixture of 10 cICAT labeled proteins. Composition of the mixture of 10 proteins is presented in Table 2. Protein samples were labeled with cICAT and processed according to the manufacture’s instructions (Applied Biosystems). Chromatographic Analysis and Off-Line Continuous Deposition. The sample was analyzed by UltiMate (Dionex, San Francisco, CA) using a precolumn (300-µm i.d., 1-mm length) Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

6317

packed with 3-µm C18 stationary phase of 100-Å pore size (PepMap, Dionex) connected to a nano LC column (75-µm i.d., 15-cm length). Solvent A consisted of 2% ACN (v/v) and 0.1% TFA (v/v), and solvent B was 85% ACN (v/v), 5% 2-propanol (v/v), and 0.1% TFA (v/v). The stream of eluent from the LC column was mixed with the MALDI matrix solution (10 mM R-cyano-4hydroxycinnamic acid, 25 mM (NH4)2SO4, 80% (v/v) methanol/ water, and 0.1% (v/v) trifluoroacetic acid, from Sigma), delivered by a syringe pump (model 101, KD Scientific, New Hope, PA) in a microTee (Upchurch, Oak Harbor, WA), and deposited on the MALDI plate using the continuous deposition interface20 in the form of continuous streak of ∼250-µm width. During the deposition, the MALDI plate moved at 0.5 mm/s using an X-Y translation stage. Sample 1 (1 µL injected) was separated using a 15-min linear HPLC gradient. Samples 2-4 (10 µL injected) were separated using a 40-min linear HPLC gradient. Sample 2 was the only sample also analyzed by LC-ESI-TOF-MS using the identical LC system coupled to the Mariner orthogonal extraction TOF instrument via home-built nanoelectrospray interface. Mass Spectrometric Analysis. The deposited streaks were analyzed by the MALDI-TOF-TOF-MS instrument in a manner similar to that described elsewhere.21 The MS signal was acquired every 500 µm along the deposited streak corresponding to 1 s of chromatography. In total, 150 laser shots were accumulated from each position (laser repetition rate 200 Hz, laser spot diameter ∼150 µm). The signal in the MS/MS was acquired in the CID mode using 1000 laser shots for each precursor ion. Estimation of AB 4700 Mass Accuracy Using Internal Calibration. The mass accuracy across the MALDI target was estimated using a standard internal calibration procedure. A mixture of five peptide standards (see Table 1) with the MALDI matrix solution was deposited as a serpentine streak across the whole plate by continuous deposition using direct infusion from a syringe pump. The MS signal was acquired from 1000 positions along the deposited streak. Each MS spectrum was internally calibrated with four-point calibration (m/z ) 904.47, 1296.68, 2093.09, 2465.20; see Table 1) using Data Explorer 4.3 (Applied Biosystems). The fifth peptide in the mixture (m/z ) 1570.68) was employed to calculate the variation of the mass accuracy. An average mass difference for this peptide was 1.4 ppm with a standard deviation of 3.6 ppm. Data Analysis. MS spectra obtained from individual wells (each well representing 1 s of chromatography) on the MALDI plate were combined together in one file. All spectra were aligned by interpolation to obtain common m/z values, which enabled the data to be processed as individual extracted ion chromatograms. For experiments where mass calibration was necessary, the individual MS spectra were internally calibrated prior to the m/z alignment. The resulting data files were analyzed with C++ software based on the MEND algorithm described in the Theory section. Typical data sets for standard peptide mixtures consisted of 1400 spectra with 100 000 data points/spectrum. The size of the data set for the yeast lysate sample was ∼3000 spectra with (20) Rejtar, T.; Hu, P.; Juhasz, R.; Campbell, J. M.; Vestal, M. L.; Preisler, J.; Karger,B. L. J. Proteome Res. 2002, 1, 171-179. (21) Moskovets, E.; Chen, H.-S.; Pashkova, A.; Rejtar, T.; Andreev, V.; Karger, B. L. Rapid Commun. Mass Spectrom, 2003, 17, 2177-2187.

6318

Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

130 000 data points/spectrum. CPU time for analysis of yeast lysate data set was roughly 40 min with a Pentium 4 (2 GHz) computer. To determine S/N in the regions of interest of denoised and original spectra, a MATLAB program was developed. All peaks and their isotopes picked by MEND were first eliminated from the given spectrum (i.e., substituted by the adjacent values of the baseline) in order to determine the noise level properly. Then, for the calculation of power density spectrum of noise, standard procedures of MATLAB Signal Processing Toolbox were used.22 RESULTS AND DISCUSSION The denoising power of the MEND algorithm is first illustrated by comparison of processed and original spectra from LC-MALDITOF-MS of some model mixtures. It will be demonstrated that MEND can significantly decrease both chemical and random noise, thus increasing S/N and enabling detection of low-intensity peaks. It will be further shown that MEND does not distort MS peak shape, thus resulting in the potential of high mass accuracy. Examples will be presented of additional differentially expressed ICAT pairs picked by MEND relative to the algorithm that only minimized white noise. Finally, it will be demonstrated that MEND can be effectively used for denoising of data from LC-ESI-TOFMS. Minimization of Noise by the MEND Algorithm in LCMALDI-TOF-MS Data Sets. As discussed earlier, suppression of chemical and random noise is important for the generation of accurate peak lists. MEND experimentally determines properties of the noise by calculating the Fourier transform of the vacant EICs and subsequently the frequency dependence of the power density of the noise PNN(f). Figure 2 illustrates the noise characteristics of an LC-MALDI-TOF-MS of a mixture of 10 standard peptides (sample 1). Figure 2A presents PNN(f) of extracted ion chromatograms for (1) low (700.0 < m/z < 707.4), (2) medium (1531.7 < m/z < 1542.8), and (3) high (2264.6 < m/z < 2278.1) masses. The noise patterns of EICs for different m/z values have some common features; e.g., although the low-frequency component of PNN(f) (f < 0.01 Hz) representing slow variations of EIC baselines dominates, there are certain frequencies (e.g., ∼0.02 Hz) where noise is rather intense. These fluctuations of intensity have a time period commensurate with the chromatographic time interval, corresponding to the distance from the top to the bottom of the MALDI plate. Higher frequency fluctuations of noise intensity can be caused by ion suppression of matrix clusters due to analyte ions. Figure 2B illustrates the m/z dependence of the mean power density of the noise, MPNN. The intensity of the noise varies with m/z by at least a factor of 6, being the highest in the low-mass region where matrix clusters ions are frequent. Since this complicated behavior of noise can vary from plate to plate and from sample to sample (among other reasons due to ion suppression), it is clearly more practical to measure the properties of the noise than to develop models predicting noise properties for the given instrument, plate, and sample. That is the reason that MEND measures not only the intensity of the noise for the given m/z value but also the noise pattern, i.e., the variation of noise along the plate, and uses this information for denoising by matched filtration. (22) www.mathworks.com.

Figure 2. Illustration of noise properties in LC-MALDI-TOF-MS. (A) Frequency dependence of the power density of noise for EICs from three different m/z regions: (1) (700.0 < m/z < 707.4), (2) (1531.7 < m/z < 1542.8), and (3) (2264.6 < m/z < 2278.1). (B) Mean power density MPNN max ) ∫ffmin PNN(f) df, fmin ) 0, fmax ) 1 Hz (highest frequency corresponds to the 1-s interval between data points in EIC). Sample: mixture of 10 standard peptides (sample 1, see Experimental Section and Table 1).

Figure 3 presents an example of denoising of a mass spectrum from an LC-MALDI-TOF-MS data set produced by the analysis of a mixture of 10 standard peptides (sample 1), mentioned above. (Here and elsewhere in the paper, for illustration purposes, only a small m/z section of spectrum is presented. Only monoisotopic peaks of the isotopic clusters are labeled.) Figure 3A shows a representative original spectrum corresponding to 1 s of chromatography, as a result of 150 laser shots. Figure 3B is the result of the average of 10 consecutive spectra from the original data set (corresponding to 10 s of chromatography and 1500 laser shots). Figure 3C presents denoising by cross-correlation with the Gaussian (matched filtration with the assumption of white noise), and Figure 3D shows the spectrum denoised by matched filtration according to the MEND algorithm (experimentally determined characteristics of the colored noise). It can be seen that matched filtration, based on the experimental determination of noise characteristics (power density spectrum of the noise) and performed according to MEND, not only improves denoising ∼5fold (S/N ) 35.6 in comparison with S/N ) 7.5 for the original spectrum) but also suppresses the background, unlike the averaging of 10 spectra and cross-correlation with the Gaussian. The similarity of results of 10 spectra averaging and cross-

correlation with Gaussian is understandable, since both approaches are able to minimize only white random noise, while MEND minimizes colored chemical noise as well. The average gain in S/N due to denoising by MEND for all peptides of sample 1 was 4-fold. It is possible that 10 spectra averaging, while reducing random noise, can even enhance chemical noise. This is illustrated in Figure 4 in which an example of efficient denoising by MEND of the data set from LC-MALDI-TOF-MS of the tryptic digest of a yeast cation exchange fraction (sample 3) is presented. For this example, S/N is improved by MEND (Figure 4C) by a factor of 8 compared with the original spectrum (Figure 4A) and by a factor of 2.5 compared with averaging of 10 spectra (Figure 4B). In the averaged spectrum (Figure 4B), the 1-Da periodicity of the chemical noise can be seen. Matched filtration with the MEND algorithm suppresses this chemical noise to a significant extent and thus not only improves the shapes of MS peaks, and as a result the accuracy of centroid determination, but also reduces the likelihood of false positives. As noted in the Theory section, reduction of chemical noise by MEND aids in the measurement of the real ratio of the peak intensities in the MS isotopic cluster and thus aids in the Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

6319

Figure 3. Comparison of spectra from LC-MALDI-TOF-MS denoised by various algorithms. Sample: mixture of 10 standard peptides (sample 1, see Experimental Section and Table 1). Vicinity of the MS peak corresponding to peptide Tyr8-bradykinin. (A) Original spectrum, S/N ) 7.5; (B) 10 spectra averaged, S/N ) 12.4, (C) spectrum denoised by cross-correlation with Gaussian (white noise assumption), S/N ) 14.3; (D) spectrum denoised by matched filtration (MEND), S/N ) 35.6.

Figure 4. Chemical noise suppression by MEND in spectra from LC-MALDI-TOF-MS of complex mixture. Sample: SXC fraction of tryptic digest of yeast lysate (sample 3, see Experimental Section). (A) Original spectrum, S/N ) 5.5; (B) 10 spectra averaged, S/N ) 18.6; (C) spectrum denoised by matched filtration (MEND), S/N ) 46.2.

determination of the monoisotopic peaks. Figure 5 illustrates how MEND improves the assignment of the monoisotopic peak in the case of a complex mixture (sample 3) in the presence of a high 6320

Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

level of chemical noise. Figure 5A presents the result of 10 spectra averaging, and Figure 5B presents the spectrum denoised by MEND. A high level of chemical noise with 1-Da periodicity,

Figure 5. Improved determination of monoisotopic peaks due to denoising by MEND of LC-MALDI-TOF-MS spectra. Sample: SXC fraction of tryptic digest of yeast lysate (sample 3, see Experimental Section). (A) Ten spectra averaged; (B) spectrum denoised by matched filtration (MEND). Peaks selected as monoisotopic are labeled.

combined with the presence of two overlapping isotopic clusters (Figure 5A), totally confused the deisotoping routine that assigned (labeled) several second isotopes as monoisotopic peaks and missed the real first isotope (m/z ) 1803.88). Denoising by MEND clearly demonstrated (Figure 5B) the presence of two overlapping isotopic clusters with the monoisotopic peaks being m/zm/ z1799.90 and 1803.88 that were successfully picked by the program. Thus, in the given example, MEND eliminated six false positives and one false negative peak. The MEND algorithm is also able to suppress undesirable peaks originating from internal standards added to the matrix, as illustrated by Figure 6. The sample analyzed by LC-MALDI-TOFMS was the mixture of 10 model peptides (sample 1). Five standards were added to the matrix for calibration purposes. The spectrum in Figure 6A was from the averaging of 10 consecutive spectra, while Figure 6B was a result of the MEND algorithm. It can be seen that the peak corresponding to Tyr-bradykinin (m/z ) 1223.64) was much more intense in the case of matched filtration than for 10 spectra averaging. Even more important, it can be seen that there are peaks (e.g., m/z ) 1233.61, 1252.67) in the averaged spectrum (Figure 6A) that are not present in the spectrum resulting from matched filtration (Figure 6B). Analysis of the corresponding matched filtered EICs at m/z ) 1223.64 (Figure 6C), 1233.61 (Figure 6D), and 1252.67 (Figure 6E) reveals that a chromatographic peak is present only for m/z ) 1223.64. Panels D and E of Figure 6 indicate oscillating behavior with maximums roughly 5-8 times smaller than in Figure 6C. It is evident that the ions, m/z ) 1233.61 and 1252.67, are present in essentially all spectra and must therefore arise from the MALDI matrix. It can be further noted from Figure 6A that the isotopes in the m/z ) 1233.61 cluster are separated not by 1 Da but by 0.5 Da. Therefore, it is a doubly charged ion, and it clearly arises from the internal standard peptide, ACTH 18-39, with a molecular weight 2465.20. The MEND algorithm is able to eliminate such

matrix peaks. If not filtered out by the algorithm, these matrix peaks could cause false identifications. Obviously, it is possible to remove these false peaks by examining extracted ion chromatograms for each picked peak and determining whether it represents a real chromatographic peak or not. However, in MEND, these peaks are suppressed simultaneously with the denoising. Improved Mass Accuracy of LC-MALDI-TOF-MS Data Sets Denoised by the MEND Algorithm. As discussed in the introduction, minimization of distortion of MS peak shape is an important requirement for any LC-MS denoising algorithm. MEND does not compromise mass accuracy even for the case of peaks with low S/N, as demonstrated by the following experiment. The tryptic digest of a model mixture of seven proteins (sample 2) was analyzed by LC-MALDI-TOF-MS. The peak list was generated by MEND and used for selection of precursor ions for MS-MS analysis, results of MS/MS analysis were submitted to MASCOT,23 and 39 peptides were identified, thus allowing the knowledge of their exact theoretical m/z values. At the second stage of the experiment, sample 2 was diluted 5 times in order to model the data set with low S/N peaks. For mass calibration, five standards were added to the MALDI matrix (see Experimental Section). As a result of the combined effects of dilution and ion suppression caused by the standards, the intensities of the MS peaks were significantly reduced, and only 12 of the peptides identified in the high concentrated sample were detected in the diluted sample. For those detected, S/N varied from 3 to 14.5. Mass accuracy ∆m/m was calculated as the difference between the theoretical and experimentally observed m/z values divided by the theoretical value:

∆m/m ) ((m/z)EXP - (m/z)THEO)/(m/z)THEO

(5)

The ∆m/m values from data set denoised by MEND were found Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

6321

Figure 6. Removal of matrix-related peaks in spectra from LC-MALDI-TOF-MS by MEND. Sample: mixture of 10 standard peptides (sample 1, see Experimental Section and Table 1). Vicinity of the MS peak corresponding to the peptide Tyr8-bradykinin. (A) Ten spectra averaged; (B) spectrum denoised by matched filtration (MEND); (C) EIC at m/z ) 1223.6; (D) EIC at m/z ) 1233.6; (E) EIC at m/z ) 1252.7. m/z ) 1233.6 corresponds to [M + 2H]2+ from internal standard; m/z ) 1252.7, matrix cluster.

Clearly, δm/m accounts for the portion of the mass accuracy determined by the presence of the chemical noise, and ∆m/m accounts for the combined effects of both accuracy of calibration and chemical noise. The average value of δm/m was 1.0 ppm and the standard deviation 1.5 ppm. Thus, in this example, the observed mass accuracy for detected peptides was limited not by the chemical noise and the method of its reduction but by the accuracy of mass calibration.

If the mass calibration is improved, MEND can provide higher mass accuracy than 10 spectra averaging. Figure 7 illustrates how denoising algorithms influence mass accuracy. The data set was produced by LC-MALDI-TOF-MS analysis of an intermediate (eighth of 23) fraction of SCX of the tryptic digest of yeast lysate (sample 3). Figure 7A presents the spectrum produced by 10 spectra averaging, whereas Figure 7B is a result of secondderivative Gaussian filtering according to ref 8 and Figure 7C illustrates the spectrum due to denoising by the MEND algorithm. To exclude the influence of mass calibration on mass accuracy, the distances ∆m/z between the isotopes in the isotopic clusters were examined. It is known that the theoretical distance between the first two isotopes in the isotopic cluster should be equal to 1.003.24 The closeness of the experimentally determined spacing between the peaks in the isotopic clusters to the theoretically predicted value would represent a good indication of the quality of denoising and the resulting mass accuracy. For the given example, matched filtration with MEND (Figure 7C) resulted in a mass accuracy of better than 1 ppm (calculated on the basis of the spacing between first and second isotopes of the isotopic clusters), while the second-derivative Gaussian filtering (Figure 7B) resulted in a mass accuracy of 5.5 ppm, and the 10 spectra averaging (Figure 7A) led to a mass accuracy of 5.6

(23) www.matrixscience.com.

(24) Yergey, J. A. Int. J. Mass Spectrom. Ion Phys. 1983, 52, 337-349.

to have values similar to those from 10 spectra averaging. For MEND, the mean value of ∆m/m was 3.5 ppm and the standard deviation was 4.1 ppm, and for 10 spectra averaging it was 3.7 and 4.7 ppm, respectively. Examination of the quality of mass calibration (see Experimental Section) showed that the mass accuracy of the calibration was at the same level (standard deviation 3.6 ppm), thus indicating that mass accuracy observed with MEND is mainly determined by the accuracy of calibration. To minimize the influence of the accuracy of mass calibration, experimental m/z values of peak centroids from the same data file processed by MEND and 10 spectra averaging were compared:

δm/m ) ((m/z)MEND - (m/z)10AVR)/(m/z)MEND

6322

(6)

Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

Figure 7. Influence of denoising algorithms on mass accuracy in LC-MALDI-TOF-MS. Sample: SXC fraction of tryptic digest of yeast sample (sample 3, see Experimental Section). (A) Ten spectra averaged; (B) spectrum denoised by second derivative Gaussian filtering;8 (C) spectrum denoised by matched filtration (MEND). Distances (∆m/z) between isotopes in the isotopic clusters are presented instead of m/z values. Theoretically distance between first two isotopes in the isotopic cluster is equal to 1.003.24

ppm. It can also be seen that the shapes of MS peaks are more symmetrical for the case of matched filtration (Figure 7C), indicating an improved reduction of chemical noise. Increase of the Number of Selected ICAT Pairs due to the Detection of Low-Intensity Peaks by MEND. As noted in the introduction, the MEND algorithm is also useful in protein expression analysis; e.g., in the case of ICAT labeled samples, MEND should increase the number of detected differentially expressed ICAT pairs. To illustrate this important advantage of the MEND algorithm, the data set from LC-MALDI-TOF-MS of the tryptic digest of the mixture of 10 ICAT labeled proteins (sample 4) was processed both by MEND and by cross-correlation with the Gaussian (assumption of white noise). In both cases, denoising and peak picking were performed by the MEND software, but in the case of white noise assumption, the routine for the experimental determination of noise characteristics was not employed. The numbers of detected ICAT pairs were 490 with MEND and 382 with the white noise assumption. Comparison of these two peak lists with the theoretical digest of the ICAT labeled 10-protein mixture (mass tolerance was set to 20 ppm) resulted in the identification of 178 peptides from the MEND list and 150 peptides from the list generated with the white noise assumption. Thus, MEND resulted in additional 28 identifications (19%). The numbers of identified peptides for each protein and the results of

quantitation are given in Table 2. Good agreement between theoretical and experimental relative abundances for each of the 10 proteins was found, even in the case of ratio 10:1 or 1:10. Examples of 2 out of 28 additional pairs detected by MEND are presented in Figures 8 and 9. Figure 8 illustrates the case where MEND enabled the ICAT pair to be picked by suppressing peaks originating from the internal standard. Peaks at m/z ) 1592.7, 1593.7 belong to the sodium adduct of the standard peptide Glu1-fibrinopeptide B with m/z ) 1570.67. In the original spectrum (Figure 8A) and the spectrum denoised with the white noise assumption (Figure 8B), these latter peaks mask the heavy member of the ICAT pair at m/z ) 1594.86. MEND (Figure 8C) suppressed peaks originating from the internal standard that were present in essentially all spectra. Thus, the ICAT pair was detected and then successfully identified by MS/MS with a MASCOT score equal to 40. Figure 9 demonstrates how reduction of chemical noise by MEND improves mass accuracy and enables the detection of ICAT pairs. In Figure 9A, the spectrum was denoised with the white noise assumption; here the light member of the ICAT pair (m/z ) 1724.8) overlapped with the chemical noise, resulting in a significant shift of the peak centroid and the distance between neighboring monoisotopic peaks equal to 8.97. MEND (Figure 9B) reduced chemical noise (i.e., suppressed the shoulder of the Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

6323

Figure 8. ICAT pair selected by MEND due to the suppression of the sodium adduct of the internal standard in the LC-MALDI-TOF-MS spectra. Sample: tryptic digest of the mixture of 10 ICAT labeled proteins (sample 4, see Table 2). Vicinity of the ICAT pair corresponding to the peptide SVIPSDGPSVACVK from human transferrin. Light, m/z ) 1585.83; heavy, m/z ) 1594.86. (A) Original spectrum; (B) spectrum denoised by cross-correlation with Gaussian (white noise assumption); (C) spectrum denoised by matched filtration (MEND). Peak at m/z ) 1592.8 is a sodium adduct of the internal standard peptide Glu1-fibrinopeptide B (m/z ) 1570.67).

Figure 9. ICAT pair selected by MEND due to the reduction of chemical noise in the LC-MALDI-TOF-MS spectra and improved mass accuracy. Sample: tryptic digest of the mixture of 10 ICAT labeled proteins (sample 4, see Table 2). Vicinity of the ICAT pair corresponding to the peptide DDPHACYST from BSA. Light, m/z ) 1724.76; heavy, m/z ) 1733.79. (A) Spectrum denoised by cross-correlation with Gaussian (white noise assumption); (B) spectrum denoised by matched filtration (MEND). Denoising by MEND reduces the shoulder of the light peak, thus bringing the distance between the MS clusters closer to the theoretical value 9.03.

m/z ) 1724.8 peak), resulting in the distance between the light and heavy peaks of the ICAT pair ∆m/z ) 9.00 close to the expected value 9.03 and, thus, enabling the detection of this pair. The accuracy of determination of the distance between the monoisotopic peaks was 35 ppm in the case of white noise 6324 Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

assumption and 17 ppm in the case of denoising by MEND (including chemical noise suppression). The pair in Figure 9A can be selected without denoising by MEND by increasing the mass tolerance for detection of ICAT pairs from 20 to 40 ppm; however, this increase could lead to selection of false positive

Figure 10. Representative example of denoising of LC-ESI-TOF-MS data by MEND. Sample: tryptic digest of the mixture of seven proteins (sample 2, see Experimental Section). (A) Original spectrum; (B) 10 spectra averaged; (C) spectrum denoised by matched filtration (MEND). (D) EIC at m/z ) 413.8; (E) EIC at m/z ) 426.0; (F) EIC at m/z ) 428.0.

pairs. Accurate denoising performed by MEND increases the “contrast” and enables the discrimination of true peaks and pairs from false ones. Minimization of Noise by the MEND Algorithm in LC-ESITOF-MS Data Set. To briefly illustrate the universality and nondistorting character of the algorithm, MEND was also applied to denoising of the data sets from LC-ESI-TOF-MS. No change in the denoising strategy of the MEND algorithm was necessary for LC-ESI-TOF-MS, because MEND automatically adapted to the new conditions by determining the noise characteristics directly from the data set. Figure 10 presents an example of the application of the algorithm to the analysis of LC-ESI-TOF-MS data. In case of the 10 spectra averaging (Figure 10B), random noise was suppressed, but the chemical noise with 1-Da periodicity was even higher than in the original spectrum (Figure 10A). On the other hand, the spectrum denoised by MEND (Figure 10C) was practically free of random and chemical noise. Peaks at m/z ) 426.01 and 428.00 are present in both the original and 10 averaged spectra but are minimized in the spectrum denoised by MEND. As can be seen from EICs for these m/z values presented in Figure 10E and F, MS peaks were suppressed because they do not correspond to any chromatographic peaks, but resulted from the chemical noise due to impurities of the eluent and therefore were present in the broad chromatographic time region. The MS peak at m/z ) 413.83 is present in the denoised spectrum, because, as can be seen from EIC in Figure 10D, it corresponds to the real chromatographic peak. Thus, the ability of the algorithm to distinguish sample components from mobile-phase impurities is seen. In addition, examination of the values of centroids calculated for the original spectrum (Figure 10A), averaged spectrum (Figure 10B), and denoised spectrum (Figure

10C) reveals that averaging has a tendency to shift centroids to the right in comparison with original values. Centroids for a denoised spectrum are much closer to the original ones. Thus, for LC-ESI-TOF-MS data, the MEND algorithm demonstrates the performance similarity to LC-MALDI-TOF-MS data: both chemical and random noise are suppressed, sample components are distinguished from mobile-phase impurities, and mass accuracy is not compromised. Optimization of the peak picking portion of the MEND algorithm for LC-ESI-TOF-MS was not performed and will be a topic of a future study. CONCLUSIONS A new algorithm for denoising and peak picking, based on the matched filtration in the chromatographic time domain was developed. By using the information on the shape and width (fwhm) of the chromatographic peak, as well as the shape and width of the m/z peaks and the structure of the isotopic cluster, the algorithm is able to successfully pick peaks with low intensities and low S/N. MEND experimentally determines the characteristics of chemical and random noise present in LC-MS data sets and uses this information for improved denoising. Thus, the algorithm significantly increases S/N in comparison to algorithms that only remove random noise. Principally, MEND has an advantage over approaches such as CODA,3 where only EICs with high S/N are retained and other EICs are eliminated. Matched filtration performed by the MEND algorithm in the chromatographic time domain denoises all EICs and does not distort the shapes of MS peaks, even those with low S/N. The algorithm is also able to filter out MS peaks corresponding to the clusters of matrix ions and internal standards introduced into the matrix for calibration purposes. In addition, MEND is shown to be useful Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

6325

for protein expression analysis by enabling picking of a high number of differentially expressed ICAT pairs due to increased S/N and mass accuracy. Additional pairs are then found due to reduction of chemical noise and suppression of matrix-related peaks. The universality of the MEND approach has been shown in the application to denoising of both LC-MALDI-TOF-MS and LCESI-TOF-MS data sets. MEND could also be quite advantageous in the case of LC-MALDI-QTOF, where chemical noise can be significant due to the cooling of ions in the quadrupole. Furthermore, MEND should be applicable to analysis of any 2-D data sets where one dimension represents a separation step and another dimension represents a spectrum, e.g., data sets from CEESI-MS, CE-MALDI-MS, LC (CE)-NMR, GC/MS, and LC (CE)diode array UV-visible detection. The results of the use of MEND in proteomic studies will be presented separately.25 ACKNOWLEDGMENT Financial support from the National Institutes of Health (GM 15847 and HG 02033) is gratefully acknowledged. Samples provided by Applied Biosystems, Framingham, MA, are gratefully acknowledged. Contribution 826 from the Barnett Institute. APPENDIX. DETAILS OF THE SCORING PROCEDURE The initial score Sc is calculated as the ratio of the maximum and mean values of the matched filtered extracted ion chromatograms and thus characterizes the signal-to-noise ratio for the largest peak in the given EIC:

Sc ≈ (S/N)mG

(A1)

where (S/N)m is the signal-to-noise ratio for the largest peak in the EIC and G is the gain due to the matched filtration determined by eq 2. A high value of Sc indicates the presence of a chromatographic peak, while the value of Sc e 2 corresponds to the case where the maximum and mean values are similar, and therefore, no chromatographic peak is assumed to be present. To enable scoring and selection of more than one peak in each EIC, the largest chromatographic peak is eliminated (substituted by the adjacent value of the baseline) after being detected. Then, the matched filtration routine is repeated and the next largest peak in the EIC is detected and scored. This procedure is repeated M times, where M is equal to the expected maximum number of chromatographic peaks in the EIC and dependent on the complexity of the analyzed mixture and the resolution of the MS instrument (the higher the complexity of the mixture, the higher the M value and the higher the resolution the lower the M value). For the data sets analyzed in this paper (e.g., LC-MS of SCX (25) Rejtar, T.; Chen, H.-S.; Hu, P.; Andreev, V. P.; Ivanov, A. R.; Zang, L.; Moskovets, E. V.; Karger, B. L., to be published.

6326

Analytical Chemistry, Vol. 75, No. 22, November 15, 2003

fractions of a yeast digest), the number of peptides possessing equal m/z values and consequently present in the same EICs was never higher than 5, so M was set equal to 5, but can be easily doubled if needed for more complicated mixtures. As a result of denoising and scoring in the chromatographic time domain, a number of candidates (typically several thousand) are preselected for the peak list. Thus, the number of data points that need to be further analyzed is significantly decreased relative to the initial several hundred million data points (130 000 × 3000 for the typical data set). For each peak candidate, the m/z value, spectrum number (chromatographic time point where peak apex was observed), and intensity are known. Next, to examine the MS peak shape, the intensity of the candidate is compared with the intensities of the neighboring m/z values. Five neighboring data points at lower and higher m/z values relative to the peak candidate are examined. The coefficient KV is generated at this stage, as described below. The highest value of the coefficient KVmax ) 10 is assigned to the peak candidate when the intensities at all five lower m/z neighboring data points monotonically increase, while the intensities of the corresponding higher m/z neighboring data points monotonically decrease. For each deviation from the above behavior, the value of KV is decreased by a factor of dv (in this work dv ) 1.25). As a result, the highest values of KV are assigned to the peak candidates representing MS peaks with smooth shapes. Next, the ratios of the peak heights of the isotopes in the isotopic clusters are compared with the theoretically predicted values.18 The closer the peak height ratios are to the theoretically predicted ratios, the higher is the coefficient KI generated at this stage. The maximum value KImax ) 10 is assigned if all peaks in the isotopic cluster have peak height ratios in agreement with the theoretically predicted values. For each isotope with a peak height different from the theoretically predicted values, the KI value is decreased by the factor of dI (in this work dI ) 1.5). It is important to note that the optimum values of KVmax, KImax, and their rates of decrease dv, dI are dependent on the type of MS instrument (both ion source and mass analyzer). For example, the mass resolution will influence not only the widths of the MS peak but the observed ratios of the isotopes in the isotopic clusters as well. For the lower resolution instrument, experimentally observed signals from the isotopic clusters must be either deconvoluted before comparison with theoretically predicted ones, or the examination of the shape of the isotopic cluster must be made more tolerant (smaller dI) to account for possible deviations from theoretical predictions. As already noted, it is planned as a next stage of MEND to apply machine learning algorithms in order to determine optimum peak picking parameters for the given MS instrument. Received for review May 5, 2003. Accepted September 4, 2003. AC0301806