Automated Intensity Descent Algorithm for Interpretation of Complex

Jun 3, 2006 - This paper describes a new automated intensity descent algorithm for analysis of complex high-resolution mass spectra. The algorithm has...
0 downloads 0 Views 304KB Size
Anal. Chem. 2006, 78, 5006-5018

Automated Intensity Descent Algorithm for Interpretation of Complex High-Resolution Mass Spectra Li Chen,† Siu Kwan Sze,*,‡ and He Yang*,†

Bioinformatics Institute, 30 Biopolis Street, Singapore 138671, and Genome Institute of Singapore, 60 Biopolis Street, Singapore 138672

This paper describes a new automated intensity descent algorithm for analysis of complex high-resolution mass spectra. The algorithm has been successfully applied to interpret Fourier transform mass spectra of proteins; however, it should be generally applicable to complex high-resolution mass spectra of large molecules recorded by other instruments. The algorithm locates all possible isotopic clusters by a novel peak selection method and a robust cluster subtraction technique according to the order of descending peak intensity after global noise level estimation and baseline correction. The peak selection method speeds up charge state determination and isotopic cluster identification. A Lorentzian-based peak subtraction technique resolves overlapping clusters in high peak density regions. A noise flag value is introduced to minimize false positive isotopic clusters. Moreover, correlation coefficients and matching errors between the identified isotopic multiplets and the averagine isotopic abundance distribution are the criteria for real isotopic clusters. The best fitted averagine isotopic abundance distribution of each isotopic cluster determines the charge state and the monoisotopic mass. Three high-resolution mass spectra were interpreted by the program. The results show that the algorithm is fast in computational speed, robust in identification of overlapping clusters, and efficient in minimization of false positives. In ∼2 min, the program identified 611 isotopic clusters for a plasma ECD spectrum of carbonic anhydrase. Among them, 50 new identified isotopic clusters, which were missed previously by other methods, have been discovered in the high peak density regions or as weak clusters by this algorithm. As a result, 18 additional new bond cleavages have been identified from the 50 new clusters of carbonic anhydrase. With the advent of electrospray ionization (ESI)1 and matrixassisted laser desorption ionization (MALDI),2 mass spectrometry * Address correspondence to either author. (S.K.S.) Phone: 065-64788111. Fax: 065-64789060. E-mail: [email protected]. (H.Y.) Phone: 065-64788268. Fax: 065-64789048. E-mail: [email protected]. † Bioinformatics Institute. ‡ Genome Institute of Singapore. (1) Fenn, J. B.; Mann, M.; Meng, C. K.; Wong, S. F.; Whitehouse, C. M. Mass Spectrom. Rev. 1990, 9, 37-70. (2) Karas, M.; Hillenkamp, F. Anal. Chem. 1988, 60, 2299-2301.

5006 Analytical Chemistry, Vol. 78, No. 14, July 15, 2006

has been widely used for characterization of macromolecules. In particular, mass spectrometry becomes the most powerful technique for identifying protein and its posttranslational modifications.3-5 It also provides a basis for identifying novel biomarkers, pathogenic factors, and protein interactions in networks and pathways. Specifically, compared to MALDI, ESI greatly extends the capability of mass spectrometry for measuring the masses of large biomolecules by generating multiply charged ions. Fourier transform mass spectrometry (FTMS)6,7 is currently the instrument that is capable of achieving high-resolution and precise mass measurement simultaneously. Electrospray ionization combined with Fourier transform mass spectrometry (ESIFTMS) can provide a high-resolution spectrum of large biomolecules (>10 kDa) with an accurate mass determination of ∼1 ppm.8 The development of tandem mass spectrometry techniques has greatly extended the capability of the system for characterization of large protein molecules. For example, electron capture dissociation (ECD) of multiply charged ions discovered by McLafferty and co-workers9-11 is the most powerful and elegant tandem MS technique for characterization of intact proteins. ECD generates extensive backbone bond cleavage of large proteins. Such cleavage provides tremendous sequence information and can be used for identification of posttranslational modifications.12 It becomes an indispensable method for analysis of biomolecules in the “top-down” approaches.13-15 However, tandem mass spectra of large biomolecules are usually very complicated because (3) Aebersold, R.; Goodlett, D. R. Chem. Rev. 2001, 101, 269-296. (4) Palmblad, M.; Buijs, J.; Hakansson, P. J. Am. Soc. Mass Spectrom. 2001, 12, 1153-1162. (5) Pandey, A.; Mann, M. Nature 2000, 405, 837-846. (6) Marshall, A.; Hendrickson, C.; Jackson, G. Mass Spectrom. Rev. 1998, 17, 1-35. (7) Makarov, A.; Denisov, E.; Kholomeev, A.; Balschun, W.; Lange, O.; Strupat, K.; and Horning, S. Anal. Chem. 2006, 78, 2113-2120. (8) McLafferty, F. W. Acc. Chem. Res. 1994, 27, 379-386. (9) Zubarev, R. A.; Kelleher, N. L.; McLafferty, F. W. J. Am. Chem. Soc. 1998, 120, 3265-3266. (10) McLafferty, F. W.; Horn, D. M.; Breuker, K.; Ge, Y.; Lewis, M. A.; Cerda, B.; Zubarev, R. A.; Carpenter, B. K. J. Am. Soc. Mass Spectrom 2001, 12, 245-249. (11) Sze, S. K.; Ge, Y.; Oh, H. B.; McLafferty, F. W. Anal. Chem. 2003, 75, 1599-1603. (12) Sze, S. K.; Ge, Y.; Oh, H. B.; McLafferty, F. W. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 1774-1779. (13) Kelleher, N.; Lin, H. Y.; Valaskovic, G. A.; Aaserud, D. J.; Fridriksson, E. K.; and McLafferty, F. W. J. Am. Chem. Soc. 1999, 121, 806-812. (14) Ge, Y.; Lawhorn, B. G.; El Naggar, M.; Strauss, E.; Park, J. H.; Begley, T. P.; McLafferty, F. W. J. Am. Chem. Soc. 2002, 124, 672-678. 10.1021/ac060099d CCC: $33.50

© 2006 American Chemical Society Published on Web 06/03/2006

thousands of isotopic peaks from a mixture of ions are packed in a small m/z window from 500 to 18 000 Da. As a result, the mass spectral interpretation for determination of monoisotopic mass of each isotopic cluster is a very demanding task. Manual interpretation is rather tedious, and it is often impossible to exhaustively identify all of the overlapped isotopic clusters. The development of an automated spectral interpretation method is necessary for resolving isotopic multiplets for ions with large masses and charges in high-resolution mass spectra. Several algorithms have been proposed to automate the procedure of mass spectral interpretation. Mann et al.16 developed the averaging and deconvolution algorithms for extracting molecular mass information from spectra showing sequences of peaks due to ions with varying numbers of charges. McLafferty and coworkers17 presented an automated method for assignment of charge states in ESI-FT-ICR mass spectra based on a combination of the Fourier and Patterson transforms of the m/z values of isotopic clusters. They also suggested a computational method for an average amino acid called averagine, which has been used for determination of the monoisotopic mass and ion population of resolved isotopic distribution of unknown proteins.18 Zhang et al.19,20 applied maximum entropy-based deconvolution to enhance the effective resolution of mass spectra of high-mass biomolecules. They also proposed a universal algorithm (Zscore) for speedy and automated charge state deconvolution of both low- and highresolution electrospray mass spectra. Currently, THRASH21 (thorough high-resolution analysis of spectra, by Horn) is the most powerful and widely used algorithm for automated reduction and interpretation of high-resolution electrospray mass spectra of large molecules. This method employs the Fourier transform/Patterson method for primary charge determination and locates all isotopic clusters by a least-squares fitting technique in which “averagine” peaks are used to derive the theoretical isotopic abundance distribution. The main computational effort of THRASH is focused on the application of least-squares fit of averagine to experimental isotopic clusters in each moving window, which is shifted gradually by one m/z data point from the lower m/z limit to the upper m/z limit. The calculated peak list can be further analyzed by submitting to web-based database search application (e.g., ProSight PTM22 based on shotgun annotation15) for top-down protein characterization. Here, we present a novel analysis algorithm (AID-MS) for efficient interpretation of complex high-resolution mass spectra. The AID-MS algorithm implements novel and speedy methods for both charge state determination and isotopic cluster identification. Robust techniques have been introduced to minimize false (15) Pesavento, J.; Kim, Y.; Taylor, G. K.; Kelleher, N. L. J, Am. Chem. Soc. 2004, 126, 33860-3387. (16) Mann, M.; Meng, C. K.; Fenn, J. B. Anal. Chem. 1989, 61, 1702-1708. (17) Senko, M. W.; Beu, S.C.; McLafferty, F. W. J. Am. Soc. Mass Spectrom. 1995, 6, 52-56. (18) Senko, M. W.; Beu, S. C.; McLafferty, F. W. J. Am. Soc. Mass Spectrom. 1995, 6, 229-233. (19) Zhang, Z. Q.; Guan, S. H.; Marshall, A. G. J. Am. Soc. Mass Spectrom. 1997,8, 659-670, (20) Zhang, Z. Q.; Marshall, A. G. J. Am. Soc. Mass Spectrom. 1998, 9, 225233. (21) Horn, D. V.; Zubareu, R. A.; McLafferty, F. W. J. Am. Soc. Mass Spectrom. 2000, 11, 320-332. (22) LeDuc, R. D.; Taylor, G. K.; Kim, Y. B.; Januszyk, T. E.; Bynum, L. H.; Sola, J. V.; Garavelli, J. S.; Kelleher, N. L. Nucleic Acids Res. 2004, 32, W340W345.

Figure 1. Three high-resolution mass spectra: (a) spectrum A, FTICR mass spectrum of a mixture of proteins; (b) spectrum B, plasma ECD spectrum of ubiquitin; and (c) spectrum C, plasma ECD spectrum of carbonic anhydrase.

positive isotopic clusters and to accurately locate the monoisotopic mass by reducing the possibility of 1- or 2-Da mismatch to polyaveragine18 isotopic distributions. In addition, a dynamic peak subtraction method extracts the overlapping clusters in a crowded spectral region. The performance of the AID-MS algorithm was gauged by analyzing three FT-ICR mass spectra of different complexities, and the results were compared with those obtained by THRASH. EXPERIMENTAL SECTION In this paper, AID-MS was applied to interpret three highresolution FTMS spectra of different complexities (Figure 1). Spectrum A is an FTICR mass spectrum (40 scans, 512 k data points, m/z 289-3000 Da, recorded by Bruker Daltonics 9.4T Q-FTMS) of a liquid chromatography-separated fraction of a commercial acid-extracted histone mixture (Sigma Chemicals). The fraction comprises a mixture of large, intact proteins (mainly histone H2A and H2B) and their truncated products. The other two spectra were selected from previously published plasma ECD of large proteins recorded by Cornell 6T FTMS.23 Spectrum B is a plasma ECD spectrum of ubiquitin (100 scans, 256 k data points, m/z 500-2200 Da), and spectrum C is a plasma ECD spectrum of carbonic anhydrase (200 scans, 256 k data points, m/z 5002200 Da). The time domain spectra were processed by two zerofill and no apodization, and converted to ASCII text file format (m/z vs intensity). (23) Sze, S. K.; Ge, Y.; Oh, H.; McLafferty, F. W. Anal. Chem. 2003, 75, 15991603.

Analytical Chemistry, Vol. 78, No. 14, July 15, 2006

5007

Figure 2. Estimation of standard deviation (SD) of the 24th subwindow in spectrum C (a) and necessary baseline recorrection for cluster 2 located at the shoulder of cluster 1 with S/N ) 144 (MaxI ) 16.456, NL ) 0.114) in the baseline-corrected spectrum A (b). In plot a, the original (star) and smoothed (solid) distribution curves of standard deviations using every n ) 31 data points are depicted. The Savitzky-Golay filter uses the first polynomial order and frame size ) 5. SD ) 0.662 is obtained from the m/z value at the maximum position of the smoothed curve.

The source codes of the AID-MS algorithm were developed in MATLAB. After having complied by MATLAB Complier, an executable program was generated for the analysis of highresolution mass spectra. The program reads in a mass spectrum in the ASCII data file format, and therefore, it can be utilized to analyze any high-resolution mass spectra, independent of instrument types and manufacturers. The AID-MS program and the supplementary information can be downloaded from the URL: http://www.bii.a-star.edu.sg/paper/chenli/. ALGORITHM The AID-MS algorithm consists of three parts: (1) estimation of noise level and global baseline correction, (2) identification of isotopic clusters, and (3) assignment of monoisotopic masses to peptide ion fragments for proteins with known sequences. Part 1: Estimation of Noise Level and Global Baseline Correction. A new technique has been proposed to estimate the noise level and subtract the baseline for the whole spectrum. The noise level usually varies in different m/z ranges of a complex high-resolution mass spectrum. Therefore, the whole spectrum is divided into a few subwindows with equal m/z ranges using a user-defined size (N Da). In each subwindow, the noise level is assumed to be uniform, and the intensities of every n data points are used to calculate the standard deviation. The distribution of all standard deviations is first determined and then smoothed by a Savitzky-Golay filter.24 Because noise signals are recorded most frequently, the maximum point in the smoothed curve can be assumed to be the standard deviation (SD) of noise signals. Thus, the noise level (NL)25 can be calculated as follows:

NL ) 2 × SD

(1)

In each subwindow, the baseline is assumed to be flat. To decide whether the ith point is a baseline point, a rectangle with a width covering M data points is constructed with its center on the point.25 If the difference between the maximum and minimum of these M data points does not exceed three times of the noise level, the ith point is considered to belong to the baseline. The mean value of all baseline points within a subwindow is set as (24) Orfanidis, S. J. Introduction to Signal Processing; Prentice-Hall: Englewood Cliffs, NJ, 1996. (25) Golotvin, S.; Williams, A. J. Magn. Reson. 2000, 146, 122-125.

5008

Analytical Chemistry, Vol. 78, No. 14, July 15, 2006

the baseline value of this subwindow. A Savitzky-Golay filter is applied to smooth the noise levels and baseline values of all subwindows. The smoothed noise level and baseline value of each subwindow are first assigned to its center point, and then all of data points are connected by straight lines to generate the noise level curve and baseline curve for the whole spectrum. The baseline-corrected mass spectrum and noise level are used as the inputs for the next process: identification of isotopic clusters. Nevertheless, during identification of isotopic clusters, when an isotopic cluster is located on the shoulder of an identified cluster with a high-intensity (S/N > 30), local baseline correction should be performed using a 4-Da window21 around the maximum peak of the cluster by the same baseline-correction technique in the global baseline-corrected spectrum. The plasma ECD spectrum of carbonic anhydrase with the m/z range from 500 to 2000 Da was demonstrated for noise level estimation and global baseline correction. Using N ) 50 Da as the subwindow size, the whole spectrum was divided into 30 subwindows. Figure 2a shows the estimation of standard deviation of the 24th subwindow. The noise level was estimated by eq 1, and the baseline value was obtained from the mean value of all baseline points (M ) 61). A Savitzky-Golay filter with a 5-point window and the first-order polynomial was applied to smooth the noise levels and baseline values of all subwindows. Finally, the noise level curve and the global baseline-corrected spectrum were generated. Figure 2b shows an example of necessary baseline recorrection for cluster 2, which is located at the shoulder of cluster 1 with S/N ) 144. Local baseline correction can improve the averagine fitting results of cluster 2 in the next section. Part 2: Identification of Isotopic Clusters. Identification of isotopic clusters is the core part of the AID-MS algorithm, and the flowchart is depicted in Figure 3. The algorithm sequentially locates all isotopic clusters in the order of descending maximum peak intensity. First, it finds the maximum peak of the whole spectrum and generates a 1.2-Da processing window. The peak selection method is applied to select possible isotopic peaks, while the charge state or the searching range is estimated using the interpeak spacing of the selected peaks. A noise flag value is used to remove noise signals. Next, the algorithm checks whether the constraints for searching isotopic clusters with charge g 3 are fulfilled. If the constraints are satisfied, the averagine peaks are fitted to the selected isotopic peaks, and the correlation coefficient

Table 1. The Look-up Table of Averagine Isotopic Abundances for a Series of Monoisotopic Masses peak index

Figure 3. Flowchart of identification of isotopic clusters in the AIDMS algorithm.

(CC) and matching error (ME) are calculated. If the CC and ME meet the requirements of a real isotopic cluster, an isotopic cluster is identified and subtracted from the spectrum using a subtracting technique. However, if the selected peaks do not satisfy the constraints of charge g 3 or the CC and ME do not meet the requirements of a real isotopic cluster, the algorithm will reselect a bigger window size of 2.7 Da and check whether it meets the requirements of an isotopic cluster with charge ) 1 or 2. If an isotopic cluster is identified for charge ) 1 or 2, the identified cluster will be subtracted using the same subtracting technique. However, in the case that there is not any isotopic cluster to be found, the maximum peak will be removed as a noise signal. The above procedure is repeated until all peak intensities are below threshold intensity posterior to subtraction of both identified isotopic clusters and noise signals. The threshold intensity (TI) can be determined as follows:

TI ) AIT × NL

(2)

where AIT is called algorithm initiation threshold or S/N threshold for algorithm initiation. The signal-to-noise ratio can be calculated from

S/N )

MaxI NL

(3)

where MaxI denotes the maximum peak intensity within the processing window. In the next few sections, this part containing the following processes will be discussed in more details: (a) averagine peaks and monoisotopic mass, (b) selection of initial window size, (c)

mass

monoisotopic

(Da) 50 100 74 950 75 000

(MPI) 1 1 48 48

intensity

most array abundance size 1 1 17 17

1 2 34 34

1st

2nd

3rd

34th

100.000 100.000 6.070 5.810 8.480 11.910 5.750 5.750 8.410 11.810 5.820

peak selection, (d) charge state estimation, (e) noise flag limit for eliminating noise signals, (f) averagine fitting, (g) subtraction of the identified isotopic cluster, and (h) searching isotopic clusters for different charges. The user-defined parameters in this part mainly include algorithm initiation threshold, correlation coefficient limit (CCL), matching error limit (MEL), and noise flag limit (NFL). Averagine Peaks and Monoisotopic Mass. Since it is rather timeconsuming to determine the theoretical isotopic abundance using the polynomial method,26 AID-MS adopts the THRASH’s look-up table of the averagine isotopic abundance method with modifications to minimize extra computation. In the look-up table, masses span from 50 to 75 000 Da with intervals of 50 Da, and indexing is used for positioning both the monoisotopic peak and the most abundant peak, as shown in Table 1. Due to such a small interval, no interpolation is needed for the masses within the intervals. Once an isotopic cluster is found, the mass can be determined using the m/z of the most abundance peak and the charge state. The closest mass in the look-up table is used for generating the averagine peaks. After the best-fitted averagine peaks have been identified, the monoisotopic mass (monomass) can be derived from the monoisotopic peak index (MPI) in the table, the mass obtained from the m/z of the best-fitted, most-abundance peak, and the charge state (charge) as follows:

monomass ) mass - (MPI - 1) × 1.00235 - charge × 1.007 276 49 (4)

where 1.002 35 Da is the interval of adjacent averagine peaks and 1.007 276 49 is the molecular weight of hydrogen. Selection of Initial Window Size. Identification of isotopic clusters is based on window-wise searching. In each search, the AID-MS algorithm first finds the maximum peak across the whole spectrum and then selects a processing window around this maximum peak. In the following text, left is for the lower m/z side, and right means the upper m/z side in the processing window. Two initial window sizes are defined for isotopic clusters with low and high charge states when searching the best match between experimental isotopic peaks and averagine peaks. The two initial window sizes are (1) when charge g 3, a 1.2Da window (left 0.5, right 0.7) is used; (2) when charge ) 1 or 2, a 2.7-Da window (left 0.6, right 2.1) is selected. For charge g 3, the reason for selection of 0.7 Da (right) is that the m/z range should exceed 0.67 Da to find at least two peaks in the right side, (26) Yergey, J. A. Int. J. Mass Spectrom. Ion Phys. 1983, 52, 337-349.

Analytical Chemistry, Vol. 78, No. 14, July 15, 2006

5009

whereas the 0.5 Da (left) is to make sure that at least one peak can be found in the left side. For charge ) 1 or 2, the selection of 2.1 Da (right) is based on the fact that the m/z range should exceed 2 Da to find at least two peaks in the right side, whereas the 0.6 Da (left) is used for any possible peaks at left 0.5 Da when charge ) 2. Figure 3 deciphers when to select 1.2 or 2.7 Da as the initial window size. It should be noted that the initial window size will be adjusted to contain all averagine peaks for calculation of correlation coefficient and matching error during identification of isotopic clusters. Peak Selection. A novel peak selection algorithm has been applied to find possible isotopic peaks in a processing window. First, all peak maximums above the threshold intensity are sequentially picked in the descending order of their intensities using the peak-finding method proposed by Chen and Garland.27 Since a large number of noise peaks have also been likely picked, it is necessary to remove those noise peaks prior to selection of isotopic peaks. From the descending peak maximum list, the weak peaks within the m/z range of neighboring peak maximums are gradually removed as noise signals if their intensities are less than one-half of the smaller intensity of neighboring peak maximums. After noise removal, the remaining peaks are used as the candidates for isotopic peaks. Their m/z distances (d1, d2, ...) to the first maximum peak are calculated, where d1 is obviously equal to 0 and other m/z distances can have positive or negative signs depending on their peak positions relative to the first maximum peak. Then the m/z distance list can be normalized as follows

[d1, d2,..., dn] i ) 2, ..., n di

(5)

where n is the number of peaks. In each normalization procedure, the peaks are selected as possible isotopic peaks when their corresponding normalized and sorted distances are close to continuous integers (i.e., ..., -3, -2, -1, 0, 1, 2, 3, ...). Thereafter, the number of the selected isotopic peaks and the summation of the selected peak intensities are used to evaluate the most possible isotopic cluster that should have the top-ranked values for both number and the summation value. Figure 4a-d shows the application of the peak selection algorithm to a selected processing window taken from the baseline-corrected spectrum C. First, all of the peak maximums were picked in descending order (peak 1, peak 2, peak 3, ...) as shown in Figure 4b. Next, weak peak maximums between the neighboring peak maximums (peak 1 vs peak 2, peak 2 vs peak 3, ...) were removed as noise peaks. For example, there were several peak maximums between peak 1 and peak 2. Only one peak had the intensity higher than the half magnitude of peak 2, and thus, the peak was retained, and the others were removed. Figure 4c shows the remaining 18 peak maximums as the candidates of isotopic peaks. The m/z distance (d1-d18) of all peak maximums to the highest peak (peak 1) were calculated. In each normalization procedure from [d1, ..., d18]/d2 to [d1, ..., d18]/d18, the peaks within the tolerance (difference to the nearest integers < 0.2) were picked as possible isotopic peaks. As shown in Table 2a, the normalization procedure ([d1, ..., d18]/d10) yielded the (27) Chen, L.; Garland, M. Appl. Spectrosc. 2003, 57, 323-330.

5010

Analytical Chemistry, Vol. 78, No. 14, July 15, 2006

maximum peak number and the maximum summation value of peak intensities, and thus, the corresponding peaks were selected as the most possible isotopic peaks in this window. Figure 4e-h is described in Subtraction of the Identified Isotopic Cluster. Charge State Estimation. The charge state or the searching range of charge states is estimated using the interpeak spacing (m/z distance of neighboring peaks) of the selected peaks. For a theoretical isotopic cluster, the interpeak spacing should be equidistant, and the closest integer of the reciprocal of the interpeak spacing is equal to the charge state. However, it is often observed that the interpeak spacing of an experimental isotopic cluster is not perfectly equidistant. Since the reciprocals of interpeak spacing of those significant peaks in an isotopic cluster are much closer to the real charge state, as compared to those weaker peaks, only the four reciprocals of the interpeak spacing of the five isotopic peaks around the maximum peak are used for charge estimation (charge g 3). If the closest integers of these four reciprocals are the same number, this number is assigned as the charge state. If the closest integers are not the same, the minimum and maximum numbers are used as the searching range of charge states. The mean value of this searching range of charge states is called the average charge. This situation often occurs for those isotopic clusters with high charges, since the difference between 1/charge and 1/(charge +1) is minimal at high charges, and the reciprocal of the interpeak spacing is too sensitive to estimate accurate charge. Thus, for each set of isotopic peaks, either the charge state is determined or the searching range is produced. It should be noted that all isotopic peaks will be used for charge estimation if their number is less than 5. In addition, for the isotopic clusters with charge ) 1 or 2, we use correlation coefficient and matching error to determine the charge state instead of the above charge estimation method. Table 2b shows an example of the charge estimation for the selected isotopic peaks in Figure 4d using the five isotopic peaks around the maximum peak (italic rows). Noise Flag Limit for Eliminating Noise Signals. As mentioned earlier, the peaks in a theoretical isotopic cluster are equally spaced. For those noise signals in coincident with approximately equal m/z distance and S/N g AIT, they can be falsely identified as isotopic peaks by the above peak selection algorithm. However, the intensity distribution of the peaks originated from noise is usually very different from that of real isotopic peaks. Thus, a noise flag value (NFV) is proposed to gauge the identification of false isotopic clusters as follows

NFV )

NP NSP

(6)

where NSP is the number of selected peaks obtained by the peak selection method, and NP is the number of peaks whose intensities are higher than half magnitude of the neighboring m/zsorted selected peaks. A higher NFV indicates a higher possibility for the peaks to be originated from noise signals, and a smaller NFV presents a higher possibility of a real isotopic cluster. Therefore, a noise flag limit is proposed to filter noise signals with NFV > NFL before triggering the averagine fitting. Because the averagine fitting is the most time-consuming step, ignoring noise signals speeds up the performance of the program tremendously. For a lower charge state, the interpeak spacing of isotopic peaks

Figure 4. Explanation of peak selection and subtraction of an isotopic cluster in spectrum C: (a) 1.2-Da processing window; (b) all of the picked peak maximums (star); (c) remaining peak maximums after noise removal (star); (d) selected isotopic peaks (star); (e) fitting of the averagine peaks (square); (f) subtraction of the identified isotopic cluster; (g) real maximum peak (solid), along with simulated peaks using both Lorentzian (dashed, fwhh ) 0.0207) and Gaussian models (dotted, fwhh ) 0.0185); and (h) heuristic factor (ratio of SPW to fwhh) versus noise/height.

is broader so that NFL is set higher, whereas for a higher charge state, only a smaller NFL is used due to the narrow interpeak spacing. NFL is a user-defined parameter. In the program, the default value is NFL ) 1 with average charge g 9 and NFL ) 2 with 3 e average charge < 9. Figure 5 shows the noise flag values of four different isotopic clusters in spectrum A. For the isotopic clusters with high S/N

and obvious isotopic peaks, their noise flag values were very close to 0 (Figure 5a). Two isotopic clusters with significant noise are shown in Figure 5b,c. Figure 5d shows an example of rejecting a noise cluster with S/N ) 5 at a high noise flag value (NFV ) 2.71). Averagine Fitting. Both correlation coefficient (CC) and matching error (ME) are used as the criteria for distinguishing real Analytical Chemistry, Vol. 78, No. 14, July 15, 2006

5011

Table 2. Selection of Isotopic Peaks and Charge State Estimation for the Spectrum in Figure 4c peaks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

m/z

intensity

Part A m/z distance (d1 ∼ d18)

873.017 21.239 d1 ) 0 873.124 17.739 d2 ) 0.107 873.458 16.638 d3 ) 0.441 872.808 16.140 d4 ) -0.209 872.736 15.140 d5 ) -0.281 873.237 14.338 d6 ) 0.220 872.879 13.440 d7 ) -0.138 872.951 12.740 d8 ) -0.066 872.593 12.441 d9 ) -0.424 872.903 12.440 d10 ) -0.114 872.790 10.440 d11 ) -0.227 873.097 10.439 d12 ) 0.080 873.605 10.237 d13 ) 0.588 872.665 9.141 d14 ) -0.352 873.348 9.038 d15 ) 0.331 872.575 3.241 d16 ) -0.442 873.658 2.837 d17 ) 0.641 873.673 2.637 d18 ) 0.656 no. of selected isotopic peaks summation of the selected peak intensities

[d1, ..., d18]/d2

[d1, ..., d18]/d3

[d1, ..., d18]/d10a

0 1 4.12 -1.95 -2.63 2.06 -1.29 -0.62 -3.96 -1.06 -2.12 0.75 5.50 -3.29 3.09 -4.13 5.99 6.13 7 107.572

0 0.24 1 -0.47 -0.64 0.50 -0.31 -0.15 -0.96 -0.26 -0.51 0.18 1.33 -0.80 0.75 -1.0 1.45 1.49 3 41.091

0 -0.94 -3.87 1.83 2.46 -1.93 1.21 0.58 3.72 1 1.99 -0.70 -5.16 3.09 -2.90 3.88 -5.62 -5.75 10 124.491

m/z

intensity

Part B interpeak spacing

reciprocal

872.575 872.665 872.790 872.903 873.017 873.124 873.237 873.348 873.458 873.605

3.241 9.141 10.440 12.440 21.239 17.739 14.338 9.038 16.638 10.237

0.090 0.135 0.113 0.114 0.113 0.113 0.111 0.110 0.147

11.11f 11 7.41 f 7 8.85 f 9 8.77 f 9 8.85 f 9 8.85 f 9 9.00 f 9 9.09 f 9 6.80 f 7

charge estimationb 9

a The selected peaks in the normalization procedure ([d , ..., d ]/d ) are the optimal isotopic peaks as shown in Figure 4d. b Estimation of the 1 18 10 charge state of the isotopic cluster in Figure 4d using the five isotopic peaks around the maximum peak are shown in italics.

isotopic cluster from noise signals in the averagine fitting. CC measures the similarity28 of the selected peak profile and the averagine peak profile. The peak profile is formed by connecting the neighboring peak maximums by straight lines. ME is quantified by the average value of both the profile error and the shifting error. The profile error is estimated by the difference between the selected peak profile and the averagine peak profile, whereas the shifting error is estimated by the differences of m/z and intensity between selected peaks and averagine peaks. Using the average charge and the m/z value of the most abundance selected peak, the mass can be estimated, and then, the averagine isotopic peaks can be generated from the averagine look-up table. During the averagine fitting, the most abundant averagine peak position may not exactly locate at the most abundance position of the selected peaks because of noise interference. In the AID-MS algorithm, any peak positions within the m/z range taken by the first three highest selected peaks are used as the possible positions for the most abundant averagine peak. As shown in Figure 4d, peaks 1, 2, and 3 are the top three peaks, and peaks 6 and 15 are located between. Thus, the positions of these five peaks (peaks 1, 2, 6, 15, and 3) are the possible positions for the most abundant averagine peak. This step ensures the locations of the real most abundant and monoisotopic peaks. (28) Signal Processing Toolbox (function corrcoef), MATLAB User’s Guide.

5012 Analytical Chemistry, Vol. 78, No. 14, July 15, 2006

From charge state estimation, either the charge state or a searching range of charge states has been generated. In the case that the searching range of charge states is given, for each possible charge state, averagine peaks are matched with the selected isotopic peak by fitting the most abundant averagine peak into all possible positions. The size of the processing window is adjusted to contain all averagine peaks, and then the CC and ME are calculated for each possible match. The highest CC with acceptable ME yields the optimal match, leading to determination of the best-fitted averagine peaks and the optimal charge state. Figure 6a,b shows an example for adjustment of the window size and estimation of CC and the profile error using the peak profiles. As a result of window size readjustment, more isotopic peaks can be identified after window adjustment (Figure 6b). In addition to noise removal using NFL, the correlation coefficient limit and matching error limit are proposed to further check whether the isotopic peaks are originated from a real isotopic cluster or noise signals. If an isotopic cluster has a high correlation coefficient (CC g CCL) and acceptable matching error (ME e MEL), it is considered as a real isotopic cluster. Otherwise, it is assumed that no isotopic cluster can be found in the processing window. Both CCL and MEL are user-defined parameters. For higher charge states, MEL is set to a smaller value because much more isotopic peaks should be involved in the

Figure 5. Noise flag values of four windows in spectrum A. Stars denote the selected isotopic peaks by the peak selection method. (a) NFV ) 0, (b) NFV ) 0.86, (c) NFV ) 1.86, and (d) NFV ) 2.71.

fitting. In the program, the default CCL is set to 0.85, and MEL is 0.15 (charge g 20) and 0.25 (charge < 20). Subtraction of the Identified Isotopic Cluster. To search for further isotopic clusters posterior to identification of an isotopic cluster, the identified isotopic cluster must be subtracted from the spectrum. First, the AID-MS algorithm checks any existing overlapping clusters by the peak selection algorithm and then employs a novel Lorentzian-based subtracting technique to estimate the width of each isotopic peak. In general, the simulated Lorentzian peak profile shows a better approximation to an experimental isotopic peak than the Gaussian peak profile (Figure 4g). Using the Lorentzian model,29 the peak intensity (Ai) and the full-width-at-half-height (fwhhi) at the ith m/z (xi) can be estimated as follows:

Ai )

H xi - x 0 4 fwhh

(

fwhhi )

)

2

(7) +1

2(xi - x0)

x

(8)

H -1 Ai

where H is the peak height, and x0 is the m/z at the peak center position. The fwhh of a peak can be estimated by the mean value of all fwhhi. When the Lorentzian peaks were simulated using the same fwhh (fwhh ) 0.2) but different peak heights (H ) 16, 8, 4, 2, 1, 0.5, 0.25), we observed that the peak width becomes broader with increasing peak height. Thus, the subtracting peak width (SPW) should be related to the peak intensity and fwhh. According to the ratios of the baseline (using noise level for an (29) Chen, L.; Garland, M. Appl. Spectrosc. 2003, 57, 331-337.

experimental peak) to the peak height, the heuristic factor of SPW to fwhh is depicted in Figure 4h for the noise-to-height ratio (noise/height) ranging from 0.01 to 0.2. When noise/height is out of the range between 0.01 and 0.2, this factor is constrained to 10 or 1, respectively. For an identified isotopic cluster, the data (H, xi, and x0) of the most abundant peak are used to determine fwhh, which will be employed for all peaks in the isotopic cluster. Next, the ratio of the noise level to each peak height is estimated, and accordingly, the SPW of each peak is obtained from the product of fwhh and the heuristic factor extracted from Figure 4h. In Figure 4g, the ratio of the noise level to the peak height was around 0.1, and it can be observed that the predicted width (SPW ) 0.062) is quite close to the real width. In the case that there is an overlapping cluster peak within the subtracting peak width, the subtracting peak width should be replaced by the half distance between the identified isotopic peak and the overlapping peak. Finally, the peak intensities within the subtracting peak width are subtracted from the spectrum using the best-fitted averagine abundance distribution. For those intensities lower than averagine abundance, they are suppressed to zero. Figure 4e shows the fitting of the averagine peaks to the selected peaks, and the monoisotopic mass and assigned peptide ion fragment of this identified isotopic cluster are shown in Table 5 (Figure 7-4, 9+). Figure 4f shows an example for the subtraction of an identified isotopic cluster in consideration of coexistence of an overlapping cluster. Although the smallest m/z distance (∆d ) 0.018) between the identified isotopic peaks and the overlapping isotopic peaks was much less than 0.05 Da, the overlapping isotopic peaks were successfully retained by this subtracting technique. Searching Isotopic Clusters with Different Charges. As shown in Figure 3, searching of isotopic clusters was performed separately in two different charge ranges: charge g 3 and charge ) 1 or 2. Analytical Chemistry, Vol. 78, No. 14, July 15, 2006

5013

Figure 6. Adjustment of the window size when charge g 3 (a, b) and identification of an isotopic cluster with charge ) 2 (c, d) (spectrum C): (a) initial window of 1.2 Da (stars, selected peaks; squares, averagine peaks); (b) adjusted window for containing all averagine peaks; the averagine peak profile (squares and solid) and the selected peak profile (stars and dashed) are used to calculate both the correlation coefficient and the profile error (fragment, z68; mass difference, 0.372 Da); (c) isotopic cluster with charge ) 2 found in the 2.7-Da window (fragment, c9; mass difference, 0.025 Da); and (d) subtraction of the identified isotopic cluster using the peak subtracting technique. Table 3. Comparison of Features between THRASH and AID-MS Algorithms comparison

THRASH

AID-MS

searching direction

from the lower m/z limit to the upper m/z limit

in the order of descending intensity of the maximum peak

noise level estimation and baseline correction

4-Da window

(1) global noise level estimation and baseline-correction (2) baseline is recorrected when the studied isotopic cluster is located on the shoulder of an isotopic cluster with S/N > 30

window size

1-Da moving window (high charge) 2-Da moving window (105), isotopic clusters are often extensively overlapped due to the presence of hundreds to thousands of isotopic peaks crowded into a narrow range. For example, the ECD spectrum of a large protein can produce more than 500 fragments, and large fragments can comprise more than 20 isotopic peaks. The peaks from different clusters are usually intermingled in the spectral region from 500 to 1800 Da. To identify these overlapping isotopic clusters, particularly the weak clusters buried under the strong clusters, successive subtraction of the identified strong isotopic peaks is essential. Because the resolution of mass spectral peaks depends on many parameters, a fixed subtracting peak width cannot universally work for every isotopic cluster. The AID-MS algorithm employs a Lorentzian-based subtracting technique to ensure that the identified peaks are exactly and completely subtracted from the spectrum in the overlapping region without leaving any

residues, since the incomplete removal residues may influence the identification of the remaining clusters. Combining this robust peak-subtracting method with the AID global isotopic clustersearching method, all isotopic clusters above the user-defined AIT can be exhaustively identified. Figure 7 shows four spectral regions with different degrees of complexity in the carbonic anhydrase plasma ECD spectrum. In Figure 7a, the best-fitted averagine peaks of all isotopic clusters were plotted with the baseline-corrected mass spectrum using AID-MS. The best-fitted averagine peaks were automatically recorded during the implementation of AID-MS. Figure 7b shows the best-fitted averagine peaks with the original mass spectrum using THRASH. The bestfitted averagine peaks of all isotopic clusters were reconstructed using the averagine look-up table and the peak list (charge and monoisotopic mass) exported from the MIDAS analysis software. The intensity of the most abundant averagine peak was estimated by the average intensity of isotopic peaks at the m/z positions of the top three averagine peaks. Table 5 shows the monoisotopic masses, charges and peptide ion fragments for the identified isotopic clusters in Figure 7. As shown in Figure 7, the AID and peak subtraction methods employed by AID-MS are more robust in identification of complicated overlapping isotopic clusters than the moving window and fixed-peak-width-subtraction method. To highlight the efficiency of the AID-MS in identification of overlapping isotopic clusters, four examples are discussed in the following study. The first example (Figure 7-1) shows that AIDMS is able to extract a weak isotopic cluster, even in the presence of a strong overlapping noise spike. The strong noise spike and the weak isotopic cluster were decoupled at different steps during program execution. Although the noise spike was removed in the earlier stage according to the order of its intensity among other peaks in the whole spectrum, the relatively weak 6+ cluster was successfully identified without the interference of the strong spike in the later stage. Examples 2-4 (Figure 7-2-4) again demonstrate the advantage of AID and the peak-subtraction method to exhaustively identify all isotopic clusters in an extensively overlapped spectral region, whereas some weak isotopic clusters were missed by the moving window method (Figure 7b). The plasma ECD spectra of ubiquitin (spectrum B) and carbonic anhydrase (spectrum C) that had been analyzed by using a combination of THRASH and manual analysis were reinterpreted by AID-MS. Owing to its robust capability to resolve overlapping isotopic clusters, the AID-MS algorithm is able to discover a number of new isotopic clusters with significant S/N that have been missed in previous analysis. Forty-nine new isotopic clusters in spectrum B were assigned to ubiquitin ECD fragments, and 50 new isotopic clusters in spectrum C were successfully assigned to carbonic anhydrase ECD fragments. Minimization of False Positives. Manual validation of the isotopic clusters, picked up by an automated analysis program from a complicated high-resolution mass spectrum, is very tedious and time-consuming. We implemented several techniques in different stages to minimize false positive isotopic clusters in the final results. In addition to the methods such as estimation of noise levels, selection of isotopic peaks, and implementation of the noise flag limit, which can exclude noise spikes prior to triggering the averagine fitting, a second layer of criteria, such as correlation coefficient limit and matching error limit, have been introduced Analytical Chemistry, Vol. 78, No. 14, July 15, 2006

5017

to measure the degree of the matching of experimental peaks to the averagine pattern. These two layers of criteria firmly safeguard against false positives originated from the noise background. Since MEL and NFV are dependent upon charge states, different MEL and NFL values have been used for different charge states. Default values used in AID-MS to minimize false positives are CCL ) 0.85, MEL ) 0.15 for charge g 20 or MEL ) 0.25 for charge < 20, and NFL ) 1 for charge g 9 or NFL ) 2 for 3 e charge < 9. It should be noted that all of these values can be user-adjustable to fit special situations. The averagine fitting is initialized by the most abundant peak in an isotopic cluster. Owing to overlapping isotopic peaks and random noise superimposed on signals, sometimes the strongest peak in the spectrum is not the most abundant peak in the corresponding isotopic cluster. Even though the whole pattern shifts by 1 or 2 peaks (Da) from the correct position, it still can pass the criteria for a real isotopic cluster because most of the shifted isotopic peaks fall on the top of averagine isotopic peaks. Incorrect location of the most abundant peak in an isotopic cluster results in shifting of the final monoisotopic mass by an integer (usually 1 or 2 Da) value. To reduce such a mass error, our program matches all of the possible positions on the basis of the three most abundant isotopic peaks to the most abundant averagine peak. The best match is then chosen for determination of the monoisotopic mass. CONCLUSION In this paper, we have presented a new algorithm for automated analysis of complex high-resolution mass spectra. The AIDMS algorithm employs several novel techniques for selection of

5018

Analytical Chemistry, Vol. 78, No. 14, July 15, 2006

isotopic peaks, subtraction of identified isotopic clusters, and noise elimination to optimize overall speed of interpretation, to exhaustively identify all isotopic clusters, and to minimize false positives. The algorithm has been successfully applied to analyze three highresolution Fourier transform mass spectra; however, it should be generally applicable to complex high-resolution mass spectra of large molecules recorded by other instruments. Reanalysis of the two previous published ubiquitin and carbonic anhydrase plasma ECD spectra by AID-MS has located and assigned 49 and 50 new isotopic clusters to the corresponding ECD fragments of ubiquitin and carbonic anhydrase, respectively. Eighteen new bond cleavages have been identified from the 50 new clusters of carbonic anhydrase. ACKNOWLEDGMENT We thank David Horn for invaluable help and discussion and the Singapore Agency for Science Technology and Research (A*Star) for its generous financial support. SUPPORTING INFORMATION AVAILABLE Data for 611 identified clusters for carbonic anhydrase; spectra for 86 new isotopic clusters for carbonic anhydrase, found by AIDMS; and assignment of 439 peptide ion fragments for carbonic anhydrase. This material is available free of charge via the Internet at http://pubs.acs.org.

Received for review January 13, 2006. Accepted May 2, 2006. AC060099D