Signal Detection in High-Resolution Mass Spectrometry Data

Jan 4, 2008 - signal-to-noise ratio, normalization, and quantification of peak intensities. ..... the Virginia Prostate Center Tissue and Body Fluid B...
0 downloads 0 Views 1MB Size
Signal Detection in High-Resolution Mass Spectrometry Data Dale F. McLerran,† Ziding Feng,† O. John Semmes,‡ Lisa Cazares,‡ and Timothy W. Randolph*,† Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, and Eastern Virginia Medical School, Norfolk, Virginia 23507 Received October 4, 2007

Mass spectrometry data from high-resolution time-of-flight instruments often contain a vast number of noninformative background-ion peaks whose signal is similar to that of peptide peaks. Consequently, seeking peptide signal in these spectra based on a signal-to-noise ratio will remove signal peaks as well as noise. This work characterizes the background as a precursor to seeking peptide-related features. Robust-regression methods are used to estimate distributions for null (background) peak intensities and locations. Defining signal peaks as outliers with respect to these distributions leads to more precision in detecting the isotopic envelope of peaks from low-abundance peptides in high-resolution spectra. Keywords: Mass spectrometry • high resolution • background signal • robust regression, isotopic peaks • wavelet

1. Introduction Algorithms for extracting peptide information (e.g., peaks) from mass spectrometry (MS) data are now common and widely varied in their approaches. These preprocessing methods perform a variety of tasks necessary for researchers to subsequently use MS spectra either for classification of serum and tissue samples or for identification of protein content via tandem MS. The challenges posed by data from MALDI-based [Matrix Assisted Laser Desorbtion/Ionization; this includes the Surface-Enhanced Laser Desorbtion/Ionization (SELDI)] timeof-flight (TOF) instrumentation, as highlighted by Yasui et al.,1 spurred a flurry of interest in improving processing methods, including spectrum alignment/registration, smoothing/denoising, adjusting for noninformative baseline, estimation of a signal-to-noise ratio, normalization, and quantification of peak intensities. All of these efforts are aimed at extracting peptideinduced signal that can be consistently identified across a large (e.g., 102 to 103) collection of samples. In low-resolution instrumentation, peptide signal appears as noisy peaks in the raw spectra and corresponds to local maxima (at a certain scale); a peak is assumed to represent a single peptide, and each detectable peptide produces a single peak. The present work considers data from high-resolution instrumentation for which the challenges are similar but the signal from each peptide appears as a sequence of peaks from several isotopes and, more problematically, is typically accompanied by a vast number of potentially noninformative “background” peaks whose signal strength can be similar to, and even greater than, informative peaks. These data, such as output from an electrospray ion source or the reflector mode of MALDI-based instruments, have different characteristics than the lower-resolution, linear-mode MALDI spectra. The * To whom correspondence should be addressed. E-mail: trandolp@ fhcrc.org. † Fred Hutchinson Cancer Research Center. ‡ Eastern Virginia Medical School.

276 Journal of Proteome Research 2008, 7, 276–285 Published on Web 01/04/2008

process of extracting relevant peaks that make up such a profile is, in many ways, harder in these higher-resolution spectra and has a larger impact on the subsequent analysis. Figure 1 exhibits a raw spectrum in a region that is further analyzed for signal in Section 4. Superimposed on it is a processed version (as described in Section 3.5) along with tickmarks that indicate potential peptide features as determined by the methods developed in Section 3.3. One method for defining signal in low-resolution MS spectra proceeds within the framework of modeling each spectrum, x, as a sum of peptide signal, s, a low-frequency baseline, b, and random noise, ; that is, x(t) ) s(t) + b(t) + (t), where t denotes either a time-of-flight (TOF) or mass-per-charge (m/z) index. This model may also be combined with a warping function, w, so that an alignment t′ ) w(t) can be estimated along with the baseline and signal. This statistically rigorous approach can provide valuable insight to the nature of variation in the process of defining signal. The data we consider here, however, is substantially less smooth, has many nonsignal peaks, and contains additional idiosyncrasies that make this elegant model inappropriate. In particular, a model of noninformative signal, x(t) - s(t), would need to be more complex than just lowfrequency trend plus Gaussian noise since features arising from a background of chemical ions are similar to those arising from peptide-related ions. The processing of low-resolution spectra is less sensitive to a precise definition of peptide signal since the scales of noninformative features are at the extremes (finescale high-frequency noise or wide long-term trends) with the scale of peptide-related peaks in between. A detailed description of the noise characteristics for the high-resolution data we consider is given in Section 2, but we note here that peptide peaks may be buried within a highly regular background of ion peaks from nonprotein sources, and the characteristics of these background-ion peaks vary as a function of m/z. The problem is illustrated by Figure 1 which shows that simply identifying peaks is not sufficient for extracting peptide information. 10.1021/pr700640a CCC: $40.75

 2008 American Chemical Society

Signal Detection in High-Resolution MS Data

research articles

Figure 1. (Top curve) A portion of one spectrum from the reflector mode of a MALDI-TOF instrument. (Bottom curve) An extracted component based on a certain scale of features in the spectrum. Tickmarks identify locations of features deemed not part of the background in the data set. Cf., Figure 5.

Rather than pursue a model-based processing, we extract all features in a manner similar to Randolph and Yasui2 and use robust-regression methods to estimate a distribution for peak intensities that are nonsignal. In addition, we similarly estimate a distribution of nonsignal peak locations. The latter is possible since there exists a pervasive set of “null” peaks which appear in a nearly periodic manner throughout the spectra. As a result, the signal-based peaks are those that are outliers in a null (i.e., background) distribution of intensity values or a null distribution of m/z location values (modulo a frequency, ω) of background peaks.

2. Peptide-Ion Signal and Background-Ion Signal The problem addressed here is similar to that considered by Kast et al.3 whose methods aim to enable a rapid identification of precursor ions and their charge states for further analysis by tandem MS. The goal of the filtering method proposed there is to identify peptide–ion peaks when their signal is very similar to that of chemical-ion peaks. For additional background on the physical and chemical nature of certain types of nonpeptide signal, we refer to a series of columns discussing chemical noise in mass spectrometry by K. Busch.4 Although Kast et al.3 consider data from electrospray ionization (ESI) TOF instruments, similar properties of the chemical noise are inherent in output from MALDI Q-TOF and reflectormode platforms. These instruments, when used for the purpose of identifying precursor ions for further analysis by tandem MS, may rely on software that picks out the peak having highest intensity and then sequentially runs through the list of decreasing intensities, processing peaks until a noise threshold is reached. When the spectra are used for classification of case versus control samples, most preprocessing/peak-selection methods also rely on a threshold defined in terms of a signalto-noise ratio. The definition of such a threshold is a key point in our developmentswhat defines signal and what defines noise? The concept implicitly assumes a discernible separation between informative and noninformative signal. As seen in Figure 1, this cannot be done by simply comparing peak intensities. The general problem of extracting low-intensity peptide signal from data of this type is less well-developed than for their low-resolution counterparts and, due to the sensitivity to both peptide signal and chemical noise, requires somewhat different methods. In particular, the fact that a single peptide gives rise to several isotopic peaks provides information that

can be used to extract non-noise peaks. In particular, for a peptide species having charge z, its isotopic peaks are separated by 1/z Da. Peptides seen in spectra from MALDI-based instruments primarily have a single charge, z ) 1, although each spectrum may contain a few doubly charged peptides. With the goal of identifying mass regions having elevated intensity across multiple isotopic peaks, a smooth enveloping curve which groups adjacent isotopic peaks can be formed so that each peptide is represented by a single peak. This is proposed by Yu et al.5 in which an envelope-detection process then precedes a search for those peaks whose local intensity exceeds a threshold. This approach transforms high-resolution data into an approximation of low-resolution data in a process that may mask low-intensity peptide signal. Aiming at a more subtle use of isotopic peaks, the pattern of their intensities was studied empirically in Gay et al.6 and exploited by Gras et al.7 in which an m/z-dependent template of this pattern is used to extract peptide information. Here, we use a Poisson distribution to model this pattern, as proposed by Breen et al.8 and Bellew et al.9 An additional useful property of peptide-related peaks is the phenomenon of peptide mass clustering (initially observed by Mann10) which describes the fact that the atomic composition of peptides implies that the list of all potential masses of peptides is not smoothly distributed across the range of m/z values but, instead, due to the limited elemental composition of 20 common amino acids, clusters into high-density regions that are separated by approximately 1 Da, and peaks in this density tail off as a function of m/z (see Figure 4 in Gay et al.6). Since a peak in a mass spectrum that does not occur within one of these high-density regions has a low probability of being peptide-related, this phenomenon has been used, either via an empirical description7 or a rigorous analytical model,11 to aid in the identification of potential peptide peaks. The focus here is on extracting potential peptide signal from spectra that contain background-ion peaks in addition to finescale noise and low-frequency trends. We note, in particular, that prior to matching a set of peaks to an isotopic template or computing whether the mass of a peak reasonably matches the distribution of peptide mass clusters, one must start with a peak. The aforementioned methods seek peaks with intensities above a threshold of some sort; the exception is Kast et al.3 in which a Fourier filtering method is used to remove periodic background signal, revealing low-intensity peptide–ion peaks. Without a threshold or filtering step, the process of Journal of Proteome Research • Vol. 7, No. 01, 2008 277

research articles template matching may be overwhelmed by noninformative peaks that dominate at low intensities. However, there is apparently peptide signal at or below the intensity of features coming from this background. Increasing sensitivity to extract this signal is the goal of Kast et al.3 and is our focus as well. Our approach is, however, somewhat reversed: rather than exploit the theoretical properties of peptide signal, we consider a statistical characterization of the noise with the aim of identifying signal as those peaks which are outliers in the distribution of background signal. In fact, the properties of background peaks can be characterized by the distributions of two components of the signal: peak intensity and peak location (modulo a frequency). These distributions are estimated using an entire data set, and so signal is defined with the aid of the central limit theorem, rather than an attempt to denoise individual spectra. A final introductory remark is that the focus here is on extracting signal from a data set of one-dimensional MS spectra as opposed to individual two-dimensional samples, as obtained from a time-course collection of spectra in liquid chromatography–mass-spectrometry (LC-MS) experiments. In those data, one can monitor a peak across multiple spectra as an aid to determining whether it comes from a peptide–ion source. In lieu of exploiting a chromatography dimension (peptides eluting over time), we use a collection of spectra to characterize noise and gain power in detecting signal. One may, however, adapt the ideas to LC-MS data or to individual spectra for subsequent analysis by tandem MS. 2.1. Characteristics and Consequences of BackgroundIon Signal. The background, or noninformative, structure in each spectrum contains peaks that are separated by approximately 1 Da. Although this frequency is nearly constant over the entire m/z range, the quality of the signal degrades and a closer look at the characteristics reveals a nonstationary signal with the following properties: (i) Peak locations are periodic with a frequency ω slightly more than 1 Da. (ii) Peak intensities exhibit some variation locally, and an overall decrease globally as a function of m/z. (iii)Peak-width-at-half-height (PWHH) increases with m/z (however, this width is constant with respect to the t scale). (iv)Signal of all types degrades with intensity. In particular, the smoothness of background decreases with increasing m/z in the sense that the lag-1 autocorrelation decreases. (v) Background peaks often share locations with peptide peaks, and the shapes of all peaks are similar. An additional property supported by empirical observation is the additivity of ion signal. That is, the intensity of a peak at m Da equals the intensity of any background ion at m plus the intensity of any peptide–ion at m. The consequence of this is that a peptide whose location coincides with a background peak will exhibit an intensity greater than the background. In addition, peptide–ion peaks that are offset from the periodic background peaks can be detected even when the peptide peak intensity is at or below that of the background. With regard to the properties of peak location, recall that a calibrated m/z axis is one that has been converted from t (or TOF, the instrument-output equally spaced ‘clock-ticks’) to m/z by an approximate physical model m/z ) (at + b)2, t > 0, which is estimated by a least-squares fit (see, e.g., ref 12). This typically uses an external calibration spectrum, although the intrinsic properties of peptide mass clustering have also been used.13,14 278

Journal of Proteome Research • Vol. 7, No. 01, 2008

McLerran et al. Item (iii) observes that the number of TOF measurements per PWHH is independent of t, whereas PWHH increases as a function of m/z. For this reason, all initial processing that relates to the scale or width of features (Section 3.5) will be performed on the t axis. However, inference about peak location (Section 3.3) is a Dalton-based concept and uses the m/z axis. As an aside, note the property that background peaks are periodic relative to the m/z axis but not the t axis indicates a (bio)chemical rather than instrument source. The most common approach for dealing with periodic noise is to use Fourier analysis to estimate its frequency and filter out this component from each spectrum. J. Kast and coauthors3 adeptly illustrate that Fourier analysis does, indeed, provide a workable solution for these types of data. The authors point out, however, that a simple Fourier filtering is insufficient due to the m/z-dependent characteristics of the signal (see (i)-(v) above), and so several modifications are required. Among them is the implementation of a windowed Fourier transform and a second filter applied in the Fourier domain, along with an initial insertion of imputed intensity values to overcome the heterogeneous sampling of features (item (iii) above). In the end, it seems that Fourier analysis is not the optimal tool for this purpose. Although the spectra do contain features that occur at regular intervals (on the m/z scale), there is no purely periodic source such as exists, for example, in many electronic, mechanical, or astrophysical signals. In an alternate, but similar, approachsthe one taken heresthese features have characterizable statistical properties that can lead to the identification of peptide–ion peaks as outliers among this background signal.

3. Materials and Methods 3.1. Samples, Sample Preparation, and Instrumentation. 3.1.1. QC Samples. Serum samples were obtained from the Virginia Prostate Center Tissue and Body Fluid Bank. Serum procurement, data management, and blood collection protocols were approved by the Eastern Virginia Medical School Institutional Review Board. Blood sample were collected into a 10 cm3 Serum Separator Vacutainer tubes and centrifuged 30 min later after clotting at 3000 rpm for 10 min. The serum was distributed into 0.5 mL aliquots and stored frozen at -80 °C. A quality-control (QC) serum sample was prepared by pooling equal amounts of serum from each specimen of an age-matched normal group and storing 0.1 mL aliquots at -80 °C. Spectra were generated from 64 aliquots. 3.1.2. Phase II Samples. Serum samples were obtained as part of a phase II study on prostate cancer performed by the National Cancer Institute’s Early Detection Research Network (EDRN). Seven participating biorepositories identified potential samples and provided demographic, diagnostic, and specimen information for each potential sample. The EDRN’s data management coordinating center identified samples that met basic eligibility requirements, including specimen storage at -70/-80 °C within 4 h of sample collection specimen and storage time less than 3 years with at most one previous thaw after being placed into storage. Each sample was placed into one of five diagnostic groups. Only one of these groups was chosen for presentation here, although that being presented appears to be representative of all five groups. This group is a secondary control group of 83 individuals with no history of prostate cancer but a history of some other cancer. A random sample of 50 eligible specimens was selected under restrictions imposed to achieve as much balance as possible across

Signal Detection in High-Resolution MS Data diagnostic groups for age and race. Three aliquots were formed from each specimen. These spectra were collected secondarily to a larger study,15 and due to limits of sample amounts, only a total of 76 aliquots were available. Sample from each of these aliquots was spotted twice on a MALDI plate to produce a total of 152 spectra. 3.1.3. Sample Preparation. For each analysis, 20 µL of serum was incubated with 10 µL of weak-cation exchange magnetic beads for 10 min on the ClinProt robotic platform as per instructions of the instrument manufacturer, Bruker Daltonics. Unbound proteins were discarded, and each sample was washed twice in binding buffer. Bound proteins were eluted and spotted in duplicate on an AnchorChip sample target platform (384 spots), after mixing 1:10 with cyano-4-hydroxycinnamic acid in an acetone ethanol mixture of 1:2. Samples were assayed randomly and blinded to the operator. Between the evaluation of the 64 QC samples and the 152 phase II samples, there was a change in the manufacturing process of the AnchorChip platform. This change produced some subtle alterations in the spectra. 3.1.4. Instrumentation. Profile spectra were acquired from an average of 400 laser shots in reflectron mode on an Ultraflex I MALDI-TOF instrument manufactured by Bruker Daltonics. Spectra were calibrated externally using a standards mixture containing peptides of known mass in the range of 1046-3147 Da. 3.2. Statistical Characterization of Background Signal. The proposed procedure is in the spirit of simple hypothesis testing: for any given peak, its properties of intensity and/or location are either indistinguishable from those of the background-ion peaks, or they are outliers; the latter are defined to be nonbackground peaks. This requires, of course, a statistical characterization of the bulk of peaks that make up “background signal”. For this, we characterize distributions of the two defining properties of the background: peak location relative to the periodic nature and peak intensity. To estimate the distributions that determine the location outliers and intensity outliers, that is, to extract non-noise peaks, a preprocessing of the raw spectra is needed, one that leads to a collection of all candidate peaks in all spectra. This collection of admissible peaks should include all backgroundion and peptide–ion peaks. It should not include local maxima that arise from fine-scale noise, and it should quantify intensities in a way that is not subject to low-frequency background trend. For this, our presentation relies on a scale-based processing using a wavelet decomposition. The implementation of this is described in Section 3.5 with details in the Appendix, but so as not to distract from the main focus we first describe the process that identifies potential peptide–ions (Subsection 3.3). The essence of this process is simple: since the spectra are dominated by background peaks, a peptide peak may be identified by an intensity that is significantly different from background peak intensities. Similarly, if ω is the periodic frequency of the background peaks, then a peak is not attributable to noise if its location is significantly different from 0 (mod ω). The notation used here is x ≡ y (mod ω) if x - y is an integer-multiple of ω. At issue is what is meant by “significantly different”. One reasonable approach is to simply compute the mean and variance of intensities within each spectrum and then define informative peaks as those that differ from the mean by a few standard deviations. Our approach is similar but uses a more

research articles careful consideration of the background characteristics (Section 2.1). For example, since peak intensities decrease, a constant mean is replaced by a local mean estimate that also decreases (linearly) as a function of m/z. Note, too, that the measure of variance typically used, ∑(yi - y)2, is a function of signal of all types, background as well as peptide-induced. But peptide signal may be identifiable as signal which differs from that of background. Using a robust-regression estimation procedure16,17 for this alleviates the influence of the high-intensity peptideion peaks and provides a better estimate of both expectation and variance of background signal. In addition, subtle shifts in location can be as informative as intensity and so this variability is also quantified. Finally, our admissible-peak detection process also adapts to signal degradation at higher m/z by a continuous change in bandwidth used to define peaks, as described in section 3.5.1. 3.3. Robust Regression: Density Estimation for Background Intensity and Location. We begin by assuming that a collection of admissible peaks has been defined. Specifics of the method used here to define this set are given in the next section, although other methods may be used. Of importance is that this collection of admissible peaks contains a list of locations and intensities for all peaks in all spectra that represent, as closely as possible, all background and peptide– ions. The process of extracting potential peptide signal from background consists of fitting M-type robust regression models aimed at characterizing the bulk of background features (both location and intensity) in the presence of a sparser set of peptide peaks. The first step is preliminary in the sense that its purpose is to prune the admissible set of peaks by removing those having very small intensity. This is analogous to removing peaks that fall below a given signal-to-noise ratio. In this analogy, however, both ‘signal’ and ‘noise’ refer to peak intensities (as opposed to total spectrum variability), and the ratio used here is 15-20 times smaller than thresholds typically used. The second step quantifies the variability in the location of a peak relative to the periodic structure. Let ω denote the frequency; for the QC data exhibited here, ω is estimated to be 1.00035 Da. Of interest is not the actual m/z value of a peak, but rather its offset relative to the periodic background. Therefore, we will identify a peak’s location M with the value of M (mod ω). That is, we identify M with the equivalence class of all peak locations that differ by an integer multiple of ω; we use the notation [M] ) {admissible peak locations N: N ≡ M (mod ω)}. The set M : ) {[M]: there is an admissible peak at mass M} contains all peak location offsets. These are offsets from the period ω where 0 e [M] < ω, and [M] ) [0] if and only if M is an integer multiple of ω. In particular, the mean µM of the set M should be approximately zero since the spectra are dominated by background peaks. For convenience, we will shift everything by 0.5ω so that µM ≈ 0.5. We conjecture that if the spectra are calibrated precisely, and if the frequency m/z is estimated correctly, then the mean offset will be independent of ω. Lacking this, then the mean offset will (locally) be a linear function of m/z, and so we model it as µM(m/z) ) a + b(m/z). In general, allowing for a nonconstant mean is desirable as it will make the estimation process more robust. Figure 2a plots the location offsets as a function of m/z in the QC data for ω ) 1.00035. A third step quantifies variability in intensity among the remaining admissible peaks. Since the second step removes Journal of Proteome Research • Vol. 7, No. 01, 2008 279

research articles

McLerran et al.

Figure 2. (a) Location offsets of background peaks in the QC data plotted as a function of m/z with ω ) 1.00035. (b) Log of intensities of all background peaks plotted as a function of m/z.

peaks whose locations are outliers relative to the periodic background, the remaining admissible peak set provides an improved estimate of the distribution of background peak intensity. Figure 2b plots the resulting intensities as a function of m/z in the QC data. These three steps are summarized as follows. For notational clarity, m/z will be denoted simply by m when used as an independent variable in a function or equation. 0. Initially declare all peaks as background. Using the entire set of intensities, I, fit a robust regression model to estimate the expected background intensity as a linear function of m/z: µI(m) ) a + bm. Retain as admissible peaks those having intensities exceeding 0.2µI. 1. Location distribution: Fit a robust M-type regression model for the set of offsets M among the admissible peaks. This produces a distribution for M with a mean modeled as µM(m) ) a + bm and a standard deviation σM(m). If the studentized residual of an admissible peak has absolute value exceeding a specified critical value, it is a location outlier and considered to be nonbackground; it is labeled as such and removed from the set of admissible peaks. (The critical value 2.5 is used here.) 2. Intensity distribution: Fit a robust M-type regression model as in (0) for the set of intensities among the remaining admissible peaks. This produces a refined set of intensities I with a mean modeled as µI(m) ) a + bm, and a standard deviation σI (m). A peak with studentized residual exceeding a specified critical value is identified as a (high-) intensity outlier and removed from the set of admissible peaks. (The critical value 2.35 is used here.) After both location and intensity outliers are removed, the set of admissible peaks should primarily consist of background peaks. Of course, we do not know what this set is when seeking peptide peaks, but the robust-regression process in each step aids in estimating its distribution. The estimation is improved at each step as potential peptide peaks are removed. In summary, a potentially informative peak is distinguished from background by deciding whether its location or intensity is different from background, as determined by the two distributions constructed here. In each case robust-regression studentized residuals18 are used to identify peaks which differ from the expected location or expected intensity of the background. Studentized, rather than simple, residuals are used ^ for the following reason. Simple residuals ri ) yi - xiTβ have variance σ2(1 - hi) (for hi ) xi(XTX)-1xi) which depends not only on σ but also on the distance |xi - x|; that is, the variance of the simple residual is not constant. However, Studentized ^

residuals, ri ⁄(σ√1-hi), have constant (unit) variance. Conse280

Journal of Proteome Research • Vol. 7, No. 01, 2008

quently, a single universal critical value may be used which is not appropriate with simple residuals. 3.4. Specifics on Admissible Peaks in a Collection of Spectra. In practice, rather than quantifying spectrum properties globall, the procedure of the previous section is performed locally on overlapping segments of prespecified width; a width of 200 Da is used here. In particular, if the spectra are miscalibrated, then the mean will vary as a function of m/z, but locally, this variation is minimal and accounted for (partially) by allowing µM to be a linear function of m/z. Moreover, in view of items (ii) and (iv) in Section 2.1, the variance of background-peak location offsets {[M]: M is the mass of an admissible background peak} increases as a function of m/z but slowly enough that it is relatively constant locally, for example, within a 200 Da window. Also in practice, when analyzing groups of spectra with the goal of detecting disease-class information, we have chosen to implement the procedure of Section 3.3 on a single mean n spectrum X(m) ) 1⁄n∑ i)1 Xi(m) obtained from the collection of all aligned (and possibly otherwise processed) spectra{Xi}ni)1. (In fact, we use processed versions Di obtained from the Xi as described in the Appendix.) By the central limit theorem, the mean intensity at a peak location m0 is asymptotically normally distributed with mean µ(m0) and variance σ2(m0). For all background peaks in a neighborhood of m0, it is reasonable to assume that the variance is constant and the mean intensity is linearly related to m. Therefore, among all background-peak K locations {mk}k)1 (K ≈ 200/ω) across a 200-Da width window, we assume X(mk)∼N(µI,σ(m)2) ) N(a + bm,σ2). In fact, normality is not used, merely the common mean and variance for the background peak intensities. A similar justification can be made for the estimation of background-peak location offsets by using a 200-Da width window of a mean spectrum. Figure 3 illustrates these ideas using the QC data set. The gray curve shows the density histograms of location offsets for all background peaks as identified by Steps 0, 1, 2; that is, the null distribution of location offsets for the set of admissible peaks after both location and intensity outliers are removed. This is contrasted with the blue curve which is the density of location offsets for peaks identified as location outliers by Step 1. Also shown, in black, is the density of location offsets for peaks that are not only location outliers but also intensity outliers, as identified by Step 2. Parts a, b, and c exhibit these in three 200-Da width windows, [700, 900], [2000, 2200], and [3300, 3500], respectively. In Figure 4, the gray curve shows the density histogram of log intensities for all background peaks; that is, the null distribution of intensities for the same set of peaks that

research articles

Signal Detection in High-Resolution MS Data

Figure 3. Density histograms of location offsets in three 200-Da regions of the spectra. Gray, density of location offsets for all background peaks as identified by Steps 0, 1, 2; blue, for peaks identified as location outliers by Step 1; black, for peaks that are both location outliers and intensity outliers as identified by Steps 1 and 2.

Figure 4. Densities of peak intensities in three 200-Da regions of the spectra. Gray, density of intensities for the same set of peaks that contribute to the gray curve in Figure 3; red, for peaks identified as intensity outliers by Step 2; black, for same set of peaks that contribute to the black curve in Figure 3.

contribute to the gray curve in Figure 3. This is contrasted with the red curve which is the density of intensities for peaks identified as intensity outliers by Step 2 (hence these peaks are not location outliers). Also shown, in black, is the density of intensities for same set of peaks that contribute to the black curve in Figure 3, peaks that are both location and intensity outliers, as identified by Steps 1 and 2. In an alternate view of these, the location offset estimates are plotted as a function of m/z in Figure 2a. The increasing variance of the location offsets is evidenced here by the widening of the location offset estimates. The drift of these location offsets away from 0.5 (the downward trend) may indicate that these data are slightly miscalibrated. See, for example, the discussion of peptide mass clusters in Section 2. Similarly, the log of the background-peak intensities for the QC data set is plotted as a function of m/z in Figure 2b. The decay in these intensities, even within any 200 Da window, indicates that modeling the mean µI as a function of m/z is preferable to a constant mean. 3.5. Defining Admissible Peaks: Preprocessing. The primary focus of this paper is on filtering out background rather than preprocessing or peak detection. However, a successful application of the methods will depend on having a set of admissible peaks that includes all background-ion and peptide–ion peaks but does not include local maxima that arise from fine-scale noise. The quantification of peak intensity should be independent of background signal and also incorporate peak area as opposed to merely measuring raw height. This section outlines the method used for constructing the set of admissible peaks. We will assume that the collection of spectra has been aligned so that peaks in different spectra that correspond to a single peptide species occur at approximately the same location. The characteristics described in Section 2.1 will influence how peak detection is implemented by this, or any, method.

An assumption made here is that features relevant to characterizing both background signal and peptide signal may be characterized by a limited range of scales. As noted, the scale or width of peaks remains relatively constant with respect to the t axis, and so by preprocessing and extracting peaks with respect to t, any change in scale across the spectrum is reduced. Rather than smoothing the spectra, we begin by simply decomposing each spectrum X into a collection of constituent functions, each representing a different scale of signal content. This is done by a wavelet analysis in which X is decomposed into a sum J

X )

∑D j)1

j

+ SJ

(1)

where each of the detail functions Dj reflects changes in X of a certain scale, as indexed by j; Sj represents the smooth remainder after the detail functions have been extracted. The decomposition is done at dyadic scales so that local change of width 2j corresponds to the scale index j. The phrase “fine-scale noise” we have used throughout refers to signal content at scale j ) 1. The maximal-overlap discrete wavelet transform (MODWT) is used because of its translation-invariant property.19 The set of peak locations is defined by using the locations of local maxima in Dj. Since the Dj reflect local changes rather than local averages in X, modeling a baseline trend is not needed. In fact, local maxima in a Dj correspond not only to peaks but also inflections or “shoulders” in X. We will continue to refer to all of these features as peaks. Figure 1 shows an example of a detail function superimposed on one raw spectrum from which it was computed. Specifics on how peak locations are defined in this paper are provided in the Appendix. The basic idea is that t0 may be viewed as a peak location (of scale j) for X if the scale-j wavelet transform of Dj has a zero-downcrossing at t0. The Journal of Proteome Research • Vol. 7, No. 01, 2008 281

research articles wavelet transform acts as a scale-based derivative, and so this process is similar to a variety of other methods that use a smoothed spectrum and derivative estimate to define peaks as local maxima. For a brief illustrated description of this concept as applied to linear-mode MALDI spectra, see Randolph et al.20 These ideas can be extended to include multiple scales. In fact, we implement a t -dependent subset of scales that is chosen based on signal strength. That is, rather than use a single scale Dj, we allow for a range of scales which may vary k cι(t)Dι(t), for some coefwith t, defining D ≡ Dj,k(t) ) ∑ ι)j ficients functions cι. We implement this so as to use no more than two adjacent scales at any t: D: ) cjDj + cj+1Dj+1, where cj+ cj+1 ) 1 and these coefficients vary with t so as to emphasize broader scales when signal strength is low (large t). The resulting D reflects only a subtle change in scale as a function of t, but it helps increase the power to detect lowintensity peptide peaks by providing a more accurate description of background, even when signal strength is weak. Specifics on how D is chosen are provided in the next section and a summary of all preprocessing steps used to define admissible peaks is given in the Appendix. 3.5.1. Estimating Scale. In spite of the property that peak widths are relatively constant with respect to t, the smoothness of a peak (either background or peptide peak) degrades as intensity increases. For essentially all background and most peptide signal, intensity decreases as t increases, and so the smoothness of the entire spectrum degrades. This affects the ability to estimate both the existence of a peak and its location: for high-intensity peaks the signal is stronger and more easily estimated even at a relatively narrow scale. Moreover, observations indicate that in the smallest m/z regions as many as four features per Dalton are resolved, while above 3500 m/z at most one may be resolved. Therefore, even though a peak in either region is sampled by the same number of t measurements (hence are of the same scale with respect to t), the fact that background and isotopic peaks occur at approximately 1 per Dalton means the precision of a peak-location estimate will be poor and benefit from a wider bandwidth at higher m/z. See Section 3.3 and Figure 3a versus Figure 3c. The problem that needs to be addressed is: how should bandwidth be adapted across the spectrum so that peaks may be confidently detected? Our somewhat ad hoc solution is based on the empirical observation that signal strength decays exponentially as a function of mass; see Figure 2b. We use this observation to help choose a combination of adjacent scales: D ) cjDj + cj + 1Dj + 1 where cj + cj+1 ) 1 and vary with m. Specifically, expressing the background signal intensity as g(m) ) k0 exp(-k1m) define a bandwidth function S as a linear function of signal strength by assuming linear change across the mass range: S(m) ) a + b log(g(m)) ) a + b log(k0e-k1m) ) a′ + b′m To obtain a′ and b′ in the data exhibited here, we use the constraints that S(680) ) 14.25 and S(3500) ) 25. These come from the observations that at 680 m/z there are 57 TOF measurements per Dalton and up to four features per Dalton are easily resolved (14.25 ) 57/4), while at 3500 m/z, there are 25 TOF measurements per Dalton and one feature per Dalton appears to be resolved. Finally, since Dj reflects changes corresponding to width 2j this can be used to choose the detail 282

Journal of Proteome Research • Vol. 7, No. 01, 2008

McLerran et al.

Figure 5. Gray curve, mean spectrum; black curve, mean detail function; dashed line, µI ) mean background-peak intensity; blue tickmarks, location outliers; red ticmarks, intensity outliers. Insets are expanded views, as indicated, with the Poisson density plotted in green.

Figure 6. Gray curve, mean spectrum; black curve, mean detail function; dashed line: µI ) mean background-peak intensity; red ticmarks, intensity outliers. Insets are expanded views, as indicated, with the Poisson density plotted in green.

function D. As an illustration, the bandwidth in Figures 1 and 5 is calculated to be S(1290) ) 24.05 which indicates use of a scale slightly larger than j ) 4. Therefore, we use most of D4 and a little bit of D5: D ) 0.95D4 + 0.05D5. Similarly, in Figure 6, we use D ) 0.66D4 + 0.34D5. As mentioned, this scaleadaptive detail function D varies only subtly, but it helps to adjust to the transition in signal strength and feature resolution that is observed across the spectrum. 3.5.2. Template Matching To Identify Isotopic Peaks. So far, peaks have often been referred to as either ‘background’ or ‘peptide’ peaks. This is not fully accurate, however, since at this point the categorization has merely focused on background versus nonbackground. A subsequent step is needed to determine whether the detected nonbackground signal matches the pattern of isotopic peaks that a peptide is expected to exhibit. For this, the observed intensities may be further analyzed for their ability to match naturally occurring peptide isotopes, as predicted from a simple Poisson distribution.8 Typically used to model rare events, the Poisson distribution is used here to model the rate of occurrence of heavy isotopes: given a peptide with monoisotopic mass m, the probability that the peptide has 0, 1, 2, . . . occurrences of the stable isotope carbon-13 may be modeled as y -λm P(Y ) y|λm) ) λm e ⁄y!

(y ) 0, 1, 2, ...)

(2)

An implementation of this is carried out in Bellew et al.9 in which the Poisson rate of λm ) m/1800 was used along with a

Signal Detection in High-Resolution MS Data Kullback-Leibler deviance score for comparing closeness of the observed and expected isotopic distributions; the Poisson rate was based on its fit to theoretical isotopic distributions calculated from tryptic peptides in the human proteome sequence database (see Supporting Information of ref 9). We have borrowed both these ideas for use here. We refer also to Breen et al.8 (Section 3.2) where the idea of Poisson modeling of isotopic distributions was first introduced and a Poisson rate of λm ) m/1683 - 0.03091 was estimated from digests of some specific proteins. In our implementation, the intensity of each peak is quantified as the value of the detail function at the location of the peak. See the Appendix for a summary on how peak locations and intensities are calculated.

4. Results The proposed method is designed to increase sensitivity in detecting low-intensity peptide signal against a background that can easily mask the desired peaks. The ability of the method to do this without sacrificing specificity relies on a careful characterization of the background signal. This section provides evidence that this has been achieved. No definitive biological outcome is claimed, however, since true peptide information has not been determined for these data.

Figure 7. Phase II data spectrum, the same portion as shown in Figure 5.

Figures 5-7 illustrate results when the methods are applied to two data sets. In the QC data, all spectra arise from the same pool of sera, and so a common signal is expected. The second set, the phase II data, is a nonhomogeneous set containing potentially more diverse signal. In describing these figures, Steps 0, 1, and 2 refer to the process described in Section 3.3. In each of these figures, the gray curve is an average spectrum computed by taking the mean over all aligned spectra. The black curve at the bottom is the mean of the detail functions D that were computed for each of the aligned spectra. The dotted line is µI as estimated in Step 2. Tickmarks identify peak locations that were deemed not part of the background: a blue tickmark if it was determined to be a location outlier in Step 1, or a red tickmark if it was determined to be an intensity outlier in Step 2. Insets in each figure provide an expanded view of some peptide peaks along with the expected isotopic distribution (from) plotted in green. In Figure 5, the peptides near 1282 and 1292 are location outliers which would be attributed to background noise if peptide peaks were defined by a simple threshold based on a signal-to-noise ratio; see also Figure 1. For the peptide near 1296, the first two isotopic peaks were identified in Step 1 as

research articles location outliers, while the second two peaks were found by Step 2 to be intensity outliers. As suggested by the Poisson distribution (shown in green in the inset), these four peaks appear to correspond to a single peptide. One would then expect that these four peaks should all have the same location offset. Here, however, the latter two peaks are dominated by background peaks which share nearly the same locations, making the subtle difference in location difficult to detect visually. On the other hand, the intensities of these two are apparently elevated above background due to the sum of background and peptide signal. This example suggests that the location test is quite powerful in its ability to detect even the most subtle shift in location of peptide signal. In Figure 6, the peptide near 2560 is an intensity outlier which is not obviously different from background. However, the variance of the background intensities is quite small, as seen by comparing the detail function peak intensities with the dashed line, and so these three peaks are detected as rising above the background. The inset figures show the extent to which the two peptides identified here match the expected template of a Poisson distribution at this mass. Figure 7 shows the same region as Figure 5 for the controlswith-cancer data. The lack of a peptide near 1292 should not be considered biologically significant as none of the groups from these data (healthy controls, prostate cancer, etc.) show a peak at this location.

Figure 8. A doubly charged peptide with features identified by the tickmarks. The height of the magenta lines represent peak intensities. The location-outlier peaks are measured from zero, while the intensity-outlier peaks are measured from the background mean, µI. The inset compares these to the expected Poisson distribution of isotopic peaks (green).

Figure 8 provides another case of potential ambiguity in peptide versus background. Three red tickmarks indicate peaks that are intensity outliers relative to the background in this region. Interspersed among these are three blue tickmarks indicating peaks that appear as location outliers. A question arises as to whether these two sets of peaks represent two different peptides or a single doubly charged peptide. Or, possibly, one set of peaks corresponds to a peptide, and the others are merely background. Here, as elsewhere, quantification of peak intensities uses the detail function and incorporates the concept of additivity of background-ion plus peptide– ion intensities. The magenta lines represent intensities computed as to whether the peaks are intensity or location outliers: the latter are measured from zero and the former are measured by the extent to which they exceed µI. The inset reproduces these magenta intensities plotted on a common baseline and plotted alongside the expected Poisson density of isotopic peaks Journal of Proteome Research • Vol. 7, No. 01, 2008 283

research articles (equation) for a one-charge peptide having monoisotopic peak at 4051.3 ) 2 * 2026.15 - 1(m/z). The strong correlation (high Kullback-Liebler score) suggests that this is a single, doubly charged peptide. This is confirmed by the observation that a singly charged peptide occurs at 4051.5, essentially twice the Dalton location (not shown). This is consistent, up to the expected calibration accuracy for these data, with the location of the monoisotopic peak at 2026.15 m/z shown in Figure 8.

5. Discussion The paper provides a statistically motivated definition of peptide signal in data from high-resolution time-of-flight instruments. This is in contrast to methods that distinguish peptide peaks from other features by relying on a signal-tonoise ratio, a concept that requires an a priori designation of signal and noise. We have proceeded by attempting to characterize the background-ion portion of the spectra and defining informative features as those that are not background. Implicit in this is the assumption that peptide peaks are sparse relative to the background peaks. In our implementation, the use of a moving 200-Da window is, ironically, subject to the potential problem that a region contains lots of signal and not much noise. In particular, in regions where the sampling rate (number of TOF measurements per peak) is low and the estimation of background properties diminished (see, e.g., above 2500 m/z in Figure 2), it is conceivable that the window contains a region rich in peptide–ions that coincide with background ions. This would degrade the method’s ability to identify subtle peptide peaks. A potential solution would be to implement a moving window whose width increases as a function of m/z or as a function of peptide content, possibly estimated using peak heights. We also note that the method as applied here uses the mean of a set of (processed) spectra rather than identifying peptides in individual spectra. It is possible to apply the method to individual spectra, but some power is lost in couple of ways. First, the variance of the background-ion signal (intensity and location) will be larger, and hence, the power to detect subtle outliers is diminished. Second, one cannot appeal to the central limit theorem to guide the selection of a critical value that determines the location and intensity outliers. As implemented here, a large set of spectra allows us to be guided by the normal distribution (via the central limit theorem) for choosing critical values that define intensity and location outliers. The critical values of 2.5 for location outliers and 2.35 for intensity outliers are based on tail areas of approximately 0.5% and 1%, respectively. We use two-sided and one-sided tests, respectively, for location and intensity outliers so the probability of removing peaks from the admissible set that are truly background is the same in each case. One of the primary contributions presented here is the use of peak location, relative to the background, in identifying peptide signal. Indeed, this serves two purposes: (1) extracting peaks (location outliers) which would be missed if the focus were solely on intensity, and (2) distilling the collection of peak intensities so that a better estimate of background intensity variance can be obtained and hence the detection of more subtle intensity outliers. In particular, the ordering of Steps 1 and 2 in Section 3.3 is important. Although the proposed ideas are not instrument-dependent, any implementation requires care and attention to the specifics of data from that instrument. Finally, we reiterate the point that this presentation is not focused on preprocessing and peak detection but certainly 284

Journal of Proteome Research • Vol. 7, No. 01, 2008

McLerran et al. success of the proposed method is subject to these procedures. We have, however, described a peak-detection method that appears to work well for these purposes. It is implemented it in a way that attempts to adapt to decreasing signal strength and sampling rate, an idea that may be of independent interest.

6. Appendix: Summary of Preprocessing Used To Define Admissible Peaks We summarize the process of producing a set of admissible peaks beginning with a raw set of spectra {Xi}ni)1. This set is the starting point for Section 3.3. Although the alignment of spectra is an important part of any preprocessing, it is not discussed, since the ideas proposed here are independent of the alignment method used. For reference, the spectra were aligned using the method described in Randolph and Yasui.2 On the other hand, the concept of scale or bandwidth appears throughout the discussion and is closely tied with our definition of a set of admissible peaks. Since the methods and results are based on this set, we provide specifics on how one might construct this set in a way that is most closely related to the ideas in this paper. We emphasize, however, that a variety of other peak-detection methods could be used to construct this set. In fact, we describe two methods here, one a slight modification of the other. The first describes how to simply define features (locations and intensities) in terms of a mean computed from all spectra. In a second method, the location of an admissible peak is determined by the mode of the histogram of locations obtained from each spectrum. The first method is more straightforward to visualize and compute. The latter may provide improved estimates of location in the higher m/z regions, and is the method used in this paper. It is assumed here that the length (i.e., the number of intensity measurements) of each spectrum X is the same; denote this by T. A wavelet decomposition involves a sequence of functions {Dj}Jj)1, each containing features representing local changes in X that occur at scale 2j: D1 corresponds to changes across a 21 -unit domain and S1: ) X - D1 is the remaining signal after D1 is removed. Repeating this process for Sj, instead of X, for j ) 1, ..., j (J e log2 T)), results in the decomposition 1. Method 1. First estimate the scales of relevant features, S; this is a function of m/z, as described in the section on Estimating Scale. Then for each spectrum, Xi(t) (represented with respect to the t axis): (i) Decompose Xi as in eq 1, and form Di(t) ) cjDi,j(t) + cj+1Di,j+1(t). (ii) Normalize each Di by dividing by the square root of its variance. (Since Di has mean zero, this is equivalent to normalizing by its L2 norm and adjusts only for the scale(s) of signal contained in Di.) n (iii) Form the mean details function D ) 1⁄n∑ i-1 Di (see Figure 5). (iv) Using the Haar wavelet family, compute the MODWT of D for all scales j ) 1, ..., j. This gives a family of transforms of D, {Wj}Jj ) 1 whose local maxima may be viewed as scale-j peaks in the data. Using the same combination of scales as in (i), form W: ) cjWj + cj+1Wj+1. (v) Find the zero-downcrossings of W. Since W acts as a scale-based derivative operator, the location of a zerodowncrossing corresponds to a local maximum in D. This produces a list of the admissible peak locations {tk}Kk)1.

research articles

Signal Detection in High-Resolution MS Data We quantify the intensity of a peak at location tk by the amount that D(tk) extends beyond the background. In Figure 8, for example, this is the height of the black curve as measured at a blue tickmark or the height of the black curve above the dashed line as measured at a red tickmark. The idea is not to obtain a precise measure of absolute peptide abundance, but to quantify the relative size of a feature for comparison across the spectrum. In particular, this incorporates shape, not merely height, of a peak into the intensity it is assigned. A dilution experiment for MALDI spectra provides some evidence that this is a useful measure of peak intensity.20 Method 2. Repeat the process in Method 1 through item (iii), except now also compute a wavelet transform Wi of each Di, i ) 1, ..., n. Item (v) also remains the same after modifying item (iv) as follows: (iv*) Find the zero-downcrossings in each Wi. This produces n lists of (potential) peak locations, one list from each spectrum. Use this to compute a histogram of peak locations found in all n spectra. Compute a localaveraging smooth of the histogram. The locations of local minima in this smooth are used to delimit bins in which peak locations aggregate in high density. Within each bin, peaks from different spectra are treated as the same peak. The mode of the histogram within each bin defines the list of admissible peak locations, {tk}Kk ) 1.

Acknowledgment. This work was supported by NIH grants from the NCI and NIGMS: R01-CA126205, U01-CA086368, and K25-GM67211. The authors thank two anonymous reviewers for their careful reading and helpful comments. Supporting Information Available: Summary data from the aligned and processed spectra, including the mean spectrum and the mean details function; a documented Matlab21 function that performs the analysis; a file with Matlab commands that loads these data, runs the function, and produces Figure 2 and portions of Figure 8. This information is available free of charge via the Internet at https://proteomics.

fhcrc.org/CPAS. Navigate to Published Experiments, then select “McLerran et al.” JPR 2007.

References (1) Yasui, Y.; Pepe, M.; Thompson, M. L.; Adam, B. L.; Wright, G. L.; Qu, Y.; Potter, J. D.; Winget, M.; Thornquist, M.; Feng, Z. Biostatistics 2003, 4, 449–463. (2) Randolph, T. W.; Yasui, Y. Biometrics 2006, 62, 589–597. (3) Kast, J.; Gentzel, M.; Wilm, M.; Richardson, K. J. Am. Soc. Mass Spectrom. 2003, 14, 766–776. (4) Busch, K. Spectroscopy 2004, 19, 44–52. (5) Yu, W.; Wu, B.; Lin, N.; Stone, K.; Williams, K.; Zhao, H. Comput. Biol. Chem. 2006, 30, 27–38. (6) Gay, S.; Binz, P.; Hochstrasser, D. F.; Appel, R. D. Electrophoresis 1999, 20, 3527–3534. (7) Gras, R.; Müller, M.; Gasteiger, E.; Gay, S.; Binz, P.; Bienvenut, W.; nad, J. C.; Sanchez, C. H.; Bairoch, A.; Hochstrasser, D.; Appel, R. Electrophoresis 1999, 20, 3535–3550. (8) Breen, E.; Hopwood, F.; Williams, K.; Wilkins, M. Electrophoresis 2000, 21, 2243–2251. (9) Bellew, M.; Coram, M.; Igra, M.; Fitzgibbon, M.; Randolph, T.; Wang, P.; Eng, J.; Lin, C.; Goodlett, D.; Fang, R.; Detter, A.; Zhang, H.; Whiteaker, J.; Paulovich, A.; McIntosh, M. Bioinformatics 2006, 22, 1902–1909. (10) Mann, M. Proceedings of the 43rd ASMS Conference on Mass Spectrometry and Allied Topics, 1996; p 639. (11) Wolski, W.; Farrow, M.; Emde, A.-K.; Lehrach, H.; Lalowski, M.; Reinert, K. Proteome Science 2006, 4, 18. (12) Gobom, J.; Mueller, M.; Egelhofer, V.; Theiss, D.; Lehrach, H.; Nordhoff, E. Anal. Chem. 2002, 74, 3915–3923. (13) Wolski, W.; Lalowski, M.; Jungblut, P.; Reinert, K. BMC Bioinf. 2005, 6, 203. (14) Wool, A.; Smilansky, Z. Proteomics 2002, 2, 1365–1373. (15) McLerran, D.; Grizzle, W. E.; Feng, Z.; Thompson, I. M.; Bigbee, W. L.; Cazares, L. H.; Chan, D. W.; Dahlgren, J.; Diaz, J.; Kagan, J.; Lin, D.; Malik, G.; Oelschlager, D.; Partin, A.; Randolph, T.; Sokoll, L.; Srivastava, S.; Srivastava, S.; Thornquist, M.; Troyer, D.; Wright, G. L.; Zhang, Z.; Zhu, L.; Semmes, O. J. Clin. Chem. 2008, 54, DOI: 10.1373/clinchem.2007.091496. (16) Huber, P. J. Ann. Stat. 1973, 1, 799–821. (17) Hampel, F.; Ronchetti, E.; Rousseeaw, P.; Stahel, W. Robust Statistics, The Approach Based on Influence Functions; John Wiley and Sons: New York, 1986. (18) Cook, R. D. Residuals and Influence in Regression; Chapman and Hall: New York, 1982. (19) Percival, D. B.; Walden, A. T. Wavelet Methods for Time Series Analysis; Cambridge University Press: Cambridge, U.K., 2000. (20) Randolph, T. W.; Mitchell, B. L.; McLerran, D. F.; Lampe, P. D.; Feng, Z. Mol. Cell. Proteomics 2005, 4, 1990–1999. (21) The Mathworks Inc., Matlab version 7.5; 2007.

PR700640A

Journal of Proteome Research • Vol. 7, No. 01, 2008 285