a c we b wo rk s
MassSpectator: Fully automated peak picking and integration A Web-based tool for locating mass spectral peaks and calculating their areas without user input. William E. Wallace, Anthony J. Kearsley, and Charles M. Guttman
S
hould measures of quantity be solely matters of opinion? For example, would you be satisfied with “guesstimates” on the purity of your food, water, or medications? How would you decide between two conflicting opinions on critical measures? MS is used in making such decisions in a host of scientific fields from environmental to pharmaceutical to forensics to standards. But when it comes down to peak integration in making quantitative determinations from mass spectra, the opinion of the analyst can carry more weight than any other aspect of the measurement. Clearly, an unbiased method would lend confidence to quantitative measures based on MS. MassSpectator is a novel suite of numerical algorithms that accurately locates and calculates the area beneath peaks from real mass spectral data by using only reproducible mathematical operations. It requires no user-selected parameters but does require a background spectrum to determine instrumental noise. Such a fully automated algorithm is useful for rapid and repeatable processing of mass spectral data containing hundreds of peaks. By working without any user input, it saves operator time and eliminates intentional or unintentional operator bias. The first criterion is desirable when processing © 2004 AMERICAN CHEMICAL SOCIETY
large amounts of data, for example, in proteomics research. The second criterion is necessary in fields such as forensics or standards—tasks in which operator bias in the data analysis cannot be tolerated. We hope MassSpectator begins a dialog between analytical chemists and
mathematicians on the development of unbiased approaches to data analysis. To this end, a publicly accessible Web-server application for online, real-time application of the method can be found at www. nist.gov/maldi. The e-mail address for comments is
[email protected]. M A Y 1 , 2 0 0 4 / A N A LY T I C A L C H E M I S T R Y
183 A
a c we b wo rk s
MassSpectator locates peaks and calculates their areas in three steps: statistical characterization of the data set and an analyte-free background spectrum; data set segmentation to determine “strategic points” using a new time series segmentation method, which was developed at the National Institute of Standards and Technology (NIST) and is described below; and deflation of the number of strategic points guided by the statistical properties of both data sets to distill the spectrum to its essential features. After those three steps, a polygonal fitting routine is used to calculate relative peak area. The output file format contains one line per peak found. Each line in turn consists of seven entries: the x and y coordinates for the beginning of the peak, the x and y coordinates for the center of the peak, the x and y coordinates for the end of the peak, and the relative peak area. Note that for closely spaced peaks, the strategic point that defines the end of one peak may also define the beginning of the next. Time series segmentation algorithms are numerical methods used to subdivide functions into conjoined line segments. In doing so, these algorithms reduce the number of points used to define the function, often by a factor of 1000 or more. It is critical for such methods to identify and preserve the most important features of the function while replacing the subtle features with straight lines; in the case of mass spectra, the features of greatest importance are the peaks. What is new in MassSpectator is its ability to adjust these line segments to best fit the data set by using a method similar to a least-squares fit. Some of the additional strengths of this method are that it requires no knowledge of peak shape and no preprocessing of the data, such as smoothing or baseline correction, which typically results in peak area distortion. However, the method does require a blank (analyte-free) spectrum to calibrate instrument background noise. This blank spectrum tells the method when to cease segmenting the data set; otherwise, the method would run to its logical conclusion and simply insert line segments between each pair of adjacent data points. This, of course, 184 A
(a)
(b)
(c)
(d)
(e)
(f)
Representation of a time series segmentation algorithm. The ideal measurement response (a) by itself and (b) with added random noise are shown. (c–e) Depiction of the first three segmentation steps. (f) Segmentation result superimposed on ideal response.
would neither reduce the size of the data set nor extract any useful features. MassSpectator represents a first attempt to create an unbiased method using time series segmentation to analyze mass spectral data. It will no doubt lead to more innovation in the rich and varied field of numerical approaches to time series segmentation. We have just begun to investigate the many facets of this approach to mass spectral analysis. Moreover, its application to other spectroscopies (e.g., IR spectroscopy) is untested. Is a unified approach to peak identification and integration for all spectroscopies used by
A N A LY T I C A L C H E M I S T R Y / M A Y 1 , 2 0 0 4
the analytical chemist possible? What is the essential logic that must be obeyed in all cases? What biases do smoothing and background subtraction impose on the data? These questions remain unanswered—for now. Wallace, Kearsley, and Guttman are researchers at the National Institute of Standards and Technology in Gaithersburg, Md. Further information about MassSpectator can be found in this issue of Analytical Chemistry (pp 2446–2452) and in Applied Mathematics Letters, “A Numerical Method for Mass Spectral Data Analysis”, in press.